SH | Diensttagebuch

In the middle of the desert you can say anything you want

20 Sep 2021

Day 993

Nvidia GPU/eGPU drivers blues

I already thought I had set up nvidia-smi and friends (Day 850 | Diensttagebuch (Work journal)), then didn’t use it for months, now when I tried it didn’t work anymore, nvidia-smi said “No devices found”

boltctl showed the device as connected and authorized, prime-select said nvidia was selected, modprobe showed that the correct drivers were used and dkms status had said the correct drivers were installed.

(11:53:23/10181)~/$ dkms status
nvidia, 460.73.01, 5.4.0-73-generic, x86_64: installed
nvidia, 460.73.01, 5.4.0-74-generic, x86_64: installed

(11:53:49/10182)~/$ boltctl
[snip]
 ● Lenovo ThinkPad Thunderbolt 3 Dock #2
   ├─ type:          peripheral
   ├─ name:          ThinkPad Thunderbolt 3 Dock
   ├─ vendor:        Lenovo
   ├─ uuid:          xxx
   ├─ status:        authorized
   │  ├─ domain:     domain0
   │  └─ authflags:  none
   ├─ authorized:    Mo 20 Sep 2021 09:41:16 UTC
   ├─ connected:     Mo 20 Sep 2021 09:41:16 UTC
   └─ stored:        no

 ● GIGABYTE GV-N1070IXEB-8GD
   ├─ type:          peripheral
   ├─ name:          GV-N1070IXEB-8GD
   ├─ vendor:        GIGABYTE
   ├─ uuid:          xxx
   ├─ status:        authorized
   │  ├─ domain:     domain0
   │  └─ authflags:  none
   ├─ authorized:    Mo 20 Sep 2021 09:42:35 UTC
   ├─ connected:     Mo 20 Sep 2021 09:42:35 UTC
   └─ stored:        Mo 20 Sep 2021 09:31:09 UTC
      ├─ policy:     manual
      └─ key:        no

(11:54:54/10188)~/$ lsmod
Module                  Size  Used by
nvidia_uvm           1015808  0
nvidia_drm             57344  1
nvidia_modeset       1228800  1 nvidia_drm
nvidia              34123776  17 nvidia_uvm,nvidia_modeset

(11:55:54/10192)~/$ sudo prime-select query
nvidia

What didn’t work:

  • prime-select cycling to default and then back to nvidia and rebooting
  • power-cycling the CPU
  • Connecting it directly, not through the dock, exact same setup I had in when it was working (link above)

What worked:

  • Honestly no idea
  • logging into gnome, opening the driver config window, logging back into i3, rebooting?…

Offtopic, when I was googling these issues I found my own serhii.net link above on the first page of Google for the key '“nvidia-smi “no devices were found” authorized', which is both nice and sad at the same time :)

EDIT: the next morning it didn’t work again. None of the same magic steps in all possible orders. I think it might be an issue with the eGPU or dock or something of that level. The best way to check this would be to do the nuclear option, uninstall all drivers, and install from the beginning, but I think my monthly quota of GPU stuff is full five times over now.

Diensttagebuch / Meta

We’re on day 993 (!) of Diensttagebuch! Freaking awesome.

python pip “advanced” requirements.txt creation

Was creating a requirements.txt for detectron2, official install instructions were:

python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html

Answer specificalyl about this: python - How to format requirements.txt when package source is from specific websites? - Stack Overflow:

requirements.txt format is:

[[--option]...]
<requirement specifier> [; markers] [[--option]...]
<archive url/path>
[-e] <local project path>
[-e] <vcs project url>

<requirements specifier> is:

SomeProject
SomeProject == 1.3
SomeProject >=1.2,<2.0
SomeProject[foo, bar]
SomeProject~=1.4.2

The –option (such as the -f/–find-links) is the same as the pip install options you would use if you were doing pip install from the command line.

Therefore, in requirements.txt it ended up literally as this:

--find-links https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html detectron2

And by the way, detectron2’s own requirements.txt demonstrates nicely part of the above.

My own requirements.txt for CUDA 11.1:

opencv-python==4.2.0.32

# torch 1.9 for cuda 10.2 (for this config https://pytorch.org/get-started/locally/ has no versions in the command
# getting both exact versions from pip freeze
-f https://download.pytorch.org/whl/torch_stable.html
torch==1.9.0+cu111
torchvision==0.10.0+cu111
#torch==1.7.1
#torchvision==0.8.2

# python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html
-f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html
detectron2

grep/ag

Best part about ag is that I don’t need to escape anything with its default settings:

pip freeze | ag "(detectron|torch)"
pip freeze | grep "\(detectron\|torch\)"

pycharm test “teamcity” output bug

Suddenly stopped getting readable output. Fix is to add the env variable JB_DISABLE_BUFFERING, without any value, to the env of the test. teamcity - no output in console for unittests in pycharm 2017 - Stack Overflow

Nel mezzo del deserto posso dire tutto quello che voglio.