Day 993
Nvidia GPU/eGPU drivers blues
I already thought I had set up nvidia-smi and friends (Day 850 | Diensttagebuch (Work journal)), then didn’t use it for months, now when I tried it didn’t work anymore, nvidia-smi
said “No devices found”
boltctl
showed the device as connected and authorized, prime-select
said nvidia
was selected, modprobe
showed that the correct drivers were used and dkms status
had said the correct drivers were installed.
(11:53:23/10181)~/$ dkms status
nvidia, 460.73.01, 5.4.0-73-generic, x86_64: installed
nvidia, 460.73.01, 5.4.0-74-generic, x86_64: installed
(11:53:49/10182)~/$ boltctl
[snip]
● Lenovo ThinkPad Thunderbolt 3 Dock #2
├─ type: peripheral
├─ name: ThinkPad Thunderbolt 3 Dock
├─ vendor: Lenovo
├─ uuid: xxx
├─ status: authorized
│ ├─ domain: domain0
│ └─ authflags: none
├─ authorized: Mo 20 Sep 2021 09:41:16 UTC
├─ connected: Mo 20 Sep 2021 09:41:16 UTC
└─ stored: no
● GIGABYTE GV-N1070IXEB-8GD
├─ type: peripheral
├─ name: GV-N1070IXEB-8GD
├─ vendor: GIGABYTE
├─ uuid: xxx
├─ status: authorized
│ ├─ domain: domain0
│ └─ authflags: none
├─ authorized: Mo 20 Sep 2021 09:42:35 UTC
├─ connected: Mo 20 Sep 2021 09:42:35 UTC
└─ stored: Mo 20 Sep 2021 09:31:09 UTC
├─ policy: manual
└─ key: no
(11:54:54/10188)~/$ lsmod
Module Size Used by
nvidia_uvm 1015808 0
nvidia_drm 57344 1
nvidia_modeset 1228800 1 nvidia_drm
nvidia 34123776 17 nvidia_uvm,nvidia_modeset
(11:55:54/10192)~/$ sudo prime-select query
nvidia
What didn’t work:
- prime-select cycling to default and then back to nvidia and rebooting
- power-cycling the CPU
- Connecting it directly, not through the dock, exact same setup I had in when it was working (link above)
What worked:
- Honestly no idea
- logging into gnome, opening the driver config window, logging back into i3, rebooting?…
Offtopic, when I was googling these issues I found my own serhii.net link above on the first page of Google for the key ’“nvidia-smi “no devices were found” authorized’, which is both nice and sad at the same time :)
EDIT: the next morning it didn’t work again. None of the same magic steps in all possible orders. I think it might be an issue with the eGPU or dock or something of that level. The best way to check this would be to do the nuclear option, uninstall all drivers, and install from the beginning, but I think my monthly quota of GPU stuff is full five times over now.
Diensttagebuch / Meta
We’re on day 993 (!) of Diensttagebuch! Freaking awesome.
python pip “advanced” requirements.txt creation
Was creating a requirements.txt for detectron2, official install instructions were:
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html
Answer specificalyl about this: python - How to format requirements.txt when package source is from specific websites? - Stack Overflow:
requirements.txt format is:
[[--option]...]
<requirement specifier> [; markers] [[--option]...]
<archive url/path>
[-e] <local project path>
[-e] <vcs project url>
<requirements specifier>
is:
SomeProject
SomeProject == 1.3
SomeProject >=1.2,<2.0
SomeProject[foo, bar]
SomeProject~=1.4.2
The –option (such as the -f/–find-links) is the same as the pip install options you would use if you were doing pip install from the command line.
Therefore, in requirements.txt it ended up literally as this:
--find-links https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html detectron2
And by the way, detectron2’s own requirements.txt demonstrates nicely part of the above.
My own requirements.txt for CUDA 11.1:
opencv-python==4.2.0.32
# torch 1.9 for cuda 10.2 (for this config https://pytorch.org/get-started/locally/ has no versions in the command
# getting both exact versions from pip freeze
-f https://download.pytorch.org/whl/torch_stable.html
torch==1.9.0+cu111
torchvision==0.10.0+cu111
#torch==1.7.1
#torchvision==0.8.2
# python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html
-f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html
detectron2
grep/ag
Best part about ag
is that I don’t need to escape anything with its default settings:
pip freeze | ag "(detectron|torch)"
pip freeze | grep "\(detectron\|torch\)"
pycharm test “teamcity” output bug
Suddenly stopped getting readable output.
Fix is to add the env variable JB_DISABLE_BUFFERING
, without any value, to the env of the test.
teamcity - no output in console for unittests in pycharm 2017 - Stack Overflow