RTX4090をDeepLearningで並列動作可能にさせるcuda-driversのversion
GeForce RTX 4090をDeepLearningなどで並列動作させようとした場合、driverのバージョンを限定(525.105.17)しないとうまく動作しません。
GeForce RTX 4090を実装したマシンでnvidia-smiとdeviceQueryを実行すると
(base) dl@dl-machine:~$ nvidia-smi Sun Jun 18 10:15:43 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off | | 33% 34C P8 16W / 450W| 6MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off | | 33% 34C P8 13W / 450W| 6MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off | | 34% 34C P8 16W / 450W| 6MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 On | 00000000:C1:00.0 Off | Off | | 34% 33C P8 21W / 450W| 6MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1377 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1377 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 1377 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 1377 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+ (base) dl@dl-machine:~$ deviceQuery deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 4 CUDA Capable device(s) Device 0: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.1 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.1 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 2: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.1 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 161 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 3: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.1 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 193 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.1, CUDA Runtime Version = 12.1, NumDevs = 4 Result = PASS (base) dl@dl-machine:~$
最後に、peer to peer accessがGPU間で可能と表示されます。
peer to peer accessとはGPU間でCPUの介在なしでmemoryのデータをDMA転送する機能で、これがYesだとDeepLearningの学習が高速に行えます。高額なGPUのA100, H100, RTX A6000などは当然Yesなのですが、Geforceの場合は1080tiまではYesだったのですが、高額なP100のマーケットを荒らし過ぎてしまったせいか、次の世代の2080tiからはNoに設定されてしまいました。3090も当然Noでした。4090がYesと表示されるのは何かの間違いではないかと思っていたのですが、simpleP2Pを試しに実行すると、
(base) dl@dl-machine:~/cuda-samples/Samples/0_Introduction/simpleP2P$ ./simpleP2P [./simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 4 Checking GPU(s) for support of peer to peer memory access... > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes Enabling peer access between GPU0 and GPU1... Allocating buffers (64MB on GPU0, GPU1 and CPU Host)... Creating event handles... cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 23.95GB/s Preparing host buffer and memcpy to GPU0... Run kernel on GPU1, taking source data from GPU0 and writing to GPU1... Run kernel on GPU0, taking source data from GPU1 and writing to GPU0... Copy data back to host from GPU0 and verify results... Verification error @ element 1: val = 0.000000, ref = 4.000000 Verification error @ element 2: val = 0.000000, ref = 8.000000 Verification error @ element 3: val = 0.000000, ref = 12.000000 Verification error @ element 4: val = 0.000000, ref = 16.000000 Verification error @ element 5: val = 0.000000, ref = 20.000000 Verification error @ element 6: val = 0.000000, ref = 24.000000 Verification error @ element 7: val = 0.000000, ref = 28.000000 Verification error @ element 8: val = 0.000000, ref = 32.000000 Verification error @ element 9: val = 0.000000, ref = 36.000000 Verification error @ element 10: val = 0.000000, ref = 40.000000 Verification error @ element 11: val = 0.000000, ref = 44.000000 Verification error @ element 12: val = 0.000000, ref = 48.000000 Disabling peer access... Shutting down... Test failed!
のようにやはりエラーになり、peer to peer通信ができません。Yesと表示されるのは見せかけだったようです。このままでは、並列学習などに支障をきたしますので、タチが悪いなあと思っていたところ、こちらにNVIDIA driver 525.105.17なら解決できると記載されていました。
インストール可能なcuda-driversを見てみると
root@dl-machine:/home/dl# apt list cuda-drivers -a Listing... Done cuda-drivers/unknown,now 530.30.02-1 amd64 [installed] cuda-drivers/unknown 525.105.17-1 amd64 cuda-drivers/unknown 525.85.12-1 amd64 cuda-drivers/unknown 525.60.13-1 amd64 cuda-drivers/unknown 520.61.05-1 amd64 cuda-drivers/unknown 515.105.01-1 amd64 cuda-drivers/unknown 515.86.01-1 amd64 cuda-drivers/unknown 515.65.07-1 amd64 cuda-drivers/unknown 515.65.01-1 amd64 cuda-drivers/unknown 515.48.07-1 amd64 cuda-drivers/unknown 515.43.04-1 amd64 cuda-drivers/unknown 510.108.03-1 amd64 cuda-drivers/unknown 510.85.02-1 amd64 cuda-drivers/unknown 510.84-1 amd64 cuda-drivers/unknown 510.73.08-1 amd64 cuda-drivers/unknown 510.47.03-1 amd64 cuda-drivers/unknown 510.39.01-1 amd64 cuda-drivers/unknown 495.29.05-1 amd64 cuda-drivers/unknown 470.182.03-1 amd64 cuda-drivers/unknown 470.161.03-1 amd64 cuda-drivers/unknown 470.141.10-1 amd64 cuda-drivers/unknown 470.141.03-1 amd64 cuda-drivers/unknown 470.129.06-1 amd64 cuda-drivers/unknown 470.103.01-1 amd64 cuda-drivers/unknown 470.82.01-1 amd64 cuda-drivers/unknown 470.57.02-1 amd64 cuda-drivers/unknown 470.42.01-1 amd64 cuda-drivers/unknown 465.19.01-1 amd64 cuda-drivers/unknown 460.106.00-1 amd64 cuda-drivers/unknown 460.91.03-1 amd64 cuda-drivers/unknown 460.73.01-1 amd64 cuda-drivers/unknown 460.32.03-1 amd64 cuda-drivers/unknown 460.27.04-1 amd64 cuda-drivers/unknown 455.45.01-1 amd64 cuda-drivers/unknown 455.32.00-1 amd64 cuda-drivers/unknown 455.23.05-1 amd64 cuda-drivers/unknown 450.236.01-1 amd64 cuda-drivers/unknown 450.216.04-1 amd64 cuda-drivers/unknown 450.203.08-1 amd64 cuda-drivers/unknown 450.203.03-1 amd64 cuda-drivers/unknown 450.191.01-1 amd64 cuda-drivers/unknown 450.172.01-1 amd64 cuda-drivers/unknown 450.156.00-1 amd64 cuda-drivers/unknown 450.142.00-1 amd64 cuda-drivers/unknown 450.119.04-1 amd64 cuda-drivers/unknown 450.119.03-1 amd64 cuda-drivers/unknown 450.102.04-1 amd64 cuda-drivers/unknown 450.80.02-1 amd64 cuda-drivers/unknown 450.51.06-1 amd64 cuda-drivers/unknown 450.51.05-1 amd64 root@dl-machine:/home/dl#
なので、早速525.105.17-1に入れ替えます。
root@dl-machine:/home/dl# apt install cuda-drivers=525.105.17-1 Reading package lists... Done Building dependency tree Reading state information... Done The following package was automatically installed and is no longer required: libnvidia-common-530 Use 'sudo apt autoremove' to remove it. The following additional packages will be installed: cuda-drivers-525 libnvidia-cfg1-525 libnvidia-common-525 libnvidia-compute-525 libnvidia-compute-525:i386 libnvidia-decode-525 libnvidia-decode-525:i386 libnvidia-encode-525 libnvidia-encode-525:i386 libnvidia-extra-525 libnvidia-fbc1-525 libnvidia-fbc1-525:i386 libnvidia-gl-525 libnvidia-gl-525:i386 nvidia-compute-utils-525 nvidia-dkms-525 nvidia-driver-525 nvidia-kernel-common-525 nvidia-kernel-source-525 nvidia-utils-525 xserver-xorg-video-nvidia-525 The following packages will be REMOVED: cuda-drivers-530 libnvidia-cfg1-530 libnvidia-compute-530 libnvidia-compute-530:i386 libnvidia-decode-530 libnvidia-decode-530:i386 libnvidia-encode-530 libnvidia-encode-530:i386 libnvidia-extra-530 libnvidia-fbc1-530 libnvidia-fbc1-530:i386 libnvidia-gl-530 libnvidia-gl-530:i386 nvidia-compute-utils-530 nvidia-dkms-530 nvidia-driver-530 nvidia-kernel-common-530 nvidia-kernel-source-530 nvidia-utils-530 xserver-xorg-video-nvidia-530 The following NEW packages will be installed: cuda-drivers-525 libnvidia-cfg1-525 libnvidia-common-525 libnvidia-compute-525 libnvidia-compute-525:i386 libnvidia-decode-525 libnvidia-decode-525:i386 libnvidia-encode-525 libnvidia-encode-525:i386 libnvidia-extra-525 libnvidia-fbc1-525 libnvidia-fbc1-525:i386 libnvidia-gl-525 libnvidia-gl-525:i386 nvidia-compute-utils-525 nvidia-dkms-525 nvidia-driver-525 nvidia-kernel-common-525 nvidia-kernel-source-525 nvidia-utils-525 xserver-xorg-video-nvidia-525 The following packages will be DOWNGRADED: cuda-drivers 0 upgraded, 21 newly installed, 1 downgraded, 20 to remove and 0 not upgraded. Need to get 382 MB of archives. After this operation, 25.4 MB disk space will be freed. Do you want to continue? [Y/n]
yを入力してインストールが終了したらリブートしてnvidia-smiとdeviceQueryを実行すると、
(base) dl@dl-machine:~$ nvidia-smi Sun Jun 18 11:31:26 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | Off | | 33% 35C P8 16W / 450W | 6MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:61:00.0 Off | Off | | 33% 35C P8 12W / 450W | 6MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:A1:00.0 Off | Off | | 34% 34C P8 12W / 450W | 6MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:C1:00.0 Off | Off | | 34% 33C P8 21W / 450W | 6MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1363 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1363 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 1363 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 1363 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ (base) dl@dl-machine:~$ deviceQuery deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 4 CUDA Capable device(s) Device 0: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.0 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.0 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 2: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.0 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 161 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 3: "NVIDIA GeForce RTX 4090" CUDA Driver Version / Runtime Version 12.0 / 12.1 CUDA Capability Major/Minor version number: 8.9 Total amount of global memory: 24217 MBytes (25393692672 bytes) (128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores GPU Max Clock rate: 2520 MHz (2.52 GHz) Memory Clock rate: 10501 Mhz Memory Bus Width: 384-bit L2 Cache Size: 75497472 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 193 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No > Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 12.1, NumDevs = 4 Result = PASS (base) dl@dl-machine:~$
となって、P2PはNoになりました。これで、RTX4090を複数枚使ってのDeepLearning学習が正常に実行可能になります(並列学習の評価はこちら)。cuda-driversが勝手にアップグレードされないよう
root@dl-machine:/home/dl# apt-mark hold cuda-drivers cuda-drivers set on hold. root@dl-machine:/home/dl#
を実行しておきます。