RTX4090 8GPUサーバーの紹介
GeForce RTX 4090はRTX A6000などと比較して、物理的サイズが非常に大きく、かつ消費電力も450Wと大きいため、RTX4090を8GPU搭載可能なサーバーの構築は困難と考えられますが、この記事では弊社が販売しているRTX4090を8GPU搭載可能な、AMD Epyc Genoa 2CPUのサーバーを紹介します。

弊社が販売するRTX4090の物理サイズはRTX A6000などと同じサイズの製品です。そのため、4Uのサーバーに8GPU搭載可能です。外排気の構造になっており、冷却用のFANの排気は、サーバーの背面外部へのみ行い、サーバー内部には排気しないため、サーバー内がGPUの熱で温度上昇することはありません。GPUの補助電源ケーブルは12VHPWR/600Wが8本付属していますので、8GPUまでの電源供給にも問題ありません。このサーバーをカスタマイズして見積もり依頼するには、ここをクリックしてください。
今回はこのサーバーにRTX4090を8GPU,
(base) dl@dl-machine:~$ nvidia-smi
Sat Sep 9 13:59:26 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 33% 33C P8 12W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:21:00.0 Off | Off |
| 34% 34C P8 18W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off |
| 33% 33C P8 19W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off |
| 33% 32C P8 11W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce RTX 4090 On | 00000000:81:00.0 Off | Off |
| 34% 32C P8 12W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
| 34% 33C P8 21W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce RTX 4090 On | 00000000:C1:00.0 Off | Off |
| 34% 31C P8 19W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce RTX 4090 On | 00000000:E1:00.0 Off | Off |
| 34% 30C P8 24W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 4151 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
(base) dl@dl-machine:~$
CPUは
(base) dl@dl-machine:~/slurm$ lscpu|grep -i "model name" Model name: AMD EPYC 9654 96-Core Processor (base) dl@dl-machine:~/slurm$ nproc 192 (base) dl@dl-machine:~/slurm$
AMD EPYC 9654 96-Coreを2CPUで192コア
メモリは
(base) dl@dl-machine:~/slurm$ sudo dmidecode -t memory | grep '\sVolatile Size'
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
Volatile Size: 64 GB
(base) dl@dl-machine:~/slurm$ sudo dmidecode -t memory | grep '\sVolatile Size'|wc
24 96 528
(base) dl@dl-machine:~/slurm$
64GB DIMMを24枚で1.5TB
SSDは
(base) dl@dl-machine:/etc/slurm-llnl$ sudo lshw -c disk
*-namespace
description: NVMe namespace
physical id: 1
logical name: /dev/nvme0n1
size: 7153GiB (7681GB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=c97becbf-ddf5-4758-81b4-a8a15482b0ce logicalsectorsize=512 sectorsize=4096
(base) dl@dl-machine:~/slurm$
7.68TBのNVMe SSDが1台です。
それではこのサーバーでお手軽にGPUの性能を評価できる、tf_cnn_benchmarksを実行してみましょう。このリンクをcloneしておきます。
tensorflow-1が必要ですがNGCのtensorflow 1をsingularityでpullしておきます。dockerでなくsingularityを使うのは、slurmでジョブを投入するためです。dockerを使用するジョブはslurm(その他の大抵のジョブスケジューラにも)に投入できません。
NVIDIAドライバーは既にインストール済みですが、RTX4090で並列学習をさせるためには、RTX 4090のGPU間Peer Peer accessがNoと表示されるドライバーであることが必要ですので、それを確認します。
(base) dl@dl-machine:/etc/slurm-llnl$ deviceQuery
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 8 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 33 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 2: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 3: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 4: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 5: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 161 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 6: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 193 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 7: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 11.1
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 225 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU6) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 11.1, NumDevs = 8
Result = PASS
(base) dl@dl-machine:/etc/slurm-llnl$
8GPUが搭載されているため表示が長くなってしまいましたが、最後に表示されるPeer accessの行の右端が全てNoになっているのでこのドライバーで問題ありません。
slurmジョブスケジューラが動作しているかの確認は
(base) dl@dl-machine:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle dl-machine (base) dl@dl-machine:~$ scontrol show nodes NodeName=dl-machine Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=192 CPULoad=0.72 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:rtx4090:8(S:0-1) NodeAddr=dl-machine NodeHostName=dl-machine Version=19.05.5 OS=Linux 5.15.0-83-generic #92~20.04.1-Ubuntu SMP Mon Aug 21 14:00:49 UTC 2023 RealMemory=1547856 AllocMem=0 FreeMem=1534739 Sockets=192 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2023-09-07T15:27:15 SlurmdStartTime=2023-09-07T16:01:09 CfgTRES=cpu=192,mem=1547856M,billing=192 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s (base) dl@dl-machine:~$
で可能です。
slurmにtf cnn benchmarkを網羅的に行うジョブを投入するscriptは
(base) dl@dl-machine:~/slurm$ cat 4090x8.sh
#!/bin/bash
bench=/home/dl/tf_cnn_benchmarks
tf1=/home/dl/singularity/tensorflow_23.02-tf1-py3.sif
logdir=logdir-4090x8
errdir=errdir-4090x8
sbatch_tf_cnn () {
if [ $4 = "fp16" ]; then
acc="--use_fp16"
else
acc=""
fi
rm -rf ${logdir} ${errdir}
mkdir -p ${logdir} ${errdir}
sbatch -J "$3_gpu$1_bs$2_$4" <
になります。
このスクリプトを実行すると網羅的にジョブが投入され
JOBID PARTI NAME USER ST TIME NODES NODELIST(REASON)
6979 debug inception4_gpu1_bs128_fp16 dl PD 0:00 1 (Resources)
6980 debug resnet50_gpu1_bs256_fp16 dl PD 0:00 1 (Priority)
6981 debug inception3_gpu1_bs256_fp16 dl PD 0:00 1 (Priority)
6982 debug vgg16_gpu1_bs256_fp16 dl PD 0:00 1 (Priority)
6983 debug nasnet_gpu1_bs256_fp16 dl PD 0:00 1 (Priority)
6984 debug resnet152_gpu1_bs256_fp16 dl PD 0:00 1 (Priority)
6985 debug inception4_gpu1_bs256_fp16 dl PD 0:00 1 (Priority)
6986 debug resnet50_gpu1_bs512_fp16 dl PD 0:00 1 (Priority)
6987 debug inception3_gpu1_bs512_fp16 dl PD 0:00 1 (Priority)
6988 debug vgg16_gpu1_bs512_fp16 dl PD 0:00 1 (Priority)
6989 debug nasnet_gpu1_bs512_fp16 dl PD 0:00 1 (Priority)
6990 debug resnet152_gpu1_bs512_fp16 dl PD 0:00 1 (Priority)
6991 debug inception4_gpu1_bs512_fp16 dl PD 0:00 1 (Priority)
6992 debug resnet50_gpu2_bs64_fp16 dl PD 0:00 1 (Priority)
6993 debug inception3_gpu2_bs64_fp16 dl PD 0:00 1 (Priority)
6994 debug vgg16_gpu2_bs64_fp16 dl PD 0:00 1 (Priority)
6995 debug nasnet_gpu2_bs64_fp16 dl PD 0:00 1 (Priority)
6996 debug resnet152_gpu2_bs64_fp16 dl PD 0:00 1 (Priority)
6997 debug inception4_gpu2_bs64_fp16 dl PD 0:00 1 (Priority)
6998 debug resnet50_gpu2_bs128_fp16 dl PD 0:00 1 (Priority)
6999 debug inception3_gpu2_bs128_fp16 dl PD 0:00 1 (Priority)
7000 debug vgg16_gpu2_bs128_fp16 dl PD 0:00 1 (Priority)
7001 debug nasnet_gpu2_bs128_fp16 dl PD 0:00 1 (Priority)
7002 debug resnet152_gpu2_bs128_fp16 dl PD 0:00 1 (Priority)
7003 debug inception4_gpu2_bs128_fp16 dl PD 0:00 1 (Priority)
7004 debug resnet50_gpu2_bs256_fp16 dl PD 0:00 1 (Priority)
7005 debug inception3_gpu2_bs256_fp16 dl PD 0:00 1 (Priority)
7006 debug vgg16_gpu2_bs256_fp16 dl PD 0:00 1 (Priority)
7007 debug nasnet_gpu2_bs256_fp16 dl PD 0:00 1 (Priority)
7008 debug resnet152_gpu2_bs256_fp16 dl PD 0:00 1 (Priority)
7009 debug inception4_gpu2_bs256_fp16 dl PD 0:00 1 (Priority)
7010 debug resnet50_gpu2_bs512_fp16 dl PD 0:00 1 (Priority)
7011 debug inception3_gpu2_bs512_fp16 dl PD 0:00 1 (Priority)
7012 debug vgg16_gpu2_bs512_fp16 dl PD 0:00 1 (Priority)
7013 debug nasnet_gpu2_bs512_fp16 dl PD 0:00 1 (Priority)
7014 debug resnet152_gpu2_bs512_fp16 dl PD 0:00 1 (Priority)
7015 debug inception4_gpu2_bs512_fp16 dl PD 0:00 1 (Priority)
7016 debug resnet50_gpu4_bs64_fp16 dl PD 0:00 1 (Priority)
7017 debug inception3_gpu4_bs64_fp16 dl PD 0:00 1 (Priority)
7018 debug vgg16_gpu4_bs64_fp16 dl PD 0:00 1 (Priority)
7019 debug nasnet_gpu4_bs64_fp16 dl PD 0:00 1 (Priority)
7020 debug resnet152_gpu4_bs64_fp16 dl PD 0:00 1 (Priority)
7021 debug inception4_gpu4_bs64_fp16 dl PD 0:00 1 (Priority)
7022 debug resnet50_gpu4_bs128_fp16 dl PD 0:00 1 (Priority)
7023 debug inception3_gpu4_bs128_fp16 dl PD 0:00 1 (Priority)
7024 debug vgg16_gpu4_bs128_fp16 dl PD 0:00 1 (Priority)
7025 debug nasnet_gpu4_bs128_fp16 dl PD 0:00 1 (Priority)
7026 debug resnet152_gpu4_bs128_fp16 dl PD 0:00 1 (Priority)
7027 debug inception4_gpu4_bs128_fp16 dl PD 0:00 1 (Priority)
7028 debug resnet50_gpu4_bs256_fp16 dl PD 0:00 1 (Priority)
7029 debug inception3_gpu4_bs256_fp16 dl PD 0:00 1 (Priority)
7030 debug vgg16_gpu4_bs256_fp16 dl PD 0:00 1 (Priority)
7031 debug nasnet_gpu4_bs256_fp16 dl PD 0:00 1 (Priority)
7032 debug resnet152_gpu4_bs256_fp16 dl PD 0:00 1 (Priority)
7033 debug inception4_gpu4_bs256_fp16 dl PD 0:00 1 (Priority)
7034 debug resnet50_gpu4_bs512_fp16 dl PD 0:00 1 (Priority)
7035 debug inception3_gpu4_bs512_fp16 dl PD 0:00 1 (Priority)
7036 debug vgg16_gpu4_bs512_fp16 dl PD 0:00 1 (Priority)
7037 debug nasnet_gpu4_bs512_fp16 dl PD 0:00 1 (Priority)
7038 debug resnet152_gpu4_bs512_fp16 dl PD 0:00 1 (Priority)
7039 debug inception4_gpu4_bs512_fp16 dl PD 0:00 1 (Priority)
7040 debug resnet50_gpu8_bs64_fp16 dl PD 0:00 1 (Priority)
7041 debug inception3_gpu8_bs64_fp16 dl PD 0:00 1 (Priority)
7042 debug vgg16_gpu8_bs64_fp16 dl PD 0:00 1 (Priority)
7043 debug nasnet_gpu8_bs64_fp16 dl PD 0:00 1 (Priority)
7044 debug resnet152_gpu8_bs64_fp16 dl PD 0:00 1 (Priority)
7045 debug inception4_gpu8_bs64_fp16 dl PD 0:00 1 (Priority)
7046 debug resnet50_gpu8_bs128_fp16 dl PD 0:00 1 (Priority)
7047 debug inception3_gpu8_bs128_fp16 dl PD 0:00 1 (Priority)
7048 debug vgg16_gpu8_bs128_fp16 dl PD 0:00 1 (Priority)
7049 debug nasnet_gpu8_bs128_fp16 dl PD 0:00 1 (Priority)
7050 debug resnet152_gpu8_bs128_fp16 dl PD 0:00 1 (Priority)
7051 debug inception4_gpu8_bs128_fp16 dl PD 0:00 1 (Priority)
7052 debug resnet50_gpu8_bs256_fp16 dl PD 0:00 1 (Priority)
7053 debug inception3_gpu8_bs256_fp16 dl PD 0:00 1 (Priority)
7054 debug vgg16_gpu8_bs256_fp16 dl PD 0:00 1 (Priority)
7055 debug nasnet_gpu8_bs256_fp16 dl PD 0:00 1 (Priority)
7056 debug resnet152_gpu8_bs256_fp16 dl PD 0:00 1 (Priority)
7057 debug inception4_gpu8_bs256_fp16 dl PD 0:00 1 (Priority)
7058 debug resnet50_gpu8_bs512_fp16 dl PD 0:00 1 (Priority)
7059 debug inception3_gpu8_bs512_fp16 dl PD 0:00 1 (Priority)
7060 debug vgg16_gpu8_bs512_fp16 dl PD 0:00 1 (Priority)
7061 debug nasnet_gpu8_bs512_fp16 dl PD 0:00 1 (Priority)
7062 debug resnet152_gpu8_bs512_fp16 dl PD 0:00 1 (Priority)
7063 debug inception4_gpu8_bs512_fp16 dl PD 0:00 1 (Priority)
7064 debug resnet50_gpu1_bs64_fp32 dl PD 0:00 1 (Priority)
7065 debug inception3_gpu1_bs64_fp32 dl PD 0:00 1 (Priority)
7066 debug vgg16_gpu1_bs64_fp32 dl PD 0:00 1 (Priority)
7067 debug nasnet_gpu1_bs64_fp32 dl PD 0:00 1 (Priority)
7068 debug resnet152_gpu1_bs64_fp32 dl PD 0:00 1 (Priority)
7069 debug inception4_gpu1_bs64_fp32 dl PD 0:00 1 (Priority)
7070 debug resnet50_gpu1_bs128_fp32 dl PD 0:00 1 (Priority)
7071 debug inception3_gpu1_bs128_fp32 dl PD 0:00 1 (Priority)
7072 debug vgg16_gpu1_bs128_fp32 dl PD 0:00 1 (Priority)
7073 debug nasnet_gpu1_bs128_fp32 dl PD 0:00 1 (Priority)
7074 debug resnet152_gpu1_bs128_fp32 dl PD 0:00 1 (Priority)
7075 debug inception4_gpu1_bs128_fp32 dl PD 0:00 1 (Priority)
7076 debug resnet50_gpu1_bs256_fp32 dl PD 0:00 1 (Priority)
7077 debug inception3_gpu1_bs256_fp32 dl PD 0:00 1 (Priority)
7078 debug vgg16_gpu1_bs256_fp32 dl PD 0:00 1 (Priority)
7079 debug nasnet_gpu1_bs256_fp32 dl PD 0:00 1 (Priority)
7080 debug resnet152_gpu1_bs256_fp32 dl PD 0:00 1 (Priority)
7081 debug inception4_gpu1_bs256_fp32 dl PD 0:00 1 (Priority)
7082 debug resnet50_gpu1_bs512_fp32 dl PD 0:00 1 (Priority)
7083 debug inception3_gpu1_bs512_fp32 dl PD 0:00 1 (Priority)
7084 debug vgg16_gpu1_bs512_fp32 dl PD 0:00 1 (Priority)
7085 debug nasnet_gpu1_bs512_fp32 dl PD 0:00 1 (Priority)
7086 debug resnet152_gpu1_bs512_fp32 dl PD 0:00 1 (Priority)
7087 debug inception4_gpu1_bs512_fp32 dl PD 0:00 1 (Priority)
7088 debug resnet50_gpu2_bs64_fp32 dl PD 0:00 1 (Priority)
7089 debug inception3_gpu2_bs64_fp32 dl PD 0:00 1 (Priority)
7090 debug vgg16_gpu2_bs64_fp32 dl PD 0:00 1 (Priority)
7091 debug nasnet_gpu2_bs64_fp32 dl PD 0:00 1 (Priority)
7092 debug resnet152_gpu2_bs64_fp32 dl PD 0:00 1 (Priority)
7093 debug inception4_gpu2_bs64_fp32 dl PD 0:00 1 (Priority)
7094 debug resnet50_gpu2_bs128_fp32 dl PD 0:00 1 (Priority)
7095 debug inception3_gpu2_bs128_fp32 dl PD 0:00 1 (Priority)
7096 debug vgg16_gpu2_bs128_fp32 dl PD 0:00 1 (Priority)
7097 debug nasnet_gpu2_bs128_fp32 dl PD 0:00 1 (Priority)
7098 debug resnet152_gpu2_bs128_fp32 dl PD 0:00 1 (Priority)
7099 debug inception4_gpu2_bs128_fp32 dl PD 0:00 1 (Priority)
7100 debug resnet50_gpu2_bs256_fp32 dl PD 0:00 1 (Priority)
7101 debug inception3_gpu2_bs256_fp32 dl PD 0:00 1 (Priority)
7102 debug vgg16_gpu2_bs256_fp32 dl PD 0:00 1 (Priority)
7103 debug nasnet_gpu2_bs256_fp32 dl PD 0:00 1 (Priority)
7104 debug resnet152_gpu2_bs256_fp32 dl PD 0:00 1 (Priority)
7105 debug inception4_gpu2_bs256_fp32 dl PD 0:00 1 (Priority)
7106 debug resnet50_gpu2_bs512_fp32 dl PD 0:00 1 (Priority)
7107 debug inception3_gpu2_bs512_fp32 dl PD 0:00 1 (Priority)
7108 debug vgg16_gpu2_bs512_fp32 dl PD 0:00 1 (Priority)
7109 debug nasnet_gpu2_bs512_fp32 dl PD 0:00 1 (Priority)
7110 debug resnet152_gpu2_bs512_fp32 dl PD 0:00 1 (Priority)
7111 debug inception4_gpu2_bs512_fp32 dl PD 0:00 1 (Priority)
7112 debug resnet50_gpu4_bs64_fp32 dl PD 0:00 1 (Priority)
7113 debug inception3_gpu4_bs64_fp32 dl PD 0:00 1 (Priority)
7114 debug vgg16_gpu4_bs64_fp32 dl PD 0:00 1 (Priority)
7115 debug nasnet_gpu4_bs64_fp32 dl PD 0:00 1 (Priority)
7116 debug resnet152_gpu4_bs64_fp32 dl PD 0:00 1 (Priority)
7117 debug inception4_gpu4_bs64_fp32 dl PD 0:00 1 (Priority)
7118 debug resnet50_gpu4_bs128_fp32 dl PD 0:00 1 (Priority)
7119 debug inception3_gpu4_bs128_fp32 dl PD 0:00 1 (Priority)
7120 debug vgg16_gpu4_bs128_fp32 dl PD 0:00 1 (Priority)
7121 debug nasnet_gpu4_bs128_fp32 dl PD 0:00 1 (Priority)
7122 debug resnet152_gpu4_bs128_fp32 dl PD 0:00 1 (Priority)
7123 debug inception4_gpu4_bs128_fp32 dl PD 0:00 1 (Priority)
7124 debug resnet50_gpu4_bs256_fp32 dl PD 0:00 1 (Priority)
7125 debug inception3_gpu4_bs256_fp32 dl PD 0:00 1 (Priority)
7126 debug vgg16_gpu4_bs256_fp32 dl PD 0:00 1 (Priority)
7127 debug nasnet_gpu4_bs256_fp32 dl PD 0:00 1 (Priority)
7128 debug resnet152_gpu4_bs256_fp32 dl PD 0:00 1 (Priority)
7129 debug inception4_gpu4_bs256_fp32 dl PD 0:00 1 (Priority)
7130 debug resnet50_gpu4_bs512_fp32 dl PD 0:00 1 (Priority)
7131 debug inception3_gpu4_bs512_fp32 dl PD 0:00 1 (Priority)
7132 debug vgg16_gpu4_bs512_fp32 dl PD 0:00 1 (Priority)
7133 debug nasnet_gpu4_bs512_fp32 dl PD 0:00 1 (Priority)
7134 debug resnet152_gpu4_bs512_fp32 dl PD 0:00 1 (Priority)
7135 debug inception4_gpu4_bs512_fp32 dl PD 0:00 1 (Priority)
7136 debug resnet50_gpu8_bs64_fp32 dl PD 0:00 1 (Priority)
7137 debug inception3_gpu8_bs64_fp32 dl PD 0:00 1 (Priority)
7138 debug vgg16_gpu8_bs64_fp32 dl PD 0:00 1 (Priority)
7139 debug nasnet_gpu8_bs64_fp32 dl PD 0:00 1 (Priority)
7140 debug resnet152_gpu8_bs64_fp32 dl PD 0:00 1 (Priority)
7141 debug inception4_gpu8_bs64_fp32 dl PD 0:00 1 (Priority)
7142 debug resnet50_gpu8_bs128_fp32 dl PD 0:00 1 (Priority)
7143 debug inception3_gpu8_bs128_fp32 dl PD 0:00 1 (Priority)
7144 debug vgg16_gpu8_bs128_fp32 dl PD 0:00 1 (Priority)
7145 debug nasnet_gpu8_bs128_fp32 dl PD 0:00 1 (Priority)
7146 debug resnet152_gpu8_bs128_fp32 dl PD 0:00 1 (Priority)
7147 debug inception4_gpu8_bs128_fp32 dl PD 0:00 1 (Priority)
7148 debug resnet50_gpu8_bs256_fp32 dl PD 0:00 1 (Priority)
7149 debug inception3_gpu8_bs256_fp32 dl PD 0:00 1 (Priority)
7150 debug vgg16_gpu8_bs256_fp32 dl PD 0:00 1 (Priority)
7151 debug nasnet_gpu8_bs256_fp32 dl PD 0:00 1 (Priority)
7152 debug resnet152_gpu8_bs256_fp32 dl PD 0:00 1 (Priority)
7153 debug inception4_gpu8_bs256_fp32 dl PD 0:00 1 (Priority)
7154 debug resnet50_gpu8_bs512_fp32 dl PD 0:00 1 (Priority)
7155 debug inception3_gpu8_bs512_fp32 dl PD 0:00 1 (Priority)
7156 debug vgg16_gpu8_bs512_fp32 dl PD 0:00 1 (Priority)
7157 debug nasnet_gpu8_bs512_fp32 dl PD 0:00 1 (Priority)
7158 debug resnet152_gpu8_bs512_fp32 dl PD 0:00 1 (Priority)
7159 debug inception4_gpu8_bs512_fp32 dl PD 0:00 1 (Priority)
6968 debug resnet50_gpu1_bs64_fp16 dl R 0:12 1 dl-machine
6970 debug vgg16_gpu1_bs64_fp16 dl R 0:12 1 dl-machine
6971 debug nasnet_gpu1_bs64_fp16 dl R 0:12 1 dl-machine
6972 debug resnet152_gpu1_bs64_fp16 dl R 0:12 1 dl-machine
6975 debug inception3_gpu1_bs128_fp16 dl R 0:12 1 dl-machine
6976 debug vgg16_gpu1_bs128_fp16 dl R 0:12 1 dl-machine
6977 debug nasnet_gpu1_bs128_fp16 dl R 0:12 1 dl-machine
6978 debug resnet152_gpu1_bs128_fp16 dl R 0:12 1 dl-machine
(base) dl@dl-machine:~/slurm$
のように実行を始めます。
実行結果は
(base) dl@dl-machine:~/slurm$ ls -lt logdir-4090x8 total 136 -rw-rw-r-- 1 dl dl 11 9月 9 16:32 nasnet_gpux2_bs256_fp16.log -rw-rw-r-- 1 dl dl 1171 9月 9 16:32 inception3_gpux2_bs256_fp16.log -rw-rw-r-- 1 dl dl 11 9月 9 16:31 vgg16_gpux2_bs256_fp16.log -rw-rw-r-- 1 dl dl 1159 9月 9 16:31 inception4_gpux2_bs128_fp16.log -rw-rw-r-- 1 dl dl 1167 9月 9 16:31 resnet50_gpux2_bs256_fp16.log -rw-rw-r-- 1 dl dl 1168 9月 9 16:30 resnet152_gpux2_bs128_fp16.log -rw-rw-r-- 1 dl dl 1165 9月 9 16:30 nasnet_gpux2_bs128_fp16.log -rw-rw-r-- 1 dl dl 1159 9月 9 16:29 inception3_gpux2_bs128_fp16.log -rw-rw-r-- 1 dl dl 1134 9月 9 16:29 resnet152_gpux1_bs64_fp32.log -rw-rw-r-- 1 dl dl 11 9月 9 16:28 vgg16_gpux2_bs128_fp16.log -rw-rw-r-- 1 dl dl 1169 9月 9 16:28 resnet50_gpux2_bs128_fp16.log -rw-rw-r-- 1 dl dl 1159 9月 9 16:28 inception4_gpux2_bs64_fp16.log -rw-rw-r-- 1 dl dl 1138 9月 9 16:28 nasnet_gpux1_bs64_fp32.log -rw-rw-r-- 1 dl dl 1158 9月 9 16:28 vgg16_gpux2_bs64_fp16.log -rw-rw-r-- 1 dl dl 1130 9月 9 16:27 vgg16_gpux1_bs64_fp32.log -rw-rw-r-- 1 dl dl 1165 9月 9 16:27 resnet152_gpux2_bs64_fp16.log -rw-rw-r-- 1 dl dl 1165 9月 9 16:26 nasnet_gpux2_bs64_fp16.log -rw-rw-r-- 1 dl dl 1135 9月 9 16:26 inception3_gpux1_bs64_fp32.log -rw-rw-r-- 1 dl dl 1168 9月 9 16:26 inception3_gpux2_bs64_fp16.log -rw-rw-r-- 1 dl dl 1133 9月 9 16:25 resnet50_gpux1_bs64_fp32.log -rw-rw-r-- 1 dl dl 1169 9月 9 16:25 resnet50_gpux2_bs64_fp16.log -rw-rw-r-- 1 dl dl 1147 9月 9 16:25 resnet50_gpux1_bs512_fp16.log -rw-rw-r-- 1 dl dl 391 9月 9 16:25 nasnet_gpux1_bs512_fp16.log -rw-rw-r-- 1 dl dl 1137 9月 9 16:25 inception4_gpux1_bs256_fp16.log -rw-rw-r-- 1 dl dl 395 9月 9 16:25 inception4_gpux1_bs512_fp16.log -rw-rw-r-- 1 dl dl 394 9月 9 16:25 resnet152_gpux1_bs512_fp16.log -rw-rw-r-- 1 dl dl 1136 9月 9 16:24 resnet152_gpux1_bs256_fp16.log -rw-rw-r-- 1 dl dl 395 9月 9 16:24 inception3_gpux1_bs512_fp16.log -rw-rw-r-- 1 dl dl 11 9月 9 16:23 vgg16_gpux1_bs512_fp16.log -rw-rw-r-- 1 dl dl 1149 9月 9 16:23 inception3_gpux1_bs256_fp16.log -rw-rw-r-- 1 dl dl 1133 9月 9 16:23 nasnet_gpux1_bs256_fp16.log -rw-rw-r-- 1 dl dl 1147 9月 9 16:23 resnet50_gpux1_bs256_fp16.log -rw-rw-r-- 1 dl dl 1137 9月 9 16:23 inception4_gpux1_bs128_fp16.log -rw-rw-r-- 1 dl dl 11 9月 9 16:22 vgg16_gpux1_bs256_fp16.log (base) dl@dl-machine:~/slurm$
のようにファイルに書き込まれます。中は
(base) dl@dl-machine:~/slurm$ cat logdir-4090x8/inception4_gpux1_bs128_fp16.log
dl-machine
TensorFlow: 1.15
Model: inception4
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 128 global
128 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 693.9 +/- 0.0 (jitter = 0.0) 7.725
10 images/sec: 694.9 +/- 1.3 (jitter = 1.3) 7.662
20 images/sec: 696.2 +/- 1.0 (jitter = 1.8) 7.695
30 images/sec: 696.2 +/- 0.8 (jitter = 1.8) 7.589
40 images/sec: 695.6 +/- 0.6 (jitter = 1.4) 7.618
50 images/sec: 695.4 +/- 0.5 (jitter = 1.0) 7.659
60 images/sec: 695.0 +/- 0.5 (jitter = 0.7) 7.555
70 images/sec: 694.9 +/- 0.4 (jitter = 0.7) 7.638
80 images/sec: 694.6 +/- 0.5 (jitter = 0.7) 7.656
90 images/sec: 694.4 +/- 0.5 (jitter = 0.8) 7.696
100 images/sec: 694.4 +/- 0.4 (jitter = 0.8) 7.634
----------------------------------------------------------------
total images/sec: 694.06
----------------------------------------------------------------
(base) dl@dl-machine:~/slurm$
のようになっています。