GeForce RTX 4090はRTX A6000などと比較して、物理的サイズが非常に大きく、かつ消費電力も450Wと大きいため、RTX4090を8GPU搭載可能なサーバーの構築は困難と考えられますが、この記事では弊社が販売しているRTX4090を8GPU搭載可能な、AMD Epyc Genoa 2CPUのサーバーを紹介します。

弊社が販売するRTX4090の物理サイズはRTX A6000などと同じサイズの製品です。そのため、4Uのサーバーに8GPU搭載可能です。外排気の構造になっており、冷却用のFANの排気は、サーバーの背面外部へのみ行い、サーバー内部には排気しないため、サーバー内がGPUの熱で温度上昇することはありません。GPUの補助電源ケーブルは12VHPWR/600Wが8本付属していますので、8GPUまでの電源供給にも問題ありません。このサーバーをカスタマイズして見積もり依頼するには、ここをクリックしてください。

今回はこのサーバーにRTX4090を8GPU,

(base) dl@dl-machine:~$ nvidia-smi
Sat Sep  9 13:59:26 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
| 33%   33C    P8              12W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:21:00.0 Off |                  Off |
| 34%   34C    P8              18W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  | 00000000:41:00.0 Off |                  Off |
| 33%   33C    P8              19W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        On  | 00000000:61:00.0 Off |                  Off |
| 33%   32C    P8              11W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        On  | 00000000:81:00.0 Off |                  Off |
| 34%   32C    P8              12W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        On  | 00000000:A1:00.0 Off |                  Off |
| 34%   33C    P8              21W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        On  | 00000000:C1:00.0 Off |                  Off |
| 34%   31C    P8              19W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        On  | 00000000:E1:00.0 Off |                  Off |
| 34%   30C    P8              24W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
|    5   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
|    6   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
|    7   N/A  N/A      4151      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
(base) dl@dl-machine:~$

CPUは

(base) dl@dl-machine:~/slurm$ lscpu|grep -i "model name"
Model name:                         AMD EPYC 9654 96-Core Processor
(base) dl@dl-machine:~/slurm$ nproc
192
(base) dl@dl-machine:~/slurm$

AMD EPYC 9654 96-Coreを2CPUで192コア

メモリは

(base) dl@dl-machine:~/slurm$ sudo dmidecode -t memory | grep '\sVolatile Size'
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
	Volatile Size: 64 GB
(base) dl@dl-machine:~/slurm$ sudo dmidecode -t memory | grep '\sVolatile Size'|wc
     24      96     528
(base) dl@dl-machine:~/slurm$ 

64GB DIMMを24枚で1.5TB

SSDは

(base) dl@dl-machine:/etc/slurm-llnl$ sudo lshw -c disk
  *-namespace               
       description: NVMe namespace
       physical id: 1
       logical name: /dev/nvme0n1
       size: 7153GiB (7681GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: guid=c97becbf-ddf5-4758-81b4-a8a15482b0ce logicalsectorsize=512 sectorsize=4096
(base) dl@dl-machine:~/slurm$ 
 

7.68TBのNVMe SSDが1台です。

それではこのサーバーでお手軽にGPUの性能を評価できる、tf_cnn_benchmarksを実行してみましょう。このリンクをcloneしておきます。

tensorflow-1が必要ですがNGCのtensorflow 1をsingularityでpullしておきます。dockerでなくsingularityを使うのは、slurmでジョブを投入するためです。dockerを使用するジョブはslurm(その他の大抵のジョブスケジューラにも)に投入できません。

NVIDIAドライバーは既にインストール済みですが、RTX4090で並列学習をさせるためには、RTX 4090のGPU間Peer Peer accessがNoと表示されるドライバーであることが必要ですので、それを確認します。

 
(base) dl@dl-machine:/etc/slurm-llnl$ deviceQuery 
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 8 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 33 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 65 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 97 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 4: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 5: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 161 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 6: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 193 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 7: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 11.1
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (128) Multiprocessors, (128) CUDA Cores/MP:     16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 225 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU6) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU7) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU4) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU5) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU7) -> NVIDIA GeForce RTX 4090 (GPU6) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 11.1, NumDevs = 8
Result = PASS
(base) dl@dl-machine:/etc/slurm-llnl$ 

8GPUが搭載されているため表示が長くなってしまいましたが、最後に表示されるPeer accessの行の右端が全てNoになっているのでこのドライバーで問題ありません。

slurmジョブスケジューラが動作しているかの確認は

(base) dl@dl-machine:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle dl-machine
(base) dl@dl-machine:~$ scontrol show nodes
NodeName=dl-machine Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=192 CPULoad=0.72
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:rtx4090:8(S:0-1)
   NodeAddr=dl-machine NodeHostName=dl-machine Version=19.05.5
   OS=Linux 5.15.0-83-generic #92~20.04.1-Ubuntu SMP Mon Aug 21 14:00:49 UTC 2023 
   RealMemory=1547856 AllocMem=0 FreeMem=1534739 Sockets=192 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug 
   BootTime=2023-09-07T15:27:15 SlurmdStartTime=2023-09-07T16:01:09
   CfgTRES=cpu=192,mem=1547856M,billing=192
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
(base) dl@dl-machine:~$ 

で可能です。

slurmにtf cnn benchmarkを網羅的に行うジョブを投入するscriptは

(base) dl@dl-machine:~/slurm$ cat 4090x8.sh 
#!/bin/bash
bench=/home/dl/tf_cnn_benchmarks
tf1=/home/dl/singularity/tensorflow_23.02-tf1-py3.sif
logdir=logdir-4090x8
errdir=errdir-4090x8
sbatch_tf_cnn () {
	if [ $4 = "fp16" ]; then
		acc="--use_fp16"
	else
		acc=""
	fi
	rm -rf ${logdir} ${errdir}
	mkdir -p ${logdir} ${errdir}
sbatch  -J "$3_gpu$1_bs$2_$4" <

になります。

このスクリプトを実行すると網羅的にジョブが投入され

   JOBID PARTI NAME                           USER  ST       TIME  NODES NODELIST(REASON)
    6979 debug inception4_gpu1_bs128_fp16     dl    PD       0:00      1 (Resources)
    6980 debug resnet50_gpu1_bs256_fp16       dl    PD       0:00      1 (Priority)
    6981 debug inception3_gpu1_bs256_fp16     dl    PD       0:00      1 (Priority)
    6982 debug vgg16_gpu1_bs256_fp16          dl    PD       0:00      1 (Priority)
    6983 debug nasnet_gpu1_bs256_fp16         dl    PD       0:00      1 (Priority)
    6984 debug resnet152_gpu1_bs256_fp16      dl    PD       0:00      1 (Priority)
    6985 debug inception4_gpu1_bs256_fp16     dl    PD       0:00      1 (Priority)
    6986 debug resnet50_gpu1_bs512_fp16       dl    PD       0:00      1 (Priority)
    6987 debug inception3_gpu1_bs512_fp16     dl    PD       0:00      1 (Priority)
    6988 debug vgg16_gpu1_bs512_fp16          dl    PD       0:00      1 (Priority)
    6989 debug nasnet_gpu1_bs512_fp16         dl    PD       0:00      1 (Priority)
    6990 debug resnet152_gpu1_bs512_fp16      dl    PD       0:00      1 (Priority)
    6991 debug inception4_gpu1_bs512_fp16     dl    PD       0:00      1 (Priority)
    6992 debug resnet50_gpu2_bs64_fp16        dl    PD       0:00      1 (Priority)
    6993 debug inception3_gpu2_bs64_fp16      dl    PD       0:00      1 (Priority)
    6994 debug vgg16_gpu2_bs64_fp16           dl    PD       0:00      1 (Priority)
    6995 debug nasnet_gpu2_bs64_fp16          dl    PD       0:00      1 (Priority)
    6996 debug resnet152_gpu2_bs64_fp16       dl    PD       0:00      1 (Priority)
    6997 debug inception4_gpu2_bs64_fp16      dl    PD       0:00      1 (Priority)
    6998 debug resnet50_gpu2_bs128_fp16       dl    PD       0:00      1 (Priority)
    6999 debug inception3_gpu2_bs128_fp16     dl    PD       0:00      1 (Priority)
    7000 debug vgg16_gpu2_bs128_fp16          dl    PD       0:00      1 (Priority)
    7001 debug nasnet_gpu2_bs128_fp16         dl    PD       0:00      1 (Priority)
    7002 debug resnet152_gpu2_bs128_fp16      dl    PD       0:00      1 (Priority)
    7003 debug inception4_gpu2_bs128_fp16     dl    PD       0:00      1 (Priority)
    7004 debug resnet50_gpu2_bs256_fp16       dl    PD       0:00      1 (Priority)
    7005 debug inception3_gpu2_bs256_fp16     dl    PD       0:00      1 (Priority)
    7006 debug vgg16_gpu2_bs256_fp16          dl    PD       0:00      1 (Priority)
    7007 debug nasnet_gpu2_bs256_fp16         dl    PD       0:00      1 (Priority)
    7008 debug resnet152_gpu2_bs256_fp16      dl    PD       0:00      1 (Priority)
    7009 debug inception4_gpu2_bs256_fp16     dl    PD       0:00      1 (Priority)
    7010 debug resnet50_gpu2_bs512_fp16       dl    PD       0:00      1 (Priority)
    7011 debug inception3_gpu2_bs512_fp16     dl    PD       0:00      1 (Priority)
    7012 debug vgg16_gpu2_bs512_fp16          dl    PD       0:00      1 (Priority)
    7013 debug nasnet_gpu2_bs512_fp16         dl    PD       0:00      1 (Priority)
    7014 debug resnet152_gpu2_bs512_fp16      dl    PD       0:00      1 (Priority)
    7015 debug inception4_gpu2_bs512_fp16     dl    PD       0:00      1 (Priority)
    7016 debug resnet50_gpu4_bs64_fp16        dl    PD       0:00      1 (Priority)
    7017 debug inception3_gpu4_bs64_fp16      dl    PD       0:00      1 (Priority)
    7018 debug vgg16_gpu4_bs64_fp16           dl    PD       0:00      1 (Priority)
    7019 debug nasnet_gpu4_bs64_fp16          dl    PD       0:00      1 (Priority)
    7020 debug resnet152_gpu4_bs64_fp16       dl    PD       0:00      1 (Priority)
    7021 debug inception4_gpu4_bs64_fp16      dl    PD       0:00      1 (Priority)
    7022 debug resnet50_gpu4_bs128_fp16       dl    PD       0:00      1 (Priority)
    7023 debug inception3_gpu4_bs128_fp16     dl    PD       0:00      1 (Priority)
    7024 debug vgg16_gpu4_bs128_fp16          dl    PD       0:00      1 (Priority)
    7025 debug nasnet_gpu4_bs128_fp16         dl    PD       0:00      1 (Priority)
    7026 debug resnet152_gpu4_bs128_fp16      dl    PD       0:00      1 (Priority)
    7027 debug inception4_gpu4_bs128_fp16     dl    PD       0:00      1 (Priority)
    7028 debug resnet50_gpu4_bs256_fp16       dl    PD       0:00      1 (Priority)
    7029 debug inception3_gpu4_bs256_fp16     dl    PD       0:00      1 (Priority)
    7030 debug vgg16_gpu4_bs256_fp16          dl    PD       0:00      1 (Priority)
    7031 debug nasnet_gpu4_bs256_fp16         dl    PD       0:00      1 (Priority)
    7032 debug resnet152_gpu4_bs256_fp16      dl    PD       0:00      1 (Priority)
    7033 debug inception4_gpu4_bs256_fp16     dl    PD       0:00      1 (Priority)
    7034 debug resnet50_gpu4_bs512_fp16       dl    PD       0:00      1 (Priority)
    7035 debug inception3_gpu4_bs512_fp16     dl    PD       0:00      1 (Priority)
    7036 debug vgg16_gpu4_bs512_fp16          dl    PD       0:00      1 (Priority)
    7037 debug nasnet_gpu4_bs512_fp16         dl    PD       0:00      1 (Priority)
    7038 debug resnet152_gpu4_bs512_fp16      dl    PD       0:00      1 (Priority)
    7039 debug inception4_gpu4_bs512_fp16     dl    PD       0:00      1 (Priority)
    7040 debug resnet50_gpu8_bs64_fp16        dl    PD       0:00      1 (Priority)
    7041 debug inception3_gpu8_bs64_fp16      dl    PD       0:00      1 (Priority)
    7042 debug vgg16_gpu8_bs64_fp16           dl    PD       0:00      1 (Priority)
    7043 debug nasnet_gpu8_bs64_fp16          dl    PD       0:00      1 (Priority)
    7044 debug resnet152_gpu8_bs64_fp16       dl    PD       0:00      1 (Priority)
    7045 debug inception4_gpu8_bs64_fp16      dl    PD       0:00      1 (Priority)
    7046 debug resnet50_gpu8_bs128_fp16       dl    PD       0:00      1 (Priority)
    7047 debug inception3_gpu8_bs128_fp16     dl    PD       0:00      1 (Priority)
    7048 debug vgg16_gpu8_bs128_fp16          dl    PD       0:00      1 (Priority)
    7049 debug nasnet_gpu8_bs128_fp16         dl    PD       0:00      1 (Priority)
    7050 debug resnet152_gpu8_bs128_fp16      dl    PD       0:00      1 (Priority)
    7051 debug inception4_gpu8_bs128_fp16     dl    PD       0:00      1 (Priority)
    7052 debug resnet50_gpu8_bs256_fp16       dl    PD       0:00      1 (Priority)
    7053 debug inception3_gpu8_bs256_fp16     dl    PD       0:00      1 (Priority)
    7054 debug vgg16_gpu8_bs256_fp16          dl    PD       0:00      1 (Priority)
    7055 debug nasnet_gpu8_bs256_fp16         dl    PD       0:00      1 (Priority)
    7056 debug resnet152_gpu8_bs256_fp16      dl    PD       0:00      1 (Priority)
    7057 debug inception4_gpu8_bs256_fp16     dl    PD       0:00      1 (Priority)
    7058 debug resnet50_gpu8_bs512_fp16       dl    PD       0:00      1 (Priority)
    7059 debug inception3_gpu8_bs512_fp16     dl    PD       0:00      1 (Priority)
    7060 debug vgg16_gpu8_bs512_fp16          dl    PD       0:00      1 (Priority)
    7061 debug nasnet_gpu8_bs512_fp16         dl    PD       0:00      1 (Priority)
    7062 debug resnet152_gpu8_bs512_fp16      dl    PD       0:00      1 (Priority)
    7063 debug inception4_gpu8_bs512_fp16     dl    PD       0:00      1 (Priority)
    7064 debug resnet50_gpu1_bs64_fp32        dl    PD       0:00      1 (Priority)
    7065 debug inception3_gpu1_bs64_fp32      dl    PD       0:00      1 (Priority)
    7066 debug vgg16_gpu1_bs64_fp32           dl    PD       0:00      1 (Priority)
    7067 debug nasnet_gpu1_bs64_fp32          dl    PD       0:00      1 (Priority)
    7068 debug resnet152_gpu1_bs64_fp32       dl    PD       0:00      1 (Priority)
    7069 debug inception4_gpu1_bs64_fp32      dl    PD       0:00      1 (Priority)
    7070 debug resnet50_gpu1_bs128_fp32       dl    PD       0:00      1 (Priority)
    7071 debug inception3_gpu1_bs128_fp32     dl    PD       0:00      1 (Priority)
    7072 debug vgg16_gpu1_bs128_fp32          dl    PD       0:00      1 (Priority)
    7073 debug nasnet_gpu1_bs128_fp32         dl    PD       0:00      1 (Priority)
    7074 debug resnet152_gpu1_bs128_fp32      dl    PD       0:00      1 (Priority)
    7075 debug inception4_gpu1_bs128_fp32     dl    PD       0:00      1 (Priority)
    7076 debug resnet50_gpu1_bs256_fp32       dl    PD       0:00      1 (Priority)
    7077 debug inception3_gpu1_bs256_fp32     dl    PD       0:00      1 (Priority)
    7078 debug vgg16_gpu1_bs256_fp32          dl    PD       0:00      1 (Priority)
    7079 debug nasnet_gpu1_bs256_fp32         dl    PD       0:00      1 (Priority)
    7080 debug resnet152_gpu1_bs256_fp32      dl    PD       0:00      1 (Priority)
    7081 debug inception4_gpu1_bs256_fp32     dl    PD       0:00      1 (Priority)
    7082 debug resnet50_gpu1_bs512_fp32       dl    PD       0:00      1 (Priority)
    7083 debug inception3_gpu1_bs512_fp32     dl    PD       0:00      1 (Priority)
    7084 debug vgg16_gpu1_bs512_fp32          dl    PD       0:00      1 (Priority)
    7085 debug nasnet_gpu1_bs512_fp32         dl    PD       0:00      1 (Priority)
    7086 debug resnet152_gpu1_bs512_fp32      dl    PD       0:00      1 (Priority)
    7087 debug inception4_gpu1_bs512_fp32     dl    PD       0:00      1 (Priority)
    7088 debug resnet50_gpu2_bs64_fp32        dl    PD       0:00      1 (Priority)
    7089 debug inception3_gpu2_bs64_fp32      dl    PD       0:00      1 (Priority)
    7090 debug vgg16_gpu2_bs64_fp32           dl    PD       0:00      1 (Priority)
    7091 debug nasnet_gpu2_bs64_fp32          dl    PD       0:00      1 (Priority)
    7092 debug resnet152_gpu2_bs64_fp32       dl    PD       0:00      1 (Priority)
    7093 debug inception4_gpu2_bs64_fp32      dl    PD       0:00      1 (Priority)
    7094 debug resnet50_gpu2_bs128_fp32       dl    PD       0:00      1 (Priority)
    7095 debug inception3_gpu2_bs128_fp32     dl    PD       0:00      1 (Priority)
    7096 debug vgg16_gpu2_bs128_fp32          dl    PD       0:00      1 (Priority)
    7097 debug nasnet_gpu2_bs128_fp32         dl    PD       0:00      1 (Priority)
    7098 debug resnet152_gpu2_bs128_fp32      dl    PD       0:00      1 (Priority)
    7099 debug inception4_gpu2_bs128_fp32     dl    PD       0:00      1 (Priority)
    7100 debug resnet50_gpu2_bs256_fp32       dl    PD       0:00      1 (Priority)
    7101 debug inception3_gpu2_bs256_fp32     dl    PD       0:00      1 (Priority)
    7102 debug vgg16_gpu2_bs256_fp32          dl    PD       0:00      1 (Priority)
    7103 debug nasnet_gpu2_bs256_fp32         dl    PD       0:00      1 (Priority)
    7104 debug resnet152_gpu2_bs256_fp32      dl    PD       0:00      1 (Priority)
    7105 debug inception4_gpu2_bs256_fp32     dl    PD       0:00      1 (Priority)
    7106 debug resnet50_gpu2_bs512_fp32       dl    PD       0:00      1 (Priority)
    7107 debug inception3_gpu2_bs512_fp32     dl    PD       0:00      1 (Priority)
    7108 debug vgg16_gpu2_bs512_fp32          dl    PD       0:00      1 (Priority)
    7109 debug nasnet_gpu2_bs512_fp32         dl    PD       0:00      1 (Priority)
    7110 debug resnet152_gpu2_bs512_fp32      dl    PD       0:00      1 (Priority)
    7111 debug inception4_gpu2_bs512_fp32     dl    PD       0:00      1 (Priority)
    7112 debug resnet50_gpu4_bs64_fp32        dl    PD       0:00      1 (Priority)
    7113 debug inception3_gpu4_bs64_fp32      dl    PD       0:00      1 (Priority)
    7114 debug vgg16_gpu4_bs64_fp32           dl    PD       0:00      1 (Priority)
    7115 debug nasnet_gpu4_bs64_fp32          dl    PD       0:00      1 (Priority)
    7116 debug resnet152_gpu4_bs64_fp32       dl    PD       0:00      1 (Priority)
    7117 debug inception4_gpu4_bs64_fp32      dl    PD       0:00      1 (Priority)
    7118 debug resnet50_gpu4_bs128_fp32       dl    PD       0:00      1 (Priority)
    7119 debug inception3_gpu4_bs128_fp32     dl    PD       0:00      1 (Priority)
    7120 debug vgg16_gpu4_bs128_fp32          dl    PD       0:00      1 (Priority)
    7121 debug nasnet_gpu4_bs128_fp32         dl    PD       0:00      1 (Priority)
    7122 debug resnet152_gpu4_bs128_fp32      dl    PD       0:00      1 (Priority)
    7123 debug inception4_gpu4_bs128_fp32     dl    PD       0:00      1 (Priority)
    7124 debug resnet50_gpu4_bs256_fp32       dl    PD       0:00      1 (Priority)
    7125 debug inception3_gpu4_bs256_fp32     dl    PD       0:00      1 (Priority)
    7126 debug vgg16_gpu4_bs256_fp32          dl    PD       0:00      1 (Priority)
    7127 debug nasnet_gpu4_bs256_fp32         dl    PD       0:00      1 (Priority)
    7128 debug resnet152_gpu4_bs256_fp32      dl    PD       0:00      1 (Priority)
    7129 debug inception4_gpu4_bs256_fp32     dl    PD       0:00      1 (Priority)
    7130 debug resnet50_gpu4_bs512_fp32       dl    PD       0:00      1 (Priority)
    7131 debug inception3_gpu4_bs512_fp32     dl    PD       0:00      1 (Priority)
    7132 debug vgg16_gpu4_bs512_fp32          dl    PD       0:00      1 (Priority)
    7133 debug nasnet_gpu4_bs512_fp32         dl    PD       0:00      1 (Priority)
    7134 debug resnet152_gpu4_bs512_fp32      dl    PD       0:00      1 (Priority)
    7135 debug inception4_gpu4_bs512_fp32     dl    PD       0:00      1 (Priority)
    7136 debug resnet50_gpu8_bs64_fp32        dl    PD       0:00      1 (Priority)
    7137 debug inception3_gpu8_bs64_fp32      dl    PD       0:00      1 (Priority)
    7138 debug vgg16_gpu8_bs64_fp32           dl    PD       0:00      1 (Priority)
    7139 debug nasnet_gpu8_bs64_fp32          dl    PD       0:00      1 (Priority)
    7140 debug resnet152_gpu8_bs64_fp32       dl    PD       0:00      1 (Priority)
    7141 debug inception4_gpu8_bs64_fp32      dl    PD       0:00      1 (Priority)
    7142 debug resnet50_gpu8_bs128_fp32       dl    PD       0:00      1 (Priority)
    7143 debug inception3_gpu8_bs128_fp32     dl    PD       0:00      1 (Priority)
    7144 debug vgg16_gpu8_bs128_fp32          dl    PD       0:00      1 (Priority)
    7145 debug nasnet_gpu8_bs128_fp32         dl    PD       0:00      1 (Priority)
    7146 debug resnet152_gpu8_bs128_fp32      dl    PD       0:00      1 (Priority)
    7147 debug inception4_gpu8_bs128_fp32     dl    PD       0:00      1 (Priority)
    7148 debug resnet50_gpu8_bs256_fp32       dl    PD       0:00      1 (Priority)
    7149 debug inception3_gpu8_bs256_fp32     dl    PD       0:00      1 (Priority)
    7150 debug vgg16_gpu8_bs256_fp32          dl    PD       0:00      1 (Priority)
    7151 debug nasnet_gpu8_bs256_fp32         dl    PD       0:00      1 (Priority)
    7152 debug resnet152_gpu8_bs256_fp32      dl    PD       0:00      1 (Priority)
    7153 debug inception4_gpu8_bs256_fp32     dl    PD       0:00      1 (Priority)
    7154 debug resnet50_gpu8_bs512_fp32       dl    PD       0:00      1 (Priority)
    7155 debug inception3_gpu8_bs512_fp32     dl    PD       0:00      1 (Priority)
    7156 debug vgg16_gpu8_bs512_fp32          dl    PD       0:00      1 (Priority)
    7157 debug nasnet_gpu8_bs512_fp32         dl    PD       0:00      1 (Priority)
    7158 debug resnet152_gpu8_bs512_fp32      dl    PD       0:00      1 (Priority)
    7159 debug inception4_gpu8_bs512_fp32     dl    PD       0:00      1 (Priority)
    6968 debug resnet50_gpu1_bs64_fp16        dl     R       0:12      1 dl-machine
    6970 debug vgg16_gpu1_bs64_fp16           dl     R       0:12      1 dl-machine
    6971 debug nasnet_gpu1_bs64_fp16          dl     R       0:12      1 dl-machine
    6972 debug resnet152_gpu1_bs64_fp16       dl     R       0:12      1 dl-machine
    6975 debug inception3_gpu1_bs128_fp16     dl     R       0:12      1 dl-machine
    6976 debug vgg16_gpu1_bs128_fp16          dl     R       0:12      1 dl-machine
    6977 debug nasnet_gpu1_bs128_fp16         dl     R       0:12      1 dl-machine
    6978 debug resnet152_gpu1_bs128_fp16      dl     R       0:12      1 dl-machine
(base) dl@dl-machine:~/slurm$

のように実行を始めます。

実行結果は

(base) dl@dl-machine:~/slurm$ ls -lt logdir-4090x8
total 136
-rw-rw-r-- 1 dl dl   11  9月  9 16:32 nasnet_gpux2_bs256_fp16.log
-rw-rw-r-- 1 dl dl 1171  9月  9 16:32 inception3_gpux2_bs256_fp16.log
-rw-rw-r-- 1 dl dl   11  9月  9 16:31 vgg16_gpux2_bs256_fp16.log
-rw-rw-r-- 1 dl dl 1159  9月  9 16:31 inception4_gpux2_bs128_fp16.log
-rw-rw-r-- 1 dl dl 1167  9月  9 16:31 resnet50_gpux2_bs256_fp16.log
-rw-rw-r-- 1 dl dl 1168  9月  9 16:30 resnet152_gpux2_bs128_fp16.log
-rw-rw-r-- 1 dl dl 1165  9月  9 16:30 nasnet_gpux2_bs128_fp16.log
-rw-rw-r-- 1 dl dl 1159  9月  9 16:29 inception3_gpux2_bs128_fp16.log
-rw-rw-r-- 1 dl dl 1134  9月  9 16:29 resnet152_gpux1_bs64_fp32.log
-rw-rw-r-- 1 dl dl   11  9月  9 16:28 vgg16_gpux2_bs128_fp16.log
-rw-rw-r-- 1 dl dl 1169  9月  9 16:28 resnet50_gpux2_bs128_fp16.log
-rw-rw-r-- 1 dl dl 1159  9月  9 16:28 inception4_gpux2_bs64_fp16.log
-rw-rw-r-- 1 dl dl 1138  9月  9 16:28 nasnet_gpux1_bs64_fp32.log
-rw-rw-r-- 1 dl dl 1158  9月  9 16:28 vgg16_gpux2_bs64_fp16.log
-rw-rw-r-- 1 dl dl 1130  9月  9 16:27 vgg16_gpux1_bs64_fp32.log
-rw-rw-r-- 1 dl dl 1165  9月  9 16:27 resnet152_gpux2_bs64_fp16.log
-rw-rw-r-- 1 dl dl 1165  9月  9 16:26 nasnet_gpux2_bs64_fp16.log
-rw-rw-r-- 1 dl dl 1135  9月  9 16:26 inception3_gpux1_bs64_fp32.log
-rw-rw-r-- 1 dl dl 1168  9月  9 16:26 inception3_gpux2_bs64_fp16.log
-rw-rw-r-- 1 dl dl 1133  9月  9 16:25 resnet50_gpux1_bs64_fp32.log
-rw-rw-r-- 1 dl dl 1169  9月  9 16:25 resnet50_gpux2_bs64_fp16.log
-rw-rw-r-- 1 dl dl 1147  9月  9 16:25 resnet50_gpux1_bs512_fp16.log
-rw-rw-r-- 1 dl dl  391  9月  9 16:25 nasnet_gpux1_bs512_fp16.log
-rw-rw-r-- 1 dl dl 1137  9月  9 16:25 inception4_gpux1_bs256_fp16.log
-rw-rw-r-- 1 dl dl  395  9月  9 16:25 inception4_gpux1_bs512_fp16.log
-rw-rw-r-- 1 dl dl  394  9月  9 16:25 resnet152_gpux1_bs512_fp16.log
-rw-rw-r-- 1 dl dl 1136  9月  9 16:24 resnet152_gpux1_bs256_fp16.log
-rw-rw-r-- 1 dl dl  395  9月  9 16:24 inception3_gpux1_bs512_fp16.log
-rw-rw-r-- 1 dl dl   11  9月  9 16:23 vgg16_gpux1_bs512_fp16.log
-rw-rw-r-- 1 dl dl 1149  9月  9 16:23 inception3_gpux1_bs256_fp16.log
-rw-rw-r-- 1 dl dl 1133  9月  9 16:23 nasnet_gpux1_bs256_fp16.log
-rw-rw-r-- 1 dl dl 1147  9月  9 16:23 resnet50_gpux1_bs256_fp16.log
-rw-rw-r-- 1 dl dl 1137  9月  9 16:23 inception4_gpux1_bs128_fp16.log
-rw-rw-r-- 1 dl dl   11  9月  9 16:22 vgg16_gpux1_bs256_fp16.log
(base) dl@dl-machine:~/slurm$ 

のようにファイルに書き込まれます。中は

(base) dl@dl-machine:~/slurm$ cat logdir-4090x8/inception4_gpux1_bs128_fp16.log
dl-machine
TensorFlow:  1.15
Model:       inception4
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 100
Num epochs:  0.01
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step	Img/sec	total_loss
1	images/sec: 693.9 +/- 0.0 (jitter = 0.0)	7.725
10	images/sec: 694.9 +/- 1.3 (jitter = 1.3)	7.662
20	images/sec: 696.2 +/- 1.0 (jitter = 1.8)	7.695
30	images/sec: 696.2 +/- 0.8 (jitter = 1.8)	7.589
40	images/sec: 695.6 +/- 0.6 (jitter = 1.4)	7.618
50	images/sec: 695.4 +/- 0.5 (jitter = 1.0)	7.659
60	images/sec: 695.0 +/- 0.5 (jitter = 0.7)	7.555
70	images/sec: 694.9 +/- 0.4 (jitter = 0.7)	7.638
80	images/sec: 694.6 +/- 0.5 (jitter = 0.7)	7.656
90	images/sec: 694.4 +/- 0.5 (jitter = 0.8)	7.696
100	images/sec: 694.4 +/- 0.4 (jitter = 0.8)	7.634
----------------------------------------------------------------
total images/sec: 694.06
----------------------------------------------------------------
(base) dl@dl-machine:~/slurm$ 

のようになっています。