GeForce RTX 3090を1, 2, 4, 8 GPU 使い、batch size を64, 128, 256, 512と変化させてtf_cnn_benchmarks での学習速度を計測しました。
modelは、resnet50, inception3, vgg16, nasnet, resnet152, inception4です。
fp16とfp32の学習速度を計測しました。
表の学習速度(images/sec)の括弧内の数値は、1GPUの時と比べて何倍になっているかを示します。
使用したハードウェアは HPCDIY-ERMGPU8R4S
CPU: AMD EPYC Rome 7352 DP/UP 24C/48T 2.3G 128M 155W
メモリ: 32GB x 16 = 512GB
SSD: Samsung PM983 7.68TB NVMePCIe3x4 V4TLC 2.5"7mm(1.3 DWPD) x 1
になります。
使用したソフトウェアは、tf_cnn_benchmarks、使用したtensorflowは、nvcr.io/nvidia/tensorflow:20.12-tf1-py3 になります。
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp16 |
resnet50 |
64 |
1026.83 |
1828.21(1.780) |
2798.55(2.725) |
4039.44(3.934) |
rtx3090 |
fp16 |
resnet50 |
128 |
1126.66 |
2103.69(1.867) |
3596.61(3.192) |
6005.06(5.330) |
rtx3090 |
fp16 |
resnet50 |
256 |
1181.90 |
2283.06(1.932) |
4266.43(3.610) |
7362.66(6.230) |
rtx3090 |
fp16 |
resnet50 |
512 |
1205.02 |
2371.21(1.968) |
4635.48(3.847) |
8756.00(7.266) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp16 |
inception3 |
64 |
709.40 |
1304.37(1.839) |
2247.00(3.167) |
3549.19(5.003) |
rtx3090 |
fp16 |
inception3 |
128 |
764.99 |
1382.01(1.807) |
2692.60(3.520) |
4865.46(6.360) |
rtx3090 |
fp16 |
inception3 |
256 |
806.46 |
1573.15(1.951) |
2979.08(3.694) |
5598.28(6.942) |
|
|
|
|
|
|
|
|
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp16 |
vgg16 |
64 |
419.83 |
589.37(1.404) |
703.38(1.675) |
634.67(1.512) |
rtx3090 |
fp16 |
vgg16 |
128 |
439.85 |
756.71(1.720) |
949.62(2.159) |
1093.66(2.486) |
rtx3090 |
fp16 |
vgg16 |
256 |
455.41 |
821.34(1.804) |
1303.34(2.862) |
1695.73(3.724) |
rtx3090 |
fp16 |
vgg16 |
512 |
440.25 |
838.75(1.905) |
1356.53(3.081) |
2171.86(4.933) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp16 |
nasnet |
64 |
343.50 |
524.19(1.526) |
813.83(2.369) |
1362.27(3.966) |
rtx3090 |
fp16 |
nasnet |
128 |
406.73 |
726.86(1.787) |
1314.13(3.231) |
2298.17(5.650) |
rtx3090 |
fp16 |
nasnet |
256 |
442.83 |
833.64(1.883) |
1576.29(3.560) |
2943.63(6.647) |
|
|
|
|
|
|
|
|
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp16 |
resnet152 |
64 |
427.40 |
766.60(1.794) |
1151.20(2.693) |
1777.87(4.160) |
rtx3090 |
fp16 |
resnet152 |
128 |
452.36 |
856.83(1.894) |
1528.42(3.379) |
2509.74(5.548) |
rtx3090 |
fp16 |
resnet152 |
256 |
486.50 |
943.83(1.940) |
1741.87(3.580) |
3157.88(6.491) |
|
|
|
|
|
|
|
|
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp16 |
inception4 |
64 |
340.48 |
616.14(1.810) |
1118.29(3.284) |
1928.91(5.665) |
rtx3090 |
fp16 |
inception4 |
128 |
363.79 |
708.04(1.946) |
1293.56(3.556) |
2458.25(6.757) |
rtx3090 |
fp16 |
inception4 |
256 |
399.16 |
786.27(1.970) |
1524.64(3.820) |
2975.40(7.454) |
|
|
|
|
|
|
|
|
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp32 |
resnet50 |
64 |
490.86 |
920.09(1.874) |
1591.11(3.241) |
2717.38(5.536) |
rtx3090 |
fp32 |
resnet50 |
128 |
535.17 |
1028.42(1.922) |
1902.45(3.555) |
3472.79(6.489) |
rtx3090 |
fp32 |
resnet50 |
256 |
549.28 |
1078.30(1.963) |
2095.68(3.815) |
3973.04(7.233) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp32 |
inception3 |
64 |
343.02 |
642.50(1.873) |
1193.98(3.481) |
2177.34(6.348) |
rtx3090 |
fp32 |
inception3 |
128 |
361.84 |
715.59(1.978) |
1362.60(3.766) |
2572.47(7.109) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp32 |
vgg16 |
64 |
313.28 |
471.89(1.506) |
576.93(1.842) |
643.78(2.055) |
rtx3090 |
fp32 |
vgg16 |
128 |
322.07 |
571.30(1.774) |
737.77(2.291) |
1019.74(3.166) |
rtx3090 |
fp32 |
vgg16 |
256 |
325.21 |
615.41(1.892) |
947.14(2.912) |
1330.58(4.091) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp32 |
nasnet |
64 |
329.72 |
551.90(1.674) |
828.22(2.512) |
1382.13(4.192) |
rtx3090 |
fp32 |
nasnet |
128 |
385.95 |
697.19(1.806) |
1266.13(3.281) |
2292.71(5.940) |
rtx3090 |
fp32 |
nasnet |
256 |
413.63 |
782.42(1.892) |
1493.11(3.610) |
2787.66(6.740) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp32 |
resnet152 |
64 |
205.71 |
384.59(1.870) |
671.86(3.266) |
1115.93(5.425) |
rtx3090 |
fp32 |
resnet152 |
128 |
222.55 |
428.55(1.926) |
796.43(3.579) |
1447.99(6.506) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
rtx3090 |
fp32 |
inception4 |
64 |
167.74 |
321.31(1.916) |
602.30(3.591) |
1108.22(6.607) |
rtx3090 |
fp32 |
inception4 |
128 |
176.45 |
344.66(1.953) |
664.47(3.766) |
1289.53(7.308) |