RTX3090 1, 2, 4, 8 GPUでの網羅的な tf_cnn

タグ

GeForce RTX 3090を1, 2, 4, 8 GPU 使い、batch size を64, 128, 256, 512と変化させてtf_cnn_benchmarks での学習速度を計測しました。

modelは、resnet50, inception3, vgg16, nasnet, resnet152, inception4です。

fp16とfp32の学習速度を計測しました。

表の学習速度（images/sec）の括弧内の数値は、1GPUの時と比べて何倍になっているかを示します。

使用したハードウェアは HPCDIY-ERMGPU8R4S
CPU: AMD EPYC Rome 7352 DP/UP 24C/48T 2.3G 128M 155W
メモリ: 32GB x 16 = 512GB
SSD: Samsung PM983 7.68TB NVMePCIe3x4 V4TLC 2.5"7mm(1.3 DWPD) x 1
になります。

使用したソフトウェアは、tf_cnn_benchmarks、使用したtensorflowは、nvcr.io/nvidia/tensorflow:20.12-tf1-py3 になります。

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp16	resnet50	64	1026.83	1828.21(1.780)	2798.55(2.725)	4039.44(3.934)
rtx3090	fp16	resnet50	128	1126.66	2103.69(1.867)	3596.61(3.192)	6005.06(5.330)
rtx3090	fp16	resnet50	256	1181.90	2283.06(1.932)	4266.43(3.610)	7362.66(6.230)
rtx3090	fp16	resnet50	512	1205.02	2371.21(1.968)	4635.48(3.847)	8756.00(7.266)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp16	inception3	64	709.40	1304.37(1.839)	2247.00(3.167)	3549.19(5.003)
rtx3090	fp16	inception3	128	764.99	1382.01(1.807)	2692.60(3.520)	4865.46(6.360)
rtx3090	fp16	inception3	256	806.46	1573.15(1.951)	2979.08(3.694)	5598.28(6.942)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp16	vgg16	64	419.83	589.37(1.404)	703.38(1.675)	634.67(1.512)
rtx3090	fp16	vgg16	128	439.85	756.71(1.720)	949.62(2.159)	1093.66(2.486)
rtx3090	fp16	vgg16	256	455.41	821.34(1.804)	1303.34(2.862)	1695.73(3.724)
rtx3090	fp16	vgg16	512	440.25	838.75(1.905)	1356.53(3.081)	2171.86(4.933)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp16	nasnet	64	343.50	524.19(1.526)	813.83(2.369)	1362.27(3.966)
rtx3090	fp16	nasnet	128	406.73	726.86(1.787)	1314.13(3.231)	2298.17(5.650)
rtx3090	fp16	nasnet	256	442.83	833.64(1.883)	1576.29(3.560)	2943.63(6.647)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp16	resnet152	64	427.40	766.60(1.794)	1151.20(2.693)	1777.87(4.160)
rtx3090	fp16	resnet152	128	452.36	856.83(1.894)	1528.42(3.379)	2509.74(5.548)
rtx3090	fp16	resnet152	256	486.50	943.83(1.940)	1741.87(3.580)	3157.88(6.491)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp16	inception4	64	340.48	616.14(1.810)	1118.29(3.284)	1928.91(5.665)
rtx3090	fp16	inception4	128	363.79	708.04(1.946)	1293.56(3.556)	2458.25(6.757)
rtx3090	fp16	inception4	256	399.16	786.27(1.970)	1524.64(3.820)	2975.40(7.454)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp32	resnet50	64	490.86	920.09(1.874)	1591.11(3.241)	2717.38(5.536)
rtx3090	fp32	resnet50	128	535.17	1028.42(1.922)	1902.45(3.555)	3472.79(6.489)
rtx3090	fp32	resnet50	256	549.28	1078.30(1.963)	2095.68(3.815)	3973.04(7.233)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp32	inception3	64	343.02	642.50(1.873)	1193.98(3.481)	2177.34(6.348)
rtx3090	fp32	inception3	128	361.84	715.59(1.978)	1362.60(3.766)	2572.47(7.109)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp32	vgg16	64	313.28	471.89(1.506)	576.93(1.842)	643.78(2.055)
rtx3090	fp32	vgg16	128	322.07	571.30(1.774)	737.77(2.291)	1019.74(3.166)
rtx3090	fp32	vgg16	256	325.21	615.41(1.892)	947.14(2.912)	1330.58(4.091)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp32	nasnet	64	329.72	551.90(1.674)	828.22(2.512)	1382.13(4.192)
rtx3090	fp32	nasnet	128	385.95	697.19(1.806)	1266.13(3.281)	2292.71(5.940)
rtx3090	fp32	nasnet	256	413.63	782.42(1.892)	1493.11(3.610)	2787.66(6.740)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp32	resnet152	64	205.71	384.59(1.870)	671.86(3.266)	1115.93(5.425)
rtx3090	fp32	resnet152	128	222.55	428.55(1.926)	796.43(3.579)	1447.99(6.506)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090	fp32	inception4	64	167.74	321.31(1.916)	602.30(3.591)	1108.22(6.607)
rtx3090	fp32	inception4	128	176.45	344.66(1.953)	664.47(3.766)	1289.53(7.308)

RTX3090 1, 2, 4, 8 GPUでの網羅的な tf_cnn_benchmarks 計測