RTX4090 1, 2, 4 GPU vs RTX3090 1,2,4 GPU for DeepLearning

タグ

GeForce RTX 4090を1, 2, 4 GPU 使い、batch size を64, 128, 256, 512と変化させてtf_cnn_benchmarks での学習速度を計測しました。RTX3090との比較も掲載してあります。

modelは、resnet50, inception3, vgg16, nasnet, resnet152, inception4です。

fp16とfp32の学習速度を計測しました。

表の学習速度（images/sec）の括弧内の数値は、1GPUの時と比べて何倍になっているかを示します。

RTX3090の計測値を上段に表示しています。RTX4090 1GPU時の値の右の括弧内の数値は、RTX3090の何倍になっているかの値です。

使用したハードウェアは HPCDIY-ERMGPU8R4S
CPU: 2 x AMD EPYC Rome 7252 DP/UP 8C/16T 3.1G 64M 120W
メモリ: 32GB x 16 = 512GB
SSD: 1 x Micron 7450 PRO 960GB NVMe PCIe 4.0 3DTLC U.3 7mm,1DWPD

になります。

使用したソフトウェアは、tf_cnn_benchmarks、使用したtensorflowは、nvcr.io/nvidia/tensorflow:23.02-tf1-py3 になります。

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp16	resnet50	64	1026.83 1688.03(1.644)	1828.21(1.780) 2794.01(1.655)	2798.55(2.725) 4233.10(2.508)	4039.44(3.934)
rtx3090 rtx4090	fp16	resnet50	128	1126.66 1748.98(1.552)	2103.69(1.867) 3145.92(1.799)	3596.61(3.192) 5548.25(3.172)	6005.06(5.330)
rtx3090 rtx4090	fp16	resnet50	256	1181.90 1776.41(1.503)	2283.06(1.932) 3357.26(1.890)	4266.43(3.610) 6142.79(3.458)	7362.66(6.230)
rtx3090 rtx4090	fp16	resnet50	512	1205.02 1760.89(1.461)	2371.21(1.968) 3408.95(1.936)	4635.48(3.847) 6582.35(3.740)	8756.00(7.266)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp16	inception3	64	709.40 1243.98(1.754)	1304.37(1.839) 2046.64(1.645)	2247.00(3.167) 3274.80(2.633)	3549.19(5.003)
rtx3090 rtx4090	fp16	inception3	128	764.99 1340.23(1.752)	1382.01(1.807) 2497.32(1.863)	2692.60(3.520) 4334.59(3.234)	4865.46(6.360)
rtx3090 rtx4090	fp16	inception3	256	806.46 1391.36(1.725)	1573.15(1.951) 2515.63(1.808)	2979.08(3.694) 5004.93(3.597)	5598.28(6.942)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp16	vgg16	64	419.83 782.81(1.865)	589.37(1.404) 1202.66(1.536)	703.38(1.675) 1093.21(1.397)	634.67(1.512)
rtx3090 rtx4090	fp16	vgg16	128	439.85 831.85(1.891)	756.71(1.720) 1311.02(1.576)	949.62(2.159) 1714.44(2.061)	1093.66(2.486)
rtx3090 rtx4090	fp16	vgg16	256	455.41 862.27(1.893)	821.34(1.804) 1671.98(1.939)	1303.34(2.862) 2429.87(2.818)	1695.73(3.724)
rtx3090 rtx4090	fp16	vgg16	512	440.25 875.55(1.989)	838.75(1.905) 1651.70(1.886)	1356.53(3.081) 2879.19(3.288)	2171.86(4.933)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp16	nasnet	64	343.50 636.89(1.854)	524.19(1.526) 740.25(1.162)	813.83(2.369) 1203.70(1.890)	1362.27(3.966)
rtx3090 rtx4090	fp16	nasnet	128	406.73 809.24(1.990)	726.86(1.787) 1242.0(1.535)	1314.13(3.231) 2026.16(2.504)	2298.17(5.650)
rtx3090 rtx4090	fp16	nasnet	256	442.83 844.10(1.906)	833.64(1.883) 1530.35(1.813)	1576.29(3.560) 2877.05(3.408)	2943.63(6.647)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp16	resnet152	64	427.40 752.43(1.760)	766.60(1.794) 1177.31(1.565)	1151.20(2.693) 1644.11(2.185)	1777.87(4.160)
rtx3090 rtx4090	fp16	resnet152	128	452.36 765.31(1.692)	856.83(1.894) 1372.78(1.794)	1528.42(3.379) 2346.98(3.067)	2509.74(5.548)
rtx3090 rtx4090	fp16	resnet152	256	486.50 774.11(1.591)	943.83(1.940) 1467.87(1.896)	1741.87(3.580) 2769.61(3.578)	3157.88(6.491)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp16	inception4	64	340.48 647.39(1.901)	616.14(1.810) 1165.49(1.800)	1118.29(3.284) 1849.91(2.857)	1928.91(5.665)
rtx3090 rtx4090	fp16	inception4	128	363.79 695.43(1.912)	708.04(1.946) 1313.02(1.888)	1293.56(3.556) 2398.77(3.449)	2458.25(6.757)
rtx3090 rtx4090	fp16	inception4	256	399.16 727.43(1.822)	786.27(1.970) 1370.57(1.884)	1524.64(3.820) 2704.61(3.718)	2975.40(7.454)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp32	resnet50	64	490.86 822.14(1.675)	920.09(1.874) 1522.85(1.852)	1591.11(3.241) 2620.04(3.187)	2717.38(5.536)
rtx3090 rtx4090	fp32	resnet50	128	535.17 832.78(1.556)	1028.42(1.922) 1584.19(1.902)	1902.45(3.555) 2970.26(3.567)	3472.79(6.489)
rtx3090 rtx4090	fp32	resnet50	256	549.28 833.72(1.518)	1078.30(1.963) 1630.34(1.956)	2095.68(3.815) 3112.87(3.734)	3973.04(7.233)

gpu

演算精度

model

batch size

images/sec(1gpu)

images/sec(2gpu)

images/sec(4gpu)

images/sec(8gpu)

rtx3090

rtx4090

fp32

resnet50

490.86

822.14(1.675)

920.09(1.874)

1522.85(1.852)

1591.11(3.241)

2620.04(3.187)

2717.38(5.536)

rtx3090

rtx4090

fp32

resnet50

128

535.17

832.78(1.556)

1028.42(1.922)

1584.19(1.902)

1902.45(3.555)

2970.26(3.567)

3472.79(6.489)

rtx3090

rtx4090

fp32

resnet50

256

549.28

833.72(1.518)

1078.30(1.963)

1630.34(1.956)

2095.68(3.815)

3112.87(3.734)

3973.04(7.233)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp32	inception3	64	343.02 632.08(1.843)	642.50(1.873) 1170.18(1.851)	1193.98(3.481) 2082.66(3.295)	2177.34(6.348)
rtx3090 rtx4090	fp32	inception3	128	361.84 632.24(1.747)	715.59(1.978) 1198.78(1.896)	1362.60(3.766) 2300.95(3.639)	2572.47(7.109)

gpu

演算精度

model

batch size

images/sec(1gpu)

images/sec(2gpu)

images/sec(4gpu)

images/sec(8gpu)

rtx3090

rtx4090

fp32

inception3

343.02

632.08(1.843)

642.50(1.873)

1170.18(1.851)

1193.98(3.481)

2082.66(3.295)

2177.34(6.348)

rtx3090

rtx4090

fp32

inception3

128

361.84

632.24(1.747)

715.59(1.978)

1198.78(1.896)

1362.60(3.766)

2300.95(3.639)

2572.47(7.109)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp32	vgg16	64	313.28 471.91(1.506)	471.89(1.506) 797.67(1.690)	576.93(1.842) 915.77(1.941)	643.78(2.055)
rtx3090 rtx4090	fp32	vgg16	128	322.07 483.77(1.502)	571.30(1.774) 887.10(1.834)	737.77(2.291) 1213.92(2.509)	1019.74(3.166)
rtx3090 rtx4090	fp32	vgg16	256	325.21 465.02(1.430)	615.41(1.892) 908.37(1.953)	947.14(2.912) 1560.53(3.356)	1330.58(4.091)

gpu

演算精度

model

batch size

images/sec(1gpu)

images/sec(2gpu)

images/sec(4gpu)

images/sec(8gpu)

rtx3090

rtx4090

fp32

vgg16

313.28

471.91(1.506)

471.89(1.506)

797.67(1.690)

576.93(1.842)

915.77(1.941)

643.78(2.055)

rtx3090

rtx4090

fp32

vgg16

128

322.07

483.77(1.502)

571.30(1.774)

887.10(1.834)

737.77(2.291)

1213.92(2.509)

1019.74(3.166)

rtx3090

rtx4090

fp32

vgg16

256

325.21

465.02(1.430)

615.41(1.892)

908.37(1.953)

947.14(2.912)

1560.53(3.356)

1330.58(4.091)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp32	nasnet	64	329.72 626.03(1.899)	551.90(1.674) 763.24(1.219)	828.22(2.512) 1239.61(1.980)	1382.13(4.192)
rtx3090 rtx4090	fp32	nasnet	128	385.95 710.80(1.842)	697.19(1.806) 1220.52(1.717)	1266.13(3.281) 2039.47(2.869)	2292.71(5.940)
rtx3090 rtx4090	fp32	nasnet	256	413.63 726.10(1.755)	782.42(1.892) 1323.51(1.823)	1493.11(3.610) 2569.41(3.539)	2787.66(6.740)

gpu

演算精度

model

batch size

images/sec(1gpu)

images/sec(2gpu)

images/sec(4gpu)

images/sec(8gpu)

rtx3090

rtx4090

fp32

nasnet

329.72

626.03(1.899)

551.90(1.674)

763.24(1.219)

828.22(2.512)

1239.61(1.980)

1382.13(4.192)

rtx3090

rtx4090

fp32

nasnet

128

385.95

710.80(1.842)

697.19(1.806)

1220.52(1.717)

1266.13(3.281)

2039.47(2.869)

2292.71(5.940)

rtx3090

rtx4090

fp32

nasnet

256

413.63

726.10(1.755)

782.42(1.892)

1323.51(1.823)

1493.11(3.610)

2569.41(3.539)

2787.66(6.740)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp32	resnet152	64	205.71 360.20(1.751)	384.59(1.870) 659.77(1.832)	671.86(3.266) 1094.79(3.039)	1115.93(5.425)
rtx3090 rtx4090	fp32	resnet152	128	222.55 363.10(1.632)	428.55(1.926) 687.65(1.894)	796.43(3.579) 1239.57(3.414)	1447.99(6.506)

gpu

演算精度

model

batch size

images/sec(1gpu)

images/sec(2gpu)

images/sec(4gpu)

images/sec(8gpu)

rtx3090

rtx4090

fp32

resnet152

205.71

360.20(1.751)

384.59(1.870)

659.77(1.832)

671.86(3.266)

1094.79(3.039)

1115.93(5.425)

rtx3090

rtx4090

fp32

resnet152

128

222.55

363.10(1.632)

428.55(1.926)

687.65(1.894)

796.43(3.579)

1239.57(3.414)

1447.99(6.506)

gpu	演算精度	model	batch size	images/sec(1gpu)	images/sec(2gpu)	images/sec(4gpu)	images/sec(8gpu)
rtx3090 rtx4090	fp32	inception4	64	167.74 323.53(1.929)	321.31(1.916) 547.17(1.691)	602.30(3.591) 1113.07(3.440)	1108.22(6.607)
rtx3090 rtx4090	fp32	inception4	128	176.45 327.04(1.853)	344.66(1.953) 630.26(1.927)	664.47(3.766) 1210.97(3.703)	1289.53(7.308)

gpu

演算精度

model

batch size

images/sec(1gpu)

images/sec(2gpu)

images/sec(4gpu)

images/sec(8gpu)

rtx3090

rtx4090

fp32

inception4

167.74

323.53(1.929)

321.31(1.916)

547.17(1.691)

602.30(3.591)

1113.07(3.440)

1108.22(6.607)

rtx3090

rtx4090

fp32

inception4

128

176.45

327.04(1.853)

344.66(1.953)

630.26(1.927)

664.47(3.766)

1210.97(3.703)

1289.53(7.308)