A6000 を1, 2, 4, 8 GPU 使い、batch size を64, 128, 256, 512, 1024と変化させてtf_cnn_benchmarks での学習速度を計測しました。
modelは、resnet50, inception3, vgg16, nasnet, resnet152, inception4です。
fp16とfp32の学習速度を計測しました。
表の学習速度(images/sec)の括弧内の数値は、1GPUの時と比べて何倍になっているかを示します。
使用したハードウェアは HPCDIY-ERMGPU8R4S
CPU: AMD EPYC Rome 7352 DP/UP 24C/48T 2.3G 128M 155W
メモリ: 32GB x 16 = 512GB
SSD: Samsung PM983 7.68TB NVMePCIe3x4 V4TLC 2.5"7mm(1.3 DWPD) x 1
になります。
使用したソフトウェアは、tf_cnn_benchmarks、使用したtensorflowは、nvcr.io/nvidia/tensorflow:20.12-tf1-py3 になります。
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp16 |
resnet50 |
64 |
1061.92 |
1969.59(1.855) |
2794.37(2.631) |
6274.37(5.909) |
a6000 |
fp16 |
resnet50 |
128 |
1156.52 |
2181.27(1.886) |
4176.46(3.611) |
7302.03(6.314) |
a6000 |
fp16 |
resnet50 |
256 |
1220.76 |
2382.99(1.952) |
4634.09(3.796) |
8816.47(7.222) |
a6000 |
fp16 |
resnet50 |
512 |
1242.82 |
2438.71(1.962) |
4793.78(3.857) |
9425.05(7.584) |
a6000 |
fp16 |
resnet50 |
1024 |
1253.40 |
2508.13(2.001) |
4944.17(3.945) |
9686.33(7.728) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp16 |
inception3 |
64 |
784.08 |
1420.48(1.812) |
2672.08(3.408) |
4687.96(5.979) |
a6000 |
fp16 |
inception3 |
128 |
858.16 |
1685.34(1.964) |
3025.44(3.525) |
5977.10(6.965) |
a6000 |
fp16 |
inception3 |
256 |
911.90 |
1799.25(1.973) |
3465.63(3.800) |
6522.25(7.152) |
a6000 |
fp16 |
inception3 |
512 |
899.23 |
1758.40(1.955) |
3477.81(3.868) |
6849.92(7.618) |
a6000 |
fp16 |
inception3 |
1024 |
858.57 |
1648.26(1.920) |
3249.41(3.785) |
6652.44(7.748) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp16 |
vgg16 |
64 |
506.68 |
824.63(1.628) |
1398.37(2.760) |
1719.74(3.394) |
a6000 |
fp16 |
vgg16 |
128 |
535.68 |
985.47(1.840) |
1726.85(3.224) |
2764.99(5.162) |
a6000 |
fp16 |
vgg16 |
256 |
554.17 |
1056.00(1.906) |
1967.01(3.549) |
3651.36(6.589) |
a6000 |
fp16 |
vgg16 |
512 |
540.55 |
1053.56(1.949) |
2040.88(3.776) |
3926.31(7.264) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp16 |
nasnet |
64 |
326.86 |
527.01(1.612) |
894.13(2.736) |
1548.88(4.739) |
a6000 |
fp16 |
nasnet |
128 |
385.16 |
726.34(1.886) |
1347.12(3.498) |
2434.67(6.321) |
a6000 |
fp16 |
nasnet |
256 |
411.98 |
788.20(1.913) |
1531.37(3.717) |
2822.33(6.851) |
a6000 |
fp16 |
nasnet |
512 |
409.85 |
761.09(1.857) |
1582.06(3.860) |
3068.16(7.486) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp16 |
resnet152 |
64 |
454.27 |
840.27(1.850) |
1540.17(3.390) |
2574.31(5.667) |
a6000 |
fp16 |
resnet152 |
128 |
507.12 |
963.78(1.900) |
1809.60(3.568) |
3320.63(6.548) |
a6000 |
fp16 |
resnet152 |
256 |
543.78 |
1054.73(1.940) |
2035.65(3.744) |
3868.09(7.113) |
a6000 |
fp16 |
resnet152 |
512 |
559.22 |
1099.23(1.966) |
2052.58(3.670) |
4169.76(7.456) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp16 |
inception4 |
64 |
379.13 |
723.55(1.908) |
1343.56(3.544) |
2538.98(6.697) |
a6000 |
fp16 |
inception4 |
128 |
407.79 |
788.35(1.933) |
1548.75(3.798) |
2941.88(7.214) |
a6000 |
fp16 |
inception4 |
256 |
469.59 |
883.94(1.882) |
1776.91(3.784) |
3393.76(7.227) |
a6000 |
fp16 |
inception4 |
512 |
473.03 |
930.52(1.967) |
1844.26(3.899) |
3602.49(7.616) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp32 |
resnet50 |
64 |
477.96 |
910.68(1.905) |
1715.91(3.590) |
3167.85(6.628) |
a6000 |
fp32 |
resnet50 |
128 |
506.40 |
993.42(1.962) |
1939.24(3.829) |
3697.20(7.301) |
a6000 |
fp32 |
resnet50 |
256 |
520.85 |
1036.15(1.989) |
2042.98(3.922) |
3980.10(7.642) |
a6000 |
fp32 |
resnet50 |
512 |
517.09 |
1029.55(1.991) |
2049.59(3.964) |
4043.83(7.820) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp32 |
inception3 |
64 |
350.54 |
662.44(1.890) |
1256.14(3.583) |
2272.07(6.482) |
a6000 |
fp32 |
inception3 |
128 |
378.27 |
742.52(1.963) |
1461.00(3.862) |
2749.77(7.269) |
a6000 |
fp32 |
inception3 |
256 |
390.65 |
774.44(1.982) |
1503.87(3.850) |
2913.61(7.458) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp32 |
vgg16 |
64 |
312.46 |
558.06(1.786) |
997.03(3.191) |
1623.46(5.196) |
a6000 |
fp32 |
vgg16 |
128 |
321.10 |
617.09(1.922) |
1127.00(3.510) |
1947.73(6.066) |
a6000 |
fp32 |
vgg16 |
256 |
323.93 |
638.49(1.971) |
1223.70(3.778) |
2232.47(6.892) |
a6000 |
fp32 |
vgg16 |
512 |
304.09 |
605.69(1.992) |
1192.28(3.921) |
2290.20(7.531) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp32 |
nasnet |
64 |
313.64 |
561.60(1.791) |
909.26(2.899) |
1612.12(5.140) |
a6000 |
fp32 |
nasnet |
128 |
363.06 |
671.66(1.850) |
1304.10(3.592) |
2313.57(6.372) |
a6000 |
fp32 |
nasnet |
256 |
382.46 |
740.62(1.936) |
1439.74(3.764) |
2672.53(6.988) |
a6000 |
fp32 |
nasnet |
512 |
384.38 |
731.87(1.904) |
1477.58(3.844) |
2870.50(7.468) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp32 |
resnet152 |
64 |
201.06 |
386.65(1.923) |
718.98(3.576) |
1301.79(6.475) |
a6000 |
fp32 |
resnet152 |
128 |
218.29 |
430.75(1.973) |
836.49(3.832) |
1574.48(7.213) |
a6000 |
fp32 |
resnet152 |
256 |
227.44 |
450.95(1.983) |
885.69(3.894) |
1721.80(7.570) |
gpu |
演算精度 |
model |
batch size |
images/sec(1gpu) |
images/sec(2gpu) |
images/sec(4gpu) |
images/sec(8gpu) |
a6000 |
fp32 |
inception4 |
64 |
173.69 |
335.13(1.929) |
646.08(3.720) |
1198.78(6.902) |
a6000 |
fp32 |
inception4 |
128 |
186.24 |
364.38(1.957) |
719.10(3.861) |
1382.76(7.425) |
a6000 |
fp32 |
inception4 |
256 |
186.57 |
353.50(1.895) |
730.18(3.914) |
1428.19(7.655) |