Benchmarking is crucial for evaluating GPU performance in AI and machine learning. This study measures training speed and scalability across various GPU types to help users choose the best fit for their workloads.
Beyond internal assessments, we compare FPT GPU Cloud’s performance with similar vendors to highlight key advantages in processing power, memory bandwidth, and scalability. These insights help customers select the most efficient GPU cloud services for their AI needs.
Check out the FPT AI Factory’s folk of the Optimum Habana trainer code. These H100 benchmarks may be reproduced by following the provided instructions in that repository.
The following benchmarks utilize Habana’s Optimum Habana v1.7 trainer code to evaluate the performance of the NVIDIA HGX H100 and HGX H200 against similar vendors.
Result H100 (samples per second) – FPT’s Metal Cloud, K8S, DGX, VM. Batch size 54
| Model | 1 GPU | 2 GPUs | 3 GPUs | 4 GPUs | 6 GPUs | 8 GPUs |
| Similar Vendor’s H100 80GB SXM | 142.3 | 275 | 400.6 | 521.8 | 740.3 | 962.2 |
| Compared to 1 GPU (times faster) | 1.93 | 2.82 | 3.67 | 5.20 | 6.76 | |
| Metal Cloud – Bare Metal H100 80GB SXM | 144.2 | 283.4 | 418.9 | 550.7 | 799.4 | 1056.3 |
| Compared to Similar Vendor | 101% | 103% | 105% | 106% | 108% | 110% |
| Compared to 1 GPU (times faster) | 1.97 | 2.91 | 3.82 | 5.54 | 7.33 | |
| FPT K8S H100 80GB SXM | 143.8 | 282.4 | 417.0 | 546.7 | 792.8 | 1046.5 |
| Compared to Similar Vendor | 101% | 103% | 104% | 105% | 107% | 109% |
| Compared to 1 GPU (times faster) | 1.96 | 2.90 | 3.80 | 5.51 | 7.28 | |
| DGX H100 80GB SXM | 143.8 | 282.2 | 417.2 | 547.7 | 793.4 | 1047.0 |
| Compared to Similar Vendor | 101% | 103% | 104% | 105% | 107% | 109% |
| Compared to 1 GPU (times faster) | 1.96 | 2.90 | 3.81 | 5.52 | 7.28 | |
| FPT VM H100 80GB SXM (no nvlink) | 143.0 | 261.7 | 376.6 | 459.5 | ||
| Compared to Similar Vendor | 101% | 95% | 94% | 88% | ||
| Compared to 1 GPU (times faster) | 1.83 | 2.63 | 3.21 | |||
| Compared to Metal Cloud | 99% | 92% | 90% | 83% |
Result H200 (samples per second) – FPT’s Metal Cloud, multiple batch sizes: 54, 95, 110
| Model | 1 GPU | 2 GPUs | 3 GPUs | 4 GPUs | 6 GPUs | 8 GPUs |
| Metal Cloud – Bare Metal H200 141GB SXM (bz54) | 158.8 | 312.4 | 460.7 | 600.9 | 881.4 | 1165.1 |
| Compared to Similar Vendor’s H100 | 112% | 114% | 115% | 115% | 119% | 121% |
| Compared to Metal Cloud H100 | 110% | 110% | 110% | 109% | 110% | 110% |
| Compared to Similar Vendor’s Baremetal H200 | 101% | 101% | 102% | 101% | 104% | 105% |
| Compared to 1 GPU (times faster) | 1.84 | 2.71 | 3.53 | 5.18 | 6.85 | |
| Metal Cloud – Bare Metal H200 141GB SXM (bz95) | 169.4 | 332.9 | 489.2 | 649.7 | 917.4 | 1238.1 |
| Compared to Similar Vendor’s H100 | 119% | 121% | 122% | 125% | 124% | 129% |
| Compared to Metal Cloud H100 | 117% | 117% | 117% | 118% | 115% | 117% |
| Compared to Similar Vendor’s Baremetal H200 | 107% | 108% | 108% | 110% | 108% | 112% |
| Compared to 1 GPU (times faster) | 1.96 | 2.87 | 3.82 | 5.39 | 7.28 | |
| Metal Cloud – Bare Metal H200 141GB SXM (bz110) | 173.9 | 341.4 | 505.8 | 651.0 | 973.7 | 1190.0 |
| Compared to Similar Vendor’s H100 | 122% | 124% | 126% | 125% | 132% | 124% |
| Compared to Metal Cloud H100 | 121% | 120% | 121% | 118% | 122% | 113% |
| Compared to Similar Vendor’s Baremetal H200 | 110% | 111% | 112% | 110% | 115% | 107% |
| Compared to 1 GPU (times faster) | 2.01 | 2.97 | 3.83 | 5.72 | 6.99 |
FPT AI Factory optimizes GPU performance through advanced infrastructure and software enhancements.
- Metal Cloud delivers the highest performance across all GPU configurations, outperforming similar vendor benchmarks, with the performance gap increasing as more GPUs are added (up to 110% at 8 GPUs).
- FPT K8S performs slightly lower due to additional overhead but remains competitive.
- FPT VM (without NVLink) shows lower performance, especially with multiple GPUs, reinforcing NVLink’s role in scaling efficiency.
- Across all models, performance scaling is sublinear, with diminishing returns as the number of GPUs increases, though Metal Cloud scales the best (7.33× at 8 GPUs). Meanwhile, the HGX H200, with its larger VRAM (141GB vs. 80GB) and higher memory bandwidth (4.8TB/s vs. 3.35TB/s), enables larger batch sizes and achieves up to 18% better performance than the H100 at maximum batch size.
Learn more about FPT AI Factory’s services HERE.
For more information and consultancy about FPT AI Factory, please contact:
- Hotline: 1900 638 399
- Email: support@fptcloud.com
- Support: m.me/fptsmartcloud
