The choice of hardware components and software configuration drives the time to solution and scalability of any application. The variety of codes that run on the DoD High Performance Computing Modernization Program (HPCMP) machines and the differences in platforms force researchers into a classic economics exercise - how to allocate limited HPC resources to achieve the most in the least amount of time.
Our results below focus on the following codes: Adaptive Mesh Refinement (AMR), Air Vehicles Unstructured Solver (AVUS), CTH, General Atomic and Molecular Electronic Structure System (GAMESS), HYbrid Coordinate Ocean Model (HYCOM), Improved Concurrent Electromagnetic Particle In Cell (ICEPIC), and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS).
Generally, the results from benchmarking runs can serve two purposes. At a fixed CPU count, relative machine performance can be judged on a per-application basis. This is normally done at what is termed a distinguished processor count--namely, 256 CPUs for the standard test case and 1024 CPUs for the large test case. Taking a different perspective, if one were to look at the times for a particular test case run for an application on a particular machine at varying CPU counts, it is possible to judge scalability of the application. This is especially true if one were to make the same runs across widely-disparate types of machines. Certainly the timing of the runs would be different--due to variations in OS, compilers, interconnects, memory layout, etc. However, if carried out to a large enough number of CPUs, a pattern should emerge, providing a sense of how well the application scales. For our purposes here, we focus exclusively on the former, hoping to assist in pushing applications to their most appropriate computational platforms and, in the process, disregarding application scalability. Note that the standard test case is built to emulate an average application run, while the large test case reflects a much more stressful computational load. In the matrix summary table below, we use a red-to-green scale to denote the application's relative performance on a machine-by-machine basis, with red highlighting poor performance and green the best relative performance.
| Architecture | AMR Standard |
AMR Large |
AVUS Large |
CTH Standard |
CTH Large |
GAMESS Standard |
GAMESS Large |
HYCOM Standard |
HYCOM Large |
ICEPIC Standard |
ICEPIC Large |
LAMMPS Standard |
LAMMPS Large |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ERDC Diamond | 83 | 89 | 100 | 100 | 89 | 87 | 54 | 87 | 89 | 100 | 100 | 100 | 100 |
| ARL Harold | 78 | 100 | 99 | 93 | 96 | 84 | 53 | 88 | 89 | 92 | 93 | 86 | 80 |
| MHPCC Mana | 100 | 97 | 82 | 74 | 89 | 60 | 38 | 86 | 90 | 93 | 85 | 86 | 75 |
| NAVY Davinci | 75 | 72 | 79 | 84 | 100 | 56 | 31 | 100 | 100 | 44 | 53 | 91 | 99 |
| NAVY Einstein | 55 | 68 | 48 | 48 | 70 | 100 | 100 | 40 | 41 | 40 | 44 | 48 | 47 |
| ERDC Jade | 48 | 59 | 44 | 38 | 64 | 92 | 94 | 35 | 36 | 37 | 40 | 46 | 44 |
| ARL MJM | 43 | 49 | 42 | 39 | 63 | 48 | na | 40 | 46 | 66 | 55 | 69 | 56 |
| ERDC Sapphire | 44 | 56 | 46 | 41 | 66 | 45 | 26 | 35 | 31 | 37 | 42 | 52 | 50 |
| NAVY Babbage | 37 | 40 | 41 | 49 | 47 | 51 | na | 31 | 23 | 23 | 22 | 40 | 36 |
| AFRL Hawk | 49 | 40 | 33 | 26 | 43 | 21 | na | 40 | 36 | 21 | 18 | 23 | 30 |
For more information on how these values are normalized, see the Summary of Application Runtimes below.
| 0-19 | 20-39 | 40-59 | 60-79 | 80-100 |
|---|---|---|---|---|
| DSRC | Name | Decomm Date |
Make | Model | Chip Set | Processor Speed (GHz) |
Interconnect | Number of Cores |
Cores per Node |
Operating System |
|---|---|---|---|---|---|---|---|---|---|---|
| ERDC | Diamond | 12/31/2013 | SGI | Altix ICE | Intel Xeon QC | 2.8 | DDR4 InfiniBand | 15,360 | 8 | SUSE Linux |
| ARL | Harold | 12/1/2013 | SGI | Altix ICE | Intel Xeon QC | 2.8 | DDR4 InfiniBand | 10,752 | 8 | SUSE Linux |
| MHPCC | Mana | 08/2013 | Dell | PowerEdge M610 | Intel Xeon QC | 2.8 | DDR Infiniband | 9216 | 8 | Linux |
| NAVY | DaVinci | 01/2013 | IBM | Power6 | IBM P6 DC | 4.7 | Infiniband | 4800 | 32 | AIX |
| NAVY | Einstein | 12/2012 | Cray | XT5 | Cray Opteron QC | 2.3 | SeaStar2+ | 12736 | 8 | CNL |
| ERDC | Jade | 06/30/2012 | Cray | XT4 | Cray Opteron QC | 2.1 | SeaStar2 | 8608 | 4 | CNL |
| ARL | MJM | 5/31/2011 | LNX | ATC | Intel Xeon DC | 3 | Infiniband | 4400 | 4 | SLES 9 |
| ERDC | Sapphire | 03/31/2011 | Cray | XT3 | Cray Opteron DC | 2.6 | SeaStar | 8192 | 2 | CNL |
| NAVY | Babbage | 10/01/2010 | IBM | Power5+ | IBM p575 DC | 1.9 | Federation | 2912 | 16 | AIX |
| AFRL | Hawk | 9/30/2011 | SGI | Altix 4700 | Intel Itanium DC | 1.6 | NUMALink4 | 9216 | 512 | SLES 10 - Pro Pack 5 |
Visit the Center for Computational Sciences and Engineering Web site for information on obtaining or using AMR.
| Binary | ASCII | |
|---|---|---|
| input | 0 | 4 KBytes |
| output | 48 GBytes | 30 MBytes |
| scratch | 0 | 0 |
| Binary | ASCII | |
|---|---|---|
| input | 0 | 4 KBytes |
| output | 65 GBytes | 100 MBytes |
| scratch | 0 | 0 |
Contact the Consolidated Customer Assistance Center for information on obtaining or using AVUS.
| Binary | ASCII | |
|---|---|---|
| input | 4 GBytes | 486 bytes |
| output | 5.1 GBytes | 100 KBytes |
| scratch | 0 | 0 |
All I/O is done by root process MPI-0
Contact the Consolidated Customer Assistance Center for information on obtaining or using CTH.
| Binary | ASCII | |
|---|---|---|
| input | 0 | 10 KBytes |
| output | 2-4 GBytes | 160 KBytes |
| scratch | 2-4 GBytes | 0 |
| Binary | ASCII | |
|---|---|---|
| input | 0 | 10 KBytes |
| output | 2-3 GBytes | 110 KBytes |
| scratch | 2-3 GBytes | 0 |
GAMESS may be obtained at http://www.msg.ameslab.gov/GAMESS/GAMESS.html. Please contact Mike Schmidt at Ames Laboratory (Iowa State University) to get instructions for downloading the correct version.
| Binary | ASCII | |
|---|---|---|
| input | 0 | 22 MBytes |
| output | 20 MBytes | |
| scratch | 9.7 GBytes |
| Binary | ASCII | |
|---|---|---|
| input | 0 | 15 MBytes |
| output | 0 | 15 MBytes |
| scratch | 353 MBytes | 0 |
Visit the HYCOM Web site for information on obtaining or using HYCOM.
| Binary | ASCII | |
|---|---|---|
| input | 270 MBytes | 7 KBytes |
| output | 10.4 GBytes | 300 KBytes |
| scratch | 4 GBytes | 0 |
| Binary | ASCII | |
|---|---|---|
| input | 270 MBytes | 1.9 GBytes |
| output | 23 GBytes | 300 KBytes |
| scratch | 23 GBytes | 0 |
Contact the Consolidated Customer Assistance Center for information on obtaining or using ICEPIC.
| Binary | ASCII | |
|---|---|---|
| input | 0 | 36,598 bytes |
| output | 105 MBytes | 170 MBytes |
| scratch | 0 | 0 |
| Binary | ASCII | |
|---|---|---|
| input | 0 | 19,048 bytes |
| output | 166 MBytes | 3.26 MBytes |
| scratch | 0 | 0 |
Visit the LAMMPS Web site for information on obtaining or using LAMMPS.
| Binary | ASCII | |
|---|---|---|
| input | N/A | 6.3 MBytes 599 bytes |
| output | N/A | 15 KBytes |
| scratch | N/A | N/A |
| Binary | ASCII | |
|---|---|---|
| input | N/A | 36 KBytes 426 KBytes |
| output | N/A | 21 KBytes |
| scratch | N/A | N/A |
Here we see the duration (in seconds) the standard test case run for the AMR (adaptive mesh refinement) application across several of the DSRC machines. We find the minimum duration, which in this case is 1069 seconds (occurring on the MHPCC_Mana machine). This is the best time we see across all machines, so all other machines should be judged relative to this duration, which should receive a normed score of 100. To do this we want to divide this minimum duration by the duration recorded for each machine individually, and then multiply the result by 100.
| Architecture | Duration | Normalized |
|---|---|---|
| ARL_mjm | 2482 | 43 |
| NAVY_Babbage | 2836 | 37 |
| AFRL_Hawk | 2176 | 49 |
| ERDC_Sapphire | 2418 | 44 |
| ERDC_Jade | 2222 | 48 |
| NAVY_Einstein | 1914 | 55 |
| NAVY_Davinci | 1411 | 75 |
| ARL_Harold | 1361 | 78 |
| MHPCC_Mana | 1069 | 100 |
| ERDC_Diamond | 1283 | 83 |
| Minimum Duration | 1069 |
normedValue= (minimumDuration over all machines / duration on single machine) * 100
Now drop the decimal portion of the normedValue calculation, i.e., only retain the integer part of the calculation, for the table.
From the above, for instance, we can see that running the AMR standard test case on AFRL_Hawk takes slightly more than twice as long as our fastest results on MHPCC_Mana. The above calculations are performed for each test case (standard and large) for each code across the DSRC architectures under study and finally summarized in the performance matrix. The results are then binned into one of five categories and highlighted by the associated color for ease of reading. Of course, the best results are those closest to 100, while the poor performers are more distant.