Contact Us

Making the Most of Your Allocation

The choice of hardware components and software configuration drives the time to solution and scalability of any application. The variety of codes that run on the DoD High Performance Computing Modernization Program (HPCMP) machines and the differences in platforms force researchers into a classic economics exercise - how to allocate limited HPC resources to achieve the most in the least amount of time.

Our results below focus on the following codes: Adaptive Mesh Refinement (AMR), Air Vehicles Unstructured Solver (AVUS), CTH, General Atomic and Molecular Electronic Structure System (GAMESS), HYbrid Coordinate Ocean Model (HYCOM), Improved Concurrent Electromagnetic Particle In Cell (ICEPIC), and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS).

Generally, the results from benchmarking runs can serve two purposes. At a fixed CPU count, relative machine performance can be judged on a per-application basis. This is normally done at what is termed a distinguished processor count--namely, 256 CPUs for the standard test case and 1024 CPUs for the large test case. Taking a different perspective, if one were to look at the times for a particular test case run for an application on a particular machine at varying CPU counts, it is possible to judge scalability of the application. This is especially true if one were to make the same runs across widely-disparate types of machines. Certainly the timing of the runs would be different--due to variations in OS, compilers, interconnects, memory layout, etc. However, if carried out to a large enough number of CPUs, a pattern should emerge, providing a sense of how well the application scales. For our purposes here, we focus exclusively on the former, hoping to assist in pushing applications to their most appropriate computational platforms and, in the process, disregarding application scalability. Note that the standard test case is built to emulate an average application run, while the large test case reflects a much more stressful computational load. In the matrix summary table below, we use a red-to-green scale to denote the application's relative performance on a machine-by-machine basis, with red highlighting poor performance and green the best relative performance.

Matrix Summary of all Application Runtimes
Architecture AMR
Standard
AMR
Large
AVUS
Large
CTH
Standard
CTH
Large
GAMESS
Standard
GAMESS
Large
HYCOM
Standard
HYCOM
Large
ICEPIC
Standard
ICEPIC
Large
LAMMPS
Standard
LAMMPS
Large
ERDC Diamond 83 89 100 100 89 87 54 87 89 100 100 100 100
ARL Harold 78 100 99 93 96 84 53 88 89 92 93 86 80
MHPCC Mana 100 97 82 74 89 60 38 86 90 93 85 86 75
NAVY Davinci 75 72 79 84 100 56 31 100 100 44 53 91 99
NAVY Einstein 55 68 48 48 70 100 100 40 41 40 44 48 47
ERDC Jade 48 59 44 38 64 92 94 35 36 37 40 46 44
ARL MJM 43 49 42 39 63 48 na 40 46 66 55 69 56
ERDC Sapphire 44 56 46 41 66 45 26 35 31 37 42 52 50
NAVY Babbage 37 40 41 49 47 51 na 31 23 23 22 40 36
AFRL Hawk 49 40 33 26 43 21 na 40 36 21 18 23 30

For more information on how these values are normalized, see the Summary of Application Runtimes below.

0-19 20-39 40-59 60-79 80-100
Architectures Used in Study
DSRC Name Decomm
Date
Make Model Chip Set Processor
Speed (GHz)
Interconnect Number
of Cores
Cores
per Node
Operating
System
ERDC Diamond 12/31/2013 SGI Altix ICE Intel Xeon QC 2.8 DDR4 InfiniBand 15,360 8 SUSE Linux
ARL Harold 12/1/2013 SGI Altix ICE Intel Xeon QC 2.8 DDR4 InfiniBand 10,752 8 SUSE Linux
MHPCC Mana 08/2013 Dell PowerEdge M610 Intel Xeon QC 2.8 DDR Infiniband 9216 8 Linux
NAVY DaVinci 01/2013 IBM Power6 IBM P6 DC 4.7 Infiniband 4800 32 AIX
NAVY Einstein 12/2012 Cray XT5 Cray Opteron QC 2.3 SeaStar2+ 12736 8 CNL
ERDC Jade 06/30/2012 Cray XT4 Cray Opteron QC 2.1 SeaStar2 8608 4 CNL
ARL MJM 5/31/2011 LNX ATC Intel Xeon DC 3 Infiniband 4400 4 SLES 9
ERDC Sapphire 03/31/2011 Cray XT3 Cray Opteron DC 2.6 SeaStar 8192 2 CNL
NAVY Babbage 10/01/2010 IBM Power5+ IBM p575 DC 1.9 Federation 2912 16 AIX
AFRL Hawk 9/30/2011 SGI Altix 4700 Intel Itanium DC 1.6 NUMALink4 9216 512 SLES 10 - Pro Pack 5

Adaptive Mesh Refinement (AMR)

Description

  • CTA: CFD
  • Solves hyperbolic equations on adaptive, structured grids.
  • Parallelization via MPI
  • Mixture of about 80% C++ and 20% FORTRAN.
  • Stand alone code except for MPI libraries

Visit the Center for Computational Sciences and Engineering Web site for information on obtaining or using AMR.

Standard Case

bubble
  • Simulates a gasdynamic shock contacting a helium bubble
    - Important problem for combustion research
  • 2 levels of refinement allowed.
  • 80 time steps representing about 0.4 seconds.
  • Problem size is very similar to production work loads. A real production problem would run for thousands of time steps.
Standard Case I/O
BinaryASCII
input04 KBytes
output48 GBytes30 MBytes
scratch00

Large Case

  • Simulates a gasdynamic shock contacting a helium bubble
    - Important problem for combustion research
  • 4 levels of refinement allowed.
  • 40 time steps representing about half a second.
  • Problem size is very similar to production work loads. A real production problem would run for thousands of time steps.
Large Case I/O
BinaryASCII
input04 KBytes
output65 GBytes100 MBytes
scratch00

Back To Top

AVUS

Description

  • CFD code, formerly COBALT_60
  • Simulates 3-D turbulent viscous flow over irregular geometries
  • Grid-based, reads a large grid file
  • AVUS: 29,000 lines of Fortran 90 code
  • Uses ParMETIS: 12,000 lines of C code
  • Parallelism via MPI, no OpenMP
  • Runs on Cray XT, IBM Power, SGI Altix, Linux clusters

Contact the Consolidated Customer Assistance Center for information on obtaining or using AVUS.

Large Case

  • Waverider: models a supersonic or hypersonic vehicle "riding" a shock wave that forms below the vehicle.
  • Uses 31,080,000 cells and runs for 400 time steps
Large Case I/O
BinaryASCII
input4 GBytes486 bytes
output5.1 GBytes100 KBytes
scratch00

All I/O is done by root process MPI-0

Back To Top

CTH

Description

  • CTA: CSM
  • Shock Physics
  • Two-step, 2nd order accurate Eulerian algorithm is used to solve the mass, momentum, and energy conservation equations.
  • This is an explicit approach that does not require solving a linear system.
  • Has both static and adaptive mesh capabilities
  • Parallelism via MPI
  • 900,000 LOC, 58% F and 42% C
  • Uses NetCDF, supplied with distribution

Contact the Consolidated Customer Assistance Center for information on obtaining or using CTH.

Standard Case

  • The model describes a long rod made of ten materials impacting a plate made of eight materials at an oblique angle of 73.5 degrees.
  • rod length = 7.67 cm, rod diameter = 0.767 cm
  • rod velocity = 1210 m/s
  • rod subdivided through length into ten materials
  • plate thickness = 0.64 cm, plate velocity = 217 m/s
  • plate subdivided through thickness into 8 materials
  • AMR (Adaptive Mesh Refinement)
  • Maximum of six levels of refinement
  • 350 time steps
  • Uses 1 GByte memory per process regardless of the number of processes used
Standard Case I/O
BinaryASCII
input010 KBytes
output2-4 GBytes160 KBytes
scratch2-4 GBytes0

Large Case

  • Same model as the Standard case but using a static grid and domain decomposition.
  • Memory requirement per process decreases with increasing processor count.
  • 300 time steps
  • The Standard and Large test cases write large (~10 MBytes) restart files for each MPI process at the start of the execution and again at the end.
Large Case I/O
BinaryASCII
input010 KBytes
output2-3 GBytes110 KBytes
scratch2-3 GBytes0

Back To Top

GAMESS

Description

  • CTA: Computational Biology, Chemistry and Materials Science (CCM)
  • Ab Initio Quantum chemistry
  • Computes many energy integrals with molecular data in form of atom positions and electron orbitals.
  • Communication depends on platform, LAPI, Sockets, SHMEM, MPI.
  • 99% FORTRAN, 1% C

GAMESS may be obtained at http://www.msg.ameslab.gov/GAMESS/GAMESS.html. Please contact Mike Schmidt at Ames Laboratory (Iowa State University) to get instructions for downloading the correct version.

Standard Case

  • Performs a DFT computation to compute the nuclear gradient vector of a "POSS" molecule using
    Restricted Hartree-Fock calculation with self-consistent field wave functions
  • Similar to actual workload
  • Standard case memory usage 800 MBytes
Standard Case I/O
BinaryASCII
input022 MBytes
output20 MBytes
scratch9.7 GBytes

Large Case

  • Performs an MP2 computation that finds the nuclear gradient vector of a "BC4" molecule using Restricted Hartree-Fock calculation with self-consistent field wave functions
  • Stressful workload
  • 1.8 GBytes memory usage
Large Case I/O
BinaryASCII
input015 MBytes
output015 MBytes
scratch353 MBytes0

In layman's terms...

  • The Standard test case is a sequence of iterations consisting primarily of three steps.
    1. Compute approximately N4 two-electron integrals
    2. Diagonalize a dense NxN matrix
    3. Evaluate the gradient vector, which involves evaluation of two-electron integral gradients
      (similar to what is done in step 1)
  • Here N=# of basis functions in the calculation. Note that steps 1 and 2 above are repeated until convergence is achieved, after which step 3 is done.
  • The Large test case begins in much the same way, with steps 1 and 2 being iterated to convergence.
    Following this, there is a rather complicated step involving an integral transformation step, which has an
    O(N5) computational requirement.

Back To Top

HYCOM

Description

  • CTA: CWO
  • A primitive equation ocean general circulation model
  • Communication is MPI (MPI-2 is available)
  • 100% FORTRAN
  • Version 2.2

Visit the HYCOM Web site for information on obtaining or using HYCOM.

Standard Case

  • A 26-layer 1/4 degree global model that simulates 1 day
  • 2160 time steps
  • Requires about 0.75 GByte of memory per processor and about 4 GBytes of globally accessible scratch disk
Standard Case I/O
BinaryASCII
input270 MBytes7 KBytes
output10.4 GBytes300 KBytes
scratch4 GBytes0

Large Case

  • A 26-layer 1/12 degree global model that simulates 3/4 day
  • 864 time steps
  • Requires about 0.9 GByte of memory per processor and about 23 GBytes of globally accessible scratch disk
Large Case I/O
BinaryASCII
input270 MBytes1.9 GBytes
output23 GBytes300 KBytes
scratch23 GBytes0

Back To Top

ICEPIC

Description

  • CTA: Computational Electromagnetics and Acoustics (CEA)
  • Particle-in-cell plasma physics code
  • Ions and electrons move under influence of electromagnetic fields
  • Fields calculated on a structured, static grid according to Maxwell's Equations
  • Can simulate plasmas contained in complex geometries
  • Used in electromagnetic device design
  • Current svn version 4016 has 334,260 lines of code, 100% C++, C

Contact the Consolidated Customer Assistance Center for information on obtaining or using ICEPIC.

Standard Case

  • Simulation of a magnetron for a high power microwave source during startup
  • Features many transients with particles representing electrons being created and moving in a grid-free way throughout the domain
  • 7267 time steps simulate 10 nanoseconds
Standard Case I/O
BinaryASCII
input036,598 bytes
output105 MBytes170 MBytes
scratch00

Large Case

  • A big simulation of the gyro-tron source of the air-born version of the active denial system (ADS)
  • 3000 time steps simulates roughly 355 picoseconds
Large Case I/O
BinaryASCII
input019,048 bytes
output166 MBytes3.26 MBytes
scratch00

Back To Top

LAMMPS

Description

  • TA: CCM (Comp Biology, Chemistry, & Material Science)
  • Classical molecular dynamics code that models particles in a liquid, solid, or gaseous state
  • Calculates atomic velocities, positions, system energy, and temperature
    • After equilibration: surface tension, radial pressure, and phase change
    • Post-processing: pair-correlation function and diffusion coefficients
  • All actions occur within box (usually orthogonal)
  • Distributed-memory message-passing parallelism (MPI)
  • Highly-portable C++
  • Libraries needed:
    • MPI
    • Single-processor FFT

Visit the LAMMPS Web site for information on obtaining or using LAMMPS.

Standard Case

  • Adapted from the LAMMPS Rhodopsin benchmark model
    - simulates an all-atom Rhodopsin protein in a solvated lipid bi-layer
  • Input file
    • Contains molecular topology, force parameters, and initial positions for 32,000 atoms
    • The computational domain is being replicated 7 times in each dimension
      • AFTER REPLICATION- 10,976,000 atoms; 9,508,989 bonds; 13,880,181 angles; 19,497,149 dihedrals; and 361,522 impropers within an orthogonal box [-27.5, -38.5,-36.3646] to [27.5, 38.5, 36.3615]
    • 1500 time steps, with each representing 2 femtoseconds
  • This case requires an FFT library
Standard Case I/O
BinaryASCII
inputN/A6.3 MBytes
599 bytes
outputN/A15 KBytes
scratchN/AN/A

Large Case

  • Adapted from the LAMMPS Embedded Atom Model (EAM) benchmark simulating a copper metallic solid
  • Input file
    • Contains specifications for the inter-atomic potentials
    • Creates 108 M atoms within orthogonal box [0,0,0] to [1084.5, 1084.5, 1084.5]
    • 2500 time steps, with each representing 0.005 femtoseconds
Large Case I/O
BinaryASCII
inputN/A36 KBytes
426 KBytes
outputN/A21 KBytes
scratchN/AN/A

Back To Top

Normalization Calculation for the Performance Matrix

Here we see the duration (in seconds) the standard test case run for the AMR (adaptive mesh refinement) application across several of the DSRC machines. We find the minimum duration, which in this case is 1069 seconds (occurring on the MHPCC_Mana machine). This is the best time we see across all machines, so all other machines should be judged relative to this duration, which should receive a normed score of 100. To do this we want to divide this minimum duration by the duration recorded for each machine individually, and then multiply the result by 100.

An Example Using the AMR Standard Case
ArchitectureDurationNormalized
ARL_mjm 2482 43
NAVY_Babbage 2836 37
AFRL_Hawk 2176 49
ERDC_Sapphire 2418 44
ERDC_Jade 2222 48
NAVY_Einstein 1914 55
NAVY_Davinci 1411 75
ARL_Harold 1361 78
MHPCC_Mana 1069 100
ERDC_Diamond 1283 83
Minimum Duration1069


In summary, the function is:

normedValue= (minimumDuration over all machines / duration on single machine) * 100

Now drop the decimal portion of the normedValue calculation, i.e., only retain the integer part of the calculation, for the table.

From the above, for instance, we can see that running the AMR standard test case on AFRL_Hawk takes slightly more than twice as long as our fastest results on MHPCC_Mana. The above calculations are performed for each test case (standard and large) for each code across the DSRC architectures under study and finally summarized in the performance matrix. The results are then binned into one of five categories and highlighted by the associated color for ease of reading. Of course, the best results are those closest to 100, while the poor performers are more distant.