OpenRadioss HMPP Development Insights

 

Amdahl’s Law

Amdahl’s law is a very famous law in parallel programing that states that the speed-up S of a parallel algorithm is bound by the sequential part (Tseq):

S (∞) = 1 / Tseq

To be efficient, aim is to achieve at least 99% parallelism.

So, everything needs to be parallelized efficiently! The table below gives some speedup according to the % of parallelism and the number of processors (P):

 

Domain Decomposition

2 approaches

  • Graph Partitioning

    • Based on element connectivities

    • Several popular libraries: Metis, Scotch, …

  • Geometrical decomposition

    • Based on spacial criteria

    • Few packages (Zoltan)

Metis

  • Current algorithm used in OpenRadioss

  • Quality & speed of decomposition

  • Possibility to optimize load-balancing and communication

  • Multi-constraints : load-balancing using several criteria

  • Trade-off between number of constraints and quality of decomposition

  • Static decomposition

 

 

Domain Splitting

Domain Splitting

  • General idea

    • Domain decomposition is done by a separate process (Starter)

    • Local data is built before the start of the MPI processes

  • Purposes

    • Compute local entities from global data and domain decomposition scheme (with help of OpenMP parallelization)

    • Local renumbering

      • Local structures are renumbered inside every MPI domain, eg: nodes from 1 to NUMNOD_L(P)

      • Renumbering is needed to avoid indirection of type: N_L = NUMNOD_LOC(I)

      • Renumbering improves locality: based on element connectivities of every local domain

    • Optimize start-up time and memory consumption of MPI processes

      • Parallel read from every MPI process

      • Memory allocation based on local memory requirement only

Radioss Workflow

Here is described the general workflow between Starter and Engine processes which allows parallel reading

 

 

General Explicit Algorithm & Communication Scheme

Here is the pseudo code of an explicit code like it is programmed in OpenRadioss in routine OpenRadioss/engine/source/engine/resol.F at main · OpenRadioss/OpenRadioss

While T < Tend do

  Contact_sorting_criteria + Communication

  If (needed) contact sorting with extra Communication

  Compute_contacts  + elements forces

  Communication (Forces)

  Assembling

  Kinematic_Conditions with extra Communication

  Compute_I/O + Communication to P0;

  If (P0) output

  dt = globmin(dt_local)

  Compute_acceleration

  Integration_V, X

  Communication X, V

  T = T + dt

End do

 

 

Contact sorting criteria in //

Contact sort requires complex communication

But is performed only few times ~ every 50 to 100 iterations

MPI Communications

Notion of Parallel Arithmetic

With or Without Parallel Arithmetic

 

 Related articles

OpenRadioss Coding Standards

OpenRadioss Coding Recommendations

OpenRadioss Performance Aspects: Vectorization and Optimization

OpenRadioss Reader (Radioss Block Format)