/
OpenRadioss HMPP Development Insights

OpenRadioss HMPP Development Insights

 

Amdahl’s Law

Amdahl’s law is a very famous law in parallel programing that states that the speed-up S of a parallel algorithm is bound by the sequential part (Tseq):

S (∞) = 1 / Tseq

To be efficient, aim is to achieve at least 99% parallelism.

So, everything needs to be parallelized efficiently! The table below gives some speedup according to the % of parallelism and the number of processors (P):

 

Domain Decomposition

2 approaches

  • Graph Partitioning

    • Based on element connectivities

    • Several popular libraries: Metis, Scotch, …

  • Geometrical decomposition

    • Based on spacial criteria

    • Few packages (Zoltan)

Metis

  • Current algorithm used in OpenRadioss

  • Quality & speed of decomposition

  • Possibility to optimize load-balancing and communication

  • Multi-constraints : load-balancing using several criteria

  • Trade-off between number of constraints and quality of decomposition

  • Static decomposition

 

 

Domain Splitting

Domain Splitting

  • General idea

    • Domain decomposition is done by a separate process (Starter)

    • Local data is built before the start of the MPI processes

  • Purposes

    • Compute local entities from global data and domain decomposition scheme (with help of OpenMP parallelization)

    • Local renumbering

      • Local structures are renumbered inside every MPI domain, eg: nodes from 1 to NUMNOD_L(P)

      • Renumbering is needed to avoid indirection of type: N_L = NUMNOD_LOC(I)

      • Renumbering improves locality: based on element connectivities of every local domain

    • Optimize start-up time and memory consumption of MPI processes

      • Parallel read from every MPI process

      • Memory allocation based on local memory requirement only

Radioss Workflow

Here is described the general workflow between Starter and Engine processes which allows parallel reading

 

 

General Explicit Algorithm & Communication Scheme

Here is the pseudo code of an explicit code like it is programmed in OpenRadioss in routine OpenRadioss/engine/source/engine/resol.F at main · OpenRadioss/OpenRadioss

While T < Tend do

  Contact_sorting_criteria + Communication

  If (needed) contact sorting with extra Communication

  Compute_contacts  + elements forces

  Communication (Forces)

  Assembling

  Kinematic_Conditions with extra Communication

  Compute_I/O + Communication to P0;

  If (P0) output

  dt = globmin(dt_local)

  Compute_acceleration

  Integration_V, X

  Communication X, V

  T = T + dt

End do

 

 

Contact sorting criteria in //

Contact sort requires complex communication

But is performed only few times ~ every 50 to 100 iterations

MPI Communications

  • Asynchronous communication

  • isend / irecv + waitany

  • non blocking collectives (MPI-3)

  • Hide communication time by computation time

  • Avoid global synchronization

  • Interaction between MPI & OpenMP

  • In RADIOSS, today, MPI communication are outside of OpenMP region

 

Notion of Parallel Arithmetic

Round-off depends on order of operation, so:

F1 + F2 + F3 + F4 # F1 + F4 + F2+ F3

OpenRadioss implements some mechanism, so-called “Parallel Arithmetic” to guaranty sommations are done in a unique order, independent from MPI grouping and OpenMP task scheduling.

With or Without Parallel Arithmetic

Results comparison

Comparison of results with and w/o parallel arithmetic on 1 and 4 CPU

Bitwise reproducibility guaranteed for any number of MPI

Bitwise reproducibility guaranteed for any number of OpenMP threads and any combination of number of MPI

 

 Related articles

OpenRadioss Coding Standards

OpenRadioss Coding Recommendations

OpenRadioss Performance Aspects: Vectorization and Optimization

OpenRadioss Reader (Radioss Block Format)