OpenRadioss HMPP Development Insights

ย 

Amdahlโ€™s Law

Amdahlโ€™s law is a very famous law in parallel programing that states that the speed-up S of a parallel algorithm is bound by the sequential part (Tseq):

S (โˆž) = 1 / Tseq

To be efficient, aim is to achieve at least 99% parallelism.

So, everything needs to be parallelized efficiently! The table below gives some speedup according to the % of parallelism and the number of processors (P):

ย 

Domain Decomposition

2 approaches

  • Graph Partitioning

    • Based on element connectivities

    • Several popular libraries: Metis, Scotch, โ€ฆ

  • Geometrical decomposition

    • Based on spacial criteria

    • Few packages (Zoltan)

Metis

  • Current algorithm used in OpenRadioss

  • Quality & speed of decomposition

  • Possibility to optimize load-balancing and communication

  • Multi-constraints : load-balancing using several criteria

  • Trade-off between number of constraints and quality of decomposition

  • Static decomposition

ย 

ย 

Domain Splitting

Domain Splitting

  • General idea

    • Domain decomposition is done by a separate process (Starter)

    • Local data is built before the start of the MPI processes

  • Purposes

    • Compute local entities from global data and domain decomposition scheme (with help of OpenMP parallelization)

    • Local renumbering

      • Local structures are renumbered inside every MPI domain, eg: nodes from 1 to NUMNOD_L(P)

      • Renumbering is needed to avoid indirection of type: N_L = NUMNOD_LOC(I)

      • Renumbering improves locality: based on element connectivities of every local domain

    • Optimize start-up time and memory consumption of MPI processes

      • Parallel read from every MPI process

      • Memory allocation based on local memory requirement only

Radioss Workflow

Here is described the general workflow between Starter and Engine processes which allows parallel reading

ย 

ย 

General Explicit Algorithm & Communication Scheme

Here is the pseudo code of an explicit code like it is programmed in OpenRadioss in routine OpenRadioss/engine/source/engine/resol.F at main ยท OpenRadioss/OpenRadioss

While T < Tend do

ย  Contact_sorting_criteria + Communication

ย  If (needed) contact sorting with extra Communication

ย  Compute_contacts ย + elements forces

ย  Communication (Forces)

ย  Assembling

ย  Kinematic_Conditions with extra Communication

ย  Compute_I/O + Communication to P0;

ย  If (P0) output

ย  dt = globmin(dt_local)

ย  Compute_acceleration

ย  Integration_V, X

ย  Communication X, V

ย  T = T + dt

End do

ย 

ย 

Contact sorting criteria in //

Contact sort requires complex communication

But is performed only few times ~ every 50 to 100 iterations

MPI Communications

Notion of Parallel Arithmetic

With or Without Parallel Arithmetic

ย 

ย Related articles

OpenRadioss Coding Standards

OpenRadioss Coding Recommendations

OpenRadioss Performance Aspects: Vectorization and Optimization

OpenRadioss Reader (Radioss Block Format)

ย