1 Amdahl’s Law
2 Domain Decomposition
- 2.1 Metis
3 Domain Splitting
4 Radioss Workflow
5 General Explicit Algorithm & Communication Scheme
6 MPI Communications
7 Notion of Parallel Arithmetic
8 With or Without Parallel Arithmetic
9 Related articles

Amdahl’s Law

Amdahl’s law is a very famous law in parallel programing that states that the speed-up S of a parallel algorithm is bound by the sequential part (Tseq):

S (∞) = 1 / Tseq

To be efficient, aim is to achieve at least 99% parallelism.

So, everything needs to be parallelized efficiently! The table below gives some speedup according to the % of parallelism and the number of processors (P):

Domain Decomposition

2 approaches

Graph Partitioning
- Based on element connectivities
- Several popular libraries: Metis, Scotch, …
Geometrical decomposition
- Based on spacial criteria
- Few packages (Zoltan)

Metis

Current algorithm used in OpenRadioss
Quality & speed of decomposition
Possibility to optimize load-balancing and communication
Multi-constraints : load-balancing using several criteria
Trade-off between number of constraints and quality of decomposition
Static decomposition

Domain Splitting

General idea
- Domain decomposition is done by a separate process (Starter)
- Local data is built before the start of the MPI processes
Purposes
- Compute local entities from global data and domain decomposition scheme (with help of OpenMP parallelization)
- Local renumbering
  - Local structures are renumbered inside every MPI domain, eg: nodes from 1 to NUMNOD_L(P)
  - Renumbering is needed to avoid indirection of type: N_L = NUMNOD_LOC(I)
  - Renumbering improves locality: based on element connectivities of every local domain
- Optimize start-up time and memory consumption of MPI processes
  - Parallel read from every MPI process
  - Memory allocation based on local memory requirement only

Radioss Workflow

Here is described the general workflow between Starter and Engine processes which allows parallel reading

General Explicit Algorithm & Communication Scheme

Here is the pseudo code of an explicit code like it is programmed in OpenRadioss in routine OpenRadioss/engine/source/engine/resol.F at main · OpenRadioss/OpenRadioss

While T < Tend do

Contact_sorting_criteria + Communication

If (needed) contact sorting with extra Communication

Compute_contacts + elements forces

Communication (Forces)

Assembling

Kinematic_Conditions with extra Communication

Compute_I/O + Communication to P₀;

If (P₀) output

dt = globmin(dt_local)

Compute_acceleration

Integration_V, X

Communication X, V

T = T + dt

End do

Contact sorting criteria in //

Contact sort requires complex communication

But is performed only few times ~ every 50 to 100 iterations

MPI Communications

Asynchronous communication
isend / irecv + waitany
non blocking collectives (MPI-3)
Hide communication time by computation time
Avoid global synchronization
Interaction between MPI & OpenMP
In RADIOSS, today, MPI communication are outside of OpenMP region

Notion of Parallel Arithmetic

Round-off depends on order of operation, so:

F1 + F2 + F3 + F4 # F1 + F4 + F2+ F3

OpenRadioss implements some mechanism, so-called “Parallel Arithmetic” to guaranty sommations are done in a unique order, independent from MPI grouping and OpenMP task scheduling.

With or Without Parallel Arithmetic

Results comparison

Comparison of results with and w/o parallel arithmetic on 1 and 4 CPU

Bitwise reproducibility guaranteed for any number of MPI

Bitwise reproducibility guaranteed for any number of OpenMP threads and any combination of number of MPI

OpenRadioss HMPP Development Insights