Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Table of Contents | ||||||
---|---|---|---|---|---|---|
|
Introduction
This page deals with vectorization and optimization of Radioss Fortran code. This is a fundamental aspect of the code that needs to be well understood and learned by new Radioss programmers. Breaking performance of current code is not allowed. Furthermore, new functionality should be developed taking into account the same level of care regarding performance.
Vectorization deals with the execution of computational loops. It allows a computer to compute several loop indexes during the same cycle leveraging vector registers. This concept was first introduced on vector supercomputers (CRAY, NEC, FUJITSU…)
Nonetheless, though there are no more vector supercomputers, vectorization is reintroduced on modern CPUs like Xeon processor with AVX and AVX512
For instance, AVX512 first introduced into Intel Xeon Phi Knights Landing and Xeon Skylake allows the handling of 8x 64-bit double precision real at the same time or 16x single precision
Vectorization
Expand | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Vector LengthMost computations in Radioss, like element or contact forces, are performed by packets of New treatments need to respect this programming model which is to split the loop over number of elements or nodes by packets of Loop Control
Loop SizeInside a loop it is recommended to keep the number of instructions reasonable. 20 instructions or less is good. Very long loops should be split to keep cache efficiency Most compilers will be able to fuse short loops, while they will probably fail to vectorize long complex loops Data DependencyThe loop below is not vectorized due to possible dependence (same value of
In case of no true dependence, vectorization needs to be forced by adding a compiler directive To keep portability across different platforms and compilers, an architecture specific include file exists named vectorize.inc that manages vectorization directives. The programmer just needs to add this include file just before the
Notice there is another include file named simd.inc which makes unconditional vectorization, even if a true dependence is detected by the compiler. It is recommended to only use vectorise.inc which is more conservative regarding correctness For Intel compiler:
Procedure CALLCalling a procedure inside a loop inhibits vectorization therefore it is not authorized Inside Radioss, there are vectorized versions of procedures, basically the loop is put inside the procedure rather than outside:
Nested LoopsIn practice, only the inner most loop will be vectorized. So the inner most loop needs to be the largest one. For fixed size loops it is possible to unroll them by hand or to use Fortran90 enhancement. Then the compiler is able to vectorize the outer loop
Example Nested Loop with |
Code Block | ||
---|---|---|
| ||
DO I = 1, NEL DO J=1, DIM A(I,J) = B(I,J) + C(I,J) END DO END DO |
To be transformed to:
Code Block | ||
---|---|---|
| ||
DO J=1, DIM DO I = 1, NEL A(I,J) = B(I,J) + C(I,J) END DO END DO |
Or using Fortran90 notation:
Code Block | ||
---|---|---|
| ||
DO I=1,NEL A(I,:DIM)=B(I,:DIM) + C(I,:DIM) END DO |
Or for this simple case:
Code Block | ||
---|---|---|
| ||
A(:NEL,:DIM)=B(:NEL,:DIM) + C(:NEL,:DIM) |
Arithmetic Functions
Expand | ||||||
---|---|---|---|---|---|---|
PowerNever use real variable for integer power because of the cost of real power arithmetic. Take care to not use real variable defined in constant.inc when integer is enough
DivFor invariant, it is advised to multiply by invert instead of doing a division by a constant inside a loop |
Arrays
Expand | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fortran90 Array OperationsUse of Fortran90 array operations is encouraged as long as code readability is kept, by always specifying array bounds to avoid confusion between variable and array arithmetic. Example:
! confusion between variable and array operation
! default lower bound:1 Multidimensional ArraysData LocalityLarge arrays over a number of nodes or elements are defined to maximize data locality and have therefore the smallest dimension first, like in the example below:
Leading Dimension for VectorizationFor vectorization on Xeon, it is better to have leading dimension first. So, depending on array size and access pattern a compromise needs to be found:
According to test with recent Intel compiler, Fortran90 array notation can also improve code generated:
Structure Of ArraysUse structure of arrays (left example) rather than arrays of structure (right example)
Warning On Large Array InitializationEspecially inside Radioss Starter it is common to find code using large array flag over Here is a method to avoid such an issue: left is original poorly performing code, right is optimized version
Additional Fortran90 Restrictions Regarding EfficiencyOperator OverloadingUsage of object oriented feature like operator overloading, while it is efficient in code writing and clarity is not recommended due to lack of performance efficiency |
\uD83D\uDCCB Related articles
OpenRadioss HMPP Development Insights
OpenRadioss Coding Recommendations
OpenRadioss Performance Aspects: Parallelism
OpenRadioss Coding: Additional Considerations
OpenRadioss HMPP Development InsightsReader ( Radioss Block Format)