Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Table of Contents | ||||||
---|---|---|---|---|---|---|
|
Introduction
This page deals with vectorization and optimization of RADIOSS Fortran code. This is a fundamental aspect of the code that needs to be well understood and learned by new RADIOSS programmers. Breaking performance of current code is not allowed. Furthermore, new functionality should be developed taking into account the same level of care regarding performance.
Vectorization deals with the execution of computational loops. It allows a computer to compute several loop indexes during the same cycle leveraging vector registers. This concept was first introduced on vector supercomputers (CRAY, NEC, FUJITSU…)
Nonetheless, though there are no more vector supercomputers, vectorization is reintroduced on modern CPUs like Xeon processor with AVX and AVX512
For instance, AVX512 first introduced into Intel Xeon Phi Knights Landing and Xeon Skylake allows the handling of 8x 64-bit double precision real at the same time or 16x single precision
Vectorization
Expand | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Vector LengthMost computations in OpenRadioss, like element or contact forces, are performed by packets of New treatments need to respect this programming model which is to split the loop over number of elements or nodes by packets of Loop Control
Loop SizeInside a loop it is recommended to keep the number of instructions reasonable. 20 instructions or less is good. Very long loops should be split to keep cache efficiency Most compilers will be able to fuse short loops, while they will probably fail to vectorize long complex loops Data DependencyThe loop below is not vectorized due to possible dependence (same value of
In case of no true dependence, vectorization needs to be forced by adding a compiler directive To keep portability across different platforms and compilers, an architecture specific include file exists named vectorize.inc that manages vectorization directives. The programmer just needs to add this include file just before the
Notice there is another include file named simd.inc which makes unconditional vectorization, even if a true dependence is detected by the compiler. It is recommended to only use vectorise.inc which is more conservative regarding correctness For Intel compiler:
Procedure CALLCalling a procedure inside a loop inhibits vectorization therefore it is not authorized Inside Radioss, there are vectorized versions of procedures, basically the loop is put inside the procedure rather than outside:
Nested LoopsIn practice, only the inner most loop will be vectorized. So the inner most loop needs to be the largest one. For fixed size loops it is possible to unroll them by hand or to use Fortran90 enhancement. Then the compiler is able to vectorize the outer loop
Example Nested Loop with |
Code Block | ||
---|---|---|
| ||
DO I = 1, NEL DO J=1, DIM A(I,J) = B(I,J) + C(I,J) END DO END DO |
To be transformed to:
Code Block | ||
---|---|---|
| ||
DO J=1, DIM DO I = 1, NEL A(I,J) = B(I,J) + C(I,J) END DO END DO |
Or using Fortran90 notation:
Code Block | ||
---|---|---|
| ||
DO I=1,NEL A(I,:DIM)=B(I,:DIM) + C(I,:DIM) END DO |
Routine & file organisation
Expand |
---|
By default, one file contains only one subroutine or function, except when the understanding of the code is facilitated by grouping few procedures The name of the subroutine (function) and the name of the file need to match. A “ The same 2 rules apply for module: one module per file, same name For module several subroutines can be defined under the |
Header definition
Expand | |||||
---|---|---|---|---|---|
Each procedure needs to have a conforming header, containing by order of appearance:
Notes:
Example:
Fortran routine generic example:
|
Comments
Expand | |||||
---|---|---|---|---|---|
It is important to comment important algorithms, especially when non straightforward coding is used Comments are written in English Comments respect Fortran90 standard. The use of " Except for precompiler directives, the following characters Example:
|
Modules
Expand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Module FormatGeneric format of a Fortran90 module is as follows:
Naming ConventionModule name is defined as follows: With module file name: modulename_mod.F Module file is placed at the same location as other files used by the option Module Usage3 types of usage:
Good practice is to split type declaration from variable declaration into 2 different modules. This way it is possible to pass variables defined in modules at upper level (resol) into calling tree at lower levels Then, such derived data type variables passed as argument of procedures can be defined using the module which defines them, allowing traceability of such variables throughout the code The procedure that uses such variables passed by argument needs to include the module that defines derived data types Example of derived data types (comments compliant with Doxygen):
Restart VariablesAll the variables communicated between Starter and Engine are declared in module Interface definitionFortran90 interface allows the compiler to do additional checks like coherency between argument types, attributes and number between calling and callee routines. It is required in some cases like when a procedure has a dummy argument with attribute In practice, it was introduced in few places of the code for routines which were called at several different places Such coherency is automatically tested by QA static analysis tools (Forcheck). And for pointer, the good practice is to use derived data types instead of pointer directly Therefore, the remaining use of interface is regarding routine with optional arguments This feature should be spread in the code instead of adding additional “dummy” arguments Interface ExampleExample of an interface for a routine called at several places in the code. The routine is put in a module to guarantee automatic update of interfaces and recompilations of all routines using this routine in case of change (dependence automatically found by compiler)
|
Memory Allocation
Expand | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dynamic memory allocation mechanismDynamic memory allocation is directly done at the Fortran90 level using This macro allows automatic error checks encapsulating call to Fortran90 Previously, allocation error check was done by hand by the:
In practice, error checking was missing in many places. Therefore, the idea to use a macro to automatically control allocation, handle error message and execution stop in case of failure was implemented. Here is the macro detail:
The previous code becomes:
Developers are required to check the success of the allocation The message printed by this macro in case of allocation failure is rather generic. For large arrays it is preferred to print a specific message, with advice for the user, or at least the option concerned by this failure Global MemoryMemory allocation of global data structures, arrays and derived data types, should be done at the highest level, in It is advised to use derived data type with structure of arrays. This way it is possible to declare the variable at the upper level, gather the allocation of array members in a dedicated subroutine, then use the variable in procedures called at lower level without losing traceability Local MemoryIn a procedure, local variable allocation method depends on its size:
Automatic Allocated arrays go into Stack ALLOCATED ARRAYS go into Heap One should take care to reduce Stacksize usage to a reasonable size. Stacksize is hardcoded under Windows It is allowed to use In case of multiple calls to Local Memory Example:
Shared Memory Programming (SMP) and memory allocationFor RADIOSS Engine, OpenMP programming model is used for second level parallelization By default any memory allocation done outside of a parallel section is shared between threads Most of the parallel sections are started from The same way, any variable defined in a common or module is shared by default For pointer, notice that a single thread needs to allocate and deallocate it. The programmer has to manage synchronization in order to insure such a variable is allocated before being used by any other thread and no longer used before it is deallocated The !$OMP THREADPRIVATE directive overrides default behavior by creating thread local storage variables |
Array Aliasing
Expand | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
DescriptionHere we discuss different arguments of a procedure referencing the same memory locations. The compiler won’t be able to detect in the procedure that different argument variables reference one or more identical memory locations. Such a situation is particularly dangerous because of compiler optimization. Even if compilers are not forbidden it, if both variables are modified inside the procedure this could lead to unpredictable results. Potential conflicts or dependencies won’t be detected Code Example:
Tested on SGI O200 IRIX 6.4: output:
|