Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel7
stylenone

Introduction

This page deals with vectorization and optimization of Radioss OpenRadioss Fortran code. This is a fundamental aspect of the code that needs to be well understood and learned by new Radioss programmers. Breaking performance of current code is not allowed. Furthermore, new OpenRadioss contributors. New functionality should be developed taking into account the same level of care regarding performance.

Vectorization

deals with the execution of computational loops. It allows a computer to compute several loop indexes during the same cycle leveraging vector registers. This concept was first introduced on vector supercomputers (CRAY, NEC, FUJITSU…)

Nonetheless, though there are no more vector supercomputers, vectorization is reintroduced on modern CPUs like Xeon processor with AVX and AVX512

For instance, AVX512 first introduced into Intel Xeon Phi Knights Landing and Xeon Skylake allows the handling of 8x 64-bit double precision real at the same time or 16x single precision  

Vectorization

Expand

Vector Length

Most computations in Radioss, like element or contact forces, are performed by packets of MVSIZ. This parameter is adjustable to match so-called vector length. This parameter is also important for cache locality. It is optimized by parallel development team according to hardware characteristics

New treatments need to respect this programming model which is to split the loop over number of elements or nodes by packets of MVSIZ. This will ensure optimal vector length and cache size as well as minimal local storage (local arrays of size MVSIZ instead of number of elements or nodes)

Loop Control

IF/THEN/ELSE

It is recommended to minimize the use of IF/THEN/ELSE instruction inside computational loop

Every time a test does not depend on loop index value, it is asked to perform it outside of such loop

GOTO

GOTO is absolutely forbidden inside computational loops as it inhibits vectorization and optimization

EXIT/CYCLE

EXIT and CYCLE need to be minimized and avoided in computational loops

Loop Size

Inside a loop it is recommended to keep the number of instructions reasonable. 20 instructions or less is good. Very long loops should be split to keep cache efficiencyMost compilers will be able to fuse short loops, while they will probably fail to vectorize long complex loops

Calling a procedure inside a loop inhibits vectorization

Data Dependency

The loop below is not vectorized due to possible dependence (same value of INDEX(I) for different I): 

Code Block
languagefortran
DO I = 1, N
         K = INDEX(I)
         A(K) = B(K)
         B(K) = 2*A(K)
END DO

In case of no true dependence, vectorization needs to be forced by adding a compiler directive

To keep portability across different platforms and compilers, an architecture specific include file exists named vectorize.inc that manages vectorization directives. The programmer just needs to add this include file just before the DO loop:

Code Block
languagefortran
#include "vectorize.inc"
       DO I = 1, N
            K = INDEX(I)
            A(K) = B(K)
            B(K) = 2*A(K)
       END DO

Notice there is another include file named simd.inc which makes unconditional vectorization, even if a true dependence is detected by the compiler. It is recommended to only use vectorise.inc which is more conservative regarding correctness

For Intel compiler:

  • vectorize.inc corresponds to !DIR$ IVDEP

  • simd.inc corresponds to !DIR$ SIMD

Procedure CALL

Calling a procedure inside a loop inhibits vectorization therefore it is not authorized

Inside Radioss, there are vectorized versions of procedures, basically the loop is put inside the procedure rather than outside:

  • VALPVEC to replace JACOBIEW

  • VINTER to replace FINTER

Nested Loops

In practice, only the inner most loop will be vectorized. So the inner most loop needs to be the largest one.

For fixed size loops it is possible to unroll them by hand or to use Fortran90 enhancement. Then the compiler is able to vectorize the outer loop

Note

Unroll by hand is not enough on AVX512 or SSE architectures

Only Fortran 90 syntax helps in this case

Example Nested Loop with NEL >> DIM

Code Block
languagefortran
DO I = 1, NEL
    DO J=1, DIM
      A(I,J) = B(I,J) + C(I,J)
    END DO
END DO 

To be transformed to:

Code Block
languagefortran
DO J=1, DIM
    DO I = 1, NEL
     A(I,J) = B(I,J) + C(I,J)
    END DO
END DO 

Or using Fortran90 notation:

Code Block
languagefortran
DO I=1,NEL
    A(I,:DIM)=B(I,:DIM) + C(I,:DIM)
END DO

Or for this simple case:

Code Block
languagefortran
A(:NEL,:DIM)=B(:NEL,:DIM) + C(:NEL,:DIM)

Arithmetic Functions

Expand

Power

Never use real variable for integer power because of the cost of real power arithmetic. Take care to not use real variable defined in constant.inc when integer is enough

A**2

Allowed

A**DEUXTWO

Forbidden as DEUX TWO is a my_real variable defined in constant.inc

A**DTIERSTHIRD

here there is no other choice as a real power arithmetic is required

Div

For invariant, it is advised to multiply by invert instead of doing a division by a constant inside a loop

Arrays

Expand

Fortran90 Array Operations

Use of Fortran90 array operations is encouraged as long as code readability is kept , by always specifying array bounds to avoid confusion between variable and array arithmetic.

 Example:

Code Block
languagefortran
INTEGER, DIMENSION(NUMNOD) :: A, B, C
A = B + C

    ! confusion between variable and array operation

Code Block
languagefortran
                               ! confusion between variable and array operation  
A(:NUMNOD) = B(:NUMNOD) + C(:NUMNOD)
   ! default lower bound:1 
    ! default lower bound:1

Multidimensional Arrays

Data Locality

Large arrays over a number of nodes or elements are defined to maximize data locality and have therefore the smallest dimension first, like in the example below:

Code Block
languagefortran

Rule of thumb for data locality of 2D arrays:

  • If the large dimension is >= MVSIZ or 128 : it should be last (X(3,NUMNOD)

, V(3,NUMNOD
  • )

, A(3,NUMNOD)

Leading Dimension for Vectorization

For vectorization on Xeon, it is better to have leading dimension first. So, depending on array size and access pattern a compromise needs to be found:

  • For large arrays like X, V, A, it is better to keep locality

  • For data structure of few times MVSIZ, like new element buffer arrays or temporary arrays of size x times MVSIZ, having largest dimension first is better:    

HOUR (MVSIZ,5) better than HOUR(5,MVSIZ)

According to test with recent Intel compiler, Fortran90 array notation can also improve code generated:

A(I,1:3) = B(I,1:3) + C(I,1:3) no need to unroll loop
  • If the large dimension is <= MVSIZ or 128:  it should be first (C(MVSIZ,5))

Structure Of Arrays

Code Block
languagefortran
TYPE POINT_STRUCT_
  my_real ::  x, y, z
END TYPE POINT_STRUCT_
Code Block
languagefortran
TYPE(GRID_STRUCT_) MESH;
! a grid of 3 arrays x, y, z of size 100
Code Block
languagefortran
TYPE(

Use structure of arrays (left examplePOINT%X(1:NBPOINTS) ) rather than arrays of structure (right example)

Correct

Incorrect

Code Block
languagefortran
TYPE GRID_STRUCT_
 my_real ::  x(100),y(100),z(100)
END TYPE GRID_STRUCT_

POINT

_STRUCT_) P(100); ! an array of size 100 of 3 scalars x, y, z

Warning On Large Array Initialization

Especially inside Radioss Starter it is common to find code using large array flag over NUMNOD initialized to zero many times which is time consuming

Here is a method to avoid such an issue: left is original poorly performing code, right is optimized version

Original

(

Poor Performance)

Optimized Code

Code Block
languagefortran
DO I=

1

, IX_TIMES !IX_TIMES set to 0           TAG(1:NUMNOD)=0          DO J=1,FEW_ITEMS            N = INDEX(I,J)            IF(TAG(N) == 0)THEN ! but few values change              TAG(N)=1 ! additional treatments…              END IF             END DO        END DO

 

Code Block
languagefortran
! single init to 0
TAG(1:NUMNOD)=0
DO I=1, IX_TIMES
  DO J=1,FEW_ITEMS
    N = INDEX(I,J)
    IF(TAG(N) == 0)THEN
      TAG(N)=1
! additional treatments…
    END IF
  END DO
! set to 0 only if needed
  DO J=1 , FEW_ITEMS
    N = INDEX(I,J)
    TAG(N) = 0
  END DO
END DO

Additional Fortran90 Restrictions Regarding Efficiency

Operator Overloading

Usage of object oriented feature like operator overloading, while it is efficient in code writing and clarity is not recommended due to lack of performance efficiency:NBPOINTS)%X)

Object Oriented Programming

It is not recommended to use object-oriented features unless you can verify that it does not harm performance