Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel7
stylenone

Introduction

This page deals with vectorization and optimization of Radioss Fortran code. This is a fundamental aspect of the code that needs to be well understood and learned by new Radioss programmers. Breaking performance of current code is not allowed. Furthermore, new functionality should be developed taking into account the same level of care regarding performance.

Vectorization deals with the execution of computational loops. It allows a computer to compute several loop indexes during the same cycle leveraging vector registers. This concept was first introduced on vector supercomputers (CRAY, NEC, FUJITSU…)

Nonetheless, though there are no more vector supercomputers, vectorization is reintroduced on modern CPUs like Xeon processor with AVX and AVX512

For instance, AVX512 first introduced into Intel Xeon Phi Knights Landing and Xeon Skylake allows the handling of 8x 64-bit double precision real at the same time or 16x single precision  

Vectorization

Expand

Vector Length

Most computations in Radioss, like element or contact forces, are performed by packets of MVSIZ. This parameter is adjustable to match so-called vector length. This parameter is also important for cache locality. It is optimized by parallel development team according to hardware characteristics

New treatments need to respect this programming model which is to split the loop over number of elements or nodes by packets of MVSIZ. This will ensure optimal vector length and cache size as well as minimal local storage (local arrays of size MVSIZ instead of number of elements or nodes)

Loop Control

IF/THEN/ELSE

It is recommended to minimize the use of IF/THEN/ELSE instruction inside computational loop

Every time a test does not depend on loop index value, it is asked to perform it outside of such loop

GOTO

GOTO is absolutely forbidden inside computational loops as it inhibits vectorization and optimization

EXIT/CYCLE

EXIT and CYCLE need to be minimized and avoided in computational loops

Loop Size

Inside a loop it is recommended to keep the number of instructions reasonable. 20 instructions or less is good. Very long loops should be split to keep cache efficiency

Most compilers will be able to fuse short loops, while they will probably fail to vectorize long complex loops

Data Dependency

The loop below is not vectorized due to possible dependence (same value of INDEX(I) for different I): 

Code Block
languagefortran
DO I = 1, N
         K = INDEX(I)
         A(K) = B(K)
         B(K) = 2*A(K)
END DO

In case of no true dependence, vectorization needs to be forced by adding a compiler directive

To keep portability across different platforms and compilers, an architecture specific include file exists named vectorize.inc that manages vectorization directives. The programmer just needs to add this include file just before the DO loop:

Code Block
languagefortran
#include "vectorize.inc"
       DO I = 1, N
            K = INDEX(I)
            A(K) = B(K)
            B(K) = 2*A(K)
       END DO

Notice there is another include file named simd.inc which makes unconditional vectorization, even if a true dependence is detected by the compiler. It is recommended to only use vectorise.inc which is more conservative regarding correctness

For Intel compiler:

  • vectorize.inc corresponds to !DIR$ IVDEP

  • simd.inc corresponds to !DIR$ SIMD

Procedure CALL

Calling a procedure inside a loop inhibits vectorization therefore it is not authorized

Inside Radioss, there are vectorized versions of procedures, basically the loop is put inside the procedure rather than outside:

  • VALPVEC to replace JACOBIEW

  • VINTER to replace FINTER

Nested Loops

In practice, only the inner most loop will be vectorized. So the inner most loop needs to be the largest one.

For fixed size loops it is possible to unroll them by hand or to use Fortran90 enhancement. Then the compiler is able to vectorize the outer loop

Note

Unroll by hand is not enough on AVX512 or SSE architectures

Only Fortran 90 syntax helps in this case

Example Nested Loop with NEL >> DIM

Code Block
languagefortran
DO I = 1, NEL
    DO J=1, DIM
      A(I,J) = B(I,J) + C(I,J)
    END DO
END DO 

To be transformed to:

Code Block
languagefortran
DO J=1, DIM
    DO I = 1, NEL
     A(I,J) = B(I,J) + C(I,J)
    END DO
END DO 

Or using Fortran90 notation:

Code Block
languagefortran
DO I=1,NEL
    A(I,:DIM)=B(I,:DIM) + C(I,:DIM)
END DO

Or for this simple case:

Code Block
languagefortran
A(:NEL,:DIM)=B(:NEL,:DIM) + C(:NEL,:DIM)

Arithmetic Functions

Expand

Power

Never use real variable for integer power because of the cost of real power arithmetic. Take

care to not use real variable defined in constant.inc when integer is enough

A**2

Allowed

A**DEUX

Forbidden as DEUX is a my_real variable defined in constant.inc

A**DTIERS

here there is no other choice as a real power arithmetic is required

Div

For invariant, it is advised to multiply by invert instead of doing a division by a constant inside a loop

Arrays

Expand

Fortran90 Array Operations

Use of Fortran90 array operations is encouraged as long as code readability is kept, by always specifying array bounds to avoid confusion between variable and array arithmetic.

 Example:

Code Block
languagefortran
INTEGER, DIMENSION(NUMNOD) :: A, B, C
A = B + C

    ! confusion between variable and array operation

Code Block
languagefortran
A(:NUMNOD) = B(:NUMNOD) + C(:NUMNOD)

   ! default lower bound:1 

Multidimensional Arrays

Data Locality

Large arrays over a number of nodes or elements are defined to maximize data locality and have therefore the smallest dimension first, like in the example below:

Code Block
languagefortran
X(3,NUMNOD), V(3,NUMNOD), A(3,NUMNOD)

Leading Dimension for Vectorization

For vectorization on Xeon, it is better to have leading dimension first. So, depending on array size and access pattern a compromise needs to be found:

  • For large arrays like X, V, A, it is better to keep locality

  • For data structure of few times MVSIZ, like new element buffer arrays or temporary arrays of size x times MVSIZ, having largest dimension first is better:    

HOUR (MVSIZ,5) better than HOUR(5,MVSIZ)

According to test with recent Intel compiler, Fortran90 array notation can also improve code generated:

A(I,1:3) = B(I,1:3) + C(I,1:3) no need to unroll loop

Structure Of Arrays

Use structure of arrays (left example) rather than arrays of structure (right example)

Correct

Incorrect

Code Block
languagefortran
TYPE GRID_STRUCT_
 my_real ::  x(100),y(100),z(100)
END TYPE GRID_STRUCT_
Code Block
languagefortran
TYPE POINT_STRUCT_
  my_real ::  x, y, z
END TYPE POINT_STRUCT_
Code Block
languagefortran
TYPE(GRID_STRUCT_) MESH;
! a grid of 3 arrays x, y, z of size 100
Code Block
languagefortran
TYPE(POINT_STRUCT_) P(100);
! an array of size 100 of 3 scalars x, y, z

Warning On Large Array Initialization

Especially inside Radioss Starter it is common to find code using large array flag over NUMNOD initialized to zero many times which is time consuming

Here is a method to avoid such an issue: left is original poorly performing code, right is optimized version

Original (Poor Performance)

Optimized Code

Code Block
languagefortran
DO I=1, IX_TIMES
!IX_TIMES set to 0 
         TAG(1:NUMNOD)=0
         DO J=1,FEW_ITEMS
           N = INDEX(I,J)
           IF(TAG(N) == 0)THEN
! but few values change
             TAG(N)=1
! additional treatments…  
           END IF   
         END DO
       END DO

 

Code Block
languagefortran
! single init to 0
TAG(1:NUMNOD)=0
DO I=1, IX_TIMES
  DO J=1,FEW_ITEMS
    N = INDEX(I,J)
    IF(TAG(N) == 0)THEN
      TAG(N)=1
! additional treatments…
    END IF
  END DO
! set to 0 only if needed
  DO J=1 , FEW_ITEMS
    N = INDEX(I,J)
    TAG(N) = 0
  END DO
END DO

 

Comments

Expand

It is important to comment important algorithms, especially when non straightforward coding is used

Comments are written in English

Comments respect Fortran90 standard. The use of "!" at beginning of line is preferred to "C" or "*"

Except for precompiler directives, the following characters ',", \, /*, */, # are forbidden, even inside comments. Especially "\" is dangerous as it is interpreted as a continuation line by the precompiler

 Example:

 

Code Block
languagefortran
! this is a legal comment
      A = A + B ! this one is also authorized in FORTRAN 90
      C = A + C ! next instruction ignored cause of this char \
      D = C + D

Modules

Expand

Module Format

Generic format of a Fortran90 module is as follows:     

Code Block
languagefortran
MODULE <module name>
  USE [other module list]
  #include "implicit_f.inc"
  <declaration section>
  CONTAINS
  <procedure definitions>
END MODULE <module name>

Naming Convention 

Module name is defined as follows: MODULENAME_MOD

With module file name: modulename_mod.F

Module file is placed at the same location as other files used by the option

Module Usage   

3 types of usage:

  • Derived data types definition

  • Variable names declaration (in replacement of commons)

  • Procedure interface (data type & argument list control,…)

Good practice is to split type declaration from variable declaration into 2 different modules.

This way it is possible to pass variables defined in modules at upper level (resol) into calling tree at lower levels

Then, such derived data type variables passed as argument of procedures can be defined using the module which defines them, allowing traceability of such variables throughout the code

The procedure that uses such variables passed by argument needs to include the module that defines derived data types

Example of derived data types (comments compliant with Doxygen):  

Code Block
languagefortran
C====================================================================
C  DUMMYDEF_MOD                     modules/dummydef_mod.F       
C-------------------------------------------------------------------
C> Description:
C> define DUMMY struture
C-------------------------------------------------------------------
C> called by:                                                     \n
C> @ref     DUMMY                 starter/src/test/dummy.F        \n
C> @ref     DINIT                 starter/src/test/dinit.F        \n
C> @ref     DSOLVE                starter/src/test/dsolve.F       \n   
C>\n calling:                                                     \n
C====================================================================
      MODULE DUMMYDEF_MOD
C----------------------------------------------------------------------- 
C-----------------------------------------------
C   M o d u l e s
C-----------------------------------------------
C-----------------------------------------------
C   m y _ r e a l
C-----------------------------------------------
C-----------------------------------------------
C   I m p l i c i t   T y p e s
C-----------------------------------------------
#include      "implicit_f.inc"
C=================================================
        TYPE DUMMY_STRUCT_
C=================================================
          INTEGER :: L_ITAB !< size of integer array ITAB
          INTEGER :: L_RTAB !< size of integer array RTAB
          INTEGER, DIMENSION(:) , POINTER ::  ITAB !< integer array ITAB
          my_real, DIMENSION(:) , POINTER ::  RTAB !< real array RTAB
        END TYPE DUMMY_STRUCT_
      END MODULE DUMMYDEF_MOD

C====================================================================
C  DINIT                 starter/src/test/dinit.F       
C-------------------------------------------------------------------
C> Description:                                                    \n
C> Allocate and initialize variable MY_DUM of type DUMMY_STRUCT_
C-------------------------------------------------------------------
C> called by:                                                      \n
C> @ref      DUMMY                  starter/src/test/dummy.F       \n
C>\n calling:                                                      \n
C====================================================================
      SUBROUTINE DINIT(MY_DUM, NITEMS)
C-----------------------------------------------
C   M o d u l e s
C----------------------------------------------- 
      USE DUMMYDEF_MOD
C-----------------------------------------------
C   I m p l i c i t   T y p e s
C-----------------------------------------------
#include      "implicit_f.inc"
C-----------------------------------------------
C-----------------------------------------------
C   G l o b a l   P a r a m e t e r s
C-----------------------------------------------
C-----------------------------------------------
C   C o m m o n   B l o c k s
C-----------------------------------------------
C-----------------------------------------------
C   D u m m y   A r g u m e n t s
C-----------------------------------------------
      INTEGER, INTENT(IN) :: NITEMS
      TYPE(DUMMY_STRUCT_), INTENT(INOUT) :: MY_DUM
C-----------------------------------------------
C   L o c a l   V a r i a b l e s
C-----------------------------------------------
      INTEGER :: N, IERROR
C-----------------------------------------------
      MY_DUM%L_ITAB = NITEMS
      MY_DUM%L_RTAB = NITEMS

      ALLOCATE(MY_DUM%ITAB(MY_DUM%L_ITAB),MY_DUM%RTAB(MY_DUM%L_RTAB),
     &         STAT=ierror)
      IF (IERROR /= 0) THEN ! better to use MY_ALLOCATE macro instead
         print*,'error:',IERROR
         stop 123
      END IF

      DO N = 1, NITEMS
        MY_DUM%ITAB(N) = N
        MY_DUM%RTAB(N) = N**2
      END DO

      RETURN
      END SUBROUTINE DINIT

C====================================================================
C  DSOLVE                 starter/src/test/dsolve.F       
C-------------------------------------------------------------------
C> Description:                                                    \n 
C> Solve some dummy problem
C-------------------------------------------------------------------
C> called by:                                                       \n
C> @ref DUMMY                  starter/src/test/dummy.F             \n
C>\n calling:                                                       \n
C|====================================================================
      SUBROUTINE DSOLVE(MY_DUM,RES)
C-----------------------------------------------
C   M o d u l e s
C----------------------------------------------- 
      USE DUMMYDEF_MOD
C----6------------------------------------------
C   I m p l i c i t   T y p e s
C-----------------------------------------------
#include      "implicit_f.inc"
C-----------------------------------------------
C   G l o b a l   P a r a m e t e r s
C-----------------------------------------------
C   C o m m o n   B l o c k s
C-----------------------------------------------
C-----------------------------------------------
C   D u m m y   A r g u m e n t s
C-----------------------------------------------
      TYPE(DUMMY_STRUCT_), INTENT(INOUT) :: MY_DUM
      my_real, INTENT(OUT) :: RES
C-----------------------------------------------
C   L o c a l   V a r i a b l e s
C-----------------------------------------------
      INTEGER :: N, NITEMS,IERROR
      my_real :: VERIF 
C-----------------------------------------------
      NITEMS = MY_DUM%L_RTAB
      RES = 0.
      VERIF = 0.
      DO N = 1, NITEMS
        RES = RES + MY_DUM%RTAB(N)
        VERIF = VERIF + N**2
      END DO
      print *,'res=',res,' verif=0?:',RES-VERIF
      DEALLOCATE(MY_DUM%ITAB,MY_DUM%RTAB,STAT=IERROR)
      print *,'solve terminated with error code:',IERROR
      RETURN
      END
	
!====================================================================
! DUMMY                 starter/src/test/dummy.F                   
!====================================================================
!> Description:                                                    \n
!> Main routine which initialize and use a variable
!> of type DUMMY_STRUCT_
!====================================================================
!>\n called by:                                                    \n
!> @ref       DUMMYEXT              starter/src/test/dummyext.F    \n
!>\n calling:                                                      \n
!> @ref       DINIT                 starter/src/test/dinit.F       \n
!> @ref       DSOLVE                starter/src/test/dsolve.F      \n
!====================================================================
      SUBROUTINE DUMMY(NITEMS)
C-----------------------------------------------
C   M o d u l e s
C----------------------------------------------- 
      USE DUMMYDEF_MOD
C-----------------------------------------------
C   I m p l i c i t   T y p e s
C-----------------------------------------------
#include      "implicit_f.inc"
C-----------------------------------------------
C-----------------------------------------------
C   G l o b a l   P a r a m e t e r s
C-----------------------------------------------
C-----------------------------------------------
C   C o m m o n   B l o c k s
C-----------------------------------------------
C-----------------------------------------------
C   D u m m y   A r g u m e n t s
C-----------------------------------------------
      INTEGER, INTENT(IN) :: NITEMS ! number of NITEMS
C-----------------------------------------------
C   L o c a l   V a r i a b l e s
C-----------------------------------------------
      TYPE(DUMMY_STRUCT_) MY_DUM ! dummy structure
      my_real RES ! result of dummy solve
C-----------------------------------------------
      CALL DINIT(MY_DUM,NITEMS)
      CALL DSOLVE(MY_DUM,RES)
      RETURN
      END SUBROUTINE DUMMY

!====================================================================
! DUMMYEXT                 starter/src/test/dummy.F                   
!====================================================================
!> Description:                                                    \n
!> main program calling dummy                                      \n
!>\n called by:                                                    \n
!>\n calling:                                                      \n
!> @ref       DUMMY                 starter/src/test/d.F           \n
!====================================================================
!> \warning program should not be put in the same file, just for
!> convenience and for the aim to put some warning message blablabla
!> \bug not good to call "dummy(10)", should be passed by address instead of a value
C====================================================================
      PROGRAM DUMMYEXT
      CALL DUMMY(10)
C> @note this comment is ignored or would be applied to next subroutine
C never do that with DOxygen !!!
C all Doxygen comments need to be put before the routine definition
      END PROGRAM DUMMYEXT

Restart Variables

All the variables communicated between Starter and Engine are declared in module RESTART_MOD, used by subroutine ARRALLOC for their allocation, by RDRESB for their reading from RESTART file, then by RESOL_HEAD in which they are used as argument for RESOL subroutine. All these argument variables are then passed by argument to procedures called from RESOL, ...)

Interface definition

Fortran90 interface allows the compiler to do additional checks like coherency between argument types, attributes and number between calling and callee routines. It is required in some cases like when a procedure has a dummy argument with attribute ALLOCATABLE, POINTER, TARGET

In practice, it was introduced in few places of the code for routines which were called at several different places

Such coherency is automatically tested by QA static analysis tools (Forcheck).

And for pointer, the good practice is to use derived data types instead of pointer directly

Therefore, the remaining use of interface is regarding routine with optional arguments

This feature should be spread in the code instead of adding additional “dummy” arguments

Interface Example

Example of an interface for a routine called at several places in the code. The routine is put in a module to guarantee automatic update of interfaces and recompilations of all routines using this routine in case of change (dependence automatically found by compiler)

Code Block
languagefortran
MODULE INITBUF_MOD
      CONTAINS
Cgw|============================================================
Cgw|  INITBUF                         src/resol/initbuf.F
Cgw|------------------------------------------------------------
Cgw| Description :
Cgw| Initialisation of vect01_c.inc variables
Cgw|-- called by -----------
Cgw|  FORINT                          src/resol/forint.F
Cgw|  FORINTS                         src/resol/forints.F
Cgw|  ALEMAIN                         priv/ale/alemain.F
Cgw|============================================================
      SUBROUTINE INITBUF (IPARG    ,NG      ,
     2   MTN     ,LLT     ,NFT     ,IAD     ,ITY     ,
 C  … 
     6   IREP    ,IINT    ,IGTYP   ,ISRAT   ,ISROT   ,
     7   ICSEN   ,ISORTH  ,ISORTHG ,IFAILURE)
C-----------------------------------------------
C   I m p l i c i t   T y p e s
C-----------------------------------------------
#include      "implicit_f.inc"
C-----------------------------------------------
C   C o m m o n   B l o c k s
C-----------------------------------------------
#include      "param_c.inc"
C-----------------------------------------------
C   D u m m y   A r g u m e n t s
C-----------------------------------------------
      INTEGER, INTENT (IN)  :: IPARG(NPARG,NGROUP),NG
      INTEGER, INTENT (OUT) :: MTN,LLT,NFT,IAD,ITY,NPT,JALE,ISMSTR,
     .   JEUL,JTUR,JTHE,JLAG,NVAUX,JMULT,JHBE,JIVF,JPOR,JPLA,JCLOSE,
     .   IREP,IINT,IGTYP,JCVT,ISROT,ISRAT,ISORTH,ISORTHG,ICSEN,IFAILURE
C-----------------------------------------------
C   S o u r c e  L i n e s
C======================================================================
      MTN     = IPARG(1,NG)    
      LLT     = IPARG(2,NG)    
…
      JCLOSE  = IPARG(33,NG)    
      IREP    = IPARG(35,NG)      
      IINT    = IPARG(36,NG)    
      JCVT    = IPARG(37,NG)
      IFAILURE = IPARG(43,NG)                  
C----
      RETURN
      END SUBROUTINE INITBUF
      END MODULE INITBUF_MOD

Cgw|============================================================
Cgw|  FORINT                          src/resol/forint.F
Cgw|------------------------------------------------------------
Cgw|-- called by -----------
Cgw|         RESOL                           src/resol/resol.F
Cgw|-- valls ---------------
Cgw|         INITBUF                         src/resol/initbuf.F
Cgw|============================================================
      SUBROUTINE FORINT(
     1    PM        ,GEO       ,X         ,A         ,AR        ,
     2    V         ,VR        ,MS        ,IN        ,W         ,
C… 
     K    MSNF      ,IGEO      ,IPM       ,XSEC      ,ITASK)
C-----------------------------------------------
C   M o d u l e s
C-----------------------------------------------
      USE INITBUF_MOD
C----6---------------------------------------------------------------7---------8
C   I m p l i c i t   T y p e s
C-----------------------------------------------
#include      "implicit_f.inc"
#include      "comlock.inc"
C-----------------------------------------------
C   G l o b a l   P a r a m e t e r s
C-----------------------------------------------
#include      "mvsiz_p.inc"
C-----------------------------------------------
C   C o m m o n   B l o c k s
C-----------------------------------------------
#include      "com01_c.inc"
#include      "com03_c.inc"
C-----------------------------------------------------------------
C   D u m m y   A r g u m e n t s
C-----------------------------------------------
      INTEGER IXS(NIXS,*),
     .   IXQ(NIXQ,*), IXT(NIXT,*), IXP(NIXP,*),
     .   IXR(NIXR,*), IELVS(*), IGEO(NPROPGI,*),
     .   IXS16(8,*),IADS16(8,*),ITASK
C     REAL ou REAL*8
      my_real
     .   X(3,*)    ,D(3,*)      ,V(3,*)   ,VR(3,*),
     .   MS(*)   ,IN(*)   ,PM(NPROPM,*),SKEW(9,*),GEO(NPROPG,*),
     .   BUFMAT(*) ,W(3,*)    ,VEUL(*),TF(*) ,FR_WAVE(*) 
C-----------------------------------------------
C   L o c a l   V a r i a b l e s
C-----------------------------------------------
      INTEGER INDXOF(MVSIZ)
      INTEGER I,II,J,N
      my_real
     .   FX(MVSIZ,20),FY(MVSIZ,20),FZ(MVSIZ,20),
     .   MX(MVSIZ,4),MY(MVSIZ,4),MZ(MVSIZ,4)
C======================================================================|
           CALL INITBUF (IPARG    ,NG      ,
     2        MLW     ,NEL     ,NFT     ,KAD     ,ITY     ,
     3        NPT     ,JALE    ,ISMSTR  ,JEUL    ,JTUR    ,
     4        JTHE    ,JLAG    ,JMULT   ,JHBE    ,JIVF    ,
     5        NVAUX   ,JPOR    ,JCVT    ,JCLOSE  ,IPLA    ,
     6        IREP    ,IINT    ,IGTYP   ,ISRAT   ,ISROT   ,
     7        ICSEN   ,ISORTH  ,ISORTHG ,IFAILURE)
C               
           ICNOD   = IPARG(11,NG)
           KADDSA  = 1+(KAD-1)*NDSAEXT
           NSG     = IPARG(10,NG)
C-----------
      RETURN
      END

Memory Allocation

Expand

Dynamic memory allocation mechanism

Dynamic memory allocation is directly done at the Fortran90 level using MY_ALLOCATE macro

This macro allows automatic error checks encapsulating call to Fortran90 ALLOCATE statement

Previously, allocation error check was done by hand by the:

Code Block
languagefortran
CALL ALLOCATE(ITAB(NUMNOD),STAT=IERR)
       IF(IERR/=0)THEN
               WRITE(ISTDO,’(A)’) ‘ERROR IN MEMORY ALLOCATION’
               WRITE(IOUT,’(A)’) ‘ERROR IN MEMORY ALLOCATION’
               CALL ARRET(2)
       END IF
C…
       CALL DEALLOCATE(ITAB)

In practice,  error checking was missing in many places. Therefore, the idea to use a macro to automatically control allocation, handle error message and execution stop in case of failure was implemented.

Here is the macro detail:

Code Block
languagefortran
#ifndef MY_ALLOCATE
#define MY_ALLOCATE(ARRAY,LENGTH)\
ALLOCATE(ARRAY(LENGTH),STAT=MY_IERR);\
IF(MY_IERR/=0) THEN;\
CALL ANCMSG(MSGID=268,MSGTYPE=MSGERROR,ANMODE=ANSTOP,C1=#ARRAY);\
ENDIF
#endif

The previous code becomes:

Code Block
languagefortran
       USE MESSAGE_MOD
C…
#include “my_allocate.inc”
C…
       CALL MY_ALLOCATE(ITAB,NUMNOD)
C…
       CALL DEALLOCATE(ITAB)

Developers are required to check the success of the allocation

The message printed by this macro in case of allocation failure is rather generic. For large arrays it is preferred to print a specific message, with advice for the user, or at least the option concerned by this failure

Global Memory

Memory allocation of global data structures, arrays and derived data types, should be done at the highest level, in LECTUR for Starter, in RADIOSS2 or RESOL for Engine

It is advised to use derived data type with structure of arrays. This way it is possible to declare the variable at the upper level, gather the allocation of array members in a dedicated subroutine, then use the variable in procedures called at lower level without losing traceability

Local Memory

In a procedure, local variable allocation method depends on its size:

The size is known and limited to a multiple of MVSIZ

Automatic allocation in the stack is ok

The size is not known or larger than times MVSIZ

It is then too large to use automatic allocation in the stack. It is therefore needed to use dynamic allocation in the heap

Automatic Allocated arrays go into Stack

ALLOCATED ARRAYS go into Heap

One should take care to reduce Stacksize usage to a reasonable size.

Stacksize is hardcoded under Windows

It is allowed to use MY_ALLOCATE at the beginning of a routine provided matching DEALLOCATE is done at the end of this routine

In case of multiple calls to MY_ALLOCATE, a call to DEALLOCATE must be applied to the variable used in the last previous call to MY_ALLOCATE to avoid a memory hole and need for a garbage collector

Local Memory Example:

Code Block
languagefortran
CALL MY_ALLOCATE(VAR1)
      CALL MY_ALLOCATE(VAR2)
C…
      DEALLOCATE(VAR2)
C…
      CALL MY_ALLOCATE(VAR3)
C…
      DEALLOCATE(VAR3)
C…
      DEALLOCATE(VAR1)

Shared Memory Programming (SMP) and memory allocation

For RADIOSS Engine, OpenMP programming model is used for second level parallelization

By default any memory allocation done outside of a parallel section is shared between threads

Most of the parallel sections are started from RESOL. So for the need of a shared memory array, the simplest way is to declare it at the level of RESOL. It will be shared inside the different parallel sections started from RESOL, possibly passed by argument to routines called from RESOL

The same way, any variable defined in a common or module is shared by default

For pointer, notice that a single thread needs to allocate and deallocate it. The programmer has to manage synchronization in order to insure such a variable is allocated before being used by any other thread and no longer used before it is deallocated

The !$OMP THREADPRIVATE directive overrides default behavior by creating thread local storage variables

Array Aliasing

Expand

Description

Here we discuss different arguments of a procedure referencing the same memory locations. The compiler won’t be able to detect in the procedure that different argument variables reference one or more identical memory locations. Such a situation is particularly dangerous because of compiler optimization. Even if compilers are not forbidden it, if both variables are modified inside the procedure this could lead to unpredictable results. Potential conflicts or dependencies won’t be detected

Code Example:

Code Block
languagefortran
PROGRAM OVERMAIN
C------------------------------------------
       IMPLICIT NONE
C------------------------------------------
       PARAMETER (MAXLEN = 40)
       INTEGER BUFFER(MAXLEN),I,I1,LEN
C------------------------------------------
       I1 = 10
       LEN = 22
       DO I = 1, LEN
         BUFFER(I1+I-1) = I
       ENDDO
       print *,’init : buffer =’,(BUFFER(I1+I-1),I=1,LEN)
       CALL SHIFTI1(BUFFER,BUFFER(I1),I1,LEN) ! potential overlap
       print *,’result: buffer =’,(BUFFER(I1+I-1),I=1,LEN)
       STOP
       END

       SUBROUTINE SHIFTI1(BUFFER,TAB,I1,LEN)
C------------------------------------------
       IMPLICIT NONE
C------------------------------------------
       INTEGER BUFFER(*),TAB(*),I1,LEN
C------------------------------------------
       INTEGER I,LEN0,I10
C------------------------------------------
       LEN0 = 19
       I10 = I1 - LEN + LEN0
       DO I = 1, LEN
         BUFFER(I10+I-1) = TAB(I) ! true overlap
       ENDDO
       I1 = I10
       RETURN
       END

Tested on SGI O200 IRIX 6.4: output:

Code Block
languagebash
skippy 61% f77 -O2 -c *.F
overmain.F:
shifti1.F:
skippy 62% f77 -O2 -o overlaps *.o
skippy 63% overlaps
init : buffer =
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22
result: buffer =
1 2 6 4 5 6
10 8 9 10 14 12
13 14 18 16 17 18
22 20 21 22
Note

Conclusion: Array aliasing is forbidden when the variable is modified inside the procedure

\uD83D\uDCCB Related articles

OpenRadioss Coding Standards
Table of Contents
minLevel1
maxLevel7
stylenone

Introduction

This page deals with vectorization and optimization of Radioss Fortran code. This is a fundamental aspect of the code that needs to be well understood and learned by new Radioss programmers. Breaking performance of current code is not allowed. Furthermore, new functionality should be developed taking into account the same level of care regarding performance.

Vectorization deals with the execution of computational loops. It allows a computer to compute several loop indexes during the same cycle leveraging vector registers. This concept was first introduced on vector supercomputers (CRAY, NEC, FUJITSU…)

Nonetheless, though there are no more vector supercomputers, vectorization is reintroduced on modern CPUs like Xeon processor with AVX and AVX512

For instance, AVX512 first introduced into Intel Xeon Phi Knights Landing and Xeon Skylake allows the handling of 8x 64-bit double precision real at the same time or 16x single precision  

Vectorization

Expand

Vector Length

Most computations in Radioss, like element or contact forces, are performed by packets of MVSIZ. This parameter is adjustable to match so-called vector length. This parameter is also important for cache locality. It is optimized by parallel development team according to hardware characteristics

New treatments need to respect this programming model which is to split the loop over number of elements or nodes by packets of MVSIZ. This will ensure optimal vector length and cache size as well as minimal local storage (local arrays of size MVSIZ instead of number of elements or nodes)

Loop Control

IF/THEN/ELSE

It is recommended to minimize the use of IF/THEN/ELSE instruction inside computational loop

Every time a test does not depend on loop index value, it is asked to perform it outside of such loop

GOTO

GOTO is absolutely forbidden inside computational loops as it inhibits vectorization and optimization

EXIT/CYCLE

EXIT and CYCLE need to be minimized and avoided in computational loops

Loop Size

Inside a loop it is recommended to keep the number of instructions reasonable. 20 instructions or less is good. Very long loops should be split to keep cache efficiency

Most compilers will be able to fuse short loops, while they will probably fail to vectorize long complex loops

Data Dependency

The loop below is not vectorized due to possible dependence (same value of INDEX(I) for different I): 

Code Block
languagefortran
DO I = 1, N
         K = INDEX(I)
         A(K) = B(K)
         B(K) = 2*A(K)
END DO

In case of no true dependence, vectorization needs to be forced by adding a compiler directive

To keep portability across different platforms and compilers, an architecture specific include file exists named vectorize.inc that manages vectorization directives. The programmer just needs to add this include file just before the DO loop:

Code Block
languagefortran
#include "vectorize.inc"
       DO I = 1, N
            K = INDEX(I)
            A(K) = B(K)
            B(K) = 2*A(K)
       END DO

Notice there is another include file named simd.inc which makes unconditional vectorization, even if a true dependence is detected by the compiler. It is recommended to only use vectorise.inc which is more conservative regarding correctness

For Intel compiler:

  • vectorize.inc corresponds to !DIR$ IVDEP

  • simd.inc corresponds to !DIR$ SIMD

Procedure CALL

Calling a procedure inside a loop inhibits vectorization therefore it is not authorized

Inside Radioss, there are vectorized versions of procedures, basically the loop is put inside the procedure rather than outside:

  • VALPVEC to replace JACOBIEW

  • VINTER to replace FINTER

Nested Loops

In practice, only the inner most loop will be vectorized. So the inner most loop needs to be the largest one.

For fixed size loops it is possible to unroll them by hand or to use Fortran90 enhancement. Then the compiler is able to vectorize the outer loop

Note

Unroll by hand is not enough on AVX512 or SSE architectures

Only Fortran 90 syntax helps in this case

Example Nested Loop with NEL >> DIM

Code Block
languagefortran
DO I = 1, NEL
    DO J=1, DIM
      A(I,J) = B(I,J) + C(I,J)
    END DO
END DO 

To be transformed to:

Code Block
languagefortran
DO J=1, DIM
    DO I = 1, NEL
     A(I,J) = B(I,J) + C(I,J)
    END DO
END DO 

Or using Fortran90 notation:

Code Block
languagefortran
DO I=1,NEL
    A(I,:DIM)=B(I,:DIM) + C(I,:DIM)
END DO

Or for this simple case:

Code Block
languagefortran
A(:NEL,:DIM)=B(:NEL,:DIM) + C(:NEL,:DIM)

Arithmetic Functions

Expand

Power

Never use real variable for integer power because of the cost of real power arithmetic. Take

care to not use real variable defined in constant.inc when integer is enough

A**2

Allowed

A**DEUX

Forbidden as DEUX is a my_real variable defined in constant.inc

A**DTIERS

here there is no other choice as a real power arithmetic is required

Div

For invariant, it is advised to multiply by invert instead of doing a division by a constant inside a loop

Arrays

Expand

Fortran90 Array Operations

Use of Fortran90 array operations is encouraged as long as code readability is kept, by always specifying array bounds to avoid confusion between variable and array arithmetic.

 Example:

Code Block
languagefortran
INTEGER, DIMENSION(NUMNOD) :: A, B, C
A = B + C

    ! confusion between variable and array operation

Code Block
languagefortran
A(:NUMNOD) = B(:NUMNOD) + C(:NUMNOD)

   ! default lower bound:1 

Multidimensional Arrays

Data Locality

Large arrays over a number of nodes or elements are defined to maximize data locality and have therefore the smallest dimension first, like in the example below:

Code Block
languagefortran
X(3,NUMNOD), V(3,NUMNOD), A(3,NUMNOD)

Leading Dimension for Vectorization

For vectorization on Xeon, it is better to have leading dimension first. So, depending on array size and access pattern a compromise needs to be found:

  • For large arrays like X, V, A, it is better to keep locality

  • For data structure of few times MVSIZ, like new element buffer arrays or temporary arrays of size x times MVSIZ, having largest dimension first is better:    

HOUR (MVSIZ,5) better than HOUR(5,MVSIZ)

According to test with recent Intel compiler, Fortran90 array notation can also improve code generated:

A(I,1:3) = B(I,1:3) + C(I,1:3) no need to unroll loop

Structure Of Arrays

Use structure of arrays (left example) rather than arrays of structure (right example)

Correct

Incorrect

Code Block
languagefortran
TYPE GRID_STRUCT_
 my_real ::  x(100),y(100),z(100)
END TYPE GRID_STRUCT_

Code Block
languagefortran
TYPE POINT_STRUCT_
  my_real ::  x, y, z
END TYPE POINT_STRUCT_

Code Block
languagefortran
TYPE(GRID_STRUCT_) MESH;
! a grid of 3 arrays x, y, z of size 100

Code Block
languagefortran
TYPE(POINT_STRUCT_) P(100);
! an array of size 100 of 3 scalars x, y, z

Warning On Large Array Initialization

Especially inside Radioss Starter it is common to find code using large array flag over NUMNOD initialized to zero many times which is time consuming

Here is a method to avoid such an issue: left is original poorly performing code, right is optimized version

Original (Poor Performance)

Optimized Code

Code Block
languagefortran
DO I=1, IX_TIMES
!IX_TIMES set to 0 
         TAG(1:NUMNOD)=0
         DO J=1,FEW_ITEMS
           N = INDEX(I,J)
           IF(TAG(N) == 0)THEN
! but few values change
             TAG(N)=1
! additional treatments…  
           END IF   
         END DO
       END DO

 

Code Block
languagefortran
! single init to 0
TAG(1:NUMNOD)=0
DO I=1, IX_TIMES
  DO J=1,FEW_ITEMS
    N = INDEX(I,J)
    IF(TAG(N) == 0)THEN
      TAG(N)=1
! additional treatments…
    END IF
  END DO
! set to 0 only if needed
  DO J=1 , FEW_ITEMS
    N = INDEX(I,J)
    TAG(N) = 0
  END DO
END DO

Additional Fortran90 Restrictions Regarding Efficiency

Operator Overloading

Usage of object oriented feature like operator overloading, while it is efficient in code writing and clarity is not recommended due to lack of performance efficiency