3
8

Unsupported data type

Causes:
  • The loop assigns one struct variable to another one. But the assignment operator is not defined inside the structure, so there is no translation of this struct assignment in terms of scalars.
  • The compiler does not support certain data types because there is no corresponding SIMD instruction.
  • The compiler cannot vectorize a loop containing complex, long, numeric types that do not fit in the vector register width.
C++ Example:
struct char4 {   char c1;   char c2;   char c3;   char c4; }; extern struct char4 *a; void vecmsg_testcore003 () {   int i;   const struct char4 n = {0, 0, 0, 0};   #pragma omp simd   for(i = 0; i < 1024; i++) {     a[i] = n;   } }

Recommendations:

  • Provide struct assignment operators in terms of scalars. For example:
    inline char4 operator=(const char4 &x){ char4 temp; temp.c1 = x.c1; temp.c2 = x.c2; temp.c3 = x.c3; temp.c4 = x.c4; return temp; }
  • Use standard data types.
  • Use instruction sets that support wider vectors.
Read More:

Not inner loop

Cause: In nested loop structures, the compiler targets the innermost loop for vectorization. The outer loop, by default, is not a target for vectorization; however, it may be a target for parallelization.
C++ Example:
#include <iostream> #define N 25 int main() {   int a[N][N], b[N], i;   for(int j = 0; j < N; j++)   {     for(int i = 0; i < N; i++)       a[j][i] = 0;     b[j] = 1;   }   int sum = __sec_reduce_add(a[:][:]) + __sec_reduce_add(b[:]);   return 0; }

Recommendation:

In some cases it is possible to collapse a nested loop structure into a single loop structure using a directive before the outer loop. The
n
argument is an integer that specifies how many loops to collapse into one loop for vectorization.
Target ICL/ICC/ICPC Directive IFORT Directive
Outer loop #pragma omp simd collapse(n), #pragma omp simd, or #pragma simd !$OMP SIMD COLLAPSE(n), !$OMP SIMD, or !DIR$ SIMD
Read More C++ Information: Read More Fortran Information:

Remainder loop vectorization possible but seems inefficient

Cause: The compiler vectorizer determined the remainder loop will not benefit from vectorization.
C++ Example:
#include < iostream > #define N 70 int main(){ static short tab1[N], tab2[N]; int i, j; static short const data[] = {32768, -256, -255, -128, -127, -1, 0, 1, 127, 128, 255, 256, 32767}; for (j = i = 0; i < N; i++) { tab1[i] = i; tab2[i] = data[j++]; if (j > 12) j = 0; } int sum = __sec_reduce_add(tab1[:]) + __sec_reduce_add(tab2[:]); return 0; }

Recommendations:

  • Force remainder vectorization using a directive before the loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma vector vecremainder !DIR$ SIMD VECREMAINDER
  • Disable remainder vectorization using a directive before the loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma vector novecremainder !DIR$ SIMD NOVECREMAINDER
Read More C++ Information: Read More Fortran Information:

Loop vectorization possible but seems inefficient

Cause: The compiler vectorizer determined the loop will not benefit from vectorization. Common reasons include:
  • Non-unit stride memory access
  • Indirect memory access
  • Low iteration count
C++ Example: The compiler vectorizer determines the cost of creating a vector operand (non-unit stride access in the vector operand creation) is significant when compared to the number/type of computations in which those vector operands are used.
#include <iostream> #define N 100 struct s1 { int a, b, c; } int main(){ s1 arr[N], sum; for(int i = 0; i < N; i++) {   sum.a += arr[i].a;   sum.b += arr[i].b;   sum.c += arr[i].c; } std::cout << sum.a << "t" << sum.b << "t" << sum.c << "n"; return 0; }

Recommendations:

  • If you still believe vectorization might result in a speedup, override the compiler cost model using a directive before the loop
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma vector or #pragma vector always !DIR$ VECTOR or !DIR$ VECTOR ALWAYS
    Alternatively, use a compiler option to always vectorize loops. The compiler will still test for dependencies and will not vectorize the loop unless it is safe.
    Windows* OS - ICL and IFORT Option Linux* OS - ICC/ICPC and IFORT Option
    /Qvec-threshold0 -vec-threshold0
  • Require vectorization using a directive before the loop. The compiler will not perform a dependency analysis; it is your responsibility to ensure vectorization is safe:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
  • Rewrite the data structure/loop to have more regular memory accesses.
Read More C++ Information: Read More Fortran Information:

Conditional assignment to a scalar

Causes:
  • The loop has an assignment operation of a structure variable and there is a complex condition controlling this assignment.
  • The loop contains a conditional statement and one of the following is true:
    • The conditional statement controls the assignment of a scalar value and the value of this variable is used in any of the next iterations or after the loop executes. Exception: loops searching for max, min values and their indices in the array.
    • The value of the scalar when loop execution ends depends on the loop executing iterations in strict order.
C++ Example:
void foo(int *A, int *restrict B, int n, int* x){   int i;   #pragma omp simd   for (i = 0; i < n; i++)   {     if (A[i] > i)       *x = i;     else       B[i] = *x;   }   B[i] = *x++; }

Recommendations:

Simplify or remove conditions in the loop by:
  • Dividing the loop into a group of sequential loops
  • Or using multiple temporary variables instead of one scalar variable
Read More:

Assumed dependence between lines

Causes:
  • Anti-dependency - Write after read (WAR) - is assumed in a loop.
  • True dependency - Read after write (RAW) - is assumed in a loop.
C++ Example: When the compiler tries to vectorize for SSE2 architecture, it chooses a vector length of 4 (because the data type it operates on is int). But when considering a vector operand instead of scalar operands for this loop, there is an overlap between the input vector and output vector. Anti-dependency occurs when the k value is positive; true dependency occurs when k value is negative.
#include < stdlib.h > #define N 70 int main(int argc, char *argv[]) {   int k = atoi(argv[1]);   int a[N], i;   for(i = abs(k); i < N; i++)     a[i] = a[i+k] + 1;   return 0; }

Recommendations:

  • Rewrite code to remove dependencies.
  • Run a Dependencies analysis to check if the loop has real dependencies.
  • If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:
    • Directive to prevent all dependencies in the loop
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    • Directive to ignore only vector dependencies (which is safer)
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma ivdep !DIR$ IVDEP
    • restrict
      keyword
  • If anti-dependency exists, use a directive where
    k
    is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd vectorlength(k) !DIR$ SIMD VECTORLENGTH(k)
Read More C++ Information: Read More Fortran Information:

Non-standard loop is not a vectorization candidate (C++)

SCENARIO 1 - THERE IS MORE THAN ONE LOOP EXIT POINT
C++ Example:
void no_vec(float a[], float b[], float c[]) {   int i = 0.;   while (i < 100) {     a[i] = b[i] * c[i]; // this is a data-dependent exit condition:     if (a[i] < 0.0)       break;     ++i;   } }
Exception: Loops searching for an array element, as in the example below, can be automatically vectorized when array a[i] is aligned.
for (i = 0; i < n; ++i) {   if (a[i] == to_find) {     index = I;     break;   } }
Recommendation: Ensure loops have a single entry and a single exit point.

SCENARIO 2 - A SIMD LOOP USES C++ EXCEPTION HANDLING OR AN OPENMP CRITICAL CONSTRUCT
Recommendation: Remove C++ exception handling and OpenMP critical sections from loops.

SCENARIO 3 - THE COMPILER CANNOT DETERMINE WHICH FUNCTION IS PASSED AS A FUNCTION PARAMETER
C++ Example:
#include <iostream> int a[100]; int b[100]; int g(int i, int y){ return b[i]+y; } __declspec(noinline) void doit1(int x(int,int), int y){ int i; #pragma parallel for(i = 0; i < 100; i++) a[i] = x(i,y); }

Read More:

Non-standard loop is not a vectorization candidate (Fortran)

SCENARIO 1 - THERE IS MORE THAN ONE LOOP EXIT POINT
Fortran Example:
subroutine d_15043(a,b,c,n)   implicit none   real, intent(in ), dimension(n) :: a, b   real, intent(out), dimension(n) :: c   integer, intent(in) :: n   integer :: i      do i=1,n     if(a(i).lt.0.) exit     c(i) = sqrt(a(i)) * b(i)   enddo end subroutine d_15043
Recommendation: Ensure:
  • The loop has a single entry and a single exit point.
  • The iteration count is constant and known to the loop on entry.
This loop can be vectorized if you replace
exit
with
cycle
, although the behavior is different.

SCENARIO 2 - THE ITERATION COUNT IS DATA DEPENDENT
Fortran Example:
subroutine d_15043_2(a,b,c,n)   implicit none   real, intent(in ), dimension(n) :: a, b   real, intent(out), dimension(n) :: c   integer, intent(in) :: n   integer :: i      i = 0   do while (a(i) > 0.)     c(i) = sqrt(a(i)) * b(i)     i = i + 1   enddo end subroutine d_15043_2
Recommendation: Replace the
do while
construct with a counted
do
loop. For example:
do i=1,n   if(a(i).ge.0.) c(i) = sqrt(a(i)) * b(i) enddo
If necessary, the iteration count can be pre-computed.

SCENARIO 3 - THE LOOP CONTAINS A SUBROUTINE OR FUNCTION CALL THAT PREVENTS VECTORIZATION
Fortran Example:
subroutine d_15043_3(a,b,c,n)   implicit none   real, intent(in ), dimension(n) :: a, b   real, intent(out), dimension(n) :: c   integer, intent(in) :: n   integer :: i      do i=1,n     call my_sub(a(i),b(i),c(i))   enddo end subroutine d_15043_3
Recommendation: Do one of the following:
  • Inline the subroutine. For example: Use interprocedural optimization.
  • Convert to a SIMD-enabled subroutine. For example: Use the !$OMP DECLARE SIMD directive.
SCENARIO 4 - THERE ARE OTHER COMPLEX CONTROL STRUCTURES
For example: There may be multiple
GOTO
statements.

Read More:

Vector dependence prevents vectorization

Cause: The compiler detected or assumed a vector dependence in the loop.
C++ Example:
int foo(float *A, int n){   int inx = 0;   float max = A[0];   int i;   for (i=0;i < n;i++){     if (max < A[i]){       max = A[i];       inx = i*i;     }   }   return inx; }
Fortran Example:
integer function foo(a, n)   implicit none   integer, intent(in) :: n   real, intent(inout) :: a(n)   real :: max   integer :: inx, i      max = a(0)   do i=1,n     if (max < a(i)) then       max = a(i)       inx = i*i     endif   end do      foo = inx    end function

Recommendations:

  • Rewrite code to remove dependencies.
  • Run a Dependencies analysis to check if the loop has real dependencies. There are two types of dependencies:
    • True dependency - Read after write (RAW)
    • Anti-dependency - Write after read (WAR)
  • If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:
    • Directive to prevent all dependencies in the loop
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    • Directive to ignore only vector dependencies (which is safer)
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma ivdep !DIR$ IVDEP
    • restrict
      keyword
  • If anti-dependency exists, use a directive where
    k
    is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd vectorlength(k) !DIR$ SIMD VECTORLENGTH(k)
Read More C++ Information: Read More Fortran Information:

Call to function cannot be vectorized (C++)

Causes:
  • The loop has a call to a function that has no vector version.
  • A user-defined vector function cannot be vectorized because the function body invokes other functions that cannot be vectorized.
C++ Example:
#include <iostream> #include <complex> using namespace std; int main(){ float c[10]; c[:] = 0.f; for(int i = 0; i < 10; i++)   cout << c[i] << "n"; return 0; }

Recommendations:

If possible, define a vector version for the function using a construct:
Target ICL/ICC/ICPC Construct IFORT Construct
Source function #pragma omp declare simd !DIR$ OMP DECLARE SIMD
Source function _declspec(vector) (Windows OS) or _attribute_(vector) (Linux OS) ELEMENTAL keyword or !DIR$ ATTRIBUTES VECTOR
Read More:

Call to function cannot be vectorized (Fortran)

Cause: A function call inside the loop is preventing auto-vectorization.
Fortran Example:
Program foo   implicit none   integer, parameter :: nx = 100000000   real(8) :: x, xp, sumx   integer :: i   interface     real(8) function bar(x, xp)       real(8), intent(in) :: x, xp     end   end interface      sumx = 0.   xp = 1.   do i = 1,nx     x = 1.D-8*real(i,8)     sumx = sumx + bar(x,xp)   enddo   print *, 'Sum =',sumx end real(8) function bar(x, xp)   implicit none   real(8), intent(in) :: x, xp      bar = 1. - 2.*(x-xp) + 3.*(x-xp)**2 - 1.5*(x-xp)**3 + 0.2*(x-xp)**4   bar = bar / sqrt(x**2 + xp**2) end

Recommendations:

If possible, define a vector version for the function using a construct:
Target ICL/ICC/ICPC Construct IFORT Construct
Source function #pragma omp declare simd !DIR$ OMP DECLARE SIMD
Source function _declspec(vector) (Windows OS) or _attribute_(vector) (Linux OS) ELEMENTAL keyword or !DIR$ ATTRIBUTES VECTOR
In this example you can vectorize the loop and function call using OpenMP* 4.0 or Intel® Cilk™ Plus explicit vector programming capabilities.

Add a
!DIR$ OMP DECLARE SIMD
directive to the function
bar()
and compile with the
/Qopenmp-simd
option to generate a vectorized version of
bar()
. Add the same directive to the interface block for
bar()
inside program
foo
. The
UNIFORM
clause specifies that
xp
is a non-varying argument and has the same value for each loop iteration in the caller being vectorized. Thus
x
is the only vector argument. Without
UNIFORM
, the compiler must determine if
xp
could also be a vector argument.
real(8) function bar(x, xp) !$OMP DECLARE SIMD (bar) UNIFORM(xp)   implicit none   real(8), intent(in) :: x, xp      bar = 1. - 2.*(x-xp) + 3.*(x-xp)**2 - 1.5*(x-xp)**3 + 0.2*(x-xp)**4   bar = bar / sqrt(x**2 + xp**2) end
The code now generates a vectorized version of function
bar()
; however, the loop inside
foo
is still not vectorized because the compiler sees dependencies between loop iterations carried by both
x
and
sumx
. Unaided, the compiler could determine how to auto-vectorize a loop with just these dependencies, or vectorize a loop with just the function call, but not both. We can tell the compiler to vectorize the loop with a
!$OMP SIMD
directive that specifies the properties of
x
and
sumx
:
Program foo   implicit none   integer, parameter :: nx = 100000000   real(8) :: x, xp, sumx   integer :: i      interface     real(8) function bar(x, xp)     !$OMP DECLARE SIMD (bar) UNIFORM(xp)       real(8), intent(in) :: x, xp     end   end interface      sumx = 0.   xp = 1.      !$OMP SIMD private(x) reduction(+:sumx)   do i = 1,nx     x = 1.D-8*real(i,8)     sumx = sumx + bar(x,xp)   enddo   print *, 'Sum =',sumx end
The loop now vectorizes successfully, and running the application shows a performance speedup.

For small functions such as
bar()
, inlining may be a simpler and more efficient way to achieve vectorization of loops containing function calls. When the caller and callee are in separate source files, as above, build the application with interprocedural optimization (
-ipo
or
/Qipo
). When the caller and callee are in the same source file, inlining of small functions is enabled by default at optimization level
O2
and above.

Read More:

Cannot compute loop iteration count before executing the loop

Cause: The loop iteration count is not available before the loop executes.
C++ Example: The upper bound of the loop iteration count is controlled by
bar()
, whose implementation is available in this compilation unit. Because the compiler cannot determine the loop iteration count, it cannot decide:
  • How to map the loop to vector registers
  • If it needs to create peeled and remainder loops
  • Where it has enough iterations to saturate at least one vector register
void foo(float *A){   int i;   int OuterCount = 90;   while(OuterCount > 0) {     for (i=1; i < bar(int(A[0]));i++){       A[i] = i+4;     }   OuterCount--;   } }
Fortran Example:
subroutine foo(a, n)   implicit none   integer, intent(in) :: n   double precision, intent(inout) :: a(n)   integer :: bar   integer :: i      i=0  100 CONTINUE   a(i)=0   i=i+1   if (i .lt. bar()) goto 100    end subroutine foo

Recommendation:

If the loop iteration count and iterations lower bound can be calculated for the whole loop:
  • Move the calculation outside the loop using an additional variable.
  • Rewrite the loop to avoid goto statements or other early exits from the loop.
  • Identify the loop iterations lower bound using a constant.
C++ Specific Recommendation
For example, introduce the new limit variable:
void foo(float *A){   int i;   int OuterCount = 90;   int limit = bar(int(A[0]));   while(OuterCount > 0) {     for (i=1; i < limit;i++){       A[i] = i+4;     }     OuterCount--;   } }
Fortran-Specific Recommendation
GOTO statements prevent vectorization, rewriting the code without GOTO will get this loop vectorized.
Read More:

Volatile assignment was not vectorized

Cause: Any usage of volatile variables in the loop causes this diagnostic.
C++ Example:
volatile int32_t x; int32_t a[c_size]; for (int32_t i = 0; i < c_size; ++i) {   [i] = exp(x + i);   x = a[i]; }

Recommendation:

Avoid using volatile variables. For example, reassign them to regular variables.
Read More:

Compile time constraints prevent loop optimization

Cause: Internal time limits for the optimization level prevented the compiler from determining a vectorization approach for this loop.

Recommendation:

When specifying code optimization, use the following compiler option to enable the compiler vectorization engine and provide detailed diagnostics about vectorization possibilities for this loop.
Windows* OS - ICL and IFORT Option Linux* OS - ICC/ICPC and IFORT Option
/O3 -O3
Read More C++ Information: Read More Fortran Information:

Inner loop throttling prevents vectorization of this outer loop

Cause: The inner loop has an irregular structure. For example, it may have non-constant lower and higher bounds, a non-constant step for iterations, more than one entry, some assembly parts, volatile variables, long jumps, or complex switch clauses.

Recommendation:

See the inner loop message for more details and simplify the inner loop structure.
Read More:

Outer loop was not auto-vectorized

Cause: The compiler vectorizer determined outer loop vectorization is not possible using auto-vectorization.
C++ Example:
void foo(float **a, float **b, int N){   int i, j; #pragma ivdep   for (i=0; i < N; i++){     float *ap = a[i];     float *bp = b[i];     for (j=0; j < N; j++){       ap[j] = bp[j];     }   } }
Fortran Example:
subroutine foo(a, n1, n)   implicit none   integer, intent(in) :: n, n1   real, intent(inout) :: a(n,n1)   integer :: i, j   do i=1,n     do j=1,n       a(j,i) = a(j-1,i)+1     end do   end do end subroutine foo

Recommendations:

  • Run a Dependencies analysis to check if the loop has real dependencies. There are two types of dependencies:
    • True dependency - Read after write (RAW)
    • Anti-dependency - Write after read (WAR)
  • If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:
    • Directive to prevent all dependencies in the loop
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    • Directive to ignore only vector dependencies (which is safer)
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma ivdep !DIR$ IVDEP
    • restrict
      keyword
  • If anti-dependency exists, use a directive where
    k
    is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd vectorlength(k) !DIR$ SIMD VECTORLENGTH(k)
  • If using the
    O3
    compiler option, use a directive before the inner and outer loops to request vectorization of the outer loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Inner loop #pragma novector !DIR$ NOVECTOR
    Outer loop #pragma vector always !DIR$ VECTOR ALWAYS
Read More C++ Information: Read More Fortran Information:

Inner loop was already vectorized

Cause: The inner loop in a nested loop is vectorized.
C++ Example:
#define N 1000 float A[N][N]; void foo(int n){   int i,j;   for (i=0; i < n; i++){     for (j=0; j < n; j++){       A[i][j]++;     }   } }
Fortran Example:
subroutine foo(a, n1, n)   implicit none   integer, intent(in) :: n, n1   real, intent(inout) :: a(n1,n1)   integer :: i, j        do i=1,n       do j=1,n         a(j,i) = a(j,i) + 1       end do     end do end subroutine foo

Recommendations:

Force vectorization of the outer loop:
  • In some cases it is possible to collapse a nested loop structure into a single loop structure using a directive before the outer loop. The
    n
    argument is an integer that specifies how many loops to collapse into one loop for vectorization:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Outer loop #pragma omp simd collapse(n), #pragma omp simd, or #pragma simd !$OMP SIMD COLLAPSE(n), !$OMP SIMD, or !DIR$ SIMD
  • If using the
    O3
    compiler option, use a directive before the inner and outer loops to request vectorization of the outer loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Inner loop #pragma novector !DIR$ NOVECTOR
    Outer loop #pragma vector always !DIR$ VECTOR ALWAYS
Read More C++ Information: Read More Fortran Information:

Low trip count

Cause: The loop lacks sufficient iterations to benefit from vectorization.
C++ Example:
#define TTT char TTT A[15]; TTT foo(int n){ TTT sum=0; int i; for (i=0;i < n;i++){ sum+=A[i]; } return sum; }
Fortran Example:
integer (kind=1) :: A(15), sum, i sum=0 do i=1,15 sum=sum+A(i) end do

Recommendations:

  • Rewrite your code to increase the number of loop iterations to fill at least one full vector.
  • Run a Trip Counts analysis to check the number of iterations and loop efficiency. A loop with iterations equal to a power of 2 can vectorize even if the trip count is low.
  • Do not vectorize a loop with so few iterations (because it incurs overhead).
  • Tell the compiler to enforce vectorization using a directive, and compare performance before and after vectorization.
    Target ICL/ICC/ICPC Construct IFORT Construct
    Source loop #pragma omp simd or #pragma simd !$OMP SIMD or !DIR$ SIMD
Read More C++ Information: Read More Fortran Information:

Intel, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation