Unsupported data type

Causes:
  • The loop assigns one struct variable to another one. But the assignment operator is not defined inside the structure, so there is no translation of this struct assignment in terms of scalars.
  • The compiler does not support certain data types because there is no corresponding SIMD instruction.
  • The compiler cannot vectorize a loop containing complex, long, numeric types that do not fit in the vector register width.
C++ Example:
struct char4 { char c1; char c2; char c3; char c4; }; extern struct char4 *a; void vecmsg_testcore003 () { int i; const struct char4 n = {0, 0, 0, 0}; #pragma omp simd for(i = 0; i < 1024; i++) { a[i] = n; } }

Recommendations:

  • Provide struct assignment operators in terms of scalars. For example:
    inline char4 operator=(const char4 &x) { char4 temp; temp.c1 = x.c1; temp.c2 = x.c2; temp.c3 = x.c3; temp.c4 = x.c4; return temp; }
  • Use standard data types.
  • Use instruction sets that support wider vectors.
Read More:

Not inner loop

Cause: In nested loop structures, the compiler targets the innermost loop for vectorization. The outer loop, by default, is not a target for vectorization; however, it may be a target for parallelization.
C++ Example:
#include <iostream> #define N 25 int main() { int a[N][N], b[N], i; for(int j = 0; j < N; j++) { for(int i = 0; i < N; i++) a[j][i] = 0; b[j] = 1; } int sum = __sec_reduce_add(a[:][:]) + __sec_reduce_add(b[:]); return 0; }

Recommendation:

In some cases it is possible to collapse a nested loop structure into a single loop structure using a directive before the outer loop. The
n
argument is an integer that specifies how many loops to collapse into one loop for vectorization.
Target ICL/ICC/ICPC Directive IFORT Directive
Outer loop #pragma omp simd collapse(n), #pragma omp simd, or #pragma simd !$OMP SIMD COLLAPSE(n), !$OMP SIMD, or !DIR$ SIMD
Read More C++ Information: Read More Fortran Information:

Remainder loop vectorization possible but seems inefficient

Cause: The compiler vectorizer determined the remainder loop will not benefit from vectorization.
C++ Example:
#include < iostream > #define N 70 int main() { static short tab1[N], tab2[N]; int i, j; static short const data[] = {32768, -256, -255, -128, -127, -1, 0, 1, 127, 128, 255, 256, 32767}; for (j = i = 0; i < N; i++) { tab1[i] = i; tab2[i] = data[j++]; if (j > 12) j = 0; } int sum = __sec_reduce_add(tab1[:]) + __sec_reduce_add(tab2[:]); return 0; }

Recommendations:

  • Force remainder vectorization using a directive before the loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma vector vecremainder !DIR$ SIMD VECREMAINDER
  • Disable remainder vectorization using a directive before the loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma vector novecremainder !DIR$ SIMD NOVECREMAINDER
Read More C++ Information: Read More Fortran Information:

Loop vectorization possible but seems inefficient

Cause: The compiler vectorizer determined the loop will not benefit from vectorization. Common reasons include:
  • Non-unit stride memory access
  • Indirect memory access
  • Low iteration count
C++ Example: The compiler vectorizer determines the cost of creating a vector operand (non-unit stride access in the vector operand creation) is significant when compared to the number/type of computations in which those vector operands are used.
#include <iostream> #define N 100 struct s1 { int a, b, c; } int main() { s1 arr[N], sum; for(int i = 0; i < N; i++) { sum.a += arr[i].a; sum.b += arr[i].b; sum.c += arr[i].c; } std::cout << sum.a << "t" << sum.b << "t" << sum.c << "n"; return 0; }

Recommendations:

  • If you still believe vectorization might result in a speedup, override the compiler cost model using a directive before the loop
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma vector or #pragma vector always !DIR$ VECTOR or !DIR$ VECTOR ALWAYS
    Alternatively, use a compiler option to always vectorize loops. The compiler will still test for dependencies and will not vectorize the loop unless it is safe.
    Windows* OS - ICL and IFORT Option Linux* OS - ICC/ICPC and IFORT Option
    /Qvec-threshold0 -vec-threshold0
  • Require vectorization using a directive before the loop. The compiler will not perform a dependency analysis; it is your responsibility to ensure vectorization is safe:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
  • Rewrite the data structure/loop to have more regular memory accesses.
Read More C++ Information: Read More Fortran Information:

Conditional assignment to a scalar

Causes:
  • The loop has an assignment operation of a structure variable and there is a complex condition controlling this assignment.
  • The loop contains a conditional statement and one of the following is true:
    • The conditional statement controls the assignment of a scalar value and the value of this variable is used in any of the next iterations or after the loop executes. Exception: loops searching for max, min values and their indices in the array.
    • The value of the scalar when loop execution ends depends on the loop executing iterations in strict order.
C++ Example:
void foo(int *A, int *restrict B, int n, int* x) { int i; #pragma omp simd for (i = 0; i < n; i++) { if (A[i] > i) *x = i; else B[i] = *x; } B[i] = *x++; }

Recommendations:

Simplify or remove conditions in the loop by:
  • Dividing the loop into a group of sequential loops
  • Or using multiple temporary variables instead of one scalar variable
Read More:

Assumed dependence between lines

Causes:
  • Anti-dependency - Write after read (WAR) - is assumed in a loop.
  • True dependency - Read after write (RAW) - is assumed in a loop.
C++ Example: When the compiler tries to vectorize for SSE2 architecture, it chooses a vector length of 4 (because the data type it operates on is int). But when considering a vector operand instead of scalar operands for this loop, there is an overlap between the input vector and output vector. Anti-dependency occurs when the k value is positive; true dependency occurs when k value is negative.
#include < stdlib.h > #define N 70 int main(int argc, char *argv[]) { int k = atoi(argv[1]); int a[N], i; for(i = abs(k); i < N; i++) a[i] = a[i+k] + 1; return 0; }

Recommendations:

  • Rewrite code to remove dependencies.
  • Run a Dependencies analysis to check if the loop has real dependencies.
  • If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:
    • Directive to prevent all dependencies in the loop
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    • Directive to ignore only vector dependencies (which is safer)
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma ivdep !DIR$ IVDEP
    • restrict
      keyword
  • If anti-dependency exists, use a directive where
    k
    is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd vectorlength(k) !DIR$ SIMD VECTORLENGTH(k)
Read More C++ Information: Read More Fortran Information:

Non-standard loop is not a vectorization candidate (C++)

Causes:
  • There is more than one loop exit point.
  • A SIMD loop uses C++ exception handling or an OpenMP critical construct.
  • The compiler cannot determine which function is passed as a function parameter.
Below are examples for all three scenarios.
C++ Example 1: There is more than one loop exit point.
void no_vec(float a[], float b[], float c[]) { int i = 0.; while (i < 100) { a[i] = b[i] * c[i]; // this is a data-dependent exit condition: if (a[i] < 0.0) break; ++i; } }
Exception: Loops searching for an array element, as in the example below, can be automatically vectorized when array a[i] is aligned.
for (i = 0; i < n; ++i) { if (a[i] == to_find) { index = I; break; } }
C++ Example 2: A SIMD loop uses C++ exception handling or an OpenMP critical construct.
#define N 1000 int foo() { #pragma omp simd for (int i = 0; i < N; i++) { try { printf ("throw exception 11\n"); throw 11; } catch (int t) { printf ("caught exception %d\n", t); if (t != 11) { #pragma omp critical { printf ("TEST FAILED\n"); exit (0); } } } } printf ("TEST PASSED\n"); exit (0); }
C++ Example 3: The compiler cannot determine which function is passed as a function parameter.
#include <iostream> int a[100]; int b[100]; int g(int i, int y) { return b[i]+y; } __declspec(noinline) void doit1(int x(int,int), int y) { int i; #pragma parallel for(i = 0; i < 100; i++) a[i] = x(i,y); }

Recommendations:

  • For Example 1, where there is more than one loop exit point: Ensure loops have a single entry and a single exit point.
  • For Example 2, where a SIMD loop uses C++ exception handling or an OpenMP critical construct: Remove C++ exception handling and OpenMP critical sections from loops.
  • For Example 3, where the compiler cannot determine which function is passed as a function parameter: There is no resolution unless you can tell the compiler during compile time which function will be called within the loop body.
Read More:

Non-standard loop is not a vectorization candidate (Fortran)

Causes:
  • There is more than one loop exit point.
  • The iteration count is data dependent.
  • The loop contains a subroutine or function call that prevents vectorization.
  • There are other complex control structures. For example: There may be multiple
    GOTO
    statements.
Below are examples for the first three scenarios.
Fortran Example 1: There is more than one loop exit point.
subroutine d_15043(a,b,c,n) implicit none real, intent(in ), dimension(n) :: a, b real, intent(out), dimension(n) :: c integer, intent(in) :: n integer :: i do i=1,n if(a(i) < 0.) exit c(i) = sqrt(a(i)) * b(i) enddo end subroutine d_15043
Fortran Example 2: The iteration count is data dependent.
subroutine d_15043_2(a,b,c,n) implicit none real, intent(in ), dimension(n) :: a, b real, intent(out), dimension(n) :: c integer, intent(in) :: n integer :: i i = 0 do while (a(i) > 0.) c(i) = sqrt(a(i)) * b(i) i = i + 1 enddo end subroutine d_15043_2
Fortran Example 3: The loop contains a subroutine or function that prevents vectorization.
subroutine d_15043_3(a,b,c,n) implicit none real, intent(in ), dimension(n) :: a, b real, intent(out), dimension(n) :: c integer, intent(in) :: n integer :: i do i=1,n call my_sub(a(i),b(i),c(i)) enddo end subroutine d_15043_3

Recommendations

  • For Example 1, where there is more than one loop exit point: Ensure:
    • The loop has a single entry and a single exit point.
    • The iteration count is constant and known to the loop on entry.
    This loop can be vectorized if you replace
    exit
    with
    cycle
    , although the behavior is different.
  • For Example 2, where the iteration count is data dependent: Replace the
    do while
    construct with a counted
    do
    loop. For example:
    do i=1,n if(a(i) > 0.) c(i) = sqrt(a(i)) * b(i) enddo
    If necessary, the iteration count can be pre-computed.
  • For Example 3, where the loop contains a subroutine or function call that prevents vectorization: Do one of the following:
    • Inline the subroutine. For example: Use interprocedural optimization.
    • Convert to a SIMD-enabled subroutine. For example: Use the
      !$OMP DECLARE SIMD
      directive.

Read More:

Vector dependence prevents vectorization

Cause: The compiler detected or assumed a vector dependence in the loop.
C++ Example:
int foo(float *A, int n) { int inx = 0; float max = A[0]; int i; for (i=0;i < n;i++) { if (max < A[i]) { max = A[i]; inx = i*i; } } return inx; }
Fortran Example:
integer function foo(a, n) implicit none integer, intent(in) :: n real, intent(inout) :: a(n) real :: max integer :: inx, i max = a(0) do i=1,n if (max < a(i)) then max = a(i) inx = i*i endif end do foo = inx end function

Recommendations:

  • Rewrite code to remove dependencies.
  • Run a Dependencies analysis to check if the loop has real dependencies. There are two types of dependencies:
    • True dependency - Read after write (RAW)
    • Anti-dependency - Write after read (WAR)
  • If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:
    • Directive to prevent all dependencies in the loop
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    • Directive to ignore only vector dependencies (which is safer)
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma ivdep !DIR$ IVDEP
    • restrict
      keyword
  • If anti-dependency exists, use a directive where
    k
    is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd vectorlength(k) !DIR$ SIMD VECTORLENGTH(k)
Read More C++ Information: Read More Fortran Information:

Call to function cannot be vectorized (C++)

Causes:
  • The loop has a call to a function that has no vector version.
  • A user-defined vector function cannot be vectorized because the function body invokes other functions that cannot be vectorized.
C++ Example:
#include <iostream> #include <complex> using namespace std; int main() { float c[10]; c[:] = 0.f; for(int i = 0; i < 10; i++) cout << c[i] << "n"; return 0; }

Recommendations:

If possible, define a vector version for the function using a construct:
Target ICL/ICC/ICPC Construct
Source function #pragma omp declare simd
Source function _declspec(vector) (Windows OS) or _attribute_(vector) (Linux OS)
Read More:

Call to function cannot be vectorized (Fortran)

Cause: A function call inside the loop is preventing auto-vectorization.
Fortran Example:
Program foo implicit none integer, parameter :: nx = 100000000 real(8) :: x, xp, sumx integer :: i interface real(8) function bar(x, xp) real(8), intent(in) :: x, xp end end interface sumx = 0. xp = 1. do i = 1,nx x = 1.D-8*real(i,8) sumx = sumx + bar(x,xp) enddo print *, 'Sum =',sumx end real(8) function bar(x, xp) implicit none real(8), intent(in) :: x, xp bar = 1. - 2.*(x-xp) + 3.*(x-xp)**2 - 1.5*(x-xp)**3 + 0.2*(x-xp)**4 bar = bar / sqrt(x**2 + xp**2) end

Recommendations:

If possible, define a vector version for the function using a construct:
Target IFORT Construct
Source function !DIR$ OMP DECLARE SIMD
Source function ELEMENTAL keyword or !DIR$ ATTRIBUTES VECTOR
In this example you can vectorize the loop and function call using OpenMP* 4.0 or Intel® Cilk™ Plus explicit vector programming capabilities.

Add a
!DIR$ OMP DECLARE SIMD
directive to the function
bar()
and compile with the
/Qopenmp-simd
option to generate a vectorized version of
bar()
. Add the same directive to the interface block for
bar()
inside program
foo
. The
UNIFORM
clause specifies that
xp
is a non-varying argument and has the same value for each loop iteration in the caller being vectorized. Thus
x
is the only vector argument. Without
UNIFORM
, the compiler must determine if
xp
could also be a vector argument.
real(8) function bar(x, xp) !$OMP DECLARE SIMD (bar) UNIFORM(xp) implicit none real(8), intent(in) :: x, xp bar = 1. - 2.*(x-xp) + 3.*(x-xp)**2 - 1.5*(x-xp)**3 + 0.2*(x-xp)**4 bar = bar / sqrt(x**2 + xp**2) end
The code now generates a vectorized version of function
bar()
; however, the loop inside
foo
is still not vectorized because the compiler sees dependencies between loop iterations carried by both
x
and
sumx
. Unaided, the compiler could determine how to auto-vectorize a loop with just these dependencies, or vectorize a loop with just the function call, but not both. We can tell the compiler to vectorize the loop with a
!$OMP SIMD
directive that specifies the properties of
x
and
sumx
:
Program foo implicit none integer, parameter :: nx = 100000000 real(8) :: x, xp, sumx integer :: i interface nbsp;real(8) function bar(x, xp) !$OMP DECLARE SIMD (bar) UNIFORM(xp) real(8), intent(in) :: x, xp end end interface sumx = 0. xp = 1. !$OMP SIMD private(x) reduction(+:sumx) do i = 1,nx x = 1.D-8*real(i,8) sumx = sumx + bar(x,xp) enddo print *, 'Sum =',sumx end
The loop now vectorizes successfully, and running the application shows a performance speedup.

For small functions such as
bar()
, inlining may be a simpler and more efficient way to achieve vectorization of loops containing function calls. When the caller and callee are in separate source files, as above, build the application with interprocedural optimization (
-ipo
or
/Qipo
). When the caller and callee are in the same source file, inlining of small functions is enabled by default at optimization level
O2
and above.

Read More:

Cannot compute loop iteration count before executing the loop (C++)

Causes:
  • The loop iteration count is not available before the loop executes.
  • The compiler cannot determine if there is aliasing between all the pointers used inside the loop and loop boundaries.
C++ Example 1: The upper bound of the loop iteration count is controlled by
bar()
, whose implementation is available in this compilation unit. Because the loop iteration count is not available before the loop executes, the compiler cannot determine:
  • How to map the loop to vector registers
  • If it needs to create peeled and remainder loops
  • Where it has enough iterations to saturate at least one vector register
void foo(float *A) { int i; int OuterCount = 90; while (OuterCount > 0) { for (i = 1; i < bar(int(A[0])); i++) { A[i] = i + 4; } OuterCount--; } }
C++ Example 2: The compiler cannot determine if there is aliasing between all the pointers used inside the loop and loop boundaries.
struct Dim { int x, y, z; }; Dim dim; double* B; void foo (double* A) { for (int i = 0; i < dim.x; i++) { A[i] = B[i]; } }

Recommendations:

  • For Example 1, where the loop iteration count is not available before the loop executes: If the loop iteration count and iterations lower bound can be calculated for the whole loop:
    • Move the calculation outside the loop using an additional variable.
    • Rewrite the loop to avoid
      goto
      statements or other early exits from the loop that prevent vectorization.
    • Identify the loop iterations lower bound using a constant.
    For example, introduce the new
    limit
    variable:
    void foo(float *A) { int i; int OuterCount = 90; int limit = bar(int(A[0])); while (OuterCount > 0) { for (i=1; i < limit; i++) { A[i] = i + 4; } OuterCount--; } }
  • For Example 2, where the compiler cannot determine if there is aliasing between all the pointers used inside the loop and loop boundaries: Assign the loop boundary value to a local variable. In most cases, this is enough for the compiler to determine aliasing may not occur.

    You can use a directive to accomplish the same thing automatically.
    Target ICL/ICC/ICPC Directive
    Source loop #pragma simd or #pragma omp simd
    Do not use global variables or indirect accesses as loop boundaries unless you also use one of the following:
    • Directive to ignore vector dependencies
      Target ICL/ICC/ICPC Directive
      Source loop #pragma ivdep
    • restrict
      keyword
Read More:

Cannot compute loop iteration count before executing the loop (Fortran)

Cause: The loop iteration count is not available before the loop executes.
Fortran Example:
subroutine foo(a, n) implicit none integer, intent(in) :: n double precision, intent(inout) :: a(n) integer :: bar integer :: i i=0 100 CONTINUE a(i)=0 i=i+1 if (i < bar()) goto 100 end subroutine foo

Recommendations:

If the loop iteration count and iterations lower bound can be calculated for the whole loop:
  • Move the calculation outside the loop using an additional variable.
  • Rewrite the loop to avoid
    goto
    statements or other early exits from the loop that prevent vectorization.
  • Identify the loop iterations lower bound using a constant.
Read More:

Volatile assignment was not vectorized

Cause: Any usage of volatile variables in the loop causes this diagnostic.
C++ Example:
volatile int32_t x; int32_t a[c_size]; for (int32_t i = 0; i < c_size; ++i) { a[i] = exp(x + i); x = a[i]; }

Recommendation:

Avoid using volatile variables. For example, reassign them to regular variables.
Read More:

Compile time constraints prevent loop optimization

Cause: Internal time limits for the optimization level prevented the compiler from determining a vectorization approach for this loop.

Recommendation:

When specifying code optimization, use the following compiler option to enable the compiler vectorization engine and provide detailed diagnostics about vectorization possibilities for this loop.
Windows* OS - ICL and IFORT Option Linux* OS - ICC/ICPC and IFORT Option
/O3 -O3
Read More C++ Information: Read More Fortran Information:

Inner loop throttling prevents vectorization of this outer loop

Cause: The inner loop has an irregular structure. For example, it may have non-constant lower and higher bounds, a non-constant step for iterations, more than one entry, some assembly parts, volatile variables, long jumps, or complex switch clauses.

Recommendation:

See the inner loop message for more details and simplify the inner loop structure.
Read More:

Outer loop was not auto-vectorized

Cause: The compiler vectorizer determined outer loop vectorization is not possible using auto-vectorization.
C++ Example:
void foo(float **a, float **b, int N) { int i, j; #pragma ivdep for (i = 0; i < N; i++) { float *ap = a[i]; float *bp = b[i]; for (j = 0; j < N; j++) { ap[j] = bp[j]; } } }
Fortran Example:
subroutine foo(a, n1, n) implicit none integer, intent(in) :: n, n1 real, intent(inout) :: a(n,n1) integer :: i, j do i=1,n do j=1,n a(j,i) = a(j-1,i)+1 end do end do end subroutine foo

Recommendations:

  • Run a Dependencies analysis to check if the loop has real dependencies. There are two types of dependencies:
    • True dependency - Read after write (RAW)
    • Anti-dependency - Write after read (WAR)
  • If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:
    • Directive to prevent all dependencies in the loop
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    • Directive to ignore only vector dependencies (which is safer)
      Target ICL/ICC/ICPC Directive IFORT Directive
      Source Loop #pragma ivdep !DIR$ IVDEP
    • restrict
      keyword
  • If anti-dependency exists, use a directive where
    k
    is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd vectorlength(k) !DIR$ SIMD VECTORLENGTH(k)
  • If using the
    O3
    compiler option, use a directive before the inner and outer loops to request vectorization of the outer loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Inner loop #pragma novector !DIR$ NOVECTOR
    Outer loop #pragma vector always !DIR$ VECTOR ALWAYS
Read More C++ Information: Read More Fortran Information:

Inner loop was already vectorized

Cause: The inner loop in a nested loop is vectorized.
C++ Example:
#define N 1000 float A[N][N]; void foo(int n) { int i,j; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { A[i][j]++; } } }
Fortran Example:
subroutine foo(a, n1, n) implicit none integer, intent(in) :: n, n1 real, intent(inout) :: a(n1,n1) integer :: i, j do i=1,n do j=1,n a(j,i) = a(j,i) + 1 end do end do end subroutine foo

Recommendations:

Force vectorization of the outer loop:
  • In some cases it is possible to collapse a nested loop structure into a single loop structure using a directive before the outer loop. The
    n
    argument is an integer that specifies how many loops to collapse into one loop for vectorization:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Outer loop #pragma omp simd collapse(n), #pragma omp simd, or #pragma simd !$OMP SIMD COLLAPSE(n), !$OMP SIMD, or !DIR$ SIMD
  • If using the
    O3
    compiler option, use a directive before the inner and outer loops to request vectorization of the outer loop:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Inner loop #pragma novector !DIR$ NOVECTOR
    Outer loop #pragma vector always !DIR$ VECTOR ALWAYS
Read More C++ Information: Read More Fortran Information:

Low trip count

Cause: The loop lacks sufficient iterations to benefit from vectorization.
C++ Example:
#define TTT char TTT A[15]; TTT foo(int n) { TTT sum=0; int i; for (i = 0; i < n; i++) { sum+=A[i]; } return sum; }
Fortran Example:
integer (kind=1) :: A(15), sum, i sum=0 do i=1,15 sum=sum+A(i) end do

Recommendations:

  • Rewrite your code to increase the number of loop iterations to fill at least one full vector.
  • Run a Trip Counts analysis to check the number of iterations and loop efficiency. A loop with iterations equal to a power of 2 can vectorize even if the trip count is low.
  • Do not vectorize a loop with so few iterations (because it incurs overhead).
  • Tell the compiler to enforce vectorization using a directive, and compare performance before and after vectorization.
    Target ICL/ICC/ICPC Construct IFORT Construct
    Source loop #pragma omp simd or #pragma simd !$OMP SIMD or !DIR$ SIMD
Read More C++ Information: Read More Fortran Information:

Loop with early exits cannot be vectorized unless it meets search loop idiom criteria

Cause: The compiler did not recognize a search idiom in a loop that may exit early. For example: The loop body contains:
  • A conditional exit or GOTO statement followed by calculations
  • A potential exception - the compiler considers an exception another possible exit (C++ only)
C++ Example:
Early exit
void c15520(float a[], float b[], float c[], int n) { int i; for(i=0; i<n; i++) { if(a[i] < 0.) break; c[i] = sqrt(a[i]) * b[i]; } }

Exception
// For Compiler 16.1 and higher this example generates Diagnostic 15333 instead __attribute__((vector)) void f1(double); int main() { int n = 10000; double a[n]; #pragma simd for(int i = 0 ; i < n ; i++) f1(a[i]); }

Fortran Example:
subroutine f15520(a,b,c,n) implicit none real, intent(in ), dimension(n) :: a, b real, intent(out), dimension(n) :: c integer, intent(in) :: n integer :: i do i=1,n if(a(i).lt.0.) exit c(i) = sqrt(a(i)) * b(i) enddo end subroutine f15520

Recommendations:

  • Split the loop into two loops:
    • A search loop that has an early exit but still meets the search idiom criteria
    • A computational loop without early exits
  • Ensure the loop has a single entry and a single exit point.
  • Avoid exceptions within the loop body by marking functions as
    nothrow
    .

C++ Example:
Split the loop into a search loop and computational loop.
void c15520(float a[], float b[], float c[]) { int i, j; for(i=0; i<1000; i++) { if(a[i] < 0.) break; } for(j=0; j<i-1; j++) { c[j] = sqrt(a[j]) * b[j]; } }

Mark the function in the loop as
nothrow
.
__attribute__((vector, nothrow)) void f1(double); int main() { int n = 10000; double a[n]; #pragma simd for(int i = 0 ; i < n ; i++) f1(a[i]); }

Fortran Example:
Split the loop into a search loop and computational loop.
subroutine f15520(a,b,c,n) implicit none real, intent(in ), dimension(n) :: a, b real, intent(out), dimension(n) :: c integer, intent(in) :: n integer :: i, j do i=1,n if(a(i).lt.0.) exit enddo do j=1,i-1 c(j) = sqrt(a(j)) * b(j) enddo end subroutine f15520
Read More C++ Information: Read More Fortran Information:

Exception handling for a call prevents vectorization

Cause: The compiler automatically generates a try block for a program block (that is, code inside {}) when it detects a local object or array in the program block that could throw an exception.
C++ Example:
__attribute__((vector)) void f1(double); int main() { int n = 10000; double a[n]; #pragma simd for(int i = 0 ; i < n ; i++) f1(a[i]); }

Recommendations:

Deallocate objects/arrays that could throw an exception by marking a routine as
nothrow
:
__attribute__((vector, nothrow)) void f1(double);

Read More C++ Information:

Intel, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation