Issue: Ineffective peeled/remainder loop(s) present

All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/remainder loops to the loop body.

Recommendation: Specify the expected loop trip count Confidence: Low

The compiler cannot statically detect the trip count. To fix: Identify the expected number of iterations using a directive.
ICL/ICC/ICPC Directive IFORT Directive
#pragma loop_count !DIR$ LOOP COUNT
C++ Example: Iterate through a loop a minimum of three, maximum of ten, and average of five times:
#include <stdio.h> int mysum(int start, int end, int a) { int iret=0; #pragma loop_count min(3), max(10), avg(5) for (int i=start;i<=end;i++) iret += a; return iret; } int main() { int t; t = mysum(1, 10, 3); printf("t1=%d\r\n",t); t = mysum(2, 6, 2); printf("t2=%d\r\n",t); t = mysum(5, 12, 1); printf("t3=%d\r\n",t); }
Read More:

Recommendation: Disable unrolling Confidence: Medium

The trip count after loop unrolling is too small compared to the vector length. To fix: Prevent loop unrolling or decrease the unroll factor using a directive.
ICL/ICC/ICPC Directive IFORT Directive
#pragma nounroll
#pragma unroll
!DIR$ NOUNROLL
!DIR$ UNROLL
Read More:

Recommendation: Use a smaller vector length Confidence: Medium

The compiler chose a vector length, but the trip count might be smaller than that vector length. To fix: Specify a smaller vector length using a directive.
ICL/ICC/ICPC Directive IFORT Directive
#pragma simd vectorlength !DIR$ SIMD VECTORLENGTH
Read More:

Recommendation: Align data Confidence: Medium

One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.
Dynamic Data:
To align dynamic data, replace
malloc()
and
free()
with
_mm_malloc()
and
_mm_free()
. To tell the compiler the data is aligned, use
__assume_aligned()
before the source loop. Also consider using
#include <aligned_new>
to enable automatic allocation of aligned data.
Static Data:
To align static data, use
__declspec(align())
. To tell the compiler the data is aligned, use
__assume_aligned()
before the source loop.
C++ Example-Dynamic Data:
Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:
float *array; array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32); // Somewhere else __assume_aligned(array, 32); // Use array in loop _mm_free(array);
C++ Example-Static Data:
Align static data using a 64-byte boundary:
__declspec(align(64)) float array[ARRAY_SIZE]
Read More:

Recommendation: Add data padding Confidence: Medium

The trip count is not a multiple of vector length. To fix: Do one of the following:
  • Increase size of objects and add iterations so the trip count is a multiple of vector length.
  • Increase the size of static and automatic objects, and use a compiler option to add data padding.
Windows* OS Linux* OS
/Qopt-assume-safe-padding -qopt-assume-safe-padding
Note: These compiler options apply only to Intel® Many Integrated Core Architecture (Intel® MIC Architecture). Option
-qopt-assume-safe-padding
is the replacement compiler option for
-opt-assume-safe-padding
, which is deprecated.

When you use one of these compiler options, the compiler does not add any padding for static and automatic objects. Instead, it assumes that code can access up to 64 bytes beyond the end of the object, wherever the object appears in your application. To satisfy this assumption, you must increase the size of static and automatic objects in your application.

Optional: Specify the trip count, if it is not constant, using a directive:
ICL/ICC/ICPC Directive IFORT Directive
#pragma loop_count !DIR$ LOOP COUNT
Read More:

Recommendation: Collect trip counts data Confidence: Need more data

The Survey Report lacks trip counts data that might generate more precise recommendations. To fix: Run a Trip Counts analysis.

Issue: Data type conversions present

There are multiple data types within loops. Utilize hardware vectorization support more effectively by avoiding data type conversion.

Recommendation: Use the smallest data type Confidence: Low

The source loop contains data types of different widths. To fix: Use the smallest data type that gives the needed precision to use the entire vector register width.
Example: If only 16-bits are needed, using a short rather than an int can make the difference between eight-way or four-way SIMD parallelism, respectively.

Issue: User function call(s) present

User-defined functions in the loop body are preventing the compiler from vectorizing the loop.

Recommendation: Enable inline expansion Confidence: Low

Inlining of user-defined functions is disabled by compiler option. To fix: When using the
Ob
or
inline-level
compiler option to control inline expansion, replace the
0
argument with the
1
argument to enable inlining when an
inline
keyword or attribute is specified or the
2
argument to enable inlining of any function at compiler discretion.
Windows* OS Linux* OS
ICL Option IFORT Option ICC/ICPC Option IFORT Option
/Ob1 or /Ob2 Ob1 or Ob2 -inline-level=1 or -inline-level=2 -inline-level=1 or -inline-level=2
Read More:

Recommendation: Vectorize user function(s) inside loop Confidence: Low

Some user-defined function(s) are not vectorized or inlined by the compiler. To fix: Do one of the following:
  • Enforce vectorization of the source loop by means of SIMD instructions and/or create a SIMD version of the function(s) using a directive:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    Inner function definition or declaration #pragma omp declare simd !$OMP DECLARE SIMD
  • If using the
    Ob
    or
    inline-level
    compiler option to control inline expansion with the
    1
    argument, use an
    inline
    keyword to enable inlining or replace the
    1
    argument with
    2
    to enable inlining of any function at compiler discretion.
Read More:

Issue: Serialized user function call(s) present

User-defined functions in the loop body are not vectorized.

Recommendation: Enable inline expansion Confidence: Low

Inlining of user-defined functions is disabled by compiler option. To fix: When using the
Ob
or
inline-level
compiler option to control inline expansion, replace the
0
argument with the
1
argument to enable inlining when an
inline
keyword or attribute is specified or the
2
argument to enable inlining of any function at compiler discretion.
Windows* OS Linux* OS
ICL Option IFORT Option ICC/ICPC Option IFORT Option
/Ob1 or /Ob2 Ob1 or Ob2 -inline-level=1 or -inline-level=2 -inline-level=1 or -inline-level=2
Read More:

Recommendation: Vectorize serialized function(s) inside loop Confidence: Medium

Some user-defined function(s) are not vectorized or inlined by the compiler. To fix: Do one of the following:
  • Enforce vectorization of the source loop by means of SIMD instructions and/or create a SIMD version of the function(s) using a directive:
    Target ICL/ICC/ICPC Directive IFORT Directive
    Source Loop #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
    Inner function definition or declaration #pragma omp declare simd !$OMP DECLARE SIMD
  • If using the
    Ob
    or
    inline-level
    compiler option to control inline expansion with the
    1
    argument, use an
    inline
    keyword to enable inlining or replace the
    1
    argument with
    2
    to enable inlining of any function at compiler discretion.
Read More:

Issue: Math function call(s) present

Math functions in the loop body may prevent compiler from effective loop vectorization. Improve performance by enabling vectorized math call(s).

Recommendation: Enable inline expansion Confidence: Low

Inlining is disabled by compiler option. To fix: When using the
Ob
or
inline-level
compiler option to control inline expansion, replace the
0
argument with the
1
argument to enable inlining when an
inline
keyword or attribute is specified or the
2
argument to enable inlining of any function at compiler discretion.
Windows* OS Linux* OS
ICL Option IFORT Option ICC/ICPC Option IFORT Option
/Ob1 or /Ob2 Ob1 or Ob2 -inline-level=1 or -inline-level=2 -inline-level=1 or -inline-level=2
Alternatively for C/C++ aplications: Use
#include <mathimf.h>
header instead of the standard
#include <math.h>
header to call highly optimized and accurate mathematical functions commonly used in applications that rely heaving on floating point computations.
Read More:

Recommendation: Vectorize math function calls inside loops Confidence: Medium

Your application calls serialized versions of math functions when you use the 'precise' floating point model. To fix: Do one of the following:
  • Add
    fast-transcendentals
    compiler option to replace calls to transcendental functions with faster calls.
    Windows* OS Linux* OS
    /Qfast-transcendentals -fast-transcendentals
    CAUTION: This may reduce floating point accuracy.
  • Enforce vectorization of the source loop by using a directive:
    ICL/ICC/ICPC Directive IFORT Directive
    #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
Read More:

Recommendation: Change the floating point model Confidence: Medium

Your application calls serialized versions of math functions when you use the 'strict' floating point model. To fix: Do one of the following:
  • Use the 'fast' floating point model to enable more aggressive optimizations or the 'precise' floating point model to disable optimizations that are not value-safe on fast transcendental functions.
    Windows* OS Linux* OS
    /fp:fast -fp-model fast
    /fp:precise /Qfast-transcendentals -fp-model precise -fast-transcendentals
    CAUTION: This may reduce floating point accuracy.
  • Use the 'precise' floating point model and enforce vectorization of the source loop using a directive:
    ICL/ICC/ICPC Directive IFORT Directive
    #pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD
Read More:

Issue: System function call(s) present

System function call(s) in the loop body may prevent the compiler from vectorizing the loop.

Recommendation: Remove system function call(s) inside loop Confidence: Low

Typically system function or subroutine calls cannot be vectorized; even a print statement is sufficient to prevent vectorization. To fix: Avoid using system function calls in loops.

Issue: Assumed dependency present

The compiler assumed there is an anti-dependency (Write after read - WAR) or a true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.

Recommendation: Confirm dependency is real Confidence: Need More Data

There is no confirmation that a real dependency is present in the loop. To confirm: Run a Dependencies analysis.

Recommendation: Remove dependency Confidence: Low

The Dependencies analysis shows there is a real dependency in the loop. To fix: Do one of the following:
  • Rewrite the code to remove the dependency.
  • If there is an anti-dependency, enable vectorization using a directive where k is smaller than the distance between dependent iterations in anti-dependency.
    ICL/ICC/ICPC Directive IFORT Directive
    #pragma simd vectorlength(k) !DIR$ SIMD VECTORLENGTH(k)
Read More:

Recommendation: Enable vectorization Confidence: Low

The Dependencies analysis shows there is no real dependency in the loop for the given workload. Tell the compiler it is safe to vectorize using the
restrict
keyword or a directive.
ICL/ICC/ICPC Directive IFORT Directive Outcome
#pragma simd or #pragma omp simd !DIR$ SIMD or !$OMP SIMD Ignores all dependencies in the loop
#pragma ivdep !DIR$ IVDEP Ignores only vector dependencies (which is safest)
Read More:

Issue: High vector register pressure

All vector registers are in use. This may result in spilling that negatively impacts performance. Improve performance by decreasing vector register pressure.

Recommendation: Decrease unroll factor Confidence: Low

The current directive unroll factor increases vector register pressure. To fix: Decrease unroll factor.
ICL/ICC/ICPC Directive IFORT Directive
#pragma nounroll
#pragma unroll
!DIR$ NOUNROLL
!DIR$ UNROLL
Read More:

Recommendation: Split loop into smaller loops Confidence: Low

High vector register pressure is preventing effective vectorization. To fix: Use a directive or rewrite your code to distribute the source loop. This can decrease register pressure as well as enable software pipelining and improve both instruction and data cache use.
ICL/ICC/ICPC Directive IFORT Directive
#pragma distribute_point !DIR$ DISTRIBUTE POINT
Read More:

Issue: Possible inefficient memory access patterns present

Inefficient memory access patterns may result in significant vector code execution slowdown or block automatic vectorization by the compiler. Improve performance by investigating.

Recommendation: Confirm inefficient memory access patterns Confidence: Need More Data

There is no confirmation inefficient memory access patterns are present. To confirm: Run a Memory Access Patterns analysis.

Issue: Inefficient memory access patterns present

There is a high of percentage memory instructions with irregular (variable or random) stride accesses. Improve performance by investigating and handling accordingly.

Recommendation: Use SoA instead of AoS Confidence: Low

An array is the most common type of data structure containing a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation, it can hinder effective vector processing. To fix: Rewrite code to organize data using SoA instead of AoS.
Read More:

Recommendation: Reorder loops Confidence: Low

This loop may have less efficient memory access patterns than a nearby outer loop. To fix: Run a Memory Access Patterns analysis on the outer loop. If the memory access patterns are more efficient for the outer loop, reorder the loops if possible.

Recommendation: Use the Fortran 2008 CONTIGUOUS attribute Confidence: Low

The loop is multi-versioned for unit and non-unit strides in assumed-shape arrays or pointers, but marked versions of the loop have unit stride access only. The CONTIGUOUS attribute specifies the target of a pointer or an assumed-shape array is contiguous. It can make it easier to enable optimizations that rely on the memory layout of an object occupying a contiguous block of memory. Note: The results are indeterminate and could result in wrong answers and segmentation faults if the user assertion is wrong and the data is not contiguous at runtime.
Example:
real, pointer, contiguous :: ptr(:)
real, contiguous :: arrayarg(:, :)
Read More:

Intel, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation