Contents
Vectorization Advisor
Threading Advisor
Intel Advisor
Vectorization Advisor is a vectorization analysis tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, explore the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.
Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options without disrupting your normal development.
Intel® Parallel Studio XE Professional Edition
Intel® Parallel Studio XE Cluster Edition
If you do not already have access to the Intel® Advisor XE 2016 or to Version 15.0 or higher of an Intel C++ or Fortran compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/. (Use an Intel compiler to get more benefit from the Vectorization Advisor Survey Report.)
Where vectorization will pay off the most
If vectorized loops are providing benefit, and if not, why not
Un-vectorized and under-vectorized loops, and the estimated expected performance gain of vectorization or better vectorization
How data accessed by vectorized loops is organized and the estimated expected performance gain of reorganization
Trip Counts analysis - Dynamically identifies the number of times loops are invoked and execute (sometimes called call count/loop count and iteration count respectively). Use this information to make better decisions about your vectorization strategy for particular loops, as well as optimize already-parallel loops.
Dependencies Report - For safety purposes, the compiler is often conservative when assuming data dependencies. Use this report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the report can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.
Memory Access Patterns (MAP) Report - Use this report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.
Follow these steps (white blocks are optional) to get started using the
Vectorization Advisor in the
Intel Advisor.
Choose New Project… in the Welcome page) to open the Create a Project dialog box.
(or clickSupply a name and location for your project, then click the Create Project button to open the Project Properties dialog box.
On the left side of the Analysis Target tab, ensure the Survey Hotspots/Suitability Analysis type is selected.
Set the appropriate parameters. (Setting the binary/symbol search and source search directories is optional for the Vectorization Advisor.)
After you click the OK button to close the Project Properties dialog box, the Intel Advisor displays an empty Survey Report and the Vectorization Workflow.
When necessary, click the control at the bottom of the Workflow to switch between the Vectorization Workflow and Threading Workflow.
If you plan to run other vectorization Analysis Types, set parameters for those Analysis Types now.
The Survey Trip Counts Analysis type has similar parameters to the Survey Hotspots/Suitability Analysis type.
The Dependencies Analysis and Memory Access Patterns Analysis types consume more resources than the Survey Hotspots/Suitability Analysis type. If these Refinement analyses take too long, consider decreasing the workload.
Select Track stack variables in the Dependencies Analysis type to detect all possible dependencies.
Under
1.Survey Target in the
Vectorization Workflow, click the
control to collect Survey data while your application executes.
After the
Intel Advisor collects the data, it displays a
Survey Report similar to the following:
1 |
Click the various Filter controls (buttons and drop-down lists) to temporarily limit displayed data based on your criteria. |
2 |
Click the Search control to search for specific data. |
3 |
Click the Expand/Collapse controls to show/hide sets of columns. |
4 |
Click a loop data row in the top of the Survey Report to display more data specific to that loop in the bottom of the Survey Report, including source and assembly code, and any available code-specific how-can-I-fix-this-issue? information. Double-click a loop data row to display a Survey Source window. |
5 |
Click a checkbox to mark a loop for deeper analysis. |
6 |
Click a light bulb icon to display code-specific how-can-I-fix-this-issue? information in the Recommendations pane. |
7 |
Click a book icon to display code-specific how-can-I-fix-this-issue? information in the Compiler Diagnostics pane. |
8 |
Click the control to show/hide the Workflow pane. |
This step is optional.
Before running a Trip Counts analysis, make sure you set the appropriate Project Properties for the Survey Trip Counts Analysis type. (Use the same application, but a smaller input data set if possible.)
Under
1.1Find Trip Counts in the
Vectorization Workflow, click the
control to collect Trip Counts data while your application executes.
After the Intel Advisor collects the data, it adds a Trip Counts column set to the Survey Report. Median data is shown by default. Min, Max, Call Count, and Iteration Duration data are shown when the column set is expanded.
Key information from the Intel compiler vectorization and optimization reports
Source and assembly code for the data row selected at the top of the report
Code-specific
how-can-I-fix-this-issue? recommendations for the data row selected at the top of the report, similar to the following:
Pay particular attention to the hottest loops in terms of Self Time and Total Time. Optimizing these loops provides the most benefit. Innermost loops and loops near innermost loops are often good candidates for vectorization. Outermost loops with significant Total Time are often good candidates for parallelization with threads.
Check if the best possible Vector Instruction Set is used by your application, or if there are heavy operations required for vectorization that might be a problem, such as masking or gather operations.
Compare the modeled Estimated Achieved Gain with the gain expected from the Vector Instruction Set to ensure you are likely to get the optimal speed-up. For example: AVX2 processing of 32-bit integers should give an 8x performance gain. If the Estimated Achieved Gain is much lower than the expected gain for the Vector Instruction Set, consider optimizing an already vectorized loop by eliminating heavy vector operations, aligning data, or rewriting the loop to remove control-flow clauses.
A vectorized loop may not achieve the best performance when the compiler peels a source loop into peeled and remainder loops. If the peeled or remainder loop takes a significant portion of loop execution time, aligning data or changing the number of loop iterations may help.
This step is optional.
Set the appropriate Project Properties for the Dependencies Analysis type. (Use the same application, but a smaller input data set if possible. And select Track stack variables to detect all possible dependencies.)
Mark one or more un-vectorized loops for deeper analysis in the Survey Report.
Under
2.1Check Dependences in the
Vectorization Workflow, click the
control to collect Dependencies data while your application executes.
After the
Intel Advisor collects the data, it displays a Dependencies-focused
Refinement Report similar to the following:
Rewrite code to remove dependencies.
#pragma simd ICL/ICC/ICPC directive, or #pragma omp simd OpenMP* 4.0 standard, or !DIR$ SIMD or !$OMP SIMD IFORT directive to ignore all dependencies in the loop
#pragma ivdep ICL/ICC/ICPC directive or !DIR$ IVDEP IFORT directive to ignore only vector dependencies (which is safest, but less powerful in certain cases)
restrict keyword
If there is an anti-dependency (often called a Write after read dependency or WAR), enable vectorization using the #pragma simd vectorlength(k) ICL/ICC/ICPC directive or !DIR$ SIMD VECTORLENGTH(k) IFORT directive, where k is smaller than the distance between dependent items in anti-dependency:
This step is optional.
Set the appropriate Project Properties for the Memory Access Patterns Analysis type. (Use the same application, but a smaller input data set if possible.)
Mark one or more loops for deeper analysis in the Survey Report.
Under
2.2Check Memory Access Patterns in the
Vectorization Workflow, click the
control to collect MAP data while your application executes.
After the
Intel Advisor collects the data, it displays a MAP-focused
Refinement Report similar to the following:
To Do This |
Optimal C/C++ Settings |
Optimal Fortran Settings |
---|---|---|
Retrieve better compiler diagnostics. |
Disable Interprocedural Optimization (IPO): -no-ipo |
Disable Interprocedural Optimization (IPO): -no-ipo |
Address any issues with source line matching. |
||
Experiment with generating code for different instructions. |
-xHost, -xSSE4.2, -xAVX, -axAVX, -xCORE-AVX2, or -axCORE-AVX2 |
-xHost, -xSSE4.2, -xAVX, -axAVX, -xCORE-AVX2, or -axCORE-AVX2 |
Survey Report - Shows the loops and functions where your application spends the most time. Use this information to discover candidates for parallelization with threads.
Trip Counts analysis - Shows the minimum, maximum, and median number of times a loop body will execute, as well as the number of times a loop is invoked. Use this information to make better decisions about your threading strategy for particular loops.
Annotations - Insert to mark places in your application that are good candidates for later replacement with parallel framework code that enables threading parallel execution. Annotations are subroutine calls or macros (depending on the programming language) that can be processed by your current compiler but do not change the computations of your application.
Suitability Report - Predicts the maximum speed-up of your application based on the inserted annotations and a variety of what-if modeling parameters with which you can experiment. Use this information to choose the best candidates for parallelization with threads.
Dependencies Report - Predicts parallel data sharing problems based on the inserted annotations. Use this information to fix the data sharing problems if the predicted maximum speed-up benefit justifies the effort.
To build applications that produce the most accurate and complete Threading Advisor analysis results, build an optimized binary of your application in release mode using these settings:
To Do This |
Optimal C/C++ Settings |
---|---|
Search additional directory related to Intel Advisor annotation definitions. |
-I${ADVISOR_XE_2016_DIR}/include |
Request full debug information (compiler and linker). |
-g |
Request moderate optimization. |
-O2 or higher |
Search for unresolved references in multithreaded, dynamically linked libraries. |
-Bdynamic |
Enable dynamic loading. |
-ldl |
To Do This |
Optimal Fortran Settings |
---|---|
Search additional directory related to Intel Advisor annotation definitions. |
|
Request full debug information (compiler and linker). |
-g |
Request moderate optimization. |
-O2 or higher |
Search for unresolved references in multithreaded, dynamically linked libraries. |
-shared-intel |
Enable dynamic loading. |
-ldl |
Follow these steps (white blocks are optional) to get started using the
Threading Advisor in the
Intel Advisor.
Choose New Project… in the Welcome page) to open the Create a Project dialog box.
(or clickSupply a name and location for your project, then click the Create Project button to open the Project Properties dialog box.
On the left side of the Analysis Target tab, ensure the Survey Hotspots/Suitability Analysis type is selected.
Set the appropriate parameters, and binary/symbol search and source search directories.
After you click the OK button to close the Project Properties dialog box, the Intel Advisor displays an empty Survey Report and the Vectorization Workflow. Click the control at the bottom of the Workflow to switch between the Vectorization Workflow and Threading Workflow.
If possible, set parameters for all threading Analysis Types now.
The Survey Trip Counts Analysis type has similar parameters to the Survey Hotspots/Suitability Analysis type.
The Dependencies Analysis type consumes more resources than the Survey Hotspots/Suitability Analysis type. If the Dependencies analysis takes too long, consider decreasing the workload.
Under
1.Survey Target in the
Threading Workflow, click the
control to collect Survey data while your application executes. Use the resulting information to discover candidates for parallelization with threads.
This step is optional.
Before running a Trip Counts analysis, make sure you set the appropriate Project Properties for the Survey Trip Counts Analysis type.
Under
1.1Find Trip Counts in the
Threading Workflow, click the
control to collect Trip Counts data while your application executes. Use the resulting information to make better decisions about your threading strategy for particular loops.
Pay particular attention to the hottest loops in terms of Self Time and Total Time. Optimizing these loops provides the most benefit. Outermost loops with significant Total Time are often good candidates for parallelization with threads. Innermost loops and loops near innermost loops are often good candidates for vectorization.
Insert annotations to mark places in parts of your application that are good candidates for later replacement with parallel framework code that enables parallel execution.
A parallel site. A parallel site is a region of code that contains one or more tasks that may execute in one or more parallel threads to distribute work. An effective parallel site typically contains a hotspot that consumes application execution time. To distribute these frequently executed instructions to different tasks that can run at the same time, the best parallel site is not usually located at the hotspot, but higher in the call tree.
One or more parallel tasks within a parallel site. A task is a portion of time-consuming code with data that can be executed in one or more parallel threads to distribute work.
Locking synchronization, where mutual exclusion of data access must occur in the parallel application.
Annotation Code Snippet |
Purpose |
---|---|
Iteration Loop, Single Task |
Create a simple loop structure, where the task code includes the entire loop body. This common task structure is useful when only a single task is needed within a parallel site. |
Loop, One or More Tasks |
Create loops where the task code does not include all of the loop body, or complex loops or code that requires specific task begin-end boundaries, including multiple task end annotations. This structure is also useful when multiple tasks are needed within a parallel site. |
Function, One or More Tasks |
Create code that calls multiple tasks within a parallel site. |
Pause/Resume Collection |
Temporarily pause data collection and later resume it, so you can skip uninteresting parts of application execution to minimize collected data and speed up analysis of large applications. Add these annotations outside a parallel site. |
Build Settings |
Set build (compiler and linker) settings specific to the language in use. |
After you insert annotations into your source code, rebuild your application in release mode.
Under
3Check Suitability in the
Threading Workflow, click the
control to collect Suitability data while your application executes.
After the
Intel Advisor collects the data, it displays a
Suitability Report similar to the following:
Different hardware configurations and parallel frameworks
Different trip counts and instance durations
Any plans to address parallel overhead, lock contention, or task chunking when you implement your parallel framework code
Use the resulting information to choose the best candidates for parallelization with threads.
A Bulls-Eye in This Area |
Means This |
---|---|
Red |
Parallelization with threads is not beneficial - and may even cause performance degradation. Consider removing or modifying annotations, or significantly refactoring the corresponding hotspot if you want to parallelize it at any cost. |
Yellow |
The predicted maximum speed-up may not be enough to justify the effort needed to refactor and maintain your application. Consider investigating. |
Green |
Parallel performance - and power efficiency - may improve significantly. |
Before running a Dependencies analysis, make sure you set the appropriate Project Properties for the Dependencies Analysis type. (Use the same application, but a smaller input data set if possible.)
Under
4Check Dependences in the
Threading Workflow, click the
control to collect Dependencies data while your application executes. Use the resulting information to fix the data sharing problems if the predicted maximum speed-up benefit justifies the effort.
This step is optional.
Complete developer/architect design and code reviews about the proposed parallel changes.
Choose one parallel programming framework (threading model) for your application, such as Intel® Threading Building Blocks (Intel® TBB), OpenMP*, Intel® Cilk™ Plus, or some other parallel framework.
Add the parallel framework to your build environment.
Add parallel framework code to synchronize access to the shared data resources, such as Intel TBB or OpenMP* locks or Intel Cilk Plus reducers.
Add parallel framework code to create parallel tasks.
As you add the appropriate parallel code from the chosen parallel framework during steps 4 and 5, you can keep, comment out, or replace the Intel Advisor annotations.
You can use the Intel Advisor command line interface, advixe-cl, to run analyses and reports. This makes it possible to automate many tasks as well as analyze an application running on remote hosts. You can then view results using the Intel Advisor GUI or command line reports.
Before running advixe-cl, do one of the following to set up your environment.
To Do This |
Tool Applicability |
Use This Command Line Model |
---|---|---|
View a full list of command line options. |
Vectorization Advisor & Threading Advisor |
advixe-cl -help |
Run a Survey analysis. |
Vectorization Advisor & Threading Advisor |
advixe-cl -collect survey –project-dir ./myAdvisorProj -- myTargetApplication |
Run a Trip Counts analysis. |
Vectorization Advisor & Threading Advisor |
advixe-cl -collect tripcounts –project-dir ./myAdvisorProj -- myTargetApplication |
Print a Survey Report to identify loop IDs for Refinement analyses. |
Vectorization Advisor |
advixe-cl -report survey –project-dir ./myAdvisorProj |
Run a Refinement analysis. |
Vectorization Advisor |
advixe-cl -collect [dependencies | map] -mark-up-list=[loopID],[loopID] –project-dir ./myAdvisorProj -- myTargetApplication |
Run a Dependencies analysis. |
Threading Advisor |
advixe-cl -collect dependencies -project-dir ./myAdvisorProj -- myTargetApplicaton |
Report a top-down functions list instead of a loop list. |
Vectorization Advisor & Threading Advisor |
advixe-cl -report survey -top-down -display-callstack |
Report all compiler opt-report and vec-report metrics. |
Vectorization Advisor |
advixe-cl -report survey -show-all-columns |
Report the top five self-time hotspots that were not vectorized because of a not inner loop msg id. |
Vectorization Advisor |
advixe-cl -report survey -limit 5 -filter "Vectorization Message(s)"="loop was not vectorized: not inner loop" |
If you have an Intel Advisor GUI in your cluster environment, open a result in the GUI.
If you do not have an Intel Advisor GUI on your cluster node, copy the result directory to another machine with the Intel Advisor GUI and open the result there.
Use the Intel Advisor command line reports to browse results on a cluster node.
Use mpirun, mpiexec, or your preferred MPI batch job manager with the advixe-cl command to start an analysis. You may also use the -gtool option of mpirun. See the Intel® MPI Library Reference Manual (available in the Intel® Software Documentation Library) for more information.
To Do This |
Use This Command Line Model |
---|---|
Run 10 MPI ranks (processes), and start an Intel Advisor analysis on each rank. |
$ mpirun -n 10 advixe-cl -collect survey --project-dir ./my_proj ./your_app Intel Advisor creates a number of result directories in the current directory, named as rank.0, rank.1, ... rank.n, where n is the MPI process rank. Intel Advisor does not combine results from different ranks, so you must explore each rank result independently. |
Run 10 MPI ranks, and start an Intel Advisor analysis only on rank #1. |
$ mpirun -n 1 advixe-cl -collect survey --project-dir ./my_proj ./your_app : -np 9 ./your_app |
Document/Resource |
Description |
---|---|
Guide a new user through basic walkthrough operations with a short C/C++ or Fortran sample using the Intel Advisor GUI. This index of available tutorials is installed at <advisor-install-dir>/documentation/<locale>/tutorials/index.htm . Check Intel Advisor Tutorials online for updates to tutorials. |
|
Contain up-to-date information about the Intel Advisor, including a description, technical support, and known limitations. This document also contains system requirements, installation instructions, and instructions for setting up the command-line environment. This document is installed at <advisor-install-dir>/documentation/<locale>/<release_notes>.pdf. Check Intel Advisor Release Notes online for updates to release notes. |
|
Samples |
Help you learn to use the Intel Advisor. Samples are installed as individual compressed files under <advisor-install-dir>/samples/en/. After you copy a sample application compressed file to a writable directory, use a suitable tool to extract the contents. Check Vectorization Sample for the Intel® Advisor XE online for vectorization sample application READMEs. |
The Help is the primary documentation for the Intel Advisor. It is also accessible from the product Help menu. This document is installed at <advisor-install-dir>/documentation/<locale>/help/index.htm. Check Intel Advisor Help online for updates to Help. |
|
More Local Resources |
One of the key Vectorization Advisor features is a Survey Report that offers integrated compiler reports and performance data all in one place, including GUI-embedded advice on how to fix vectorization issues specific to your code. To help you quickly locate information that augments that GUI-embedded advice, the
Intel Advisor provides Intel compiler
mini-guides:
You can also find complete recommendations and compiler_diagnostics advice libraries in the same location as the mini-guides. Each issue and recommendation in these HTML files is collapsible/expandable. These documents are installed below <advisor-install-dir>/documentation/<locale>/advice/. |
Web Resources |
Vectorization Advisor Glossary Vectorization Resources for Intel® Advisor XE Users Intel® Learning Lab (white papers, articles and more) |