If possible, define a vector version for the function using a construct:
Target |
ICL/ICC/ICPC Construct |
IFORT Construct |
Source function |
#pragma omp declare simd |
!DIR$ OMP DECLARE SIMD |
Source function |
_declspec(vector) (Windows OS) or _attribute_(vector) (Linux OS) |
ELEMENTAL keyword or !DIR$ ATTRIBUTES VECTOR |
In this example you can vectorize the loop and function call using OpenMP* 4.0 or Intel® Cilk™ Plus explicit vector programming capabilities.
Add a
!DIR$ OMP DECLARE SIMD
directive to the function
bar()
and compile with
the
/Qopenmp-simd
option to generate a vectorized version of
bar()
. Add the same directive to the interface block for
bar()
inside program
foo
. The
UNIFORM
clause specifies that
xp
is a non-varying argument and has the same value for each loop iteration in the caller being vectorized. Thus
x
is the only vector argument. Without
UNIFORM
, the compiler must determine if
xp
could also be a vector argument.
real(8) function bar(x, xp)
!$OMP DECLARE SIMD (bar) UNIFORM(xp)
implicit none
real(8), intent(in) :: x, xp
bar = 1. - 2.*(x-xp) + 3.*(x-xp)**2 - 1.5*(x-xp)**3 + 0.2*(x-xp)**4
bar = bar / sqrt(x**2 + xp**2)
end
The code now generates a vectorized version of function
bar()
; however, the loop inside
foo
is still not vectorized because the compiler sees dependencies between loop iterations carried by both
x
and
sumx
. Unaided, the compiler could determine how to auto-vectorize a loop with just these dependencies, or vectorize a loop with just the function call, but not both. We can tell the compiler to vectorize the loop with a
!$OMP SIMD
directive that specifies the properties of
x
and
sumx
:
Program foo
implicit none
integer, parameter :: nx = 100000000
real(8) :: x, xp, sumx
integer :: i
interface
real(8) function bar(x, xp)
!$OMP DECLARE SIMD (bar) UNIFORM(xp)
real(8), intent(in) :: x, xp
end
end interface
sumx = 0.
xp = 1.
!$OMP SIMD private(x) reduction(+:sumx)
do i = 1,nx
x = 1.D-8*real(i,8)
sumx = sumx + bar(x,xp)
enddo
print *, 'Sum =',sumx
end
The loop now vectorizes successfully, and running the application shows a performance speedup.
For small functions such as
bar()
,
inlining may be a simpler and more efficient way to achieve vectorization of loops containing function calls. When the caller and callee are in separate source files, as above, build the application with interprocedural optimization (
-ipo
or
/Qipo
). When the caller and callee are in the same source file, inlining of small functions is enabled by default at optimization level
O2
and above.
Read More: