# How to optimize "u*v + u*v" code line with SSE or GLSL

I have the following function (from the opensource project "recast navigation"):

```/// Derives the dot product of two vectors on the xz-plane. (@p u . @p v)
///  @param[in]     u       A vector [(x, y, z)]
///  @param[in]     v       A vector [(x, y, z)]
/// @return The dot product on the xz-plane.
///
/// The vectors are projected onto the xz-plane, so the y-values are ignored.
inline float dtVdot2D(const float* u, const float* v)
{
return u*v + u*v;
}
```

I ran it through the VS2010 CPU performance test and it showed me that in all recast codebase codeline in this function u*v + u*v is most CPU hot .

How can I CPU optimize (via SSE or GLSL like GLM (if it is easier or faster and appropriate in such case) for example) this line?

Edit: The context in which the calls appear:

```bool dtClosestHeightPointTriangle(const float* p, const float* a, const float* b, const float* c, float& h) {
float v0, v1, v2;
dtVsub(v0, c,a);
dtVsub(v1, b,a);
dtVsub(v2, p,a);

const float dot00 = dtVdot2D(v0, v0);
const float dot01 = dtVdot2D(v0, v1);
const float dot02 = dtVdot2D(v0, v2);
const float dot11 = dtVdot2D(v1, v1);
const float dot12 = dtVdot2D(v1, v2);

// Compute barycentric coordinates
const float invDenom = 1.0f / (dot00 * dot11 - dot01 * dot01);
const float u = (dot11 * dot02 - dot01 * dot12) * invDenom;
const float v = (dot00 * dot12 - dot01 * dot02) * invDenom;
```

After trying a few things on paper I've come up with something that might work for you. It is a properly parallelized/vectorized implementation of the function in SSE.

It does however require data reorganization because we'll do parallel processing for 4 triangles at once.

I will break it up in steps and use instruction names here and there, but please use the C intrinsics (_mm_load_ps(), _mm_sub_ps() et al, they're in xmmintrin.h in VC) -- when I speak of registers this just means __m128.

STEP 1.

We do not need the Y coordinate at all, so we set up pointers to pairs of X and Z. Supply at least 4 pairs (i.e. 4 triangles total) per call. I'll call each X and Z pair a vertex.

STEP 2.

Use MOVAPS (requires the pointers to be aligned to 16-bit) to load the first two vertices pointed to by each pointer into registers.

A register loaded from a will look like this: [ a0.x, a0.z, a1.x, a1.z ]

STEP 3.

Now, using a single subtract instruction, you can calculate the deltas (your v0, v1, v2) for 2 vertices at once.

Calculate v0, v1 and v2 not only for the first 2 triangles, but also for the latter 2! As I said you should supply a total of 4 vertices, or 8 floats, per input. Just repeat steps 2 and 3 for that data.

Now we have 2 pairs of vx registers, each pair containing the result for 2 triangles. I'll refer to them as vx_0 (first pair) and vx_1 (second pair) where x is from 0 to 2.

STEP 4.

Dot products. In order to parallelize the barycentric calculation (later on) we require the result of each dot product for each of the 4 triangles, in 1 single register.

So where you'd calculate dot01 for example, we will do the same, but for 4 triangles at once. Each v-register contains the result for 2 vectors, so we begin by multiplying them.

Let us say that u and v -- parameters in your dot product function -- are now v0_0 and v1_0 (as to calculate dot01):

Multiply u and v to get: [ (v0_0.x0) * (v1_0.x0), (v0_0.z0) * (v1_0.z0), (v0_0.x1) * (v1_0.x1), (v0_0.z1) * (v1_0.z1) ]

This may look confusing because of .x0 / .z0 and .x1 / .z1, but look at what was loaded at step 2: a0, a1.

If by now this still feels fuzzy, pick up a piece of paper and write along from the start.

Next, still for the same dot product, do the multiplication for v0_1 and v1_1 (i.e. the second pair of triangles).

Now we have 2 registers with 2 X & Z pairs each (4 vertices total), multiplied and ready to be added together to form 4 separate dot products. SSE3 has an instruction to do exactly this, and it's called HADDPS:

xmm0 = [A, B, C, D] xmm1 = [E, F, G, H]

xmm0 = [A+B, C+D, E+F, G+H]

It will take the X & Z pairs from our first register, those from the second, add them together and store them in the first, second, third and fourth component of the destination register. Ergo: at this point we've got this particular dot product for all 4 triangles!

**Now repeat this process for all the dot products: dot00 et cetera. **

STEP 5.

The last calculation (as far as I could see by the supplied code) is the barycentric stuff. This is a 100% scalar calculation in your code. But your inputs now are not scalar dot product results (i.e. single floats), they are SSE vectors/registers with a dot product for each of the 4 triangles.

So if you flat out vectorize this by using the parallel SSE operations that operate on all 4 floats, you'll eventually end up with 1 register (or result) carrying 4 heights, 1 for each triangle.

Since my lunch break is well past due I'm not going to spell this one out, but given the setup/idea I've given this is a last step and shouldn't be hard to figure out.

I know that this idea is a bit of a stretch and requires some loving from the code that sits above it and perhaps some quality time with pencil and paper, but it will be fast (and you can even add OpenMP afterwards if you'd like).

Good luck :)

(and forgive my fuzzy explanation, I can whip up the function if need be =))

UPDATE

I've written an implementation and it did not go as I expected, mainly because the Y component got involved beyond the piece of code that you initially pasted in your question ( I looked it up). What I've done here is not just ask you to rearrange all points in to XZ pairs and feed them per 4, but also feed 3 pointers (for points A, B and C) with the Y values for each of the 4 triangles. From a local perspective this is fastest. I can still modify it to require less intrusive changes from the callee's end, but please let me know what's desirable.

Then a disclaimer: this code is straightforward as hell (something I've found to work pretty well with compilers in terms of SSE... they can reorganize as see fit and x86/x64 CPUs take their share in it too). Also the naming, it's not my style, it's not anyone's, just do with it what you see fit.

Hope it helps and if not I'll gladly go over it again. And if this is a commercial project there's also the option of getting me on board I guess ;)

Anyway, I've put it on pastebin: http://pastebin.com/20u8fMEb

You can implement your single dot product with SSE instructions but the result will not be significantly faster (and may even be slower) than the code as written now. Your rewrite may defeat compiler optimizations that are helping the current version.

In order to reap any benefit from rewriting that with SSE or CUDA you have to optimize the loop that's calling for that dot product. This is particularly true for CUDA where the overhead of executing one dot product would be huge. You could only see a speedup if you sent thousands of vectors into the GPU to compute thousands of dot products. The same idea holds for SSE on the CPU but you may be able to see an improvement over a smaller number of operations. It will still be more than one dot product, though.

The easiest thing to try might be g++ -ftree-vectorize. GCC will be able to inline your small function and then try to optimize the loop for you (in fact it probably already is, but without SSE instructions). The tree vectorizer will attempt to do automatically what you propose to do by hand. It's not always successful.

SSE instructions are meant for optimizing alghoritms that are processing large blocks of data represented as integers or floating point numbers. Typical sizes are millions and billions of numbers that need to be somehow processed. It doesnt make sense to optimize function that processes only four (or twenty) scalars. What you gain with SSE you could loss with function calling overhead. Reasonable number of numbers processed by one call of function is at least thousand. It is possible that you can obtain huge performance gain by using SSE intrinsics. But its hard to give you specific advice tailored to your needs based on information you provided. You should edit your question and provide more high level view of your problem (functions located deeper on your callstack). It is for example not obvious how many times is dtClosestHeightPointTriangle method called per second? This number is critical to objectively judge if transition to SSE would be of practical value. Organization of data is also very important. Ideally your data should be stored in as few as possible linear segments of memory to utilize cache subsystem of CPU efficiently.

You asked for an SSE version of your algorithm so here it is:

```// Copied and modified from xnamathvector.inl
XMFINLINE XMVECTOR XMVector2DotXZ
(
FXMVECTOR V1,
FXMVECTOR V2
)
{
#if defined(_XM_NO_INTRINSICS_)

XMVECTOR Result;

Result.vector4_f32 =
Result.vector4_f32 =
Result.vector4_f32 =
Result.vector4_f32 = V1.vector4_f32 * V2.vector4_f32 + V1.vector4_f32 * V2.vector4_f32;

return Result;

#elif defined(_XM_SSE_INTRINSICS_)
// Perform the dot product on x and z
XMVECTOR vLengthSq = _mm_mul_ps(V1,V2);
// vTemp has z splatted
XMVECTOR vTemp = _mm_shuffle_ps(vLengthSq,vLengthSq,_MM_SHUFFLE(2,2,2,2));
// x+z
vLengthSq = _mm_shuffle_ps(vLengthSq,vLengthSq,_MM_SHUFFLE(0,0,0,0));
return vLengthSq;
#else // _XM_VMX128_INTRINSICS_
#endif // _XM_VMX128_INTRINSICS_
}

bool dtClosestHeightPointTriangle(FXMVECTOR p, FXMVECTOR a, FXMVECTOR b, FXMVECTOR c, float& h)
{
XMVECTOR v0 = XMVectorSubtract(c,a);
XMVECTOR v1 = XMVectorSubtract(b,a);
XMVECTOR v2 = XMVectorSubtract(p,a);

XMVECTOR dot00 = XMVector2DotXZ(v0, v0);
XMVECTOR dot01 = XMVector2DotXZ(v0, v1);
XMVECTOR dot02 = XMVector2DotXZ(v0, v2);
XMVECTOR dot11 = XMVector2DotXZ(v1, v1);
XMVECTOR dot12 = XMVector2DotXZ(v1, v2);

// Compute barycentric coordinates
XMVECTOR invDenom = XMVectorDivide(XMVectorReplicate(1.0f), XMVectorSubtract(XMVectorMultiply(dot00, dot11), XMVectorMultiply(dot01, dot01)));

XMVECTOR u = XMVectorMultiply(XMVectorSubtract(XMVectorMultiply(dot11, dot02), XMVectorMultiply(dot01, dot12)), invDenom);
XMVECTOR v = XMVectorMultiply(XMVectorSubtract(XMVectorMultiply(dot00, dot12), XMVectorMultiply(dot01, dot02)), invDenom);
}
```

The XMVector2Dot is taken from xnamathvector.inl, I renamed it and modified to operate on the X/Z coordinates.

XNAMath is a great vectorized cross-platform math library by Microsoft; I use it on OS X as well by importing the sal.h header (I'm not sure about licensing issue there so watch out). In fact, any platform that supports intrinsics SSE should support it.

A couple of things to watch for:

• You will probably not see any performance improvement from this SSE code (profile !!) as there is a performance penalty for loading un-aligned floats into SSE registers.
• This is a brute-force conversion of the algorithm to SSE, you will have better luck by being smarter than me and actually trying to understand the algorithm and implementing an SSE friendly version.
• You will have better luck by converting the entire application to use XNA Math/SSE code rather than just that small portion. At least enforce the use of aligned vector types (XMFLOAT3A or struct __declspec(align(16)) myvectortype { };).
• Straight SSE assembly is discouraged especially in x64, in favor of intrinsics.