There aren’t many low-level GLSL optimisation resources out there, so I decided that I would share my thoughts when working on some specific parts of my code.

## The basic, complete shader function

This one time I had four `vec3` vectors, with an `xy` texture coordinate, and a weight stored in `z`. The code to compute the final pixel value was:

vec4 getColor(vec3 a, vec3 b, vec3 c, vec3 d)
{
vec4 pa = texture2D(tex, a.xy) * a.z;
vec4 pb = texture2D(tex, b.xy) * b.z;
vec4 pc = texture2D(tex, c.xy) * c.z;
vec4 pd = texture2D(tex, d.xy) * d.z;
return (pa + pb + pc + pd) / (a.z + b.z + c.z + d.z);
}

That is four texture lookups, which is expensive.

## The lightweight version for coarse levels of detail

If I wanted a more lightweight fragment shader, for instance when implementing **variable levels of shader complexity**, I would want to do only one texture lookup, and use the vector with the largest weight:

vec4 getColorFast(vec3 a, vec3 b, vec3 c, vec3 d)
{
if (a.z < c.z) // These two tests are
a = c; // likely to be run in
if (b.z < d.z) // parallel because they use
b = d; // independent data.
if (a.z < b.z)
a = b;
return texture2D(tex, a.xy);
}

Only one texture lookup, but three branches. Branches are expensive and should be avoided.

Fortunately, GLSL provides `step()` and `mix()` (in HLSL or Cg, `step()` and `lerp()`) that let us do things similar to `fsel` on the PowerPC, or `vec_sel` in AltiVec: a branch-free select.

vec4 getColorFaster(vec3 a, vec3 b, vec3 c, vec3 d)
{
a = mix(a, c, step(a.z, c.z)); // Again, potentially good
b = mix(b, d, step(b.z, d.z)); // scheduling between these lines
a = mix(a, b, step(a.z, b.z));
return texture2D(tex, a.xy);
}

Excellent! Only six instructions in addition to the texture lookup.

But if you are familiar with SSE or AltiVec-style SIMD programming on the CPU, you will know this is not the usual way to do. Rather than 4 vectors of 3 elements, SIMD programming prefers to work in parallel on 3 vectors of 4 `X`, `Y` and `Z` components:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
/* Now what do we do here? */
}

One nice thing to realise is that the equivalent of our previous `step(a.z, c.z)` and `step(b.z, d.z)` tests can be done in parallel:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
// compare a.z >? c.z and b.z >? d.z in parallel
vec2 t = step(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]));
// choose between a and c using t[0], between b and d using t[1]
vec2 twox = mix(vec2(allx[0], allx[2]), vec2(allx[1], allx[3]), t);
vec2 twoy = mix(vec2(ally[0], ally[2]), vec2(ally[1], ally[3]), t);
vec2 twoz = mix(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]), t);
// compare a.z and b.z
float s = step(twoz[0], twoz[1]);
// now choose between a and b using s
vec2 best = vec2(mix(twox[0], twox[1], t2), mix(twoy[0], twoy[1], s));
return texture2D(tex, best);
}

Wow, that’s a bit complex. And even if we’re doing two calls to `step()` instead of three, there are now five calls to `mix()` instead of three. Fortunately, thanks to swizzling, we can combine most of these calls to `mix()`:

vec4 getColorShuffledFast(vec4 allx, vec4 ally, vec4 allz)
{
vec2 t = step(allz.ag, allz.rb);
vec4 twoxy = mix(vec4(allx.ag, ally.ag), vec4(allx.rb, ally.rb), t.xyxy);
vec2 twoz = mix(allz.ag, allz.rb, t);
float t2 = step(twoz.a, twoz.r);
vec2 best = mix(twoxy.ag, twoxy.rb, t2);
return texture2D(tex, best);
}

That’s it! Only three `mix()` and two `step()` instructions. Quite a few swizzles, but these are extremely cheap on modern GPUs.

## Afterthoughts

The above transformation was at the “cost” of a big data layout change known as **array of structures to structure of arrays**. When working in parallel on similar data, it is very often a good idea, and the GPU was no exception here.

This was actually a life saver when trying to get a fallback version of a shader to work on an i915 card, where `mix` and `step` must be emulated using ALU instructions, up to a maximum of 64. The result can be seen in this NaCl plugin.