# Announce: LolRemez 0.1 released

In my previous article about the Remez exchange algorithm I said I was working on a Remez exchange toolkit for everyone to play with. Though it’s far from being full-featured, I already use it and I believe it is already extremely useful. So I decided to release LolRemez 0.1 to the masses.

You can visit the software homepage to download LolRemez and, more importantly, the comprehensive documentation featuring a step-by-step tutorial.

# Better function approximations: Taylor vs. Remez

You may have once crossed this particular piece of magic:

The right part is the Taylor series of sin around 0. It converges very quickly to the actual value of sin(a). This allows a computer to compute the sine of a number with arbitrary precision.

And when I say it’s magic, it’s because it is! Some functions, called the entire functions, can be computed everywhere using one single formula! Other functions may require a different formula for different intervals; they are the analytic functions, a superset of the entire functions. In general, Taylor series are an extremely powerful tool to compute the value of a given function with very high accuracy, because for several common functions such as sin, tan or exp the terms of the series are easy to compute and, when implemented on a computer, can even be stored in a table at compile time.

## Approximating sin with Taylor series

This is how one would approximate sin using 7 terms of its Taylor series on the [-π/2, π/2] interval. The more terms, the better the precision, but we’ll stop at 7 for now:

static double taylorsin(double x)
{
static const
double a0 =  1.0,
a1 = -1.666666666666666666666666666666e-1,  /* -1/3! */
a2 =  8.333333333333333333333333333333e-3,  /*  1/5! */
a3 = -1.984126984126984126984126984126e-4,  /* -1/7! */
a4 =  2.755731922398589065255731922398e-6,  /*  1/9! */
a5 = -2.505210838544171877505210838544e-8,  /* -1/11! */
a6 =  1.605904383682161459939237717015e-10; /*  1/13! */
double x2 = x * x;
return x * (a0 + x2 * (a1 + x2 * (a2 + x2
* (a3 + x2 * (a4 + x2 * (a5 + x2 * a6))))));
}


And you may think…

“Oh wow that is awesome! So simple for such a difficult function. Also, since I read your masterpiece about polynomial evaluation I know how to improve that function so that it is very fast!”

Well, actually, no.

If you are approximating a function over an interval using its Taylor series then either you or the person who taught you is a fucking idiot because a Taylor series approximates a function near a fucking point, not over a fucking interval, and if you don’t understand why it’s important then please read on because that shit is gonna blow your mind.

## Error measurement

Let’s have a look at how much error our approximation introduces. The formula for the absolute error is simple:

And this is how it looks like over our interval:

You can see that the error skyrockets near the edges of the [-π/2, π/2] interval.

“Well the usual way to fix this is to split the interval in two or more parts, and use a different Taylor series for each interval.”

Oh, really? Well, let’s see the error on [-π/4, π/4] instead:

I see no difference! The error is indeed smaller, but again, it becomes extremely large at the edges of the interval. And before you start suggesting reducing the interval even more, here is the error on [-π/8, π/8] now:

I hope this makes it clear that:

• the further from the centre of the interval, the larger the error
• the error distribution is very unbalanced
• the maximum error on [-π/2, π/2] is about 6.63e-10

And now I am going to show you why that maximum error value is pathetic.

## A better approximation

Consider this new function:

static double minimaxsin(double x)
{
static const
double a0 =  1.0,
a1 = -1.666666666640169148537065260055e-1,
a2 =  8.333333316490113523036717102793e-3,
a3 = -1.984126600659171392655484413285e-4,
a4 =  2.755690114917374804474016589137e-6,
a5 = -2.502845227292692953118686710787e-8,
a6 =  1.538730635926417598443354215485e-10;
double x2 = x * x;
return x * (a0 + x2 * (a1 + x2 * (a2 + x2
* (a3 + x2 * (a4 + x2 * (a5 + x2 * a6))))));
}


It doesn’t look very different, right? Right. The values a0 to a6 are slightly different, but the rest of the code is strictly the same.

Yet what a difference it makes! Look at this error curve:

That new function makes it obvious that:

• the error distribution is better spread over the interval
• the maximum error on [-π/2, π/2] is about 4.96e-14

Check that last figure again. The new maximum error isn’t 10% better, or maybe twice as good. It is more than ten thousand times smaller!!

## The minimax polynomial

The above coefficients describe a minimax polynomial: that is, the polynomial that minimises a given error when approximating a given function. I will not go into the mathematical details, but just remember this: if the function is sufficiently well-suited (as sin, tan, exp etc. are), then the minimax polynomial can be found.

The problem? It’s hard to find. The most popular algorithm to find it is the Remez exchange algorithm, and few people really seem to understand how it works (or there would be a lot less Taylor series). I am not going to explain it right now. Usually you need professional math tools such as Maple or Mathematica if you want to compute a minimax polynomial. The Boost library is a notable exception, though.

But you saw the results, so stop using Taylor series. Spending some time finding the minimax polynomial is definitely worth it. This is why I am working on a Remez framework that I will make public and free for everyone to use, modify and do what the fuck they want. In the meantime, if you have functions to numerically approximate, or Taylor-based implementations that you would like to improve, let me know in the comments! This will be great use cases for me.

# C/C++ trick: static lookup table generation

There are two major ways of using lookup tables (LUTs) in C/C++ code:

• build them at runtime,
• embed them in the code.

One major advantage of runtime initialisation is the choice between static initialisation (at program startup), or lazy initialisation (on demand) to save memory. Also, the generating code can be complex, or use information that is only available at runtime.

In the case of an embedded table, the generation cost is only at compile time, which can be very useful. Also, the compiler may take advantage of its early knowledge of the table contents to optimise code. However, quite often the content of embedded tables is abstruse and hardly useful to someone viewing the code. Usually this is due to the use of an external program for generation, sometimes in a completely different language. But the generation can also often be done using the C/C++ preprocessor.

## Practical example

Consider the bit interleaving routine at Sean Eron Anderson's Bit Twiddling Hacks page (which, by the way, I recommend you read and bookmark). It uses the following LUT (shortened for brevity):

static const unsigned short MortonTable256[256] =
{
0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,
0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,
0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,
... 32 lines in total ...
0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555
};


The MortonTable256 table has, as its name suggests, 256 elements. It was pregenerated by some external piece of code which probably looked like this:

for (int i = 0; i < 256; i++)
MortonTable256[i] = (i & 1) | ((i & 2) << 1) | ((i & 4) << 2) | ((i & 8) << 3);


The problem with that external piece of code is that it is external. You cannot write it in this form and have it fill the table at compile time.

If you only take the output of this table, the information on how the table was created is lost. It makes it impractical to build another table that has, for instance, all values shifted one bit left. Even if such a table was created using a modified version of the above code, switching between the two tables would be a hassle unless both versions were kept between preprocessor tests.

## Preprocessor iterator

Here is one way to get the best of both worlds. First, declare the following iterator macros. They can be declared somewhere in a global .h, maybe with more descriptive names:

#define S4(i)    S1((i)),   S1((i)+1),     S1((i)+2),     S1((i)+3)
#define S16(i)   S4((i)),   S4((i)+4),     S4((i)+8),     S4((i)+12)
#define S64(i)   S16((i)),  S16((i)+16),   S16((i)+32),   S16((i)+48)
#define S256(i)  S64((i)),  S64((i)+64),   S64((i)+128),  S64((i)+192)
#define S1024(i) S256((i)), S256((i)+256), S256((i)+512), S256((i)+768)


Their purpose is simple: calling eg. S16(i) will expand to S1(i), S1(i+1), …, S1(i+15). Similarly, S256(i) will call S1 with values from i to i + 255 times.

And this is how to use them in our example:

static const unsigned short MortonTable256[256] =
{
#define S1(i) ((i & 1) | ((i & 2) << 1) | ((i & 4) << 2) | ((i & 8) << 3))
S256(0)
#undef S1
};


That's it! The table will be built at compile time, and you get to keep the logic behind it.

## A more complex example

Jeroen van der Zijp's fast half float conversions paper describes table-based methods to convert between 16-bit and 32-bit floating point values. The construction of one of the LUTs is as follows:

void generatetables(){
for(unsigned int i=0; i<256; ++i){
int e=i-127;
if(e<-24){                  // Very small numbers map to zero
basetable[i|0x000]=0x0000;
basetable[i|0x100]=0x8000;
} else if(e<-14){             // Small numbers map to denorms
basetable[i|0x000]=(0x0400>>(-e-14));
basetable[i|0x100]=(0x0400>>(-e-14)) | 0x8000;
} else if(e<=15){             // Normal numbers just lose precision
basetable[i|0x000]=((e+15)<<10);
basetable[i|0x100]=((e+15)<<10) | 0x8000;
} else if(e<128){             // Large numbers map to Infinity
basetable[i|0x000]=0x7C00;
basetable[i|0x100]=0xFC00;
} else{                       // Infinity and NaN's stay Infinity and NaN's
basetable[i|0x000]=0x7C00;
basetable[i|0x100]=0xFC00;
}
}
}


And this is the compile-time version :

static uint16_t const basetable[512] =
{
#define S1(i) (((i) < 103) ? 0x0000 : \
((i) < 113) ? 0x0400 >> (0x1f & (113 - (i))) : \
((i) < 143) ? ((i) - 112) << 10 : 0x7c00)
S256(0),
#undef S1
#define S1(i) (0x8000 | basetable[i])
S256(0),
#undef S1
};


In this case the macro code is slightly bigger and was slightly rewritten, but is no more complicated than the original code. Note also the elegant reuse of previous values in the second half of the table.

This trick is certainly not new, but since I have found practical uses for it, I thought you may find it useful, too.

# Understanding basic motion calculations in games: Euler vs. Verlet

During the past month, I have found myself in the position of having to explain the contents of this article to six different persons, either at work or over the Internet. Though there are a lot of articles on the subject, it’s still as if almost everyone gets it wrong. I was still polishing this article when I had the opportunity to explain it a seventh time.

And two days ago a coworker told me the source code of a certain framework disagreed with me… The kind of framework that probably has three NDAs preventing me from even thinking about it.

Well that framework got it wrong, too. So now I’m mad at the entire world for no rational reason other than the ever occurring realisation of the amount of wrong out there, and this article is but a catharsis to deal with my uncontrollable rage.

## A simple example

Imagine a particle with position Pos and velocity Vel affected by acceleration Accel. Let’s say for the moment that the acceleration is constant. This is the case when only gravity is present.

A typical game engine loop will update position with regards to a timestep (often the duration of a frame) using the following method, known as Euler integration:

Particle::Update(float dt)
{
Accel = vec3(0, 0, -9.81); /* Constant acceleration: gravity */
Vel = Vel + Accel * dt;    /* New, timestep-corrected velocity */
Pos = Pos + Vel * dt;      /* New, timestep-corrected position */
}


This comes directly from the definition of acceleration:

Putting these two differential equations into Euler integration gives us the above code.

## Measuring accuracy

Typical particle trajectories would look a bit like this:

These are three runs of the above simulation with the same initial values.

• once with maximum accuracy,
• once at 60 frames per second,
• once at 30 frames per second.

You can notice the slight inaccuracy in the trajectories.

You may think…

“Oh, it could be worse; it’s just the expected inaccuracy with different framerate values.”

Well, no.

Just… no.

If you are updating positions this way and you do not have a really good reason for doing so then either you or the person who taught you is a fucking idiot and should not have been allowed to write so-called physics code in the first place and I most certainly hope to humbly bestow enlightenment upon you in the form of a massive cluebat and don’t you dare stop reading this sentence before I’m finished.

## Why this is wrong

When doing kinematics, computing position from acceleration is an integration process. First you integrate acceleration with respect to time to get velocity, then you integrate velocity to get position.

The integral of a function can be seen as the area below its curve. So, how do you properly get the integral of our velocity between t and t + dt, ie. the green area below?

It’s not by doing new_velocity * dt (left image).

It’s not by doing old_velocity * dt either (middle image).

It’s obviously by doing (old_velocity + new_velocity) * 0.5 * dt (right image).

## And now for the correct code

This is what the update method should look like. It’s called Velocity Verlet integration (not strictly the same as Verlet integration, but with a similar error order) and it always gives the perfect, exact position of the particle in the case of constant acceleration, even with the nastiest framerate you can think of. Even at two frames per second.

Particle::Update(float dt)
{
Accel = vec3(0, 0, -9.81);
vec3 OldVel = Vel;
Vel = Vel + Accel * dt;
Pos = Pos + (OldVel + Vel) * 0.5 * dt;
}


And the resulting trajectories at different framerates:

## Further readings

“Oh wow thank you. But what if acceleration is not constant, like in real life?”

Well I am glad you asked.

Euler integration and Verlet integration are part of a family of iterative methods known as the Runge-Kutta methods, respectively of first order and second order. There are many more for you to discover and study.

• Richard Lord did this nice and instructive animated presentation about several integration methods.
• Glenn Fiedler also explains in this article why idiots use Euler, and provides a nice introduction to RK4 together with source code.
• Florian Boesch did a thorough coverage of various integration methods for the specific application of gravitation (it is one of the rare cases where Euler seems to actually perform better).

In practice, Verlet will still only give you an approximation of your particle’s position. But it will almost always be a much better approximation than Euler. If you need even more accuracy, look at the fourth-order Runge-Kutta (RK4) method. Your physics will suck a lot less, I guarantee it.

## Acknowledgements

I would like to thank everyone cited in this article, explicitly or implicitly, as well as the commenters below who spotted mistakes and provided corrections or improvements.

# GLSL code snippet: choosing from 4 vectors by Z value

There aren’t many low-level GLSL optimisation resources out there, so I decided that I would share my thoughts when working on some specific parts of my code.

## The basic, complete shader function

This one time I had four vec3 vectors, with an xy texture coordinate, and a weight stored in z. The code to compute the final pixel value was:

vec4 getColor(vec3 a, vec3 b, vec3 c, vec3 d)
{
vec4 pa = texture2D(tex, a.xy) * a.z;
vec4 pb = texture2D(tex, b.xy) * b.z;
vec4 pc = texture2D(tex, c.xy) * c.z;
vec4 pd = texture2D(tex, d.xy) * d.z;
return (pa + pb + pc + pd) / (a.z + b.z + c.z + d.z);
}


That is four texture lookups, which is expensive.

## The lightweight version for coarse levels of detail

If I wanted a more lightweight fragment shader, for instance when implementing variable levels of shader complexity, I would want to do only one texture lookup, and use the vector with the largest weight:

vec4 getColorFast(vec3 a, vec3 b, vec3 c, vec3 d)
{
if (a.z < c.z) // These two tests are
a = c;     // likely to be run in
if (b.z < d.z) // parallel because they use
b = d;     // independent data.
if (a.z < b.z)
a = b;
return texture2D(tex, a.xy);
}


Only one texture lookup, but three branches. Branches are expensive and should be avoided.

Fortunately, GLSL provides step() and mix() (in HLSL or Cg, step() and lerp()) that let us do things similar to fsel on the PowerPC, or vec_sel in AltiVec: a branch-free select.

vec4 getColorFaster(vec3 a, vec3 b, vec3 c, vec3 d)
{
a = mix(a, c, step(a.z, c.z)); // Again, potentially good
b = mix(b, d, step(b.z, d.z)); // scheduling between these lines
a = mix(a, b, step(a.z, b.z));
return texture2D(tex, a.xy);
}


Excellent! Only six instructions in addition to the texture lookup.

But if you are familiar with SSE or AltiVec-style SIMD programming on the CPU, you will know this is not the usual way to do. Rather than 4 vectors of 3 elements, SIMD programming prefers to work in parallel on 3 vectors of 4 X, Y and Z components:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
/* Now what do we do here? */
}


One nice thing to realise is that the equivalent of our previous step(a.z, c.z) and step(b.z, d.z) tests can be done in parallel:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
// compare a.z >? c.z and b.z >? d.z in parallel
vec2 t = step(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]));
// choose between a and c using t[0], between b and d using t[1]
vec2 twox = mix(vec2(allx[0], allx[2]), vec2(allx[1], allx[3]), t);
vec2 twoy = mix(vec2(ally[0], ally[2]), vec2(ally[1], ally[3]), t);
vec2 twoz = mix(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]), t);
// compare a.z and b.z
float s = step(twoz[0], twoz[1]);
// now choose between a and b using s
vec2 best = vec2(mix(twox[0], twox[1], t2), mix(twoy[0], twoy[1], s));
return texture2D(tex, best);
}


Wow, that’s a bit complex. And even if we’re doing two calls to step() instead of three, there are now five calls to mix() instead of three. Fortunately, thanks to swizzling, we can combine most of these calls to mix():

vec4 getColorShuffledFast(vec4 allx, vec4 ally, vec4 allz)
{
vec2 t = step(allz.ag, allz.rb);
vec4 twoxy = mix(vec4(allx.ag, ally.ag), vec4(allx.rb, ally.rb), t.xyxy);
vec2 twoz  = mix(allz.ag, allz.rb, t);
float t2 = step(twoz.a, twoz.r);
vec2 best = mix(twoxy.ag, twoxy.rb, t2);
return texture2D(tex, best);
}


That’s it! Only three mix() and two step() instructions. Quite a few swizzles, but these are extremely cheap on modern GPUs.

## Afterthoughts

The above transformation was at the “cost” of a big data layout change known as array of structures to structure of arrays. When working in parallel on similar data, it is very often a good idea, and the GPU was no exception here.

This was actually a life saver when trying to get a fallback version of a shader to work on an i915 card, where mix and step must be emulated using ALU instructions, up to a maximum of 64. The result can be seen in this NaCl plugin.