Damping with delta-time

Just a quick tip on how to convert usual damping code to something framerate-independent.

Most of us have probably, at some point, written code resembling this:

// Perform velocity damping
velocity -= velocity * 0.01f;

… or probably the more correct:

// Per-second damping coefficient
float const D = 10.0f;
// Damp velocity according to timestep
velocity -= velocity * D * delta_time;

Yet this is not fully framerate-independent; results are slightly different at 30fps and 60fps, and more importantly, spikes in the framerate cause lots of weird artifacts, causing developers to attempt to fix the situation by clamping delta_time, which is not ideal.

The exponentiation method

Here is one way to fix it: assume that the code works correctly at 60 fps. This means that each frame, velocity is effectively multiplied by 1 - D / 60.

After one second, i.e. 60 frames, velocity has been multiplied by (1 - D / 60) ^ 60.

After two seconds, it has been multiplied by (1 - D / 60) ^ (60 * 2).

After N seconds, it has been multiplied by (1 - D / 60) ^ (60 * N).

So, there, we have a formula that tells us what happens after N seconds, and it’s a continuous function. We can therefore choose N as we like, and especially N = delta_time:

// Per-second damping coefficient
float const D = 10.0f;
// Damp velocity (framerate-independent)
velocity *= pow(1.f - D / 60.f, 60.f * delta_time);

Which can be conveniently rewritten as:

// Per-second damping coefficient
float const D = 10.0f;
// Exponentiation base for velocity damping
float const D2 = pow(1.f - D / 60.f, 60.f);
// Damp velocity (framerate-independent)
velocity *= pow(D2, delta_time);

The stolen bytes: Visual Studio, virtual methods and data alignment

This article describes a design choice in the C++ ABI of the Visual Studio compiler that I believe should be considered a bug. I propose a trivial workaround at the end.

TL;DR — if the topmost polymorphic class in a hierarchy has members with alignment requirement N where N > sizeof(void *), the Visual Studio compiler may add up to N bytes of useless padding to your objects.

Update: be sure to read the explanation by Jan Gray, who designed the relevant part of the MS C++ ABI some 22 years ago, in the comments section below.

My colleague Benlitz first hit the problem when trying to squeeze memory out of some of our game’s most often instantiated classes. I think it is best illustrated with the following minimal example:

class Foo
{
    virtual void Hello() {}
    float f;     /* 4 bytes */
};
class Bar
{
    virtual void Hello() {}
    float f;     /* 4 bytes */
    double d;    /* 8 bytes */
};

This is the size of Foo and Bar on various 32-bit platforms:

Platform sizeof(Foo) sizeof(Bar) Madness?
Linux x86 (gcc) 8 16 no
Linux ARMv9 (gcc) 8 16 no
Win32 (gcc) 8 16 no
Win32 (Visual Studio 2010) 8 24 yes
Xbox 360 (Visual Studio 2010) 8 24 yes
PlayStation 3 (gcc) 8 16 no
PlayStation 3 (SNC) 8 16 no
Mac OS X x86 (gcc) 8 16 no

There is no trick. This is by design. The Visual Studio compiler is literally stealing 8 bytes from us!

What the fuck is happening?

This is the memory layout of Foo on all observed platforms:

\begin{tabular}{|r|llll|llll|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\
\hline
field & \multicolumn{4}{|c|}{\textit{vfptr}}
      & \multicolumn{4}{|c|}{\texttt{float f;}} \\
\hline
\end{tabular}

The vfptr field is a special pointer to the vtable. The vtable is probably the most widespread compiler-specific way to implement virtual methods. Since all the platforms studied here are 32-bit, this pointer requires 4 bytes. A float requires 4 bytes, too. The total size of the class is therefore 8 bytes.

This is the memory layout of Bar on eg. Linux using GCC:

\begin{tabular}{|r|llll|llll|llllllll|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7
     & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
\hline
field & \multicolumn{4}{|c|}{\textit{vfptr}}
      & \multicolumn{4}{|c|}{\texttt{float f;}}
      & \multicolumn{8}{|c|}{\texttt{double d;}} \\
\hline
\end{tabular}

The double type has an alignment requirement of 8 bytes, which makes it fit perfectly at byte offset 8.

And finally, this is the memory layout of Bar on Win32 using Visual Studio 2010:

\begin{tabular}{|r|llll|llll|llll|llll|llllllll|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7
     & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15
     & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 \\
\hline
field & \multicolumn{4}{|c|}{\textit{vfptr}}
      & \multicolumn{4}{|c|}{\textit{padding}}
      & \multicolumn{4}{|c|}{\texttt{float f;}}
      & \multicolumn{4}{|c|}{\textit{padding}}
      & \multicolumn{8}{|c|}{\texttt{double d;}} \\
\hline
\end{tabular}

This is madness! The requirement for the class to be 8-byte aligned causes the first element of the class to be 8-byte aligned, too! I demand a rational explanation for this design choice.

The problem is that the compiler decides to add the vtable pointer after it has aligned the class data, resulting in excessive realignment.

Compilers affected

The Visual Studio compilers for Win32, x64 and Xbox 360 all appear to create spurious padding in classes.

Though this article focuses on 32-bit platforms for the sake of simplicity, 64-bit Windows is affected, too.

The problem becomes even worse with larger alignment requirements, for instance with SSE3 or AltiVec types that require 16-byte storage alignment such as _FP128:

class Quux
{
    virtual void Hello() {}
    float f;     /* 4 bytes */
    _FP128 dd;   /* 16 bytes */
};

This is the GCC memory layout on both 32-bit and 64-bit platforms:

\begin{tabular}{|r|c|c|c|c|c|c|c|c|}
\hline
byte & 0--3 & 4--7 & 8--11 & 12--15
     & 16--19 & 20--23 & 24--27 & 28--31 \\
\hline
\hline
field (32-bit) & \textit{vfptr}
                & \texttt{float f;}
                & \multicolumn{2}{|c|}{\textit{padding}}
                & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
field (64-bit) & \multicolumn{2}{|c|}{\textit{vfptr}}
               & \texttt{float f;}
               & \textit{padding}
               & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
\end{tabular}

The padding there is perfectly normal and expected, because of the alignment requirements for dd.

But this is how Visual Studio decides to lay it out:

\begin{tabular}{|r|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline
byte & 0--3 & 4--7 & 8--11 & 12--15
     & 16--19 & 20--23 & 24--27 & 28--31
     & 32--35 & 36--39 & 40--43 & 44--47 \\
\hline
\hline
field (32-bit) & \textit{vfptr}
               & \multicolumn{3}{|c|}{\textit{padding}}
               & \texttt{float f;}
               & \multicolumn{3}{|c|}{\textit{padding}}
               & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
field (64-bit) & \multicolumn{2}{|c|}{\textit{vfptr}}
               & \multicolumn{2}{|c|}{\textit{padding}}
               & \texttt{float f;}
               & \multicolumn{3}{|c|}{\textit{padding}}
               & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
\end{tabular}

That is 16 lost bytes, both on 32-bit and 64-bit versions of Windows.

Workaround

There is fortunately a workaround if you want to get rid of the useless padding. It is so trivial that it actually makes me angry that the problem exists in the first place.

This will get you your bytes back:

class EmptyBase
{
protected:
    virtual ~EmptyBase() {}
};
class Bar : public EmptyBase
{
    virtual void Hello() {}
    float f;     /* 4 bytes */
    double d;    /* 8 bytes */
};

And this is the size of Bar on the same 32-bit platforms:

Platform sizeof(Bar)
Linux x86 (gcc) 16
Linux ARMv9 (gcc) 16
Win32 (gcc) 16
Win32 (Visual Studio 2010) 16
Xbox 360 (Visual Studio 2010) 16
PlayStation 3 (gcc) 16
PlayStation 3 (SNC) 16
Mac OS X x86 (gcc) 16

Phew. Sanity restored.

\begin{tabular}{|r|cccc|cccc|cccccccc|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7
     & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
\hline
\texttt{EmptyBase} fields & \multicolumn{4}{|c|}{\textit{vfptr}}
                          & \multicolumn{12}{|c|}{} \\
\hline
\texttt{Bar} fields & \multicolumn{4}{c}{\texttt{EmptyBase}}
                    & \multicolumn{4}{|c|}{\texttt{float f;}}
                    & \multicolumn{8}{|c|}{\texttt{double d;}} \\
\hline
\end{tabular}

The compiler is a lot less confused now: it no longer has to create space for a vfptr in Bar since it is technically already part of EmptyBase.

Conclusion

Lessons learned:

  • The pointer to the vtable isn’t just like any other pointer.
  • Various C++ ABIs have different stances on padding and alignment.
  • Inheriting from an empty abstract class can make your objects smaller on Windows and Xbox 360!
  • Design decisions can haunt you for decades!

The workaround is so simple that it sounds like a good idea to always use it, preemptively.

Setting up a real Compose key on Mac OS X

The Compose key is my method of choice for international character input. It lets me type characters as diverse as é Â ẃ ṗ § … « ¿ ¥ ¹ ½ © × using simple and intuitive key combinations.

How intuitive exactly? Let’s see:

  • Compose + C + , gives me Ç.
  • Compose + l + / gives me ł.
  • Compose + < + < gives me «.
  • Compose + - + > gives me →.
  • Compose + < + 3 gives me ♥.
  • Compose + C + C + C + P gives me ☭ (no kidding).

You can of course set up your own rules. This shit is so powerful that I cannot imagine I could ever use any other input method.

So, one would think that with all its glorious Unix heritage, Mac OS X would let you get the most out of your keyboard like the good old X11 system does. Well it turns out it’s possible, but not straightforward.

Fortunately, other people already did all the work. I will just indicate how to put their stuff together.

Step 1: choose a Compose key

Choose the Compose key so that it is easily accessible but does not prevent you from doing anything you ordinarily do. Fortunately, modern keyboards come with more and more idiotic and useless keys.

I use the Right Alt key as my Compose key. I already have a Left Alt key so the right one is a bit useless to me. And it somehow matches the position of the Compose key on old Sun keyboards.

That would be Right Option on a Mac keyboard. I recommend that.

Step 2: remap the Compose key

The problem is that the Mac OS X keyboard preferences:

  • do not let you differentiate between Left and Right Option keys
  • only let you remap modifier keys to another modifier key (or to nothing)

Fortunately, there is KeyRemap4MacBook that lets you do very low level things with your keyboard. Install it.

We will now remap our compose key to something that the next layer will understand. I chose Shift-Control-F13 for that. It is very unlikely you will need that key combination.

In the file ~/Library/Application Support/KeyRemap4MacBook/private.xml put the following:

<?xml version="1.0"?>
<root>
  <item>
    <name>Send Shift-Ctrl-F13 for Right Option</name>
    <identifier>private.send_shift_ctrl_f13_for_ropt</identifier>
    <autogen>--KeyToKey-- KeyCode::OPTION_R,
                          KeyCode::F13, ModifierFlag::SHIFT_L
                                      | ModifierFlag::CONTROL_L
    </autogen>
  </item>
</root>

Finally, from the System Preferences, open the KeyRemap4MacBook settings and click on the ReloadXML button:

The new option should appear. Activate it:

Step 3: create compose bindings

The last step is the creation of the actual bindings. I chose to import the rules from /usr/share/X11/locale/en_US.UTF-8/Compose on my Debian system.

Bob Kåres wrote a script that lets you convert X11 compose rules into Cocoa key bindings.

Either convert a Compose file of your own using Bob’s script, or download my DefaultKeyBinding.dict. Save it in ~/Library/KeyBindings/DefaultKeyBinding.dict.

Be careful: by default Bob’s script uses F13 instead of Shift-Ctrl-F13 so in DefaultKeyBinding.dict you need to change:

    "\UF710"

into:

    "^$\UF710"

If for some reason you decided to go for another combination, check out this article by Xah Lee to find out the proper syntax.

Step 4: restart all applications

And that’s it! Your Mac OS X system is now slightly more usable.

LINK : fatal error LNK1104: cannot open file ‘XAPID.lib’

Ever got a link error for a library that was referenced nowhere in your Visual Studio project or even in the final link.exe command line? Here's a hint: check the contents of static libraries, too. They may be pulling unexpected dependencies behind your back!

If the static library is part of your solution, here is another hint: check that the [Configuration Properties] >> [C/C++] >> [Code Generation] >> [Runtime Library] configuration values match across projects.

Maths trick: doing fewer comparisons

Note: this is not an optimisation. It is just one more tool you should have in your toolbox when looking for optimisations. It may be useful.

This is the trick:

\[\min(x,y) = \dfrac{x + y - |x - y|}{2}\]
\[\max(x,y) = \dfrac{x + y + |x - y|}{2}\]

You can check for yourself that it is always true: when x > y, |x - y| is the same as x - y, etc.

What good is it for? There is often an implicit comparison in min or max. It might be interesting to replace it with a call to the branchless fabs.

Example usage

Consider the following code:

float a, b, c, d;
/* ... */
return (a > b) && (c > d);

That kind of code is often used eg. in collision checks, where a lot of tests can be done. This code does two comparisons. On some architectures, this means two branches. Not always something you want.

The test condition is equivalent to:

(a - b > 0) && (c - d > 0)

Now when are two given numbers both positive? That is if and only if the smallest is positive:

min(a - b, c - d) > 0

We may now use our trick:

(a - b) + (c - d) - |(a - b) - (c + d)| > 0

And so the code could be rewritten as such:

float a, b, c, d;
/* ... */
return (a - b) + (c - d) > fabsf((a - b) - (c - d));

We basically replaced the additional test with a call to fabsf and some additions/subtractions. It may be possible to reorganise the input data so that this second version performs better.

C++ trick: selectively restrict implicit conversions

TL;DR: given a class Foo with an implicit constructor from int, how to allow the implicit conversion in f(42); but not in g(42); where both f and g take a Foo const & argument?

Background

So I have this real class that performs numeric operations that I want use just like any other C++ numeric type. For instance, I can write the following:

float f = 15, g = 3.5;
int x = f / g;

If I decide that I need double precision, I can write:

double f = 15, g = 3.5;
int x = f / g;

And of course, using my real class for even higher precision:

real f = 15, g = 3.5;
int x = f / g;

I like that. I can just write code as usual, and when I need higher precision, I use real instead of double. It's transparent and convenient.

Implementation example

Here is a highly simplified example of a real class:

struct real
{
    inline real(double d) : m_value(d) {}
    inline operator int() const { return (int)m_value; }
    /* ... */
    long double m_value;
};

It is possible to write real f = 15 because of the implicit constructor. Actually, C++ constructors are always implicit unless specified otherwise.

It is possible to write int x = f / g because of the conversion operator.

So far, so good.

The problem with implicit promotion

Here is how fabs could be implemented:

real fabs(real const &r)
{
    return real(r.m_value < 0 ? -r.m_value : r.m_value);
}

But now we have a problem. A subtle problem. Consider the following code:

double x = fabs(-5.0);

What does this do? Well, it depends. It depends whether <cmath> was included or not. Because if <cmath> wasn’t included, then that code is going to automatically promote -5.0 to a real and call our custom function instead of the one provided by the math library! With no compile-time warning!

This is confusing. It should not happen. But it is a well known problem and there are several obvious workarounds:

  1. What most professional C++ programmers will tell you: use namespaces
  2. Mark the real(int) constructor explicit

The problem with 1. is that I am not a professional C++ programmer. I am a C programmer who uses C++. I use preprocessor macros and printf and memalign and goto. Try and stop me!

The problem with 2. is that I can no longer write real f = 15, I would need real f(15) or real f = real(15) instead. This is not acceptable, I want real to behave exactly like float and others, to the fullest extent of what the language allows.

Another solution

Fortunately, the C++ standard has a solution for us: “Implicit conversions will be performed [...] if the parameter type contains no template-parameters that participate in template argument deduction” (ISO/IEC 14882:1998, section 14.8.1.4). You cannot have implicit conversion and template argument deduction at the same time.

It means we just have to make fabs a template function! Which means making real a template class, too.

A quick way to fix real would be:

/* N is unused */
template<int N> struct real_base
{
    inline real_base(double d) : m_value(d) {}
    inline operator int() const { return (int)m_value; }
    /* ... */
    long double m_value;
};
typedef real_base<0> real;

The template argument is useless, unfortunately. It will just have to be here, forever. But who knows, you might find a use for it one day.

And to fix fabs:

/* A generic template declaration is needed */
template<int N> real_base<N> fabs(real_base<N> const &r);
/* Here we just add template<> to the previous version */
template<>
real fabs(real const &r)
{
    return real(r.m_value < 0 ? -r.m_value : r.m_value);
}

So, what happens with double x = fabs(-5.0); when we forget to include <cmath> now? Well, here is what GCC says:

In function ‘int main()’:
error: no matching function for call to ‘fabs(double)’
note: candidate is:
note: template<int N> real_base<N> fabs(const real_base<N>&)

It seems we’ve successfully managed to avoid the problematic implicit conversion, yet still allow it in places where it was useful!

So what is the rule? It’s simple: where implicit conversion should not be allowed, make the function a specialised template function.

C/C++ trick: static string hash generation

I am always interested in having the compiler do more things for me, without giving away code clarity or performance for the convenience. Today a colleague linked me to Pope Kim's Compile-Time Hash String Generation article which is a perfect example of the things I like: hidden syntactic sugar that does useful things.

Inline hash function

The goal: for a given hash function, write something like HASH_STRING("funny_bone") in the code, and have the compiler directly replace it with the result, 0xf1c6fd7f.

The solution: inline the function and hope that the compiler will be clever enough.

#include <string.h>
#define HASH(str) generateHash(str, strlen(str))

inline unsigned int generateHash(const char *string, size_t len)
{
    unsigned int hash = 0;
    for(size_t i = 0; i < len; ++i)
        hash = 65599 * hash + string[i];
    return hash ^ (hash >> 16);
}

Unfortunately Pope ran into several very problematic issues:

  • requires heavy optimisation flags (/O2 with Visual Studio, -O3 with g++)
  • limited to 10-character strings with Visual Studio
  • limited to 17-character strings with g++

I could personally reproduce the g++ limitations. I believe they are more related to loop unrolling limits than to the actual string size, but they indeed make the technique unusable in practice.

Macro-based hash function

If you read my previous article about C/C++ preprocessor LUT generation, you may remember that it used preprocessor tricks to do loop unrolling.

Hence the following implementation:

#include <string.h>
#include <stdint.h>
#include <stdio.h>

#define H1(s,i,x)   (x*65599u+(uint8_t)s[(i)<strlen(s)?strlen(s)-1-(i):strlen(s)])
#define H4(s,i,x)   H1(s,i,H1(s,i+1,H1(s,i+2,H1(s,i+3,x))))
#define H16(s,i,x)  H4(s,i,H4(s,i+4,H4(s,i+8,H4(s,i+12,x))))
#define H64(s,i,x)  H16(s,i,H16(s,i+16,H16(s,i+32,H16(s,i+48,x))))
#define H256(s,i,x) H64(s,i,H64(s,i+64,H64(s,i+128,H64(s,i+192,x))))

#define HASH(s)    ((uint32_t)(H256(s,0,0)^(H256(s,0,0)>>16)))

It has the following properties:

  • works in C in addition to C++
  • strings are always optimised away by gcc or g++ (but not always the computation itself)
  • hash computation is optimised away by gcc or g++ even with -O, -O1 or -Os
  • string size limit is 256 characters (probably more than enough for most uses) and can be manually increased or decreased

The following code:

int main(void)
{
    printf("%08x\n", HASH("funny_bone"));
    printf("%08x\n", HASH("incredibly_large_string_that_gcc_groks_easily"));
}

Is (correctly) optimised to this with gcc -Os:

  ...
  movl    $-238617217, %esi
  movl    $.LC0, %edi
  xorl    %eax, %eax
  call    printf
  movl    $-453669173, %esi
  movl    $.LC0, %edi
  xorl    %eax, %eax
  call    printf
  ...

I haven't tested it with Visual Studio. Feedback from this compiler would be very appreciated!

C/C++ trick: static lookup table generation

There are two major ways of using lookup tables (LUTs) in C/C++ code:

  • build them at runtime,
  • embed them in the code.

One major advantage of runtime initialisation is the choice between static initialisation (at program startup), or lazy initialisation (on demand) to save memory. Also, the generating code can be complex, or use information that is only available at runtime.

In the case of an embedded table, the generation cost is only at compile time, which can be very useful. Also, the compiler may take advantage of its early knowledge of the table contents to optimise code. However, quite often the content of embedded tables is abstruse and hardly useful to someone viewing the code. Usually this is due to the use of an external program for generation, sometimes in a completely different language. But the generation can also often be done using the C/C++ preprocessor.

Practical example

Consider the bit interleaving routine at Sean Eron Anderson's Bit Twiddling Hacks page (which, by the way, I recommend you read and bookmark). It uses the following LUT (shortened for brevity):

static const unsigned short MortonTable256[256] =
{
  0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,
  0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,
  0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,
  ... 32 lines in total ...
  0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555
};

The MortonTable256 table has, as its name suggests, 256 elements. It was pregenerated by some external piece of code which probably looked like this:

for (int i = 0; i < 256; i++)
    MortonTable256[i] = (i & 1) | ((i & 2) << 1) | ((i & 4) << 2) | ((i & 8) << 3);

The problem with that external piece of code is that it is external. You cannot write it in this form and have it fill the table at compile time.

If you only take the output of this table, the information on how the table was created is lost. It makes it impractical to build another table that has, for instance, all values shifted one bit left. Even if such a table was created using a modified version of the above code, switching between the two tables would be a hassle unless both versions were kept between preprocessor tests.

Preprocessor iterator

Here is one way to get the best of both worlds. First, declare the following iterator macros. They can be declared somewhere in a global .h, maybe with more descriptive names:

#define S4(i)    S1((i)),   S1((i)+1),     S1((i)+2),     S1((i)+3)
#define S16(i)   S4((i)),   S4((i)+4),     S4((i)+8),     S4((i)+12)
#define S64(i)   S16((i)),  S16((i)+16),   S16((i)+32),   S16((i)+48)
#define S256(i)  S64((i)),  S64((i)+64),   S64((i)+128),  S64((i)+192)
#define S1024(i) S256((i)), S256((i)+256), S256((i)+512), S256((i)+768)

Their purpose is simple: calling eg. S16(i) will expand to S1(i), S1(i+1), …, S1(i+15). Similarly, S256(i) will call S1 with values from i to i + 255 times.

And this is how to use them in our example:

static const unsigned short MortonTable256[256] =
{
#define S1(i) ((i & 1) | ((i & 2) << 1) | ((i & 4) << 2) | ((i & 8) << 3))
    S256(0)
#undef S1
};

That's it! The table will be built at compile time, and you get to keep the logic behind it.

A more complex example

Jeroen van der Zijp's fast half float conversions paper describes table-based methods to convert between 16-bit and 32-bit floating point values. The construction of one of the LUTs is as follows:

void generatetables(){
  for(unsigned int i=0; i<256; ++i){
    int e=i-127;
    if(e<-24){                  // Very small numbers map to zero
      basetable[i|0x000]=0x0000;
      basetable[i|0x100]=0x8000;
    } else if(e<-14){             // Small numbers map to denorms
      basetable[i|0x000]=(0x0400>>(-e-14));
      basetable[i|0x100]=(0x0400>>(-e-14)) | 0x8000;
    } else if(e<=15){             // Normal numbers just lose precision
      basetable[i|0x000]=((e+15)<<10);
      basetable[i|0x100]=((e+15)<<10) | 0x8000;
    } else if(e<128){             // Large numbers map to Infinity
      basetable[i|0x000]=0x7C00;
      basetable[i|0x100]=0xFC00;
    } else{                       // Infinity and NaN's stay Infinity and NaN's
      basetable[i|0x000]=0x7C00;
      basetable[i|0x100]=0xFC00;
    }
  }
}

And this is the compile-time version :

static uint16_t const basetable[512] =
{
#define S1(i) (((i) < 103) ? 0x0000 : \
               ((i) < 113) ? 0x0400 >> (0x1f & (113 - (i))) : \
               ((i) < 143) ? ((i) - 112) << 10 : 0x7c00)
    S256(0),
#undef S1
#define S1(i) (0x8000 | basetable[i])
    S256(0),
#undef S1
};

In this case the macro code is slightly bigger and was slightly rewritten, but is no more complicated than the original code. Note also the elegant reuse of previous values in the second half of the table.

This trick is certainly not new, but since I have found practical uses for it, I thought you may find it useful, too.

Understanding fast float/integer conversions

If you are interested in micro-optimisation and have never studied IEEE-754 float representation, I suggest you have a look at the beast. It may give you ideas for interesting bitlevel operations. This article will cover the specific topic of conversions between integers and floats.

Note: unless you are coding for antique or very specific architectures such as the PDP-11, you may assume that the floating point storage endianness and the integer endianness match. The code presented here will therefore work flawlessly on modern CPU architectures such as x86, amd64, PowerPC or even ARM.

Introduction

Here are a few floating point values and their bitlevel representation. Notice how the different values affect the sign, exponent and mantissa fields:

sign exponent mantissa
1.0f 0 01111111 0000000 00000000 00000000
-1.0f 1 01111111 0000000 00000000 00000000
0.5f 0 01111110 0000000 00000000 00000000
0.25f 0 01111101 0000000 00000000 00000000
1.0f + 0.5f 0 01111111 1000000 00000000 00000000
1.0f + 0.25f 0 01111111 0100000 00000000 00000000

The core idea behind this article is the manipulation of the last field, the mantissa.

Byte to float conversion

A classical byte (0 - 255) to float (0.0f - 1.0f) conversion function is shown here:

float u8tofloat(uint8_t x)
{
    return (float)x * (1.0f / 255.0f);
}

This looks very simple: one conversion (fild on x86) and one multiplication (fmul on x86). However, the fild instruction has a latency such that the conversion may have a severe impact on performance.

But let’s look at these interesting floating point values:

sign exponent mantissa
32768.0f 0 10001110 0000000 00000000 00000000
32768.0f + 1.0f/256.0f 0 10001110 0000000 00000000 00000001
32768.0f + 2.0f/256.0f 0 10001110 0000000 00000000 00000010
32768.0f + 255.0f/256.0f 0 10001110 0000000 00000000 11111111

Notice the last eight bits? They look almost exactly like the input byte expected by u8tofloat. Taking advantage of the binary representation allows us to write the following conversion function:

float u8tofloat_trick(uint8_t x)
{
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return u.f - 32768.0f;
}

When used in a CPU-intensive loop, this method can be up to twice as fast as the previous implementation, for instance on the amd64 architecture. On the x86 architecture, the difference is far less noticeable.

You probably noticed that the output range is 0.0f - 255.0f/256.0f instead of 0.0f - 1.0f. This may be preferred in some cases when the value is supposed to wrap around. However, colour coordinates will require exact 0.0f - 1.0f bounds. This is easily fixed with an additional multiplication:

float u8tofloat_trick2(uint8_t x)
{
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return (u.f - 32768.0f) * (256.0f / 255.0f);
}

This can still be up to twice as fast than the original integer to float cast.

Short to float conversion

The usual way to convert a 16-bit integer to a float will be:

float u16tofloat(uint16_t x)
{
    return (float)x * (1.0f / 65535.0f);
}

Again, careful observation of the following floats will be useful:

sign exponent mantissa
16777216.0f 0 10010111 0000000 00000000 00000000
16777216.0f + 1.0f/65536.0f 0 10010111 0000000 00000000 00000001
16777216.0f + 2.0f/65536.0f 0 10010111 0000000 00000000 00000010
16777216.0f + 65535.0f/65536.0f 0 10010111 0000000 11111111 11111111

And the resulting conversion method:

float u16tofloat_trick(uint16_t x)
{
    union { float f; uint32_t i; } u; u.f = 16777216.0f; u.i |= x;
    return u.f - 16777216.0f; // optionally: (u.f - 16777216.0f) * (65536.0f / 65535.0f)
}

However, due to the size of the input data, the performance gain here can be much less visible. Be sure to properly benchmark.

Int to float conversion

The above techniques cannot be directly applied to 32-bit integers because floats only have a 23-bit mantissa. Several methods are possible:

  • Use the double type instead of float. They have a 52-bit mantissa.
  • Reduce the input int precision to 23 bits.

Float to int conversion

Finally, the exact same technique can be used for the inverse conversion. This is the naive implementation:

static inline uint8_t u8fromfloat(float x)
{
    return (int)(x * 255.0f);
}

Clamping is left as an exercise to the reader. Also note that a value such as 255.99999f will ensure better distribution and avoid singling out the 1.0f value.

And our now familiar bitlevel trick:

static inline uint8_t u8fromfloat_trick(float x)
{
    union { float f; uint32_t i; } u;
    u.f = 32768.0f + x * (255.0f / 256.0f);
    return (uint8_t)u.i;
}

Unfortunately, this is usually a performance hit on amd64. However, on x86, it is up to three time as fast as the original. Choose wisely!

The LUT strategy

Some will point out that using a lookup table is much faster.

float lut[256];
void fill_lut()
{
    for (int n = 0; n < 256; n++) lut[n] = (float)n / 255.0f;
}
float u8tofloat_lut(uint8_t x)
{
    return lut[x];
}

This is indeed faster in many cases and should not be overlooked. However, the following should be taken into account:

  • LUTs are fast, but if unlucky, the cache may get in your way and cause performance issues
  • the LUT approach is actually almost always slower with 16-bit input, because the size of the table starts messing with the cache
  • do not underestimate the time needed to fill the LUT, especially if different conversions need to be performed, requiring several LUTs
  • LUTs do not mix well with SIMD instructions
  • obviously, this method doesn’t work with float to int conversions

Last warnings

Many programmers will be tempted to write shorter code such as:

float u8tofloat_INVALID(uint8_t x)
{
    // NO! DON’T DO THIS! EVER!
    float f = 32768.0f; *(uint32_t *)&f |= x;
    return f - 32768.0f;
}

Do not do this, ever! I guarantee that this will break in very nasty and unexpected places. Strict C and C++ aliasing rules make it illegal to have a pointer to a float also be a pointer to an integer. The only legal way to do this is to use a union (actually, this is still not legal by the C++ standard but most real-life compilers allow this type punning through documented extensions).

Finally, one last, obvious tip: always measure the effects of an optimisation before deciding to use it!

Build and run Android NDK applications without Eclipse

If you already have a development environment and do not wish to use Eclipse, you can easily build and run your NDK application from makefiles or the command line.

First of all, you need to set the ANDROID_NDK_ROOT environment variable and ensure the SDK and NDK binary directories are in PATH. Here are my definitions:

ANDROID_NDK_ROOT=/home/sam/android/android-ndk-r8
PATH="$PATH:$ANDROID_NDK_ROOT"
PATH="$PATH:/home/sam/android/android-sdk-linux_x86/platform-tools"
PATH="$PATH:/home/sam/android/android-sdk-linux_x86/tools"

This is best defined in one of your shell’s startup scripts such as .zshenv.

Build and install package

Now, whenever you are in an NDK project’s directory, build the project using:

ndk-build && ant release

And to upload it to the emulator or to a connected device:

ant release install

That’s all! Those two simple commands can easily be launched from your preferred development environment.

Update: ant compile no longer exists in recent SDKs; replaced with ant release.

Run package

You can use adb to run any application remotely. For instance:

adb shell am start -a android.intent.action.MAIN -n $PACKAGENAME/.$ACTIVITYNAME

Both package name and activity name can be found in your AndroidManifest.xml.