Recently I needed a method for retrieving the binary logarithm of a 10bit number (for the curious, it was for the purpose of converting between 32bit and 16bit floating point numbers).
Computing the binary logarithm is equivalent to knowing the position of the highest order set bit. For instance, log2(0x1) is 0 and log2(0x100) is 8.
One well known method for fast binary logarithm is presented at Bit Twiddling Hacks. It is a twostep method where first all lower bits are set to 1 and then a De Bruijnlike sequence is used to perform a table lookup:
int fastlog2(uint32_t v)
{
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v = v >> 1;
v = v >> 2;
v = v >> 4;
v = v >> 8;
v = v >> 16;
return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
}
That is 12 integer operations and a table lookup.
Optimising
It should be obvious what the sequence of operations on v does: fill the integer with ones starting from the highest order bit. Here are a few examples of what happens to v at each step:
v  v = v >> 1  v = v >> 2  v = v >> 4  v = v >> 8  v = v >> 16

0x0001  0x0001  0x0001  0x0001  0x0001  0x0001

0x0002  0x0003  0x0003  0x0003  0x0003  0x0003

0x0003  0x0003  0x0003  0x0003  0x0003  0x0003

0x0004  0x0006  0x0007  0x0007  0x0007  0x0007

0x0100  0x0180  0x01e0  0x01fe  0x01ff  0x01ff

0x80000000  0xc0000000  0xf0000000  0xff000000  0xffff0000  0xffffffff

There is one obvious optimisation available: since the input is only 10bit, the last shift operation v = v >> 16 can be omitted because the final value was already reached.
int fastlog2(uint32_t v)
{
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v = v >> 1;
v = v >> 2;
v = v >> 4;
v = v >> 8;
return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
}
10 instructions instead of 12. Not really amazing, but worth mentioning.
Optimising more?
Could we do better? Now the last line is v = v >> 8; and it is only useful to propagate the 9th and 10th bits to positions 1 and 2. What happens if we omit that line? Let’s see:
 For most values of v, the expected value is obtained.
 For values of v with a highest order bit at 9th position, 0x1fe could be obtained instead of 0x1ff.
 For values of v with a highest order bit at 10th position, one of 0x3fc, 0x3fd or 0x3fe could be obtained instead of 0x3ff.
The list of possible output values would therefore be 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f, 0x7f, 0xff, 0x1fe, 0x1ff, 0x3fc, 0x3fd, 0x3fe, 0x3ff. What happens to these values when multiplying them with the De Bruijn sequence? Let's see:
v  v * 0x07C4ACDDU) >> 27

0x1  0

0x3  2

0x7  6

0xf  14

0x1f  30

0x3f  29

0x7f  27

0xff  23

0x1fe  15

0x1ff  16

0x3fc  30

0x3fd  31

0x3fe  0

0x3ff  1

Damn! Two values are colliding. It looks like we cannot omit the last line after all.
Beyond De Bruijn
Let’s give another try at the problem. Usually De Bruijn sequences are built using either nontrivial algorithms, or brute force. Maybe we could find another sequence that has no collision? Or a sequence that is not a De Bruijn sequence but that works for our problem?
Well, let’s just brute force!
(2 seconds later)
int fastlog2(uint32_t v)
{
static const int MagicTable[16] =
{
0, 1, 2, 8, 1, 3, 5, 9, 9, 7, 4, 1, 6, 1, 1, 1
};
v = v >> 1;
v = v >> 2;
v = v >> 4;
return MagicTable[(uint32_t)(v * 0x5a1a1a2u) >> 28];
}
Down to 8 instructions instead of 12. And the lookup table is now half the size!
Conclusion
It is possible for multiplyandshift techniques similar to the De Bruijn sequence algorithm to exist for a larger set of problems. Brute forcing the search is a totally valid method for 32bit multiplication.
The code used for this article is included in the attached file.