## Fast Square Root Approximation

Posted on Dec 31, 2022

Almost a year ago I came across an old article named Understanding Quake’s Fast Inverse Square Root. It blew my mind. A very fascinating approach to cut down the execution time by a good margin.

`# $ ./a.out number: 1337 + result of math.h sqrt 36.565 * math.h took 15 ticks(1.5e-05s) + result of fast sqrt 36.5668 * fast sqrt took 1 ticks(1e-06s)`

Note that this is an approximation, which was good enough for quake. In this post, we'll re-create this algorithm from scratch.

The original code looks like the following.

`float Q_rsqrt( float number ) { long i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; // evil floating point bit level hacking i = 0x5f3759df - ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed #ifndef Q3_VM #ifdef __linux__ assert( !isnan(y) ); // bk010122 - FPE? #endif #endif return y; }`

Not that obvious, huh?

## IEEE 754 Standard

"The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably" -Wikipedia

Floating point numbers follow IEEE 754 standard for efficient memory utilization. Consider a number `43.4`

, how do you think a computer would store this number in 32 bits of memory?

How about splitting the number into `43`

and `0.3`

, 16 bits each? Well that would allow us to have numbers only upto `2**16 = 65536`

, which is a waste of memory. But there's an efficient way of storing larger range of numbers with IEEE 754 representation.

Consider the same number `43.3`

.

`[cling]$ float n = 43.3 (float) 43.3000f [cling]$ *(long *)&n (long) 1110258483 [cling]$ #include <bitset> [cling]$ std::cout << std::bitset<32> (*(long *)&n); 01000010001011010011001100110011`

`43.3`

is represented as `1110258483`

in memory, but why? Short answer is to be more efficient with space.

The idea is to represent a floating point number as a floating point arithmetic expression.

Here `s`

is the **Significand**, `b`

is the **Base**, and `e`

is the **Exponent**.

An Example,

So let's try this for the number `43.3`

with the base as `2`

.

`>>> bin(43) '0b101011'`

`>>> n = 0.3 >>> ''.join(map(str, [int((n := (n - (1 * (n > 1))) * 2) > 1) for i in range(32)])) '01001100110011001100110011001100'`

So the number in binary looks like the following.

`# 43.3 101011.01001100110011001100110011001100`

Now let's convert this to a floating point arithmetic expression, by moving the decimal point to the left.

Where `s = 1.0101101001100110011001100110011001100`

, `b = 2`

, and `e = 5`

.

If we wanna store this representation in 32 bits, we can ignore the constants, like the `b`

. But we can also ignore the first digit of `s`

, because we are dealing with Base 2 and the first digit is always `1`

, hence we can omit it.

So now `s - 1`

becomes the **Mantissa** `M`

, and `E`

is the **Exponent**. In memory the 32 bit number looks like the following.

`Sign(1 bit) Exponent(8 bits) Mantissa(23 bits) # ######## #######################`

- Sign
`S`

is`1`

if the number is negative. - Exponent
`E`

is`127 + e`

. We add`127`

so we can have negative exponents. So in our case for the number`43.3`

, we had the`e = 5`

, but we add`127`

to it, so it becomes`E = 132`

. - Mantissa
`M`

will be`s - 1`

, where`s`

is the significand. Note that Mantissa is about 23 bits for a 32 bit floating point number, so numbers beyond 23 bits are truncated.

For the number `43.3`

, the IEEE 754 representation looks like the following.

`Sign(1 bit) Exponent(8 bits) Mantissa(23 bits) 0 10000100 01011010011001100110011 // Binary 0 132 2962227 // Decimal`

The decimal representation of this number is,

`>>> 0b01000010001011010011001100110011 1110258483`

We can confirm that in C++.

`[cling]$ float n = 43.3 (float) 43.3000f [cling]$ *(long *)&n (long) 1110258483`

With `S`

, `M`

and `E`

, we can get the original floating point number using the following expression.

Solving for `n`

,

Note I won't be covering rounding or other nuances in IEEE 754 standard, but this should give us a pretty good idea of what's going on under the hood.

We can also get the IEEE 754 representation of the number with `S`

, `M`

and `E`

using,

Solving for `b`

,

Now that we have a general understand of how floating point numbers are stored, we can start with solving for square roots.

## Simplifying with logarithms

Let's simplify the following expression.

Note `S`

is omitted because I'm considering only positive numbers, we are not dealing with imaginary numbers.

Before we find $\sqrt{y}$, we can simplify the equation a bit. Since we are dealing with exponents, we can try applying logarithm and see what happens.

We simplified it to a degree, but there's still $\log_2$ in the equation. As it turns out that for $0 < {M \over {2 ^ {23}}} < 1$, there's an appromixation.

Here's a function plot for ${f(x) = \log_2 ({1 + x})}$ , for $x \in [0,1)$

Kinda looks linear doesn't it? Let's plot one for ${f(x) = x}$

It's slightly off, but we can add a bias $"\sigma"$ to offset the $f(x) = x$ such that it would best fit ${\ \log_2 ({1 + x})}$, meaning with least error.

$({E \times {2 ^ {23}}) + M}$ should look familiar because that's how we get the IEEE 754 representation. And also it turns out, $\sigma = 0.0430$ gives use the minimum error on average for points $[0,1)$.

## Solving for Square Roots

Now let's solve for the square root, we'll call it $"r"$ .

`>>> round((127 - 0.0430) * (2**22)) 532496253 >>> hex(_) '0x1fbd3f7d'`

The solution can be translated to the following code.

`i = 0x1fbd3f7d + (i >> 1);`

Here instead of division `i/2`

, we shift right by one `i >> 1`

to half the number. We do this because division can be slow. But these days compilers are intelligent enough to use bit shifts and multiplication to divide a number if the divisor is known at compile time.

`int optimized(int n) { int x = n / 2, y = n / 3; return x + y; }`

`optimized(int): mov eax, edi shr eax, 31 add eax, edi sar eax mov edx, eax movsx rax, edi sar edi, 31 imul rax, rax, 1431655766 shr rax, 32 sub eax, edi add eax, edx ret`

Let's test the square root solution for the number `43.3`

.

`[cling]$ float n = 43.3 (float) 43.3000f [cling]$ long i = *(long *)&n (long) 1110258483 [cling]$ i = 0x1fbd3f7d + (i >> 1) (long) 1087625494 [cling]$ [cling]$ *(float *)&i (float) 6.62025f`

`>>> 6.62025 ** 2 43.8277100625`

That's a pretty good approximation, but we can make it better with Newton–Raphson method.

## Error minimization with Newton–Raphson method

"In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function." -Wikipedia

The idea is to get closer to the approxmiation by starting off with calculating $f(x)$ for some initial value of $x$, then finding it's derivative to get closer to the root of $x$.

Here's an example, consider $f(x) = (2x-1)^{3} + 0.3$, and initial $x_0 = 0.76$.

We can now plug $x_{1} = 0.4884220907297829$ back into the newton's method equation to get $x_{2}$, and so on, it's iterative.

In our case, $f(x) = y_a^2 - n$, where $y_r = \sqrt{n}$ and $y_r \approx y_a$. Here $y_r$ is the result, and $y_a$ is our approximate value.

We can confirm the minimization of the error like the following.

`[cling]$ #include <cmath> [cling]$ pow(6.62025f, 2) // 6.62025f is our initial approximation (double) 43.827710 [cling]$ 0.5 * ((pow(6.62025f, 2) + 43.3)/6.62025f) (double) 6.5803943 [cling]$ [cling]$ pow(6.5803943f, 2) (double) 43.301589`

There we go, we have a better approximation.

Now putting it all together, our final code looks like the following.

`#include <iostream> #include <bits/stdc++.h> using namespace std; int main() { float n = 43.3; long i; std::memcpy(&i, &n, sizeof(float)); // Solved equation for square roots i = 0x1fbd3f7d + (i >> 1); float y; std::memcpy(&y, &i, sizeof(float)); // Newton-Rapson iteration y = (((y * y) + n)/(y)) * 0.5; std::cout << y << std::endl; }`

**Note**: Instead of the pointer magic (which produced undefined behaviours), I've used `memcpy`

to get the raw IEEE 754 representation of a floating point number.

Hope this was insightful.

## Resources

### Get notified on new posts

If you enjoyed this post and want to get notified on more posts like this, submit your email below and I'll keep you updated.