Suraj logo

Fast Square Root Approximation

Posted on Dec 31, 2022

Almost a year ago I came across an old article named Understanding Quake’s Fast Inverse Square Root. It blew my mind. A very fascinating approach to cut down the execution time by a good margin.

# $ ./a.out
number: 1337

+ result of math.h sqrt 36.565
* math.h took 15 ticks(1.5e-05s)

+ result of fast sqrt 36.5668
* fast sqrt took 1 ticks(1e-06s)

Note that this is an approximation, which was good enough for quake. In this post, we'll re-create this algorithm from scratch.

The original code looks like the following.

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;						// evil floating point bit level hacking
	i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//	y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

#ifndef Q3_VM
#ifdef __linux__
	assert( !isnan(y) ); // bk010122 - FPE?
#endif
#endif
	return y;
}

Not that obvious, huh?

IEEE 754 Standard


"The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably" -Wikipedia

Floating point numbers follow IEEE 754 standard for efficient memory utilization. Consider a number 43.4, how do you think a computer would store this number in 32 bits of memory?

How about splitting the number into 43 and 0.3, 16 bits each? Well that would allow us to have numbers only upto 2**16 = 65536, which is a waste of memory. But there's an efficient way of storing larger range of numbers with IEEE 754 representation.

Consider the same number 43.3.

[cling]$ float n = 43.3
(float) 43.3000f
[cling]$ *(long *)&n
(long) 1110258483
[cling]$ #include <bitset>
[cling]$ std::cout << std::bitset<32> (*(long *)&n);
01000010001011010011001100110011

43.3 is represented as 1110258483 in memory, but why? Short answer is to be more efficient with space.

The idea is to represent a floating point number as a floating point arithmetic expression.

n=s×ben = {s \times b ^ {e}}

Here s is the Significand, b is the Base, and e is the Exponent.

An Example,

12.345=1234510312.345 = {12345 * 10 ^ {-3}}

So let's try this for the number 43.3 with the base as 2.

>>> bin(43)
'0b101011'
>>> n = 0.3
>>> ''.join(map(str, [int((n := (n - (1 * (n > 1))) * 2) > 1) for i in range(32)]))
'01001100110011001100110011001100'

So the number in binary looks like the following.

#   43.3
101011.01001100110011001100110011001100

Now let's convert this to a floating point arithmetic expression, by moving the decimal point to the left.

1.0101101001100110011001100110011001100×25{1.0101101001100110011001100110011001100 \times 2 ^ {5}}

Where s = 1.0101101001100110011001100110011001100, b = 2, and e = 5.

If we wanna store this representation in 32 bits, we can ignore the constants, like the b. But we can also ignore the first digit of s, because we are dealing with Base 2 and the first digit is always 1, hence we can omit it.

So now s - 1 becomes the Mantissa M, and E is the Exponent. In memory the 32 bit number looks like the following.

Sign(1 bit)  Exponent(8 bits)  Mantissa(23 bits)
#            ########          #######################
  • Sign S is 1 if the number is negative.
  • Exponent E is 127 + e. We add 127 so we can have negative exponents. So in our case for the number 43.3, we had the e = 5, but we add 127 to it, so it becomes E = 132.
  • Mantissa M will be s - 1, where s is the significand. Note that Mantissa is about 23 bits for a 32 bit floating point number, so numbers beyond 23 bits are truncated.

For the number 43.3, the IEEE 754 representation looks like the following.

Sign(1 bit)  Exponent(8 bits)  Mantissa(23 bits)
0            10000100          01011010011001100110011  // Binary
0            132               2962227                  // Decimal

The decimal representation of this number is,

>>> 0b01000010001011010011001100110011
1110258483

We can confirm that in C++.

[cling]$ float n = 43.3
(float) 43.3000f
[cling]$ *(long *)&n
(long) 1110258483

With S, M and E, we can get the original floating point number using the following expression.

n=(1)S×(1+M223)×2E127n = {(-1) ^ S} \times ({1 + {M \over {2 ^ {23}}}}) \times {2 ^ {E-127}}

Solving for n,

n=(1)0×(1+2962227223)×2132127n = {(-1) ^ 0} \times ({1 + {2962227 \over {2 ^ {23}}}}) \times {2 ^ {132-127}}
n=43.29999923706055n = 43.29999923706055

Note I won't be covering rounding or other nuances in IEEE 754 standard, but this should give us a pretty good idea of what's going on under the hood.

We can also get the IEEE 754 representation of the number with S, M and E using,

B=231×S+223×E+MB = {2 ^ {31}} \times S + {2 ^ {23}} \times E + M

Solving for b,

B=231×0+223×132+2962227B = {2 ^ {31}} \times 0 + {2 ^ {23}} \times 132 + 2962227
B=1110258483B = 1110258483

Now that we have a general understand of how floating point numbers are stored, we can start with solving for square roots.

Simplifying with logarithms


Let's simplify the following expression.

y=(1+M223)×2E127y = ({1 + {M \over {2 ^ {23}}}}) \times {2 ^ {E-127}}

Note S is omitted because I'm considering only positive numbers, we are not dealing with imaginary numbers.

Before we find y\sqrt{y}, we can simplify the equation a bit. Since we are dealing with exponents, we can try applying logarithm and see what happens.

log2(y)=log2((1+M223)×2E127)\log_2 (y) = \log_2 (({1 + {M \over {2 ^ {23}}}}) \times {2 ^ {E-127}})
log2(y)=log2((1+M223))+log2(2E127)\log_2 (y) = \log_2 (({1 + {M \over {2 ^ {23}}}})) + \log_2({2 ^ {E-127}})
log2(y)=log2((1+M223))+E127\log_2 (y) = \log_2 (({1 + {M \over {2 ^ {23}}}})) + {E-127}

We simplified it to a degree, but there's still log2\log_2 in the equation. As it turns out that for 0<M223<10 < {M \over {2 ^ {23}}} < 1, there's an appromixation.

log2(1+x)x,0<x<1{\log_2 ({1 + x}) \approx x}, {0 < {x} < 1}

Here's a function plot for f(x)=log2(1+x){f(x) = \log_2 ({1 + x})} , for x[0,1)x \in [0,1)

Kinda looks linear doesn't it? Let's plot one for f(x)=x{f(x) = x}

It's slightly off, but we can add a bias "σ""\sigma" to offset the f(x)=xf(x) = x such that it would best fit  log2(1+x){\ \log_2 ({1 + x})}, meaning with least error.

log2(y)=M223+σ+E127\log_2 (y) = {M \over {2 ^ {23}}} + \sigma + {E-127}
log2(y)=1223(E×223+M)+σ127\log_2 (y) = {1 \over 2^{23}} ({E \times {2 ^ {23}}} + M) + \sigma - 127

(E×223)+M({E \times {2 ^ {23}}) + M} should look familiar because that's how we get the IEEE 754 representation. And also it turns out, σ=0.0430\sigma = 0.0430 gives use the minimum error on average for points [0,1)[0,1).

Solving for Square Roots


Now let's solve for the square root, we'll call it "r""r" .

r=yr = \sqrt{y}
log(r)=log(y)\log({r}) = {\log(\sqrt{y})}
log(r)=12log(y)\log({r}) = {1 \over 2}{\log(y)}
1223(ir)+σ127=12(1223(iy)+σ127),ix=(Ex×223+Mx),x=ry{1 \over {2^{23}}}(i_r) + \sigma - 127 = {1 \over 2} ({1 \over {2^{23}}}(i_y) + \sigma - 127), i_x = ({E_x \times {2 ^ {23}}} + M_x), x = r | y
ir223=iy224+σ1272σ+127{i_r \over 2^{23}} = {i_y \over 2^{24}} + {\sigma - 127 \over 2} - \sigma + 127
ir223=iy224+σ1272σ+2542{i_r \over 2^{23}} = {i_y \over 2^{24}} + {\sigma - 127 - 2\sigma + 254 \over 2}
ir=223(iy224+127σ2){i_r} = 2^{23}({i_y \over 2^{24}} + {127 - \sigma \over 2})
ir=(127σ)222+iy2{i_r} = ({127 - \sigma}){2^{22}} + {i_y \over 2}
ir=(1270.0430)222+iy2,σ=0.0430{i_r} = ({127 - 0.0430}){2^{22}} + {i_y \over 2}, \sigma = 0.0430
ir=532496253+iy2{i_r} = 532496253 + {i_y \over 2}
>>> round((127 - 0.0430) * (2**22))
532496253
>>> hex(_)
'0x1fbd3f7d'

The solution can be translated to the following code.

i = 0x1fbd3f7d + (i >> 1);

Here instead of division i/2, we shift right by one i >> 1 to half the number. We do this because division can be slow. But these days compilers are intelligent enough to use bit shifts and multiplication to divide a number if the divisor is known at compile time.

int optimized(int n) {
    int x = n / 2, y = n / 3;
    return x + y;
}
optimized(int):
    mov     eax, edi
    shr     eax, 31
    add     eax, edi
    sar     eax
    mov     edx, eax
    movsx   rax, edi
    sar     edi, 31
    imul    rax, rax, 1431655766
    shr     rax, 32
    sub     eax, edi
    add     eax, edx
    ret

Let's test the square root solution for the number 43.3.

[cling]$ float n = 43.3
(float) 43.3000f
[cling]$ long i = *(long *)&n
(long) 1110258483
[cling]$ i = 0x1fbd3f7d + (i >> 1)
(long) 1087625494
[cling]$
[cling]$ *(float *)&i
(float) 6.62025f
>>> 6.62025 ** 2
43.8277100625

That's a pretty good approximation, but we can make it better with Newton–Raphson method.

Error minimization with Newton–Raphson method

"In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function." -Wikipedia

The idea is to get closer to the approxmiation by starting off with calculating f(x)f(x) for some initial value of xx, then finding it's derivative to get closer to the root of xx.

Here's an example, consider f(x)=(2x1)3+0.3f(x) = (2x-1)^{3} + 0.3, and initial x0=0.76x_0 = 0.76.

xn+1=xnf(xn)f(xn)x_{n+1} = x_n - {f(x_n) \over f'(x_n)}
x1=x0(2x1)3+0.324x224x+6x_{1} = x_0 - {{(2x-1)^{3} + 0.3} \over {24x^{2} - 24x + 6}}
x1=0.760.4406081.622399999999999x_{1} = 0.76 - {{0.440608} \over {1.622399999999999}}
x1=0.4884220907297829x_{1} = 0.4884220907297829

We can now plug x1=0.4884220907297829x_{1} = 0.4884220907297829 back into the newton's method equation to get x2x_{2}, and so on, it's iterative.

In our case, f(x)=ya2nf(x) = y_a^2 - n, where yr=ny_r = \sqrt{n} and yryay_r \approx y_a. Here yry_r is the result, and yay_a is our approximate value.

xn+1=xnf(xn)f(xn)x_{n+1} = x_n - {f(x_n) \over f'(x_n)}
yn+1=yy2n2yy_{n+1} = y - {y^2 - n \over 2y}
yn+1=2y2y2+n2yy_{n+1} = {2y^2 - y^2 + n \over 2y}
yn+1=0.5×(y2+ny)y_{n+1} = 0.5 \times ({y^2 + n \over y})

We can confirm the minimization of the error like the following.

[cling]$ #include <cmath>
[cling]$ pow(6.62025f, 2) // 6.62025f is our initial approximation
(double) 43.827710
[cling]$ 0.5 * ((pow(6.62025f, 2) + 43.3)/6.62025f)
(double) 6.5803943
[cling]$
[cling]$ pow(6.5803943f, 2)
(double) 43.301589

There we go, we have a better approximation.

Now putting it all together, our final code looks like the following.

#include <iostream>
#include <bits/stdc++.h>
using namespace std;

int main() {
  float n = 43.3;
  long i;

  std::memcpy(&i, &n, sizeof(float));
  // Solved equation for square roots
  i = 0x1fbd3f7d + (i >> 1);

  float y;
  std::memcpy(&y, &i, sizeof(float));

  // Newton-Rapson iteration
  y = (((y * y) + n)/(y)) * 0.5;

  std::cout << y << std::endl;
}

Note: Instead of the pointer magic (which produced undefined behaviours), I've used memcpy to get the raw IEEE 754 representation of a floating point number.

Hope this was insightful.

Resources

Dec 31, 2022

Get notified on new posts

If you enjoyed this post and want to get notified on more posts like this, submit your email below and I'll keep you updated.

Keep Reading