AArch64 NEON has the URSQRTE instruction, which gets closer to the OP's question than you might think; view a 32-bit value as a fixed-precision integer with 32 fractional bits (so the representable range is evenly spaced 0 through 1-ε, where ε=2^-32), then URSQRTE computes the approximate inverse square root, halves it, then clamps it to the range 0 through 1-ε. Fixed-precision integers aren't quite integers, and approximate inverse square root isn't quite square root, but it might get you somewhere close.
The related FRSQRTE instruction is much more conventional, operating on 32-bit floats, again giving approximate inverse square root.
Neon is SIMD so I would presume these instructions let you vectorize those calculations and do them in parallel on a lot of data more efficiently than if you broke it down into simpler operations and did them one by one.
Yes, but the part that got me was the halving of the result followed by the clamping. SIMD generally makes sense, but for something like this to exist usually there's something very specific (like a certain video codec, for example) that greatly benefits from such a complex instruction.
It's probably not about avoiding extra instructions/performance, but making the range of the result more useful and avoiding overflow. Or in other words, the entire instruction may be useless if you don't do these things.
The halving and clamping is nothing particularly remarkable in the context of usefully using fixed point numbers (scaled integers) to avoid overflow. Reciprocal square root itself is a fundamental operation for DSP algorithms and of course computer graphics.
This is a fairly generic instruction really, though FRSQRTE likely gets more real world use.
Is it possible in a single clock-cycle. Yes, with a very large lookup table. It is probably possible to reduce the size depending on how many serial logical gates can be executed within the clock-cycle. Think about that the binary root of 10000 is rather similar to that of 100 only with respect to different number of zero's.
Floatingpointreciprocalsquarerootestimate (`frsqrte`) instructions are typically implemented as just such a table lookup, indexed by a few bits of the fraction and the LSB of the exponent.
The precision is typically limited to similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are expected to do a few Newton-Raphson iterations afterwards if they want more.
You can compute the integer square root in n/2 iterations where n is the number of bits in the source using just shifts and adds. For each step, check if a new bit has to be set in the result n_old by computing
Then compare it with the source operand, and if it's greater or equal: 1) set the bit in the result 2) update n2_old with n2_new
It can be done in n/2 or perhaps n clock cycles with a suitable microcode instruction set and ALU. With some effort it can be optimized to reduce n to the index of the leftmost set bit in the operand.
Compare the integer square root algorithm used in "Spacewar!" [1]. So, even by 1960 it should have been possible to implement a square root step instructions for each bit, much like division or multiplication shifts, and progress from this to full-fledged automatic operations by the use of a sub timing network. (I guess, it really depends on the economics of the individual use case, whether the effort does pay off or not, as you would amass a few additional hardware modules to accomplish this.)
There definitely is a trade-off between memory size and how quickly it can be accessed.
IIRC IBM z/Arch processors (AFAIK they are internally similar to POWER) have clock limited to around 5 GHz or so, so that L1 cache lookup costs only one cycle (a design requirement).
For example, z14 has 5.2 GHz clock rate and 2x128 kB data and instruction L1 caches.
Yes. In theory, any pure function can be turned into a lookup table. And any lookup table that isn't just random numbers can be turned into a more compact algorithm that spends compute to save space.
Such tables may be infeasible, though. While a int8 -> int8 table only needs 256 bytes, an int32 -> int32 needs 16 gigabytes.
It isn't, because eventually the size of your logic or table becomes larger than the distance a signal can propagate in one clock tick. Before that, it likely presents practical issues (eg, is it worth dedicating that much silicon)
Yes, this solves the stated issue about huge lookup tables.
> A planet size CPU that runs at .5 hz but can work on impossibly large numbers.
This doesn't make much sense to me, though.
If your goal is "any algorithm", you'll often have to go a lot slower than .5Hz. A hash-calculating circuit that's built out of a single mountain of silicon could have a critical path that's light-years long.
But if your goal is just "work on impossibly large numbers", but it's okay to take multiple ticks, then there's no reason to drag the frequency down that low. You can run a planet-scale CPU at 1GHz. CPUs have no need for signals to go all the way across inside a single tick.
Related to the fallacy of comparing what's slow in our world with what's computable in a simulation of it--there's no requirement for time to tick similarly.
It's not as bad for integer square root; you only need to store N^0.5 entries in a greater/lesser-than lookup table: N^2 for all the answers N. Feasible for 16-bit integers, maybe for 32-bit, not for 64-bit.
If you wanted to expand the definition of “processor” to electromechanical contraptions, the Friden SRQ could perform square roots using just additions and shifts, with not a single electronic component other than a motor. And since you had to position the decimal points manually, it would _technically_ be an integer operation…
Can't you use the sequence 1 + 3 + 5 + ... + 2k + 1 to get the integer square root of any integer number? It's basically the k of the nearest lower number to your number in this sequence.
Can you explain your idea? Your algorithm is correct by definition, but doing this naively would be very slow (even for 32bit number). At this point it would be much faster to just binsearch it.
For an example of binsearch algo, I recently dipped into this while switching some code from floats to fixed point arithmetic (reducing overall wasm blob size)
Better might be to use the expansion (x+y)^2=x^2+2xy+y^2 along with the observation that in any base, the square root of a 2n-digit number is at most n digits, as in the common method for calculating a square root "by hand" with pen and paper. If you did this 8 bits at a time then you only need a lookup table for roots of 8bit numbers.
It's sqrt(n) - 1 additions for the n you're trying to get the integer square root of. Memoization would make it constant time for any lesser n than the greatest n you've done this for. For greater n it's sqrt(new_big_n - prev_big_n) - 1 more additions to memoize.
You're right this isn't practical, but fun to think about. Good refresher for those out of school for a while.
Yes. On a desert island we can have the whole village construct this table for newton-raphson guesses.
Combined with a cutting tool attached to a worm drive we will precisely count our turns (big radius crank for extra precision!) and begin manufacture of slide rules. Can never have too many scales and this is just one we shall etch into them!
This bit in an answer further down made me chuckle:
> My implementation of square root using binary search, that doesn't depend on a multiplier. Only basic ALU instructions are used. It is vigorously undocumented. I have no idea what I wrote but it seems to work.
A fine reminder that if we write clever code, we're probably not going to remember how it works.
So many people assume that everything that came before they were at school was primitive, and barely chugged along :)
A little reading shows the opposite. Most of our smart ideas were already used in 1940s/50s/60s computers, and are recycled on our fab new chips!! Pipelining, out of order exec, multiple cores, etc.
That old-time hardware might have been a bit "chunky" but the architectures used some very smart techniques.
another example is the virtualization that pretty much enabled the whole "cloud" thing came from mainframe architecture in the 60s. Intel and others brought it to consumer grade CPUs.
You can get a really really rough approximation if you replace Log2(x) with 'count leading zeroes'. With a better approximation of Log(2), you can get closer to the answer.
For an approximate (very rough) answer, as opposed to one accurate to the nearest integer, a right shift by half the number of bits of the leading 1’s position will do, and of course nearly every processor has a shift instruction. I’m not sure how often processors haven’t had something like FLO (Find Leading One) or FFS (Find First Set) instruction, those seem ubiquitous as well.
The super rough approximation for some uses can be approximately as good as an accurate answer. When you just need a decent starting place for some further Newton-Raphson iteration, for example. (Of course the right-shift trick is a nice way to seed a more accurate square root calculation. :P)
I wondered about the name of that function: surely it isn't inverse square root? That would be "squared". It's "1 over square root" or something. So down the rabbit hole to see if it was always called that. Yup, in Wikipedia articles at least, but the first paper seems to be by Jim Blinn in 1997, without the term in the title. So let's read the paper... Frustratingly, although I am an IEEE member, and I did subscribe to Computer Graphics and Applications in 1997, it won't let me read the paper without paying again. So curious to hear from folks knowledgeable in the space if this was a mis-naming that stuck or I'm confused about the meaning of "inverse". In my universe we used inverse in the context of functions and their inverses, and "invert" in the context of binary values (1s compliment of x is also called x inverted). Never to describe reciprocal.
This seems to have been common usage. I never really thought about it as it was just so normal to refer to reciprocal as "inverse" in this context.
> In my universe we used inverse in the context of functions and their inverses
Yes but, the other type of inverse that is so fundamental to CS in general, and especially geometry is a matrix inverse, which is again a multiplicative inverse, so it's not too surprising how this usage became assumed in many contexts.
I’ve heard inverse used to mean reciprocal often enough. And it’s technically accurate - a reciprocal is a multiplicative inverse. The problem is mainly that “inverse” is ambiguous, especially in this particular case (an inverse square root is a square!), whereas “reciprocal” is clear and unambiguous. Online you can find lots of uses of inverse, and questions about inverse vs reciprocal. So yes reciprocal is the better, more clear term to use here, but “inverse” to mean reciprocal does happen.
Not really, that’s a very clever trick used on floating point numbers, and does the approximate reciprocal square root.
This right-shift thing is far simpler, not very clever, doesn’t involve magic numbers, and is much more well known than the “Quake trick”. There are easy ways to see it. One would be that multiplying a number by itself approximately doubles the number of digits. Therefore halving the number of digits is approximately the square root. You can get more technical and precise by noting that FFS(n) =~ log2(n), and if you remember logs, you know that exp(log(n)/2) = n^(1/2), so shifting right by FFS(n)/2 is just mathematically approximately a square root.
They are more closely related than you suggest. Both methods are using an approximation of log2 to get an initial guess. One gets it from "FPS(n)", the other gets it from the floating point representation of n, where you can roughly find the log2(n) by looking at the exponent of n in the float representation. You can also use the mantissa to refine the guess further.
They are related a little, around the log2 (exponent), I totally agree. I guess I figured the magic number subtraction that turns it into n^(-1/2) is so wild that it puts the Quake trick in a bit of a different class. This right shift thing is a lot older and there’s probably a lineage of ideas from the right shift on integers to the Quake trick. I discovered another fun one on my own that is probably already well known in bit-math circles, but simply inverting the bits of a floating point exponent is a very rough approximation to the reciprocal, good for a Newton initial guess.
The related FRSQRTE instruction is much more conventional, operating on 32-bit floats, again giving approximate inverse square root.