Did any processor implement an integer square root instruction?

corsix · 2024-04-07T14:30:28 1712500228

AArch64 NEON has the URSQRTE instruction, which gets closer to the OP's question than you might think; view a 32-bit value as a fixed-precision integer with 32 fractional bits (so the representable range is evenly spaced 0 through 1-ε, where ε=2^-32), then URSQRTE computes the approximate inverse square root, halves it, then clamps it to the range 0 through 1-ε. Fixed-precision integers aren't quite integers, and approximate inverse square root isn't quite square root, but it might get you somewhere close.

The related FRSQRTE instruction is much more conventional, operating on 32-bit floats, again giving approximate inverse square root.

voidbert · 2024-04-07T16:42:06 1712508126

What task benefits from using such a complex instruction so easily dividable in simpler ones for it to be present in aarch64?

colechristensen · 2024-04-07T16:54:40 1712508880

Inverse square root is for normalizing vectors particularly in computer graphics calculations, it needs to be run a whole lot very fast.

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Mot...

hinkley · 2024-04-07T19:44:07 1712519047

Famously the magic constant in the Quake engine that nobody remembers inventing.

That article does say there’s an SSE instruction rsqrtss that is better.

nanidin · 2024-04-07T16:45:10 1712508310

Neon is SIMD so I would presume these instructions let you vectorize those calculations and do them in parallel on a lot of data more efficiently than if you broke it down into simpler operations and did them one by one.

voidbert · 2024-04-07T17:18:08 1712510288

Yes, but the part that got me was the halving of the result followed by the clamping. SIMD generally makes sense, but for something like this to exist usually there's something very specific (like a certain video codec, for example) that greatly benefits from such a complex instruction.

ekelsen · 2024-04-08T03:09:14 1712545754

The halving could come from an intended use in a Newton Raphson iteration of a square root refinement.

See for example https://math.mit.edu/~stevenj/18.335/newton-sqrt.pdf

The initial guess is the approximate square root, but it needs to be halved as part of the calculation.

creato · 2024-04-07T17:23:32 1712510612

It's probably not about avoiding extra instructions/performance, but making the range of the result more useful and avoiding overflow. Or in other words, the entire instruction may be useless if you don't do these things.

epcoa · 2024-04-07T21:30:00 1712525400

The halving and clamping is nothing particularly remarkable in the context of usefully using fixed point numbers (scaled integers) to avoid overflow. Reciprocal square root itself is a fundamental operation for DSP algorithms and of course computer graphics. This is a fairly generic instruction really, though FRSQRTE likely gets more real world use.

fjfaase · 2024-04-07T12:17:36 1712492256

Is it possible in a single clock-cycle. Yes, with a very large lookup table. It is probably possible to reduce the size depending on how many serial logical gates can be executed within the clock-cycle. Think about that the binary root of 10000 is rather similar to that of 100 only with respect to different number of zero's.

Findecanor · 2024-04-07T13:07:13 1712495233

Floating point reciprocal square root estimate (`frsqrte`) instructions are typically implemented as just such a table lookup, indexed by a few bits of the fraction and the LSB of the exponent. The precision is typically limited to similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are expected to do a few Newton-Raphson iterations afterwards if they want more.

bonzini · 2024-04-07T12:42:43 1712493763

You can compute the integer square root in n/2 iterations where n is the number of bits in the source using just shifts and adds. For each step, check if a new bit has to be set in the result n_old by computing

n2_new = (n_old + (1 << bit))^2 = n2_old + (n_old << (bit + 1)) + (1 << (bit*2))

Then compare it with the source operand, and if it's greater or equal: 1) set the bit in the result 2) update n2_old with n2_new

It can be done in n/2 or perhaps n clock cycles with a suitable microcode instruction set and ALU. With some effort it can be optimized to reduce n to the index of the leftmost set bit in the operand.

masswerk · 2024-04-07T15:12:33 1712502753

Compare the integer square root algorithm used in "Spacewar!" [1]. So, even by 1960 it should have been possible to implement a square root step instructions for each bit, much like division or multiplication shifts, and progress from this to full-fledged automatic operations by the use of a sub timing network. (I guess, it really depends on the economics of the individual use case, whether the effort does pay off or not, as you would amass a few additional hardware modules to accomplish this.)

[1] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...

m463 · 2024-04-07T22:15:16 1712528116

so, dumb question.

do lookups in large tables ever (practically, not theoretically) take one clock cycle?

If there's a large lookup table, it would have to come from memory, which might mean cache and memory hierarchy delays, right?

bagels · 2024-04-07T23:03:43 1712531023

If the table is in silicon, you can avoid this. Not sure if that is done in practice though.

js8 · 2024-04-08T11:15:50 1712574950

There definitely is a trade-off between memory size and how quickly it can be accessed.

IIRC IBM z/Arch processors (AFAIK they are internally similar to POWER) have clock limited to around 5 GHz or so, so that L1 cache lookup costs only one cycle (a design requirement).

For example, z14 has 5.2 GHz clock rate and 2x128 kB data and instruction L1 caches.

WithinReason · 2024-04-07T12:41:10 1712493670

Sounds like it's possible to run any algorithm in the world in 1 clock cycle.

retrac · 2024-04-07T15:40:42 1712504442

Yes. In theory, any pure function can be turned into a lookup table. And any lookup table that isn't just random numbers can be turned into a more compact algorithm that spends compute to save space.

Such tables may be infeasible, though. While a int8 -> int8 table only needs 256 bytes, an int32 -> int32 needs 16 gigabytes.

DaiPlusPlus · 2024-04-07T17:20:46 1712510446

Fractal functions are pure, but don’t lend themselves well to memoization nor lookup tables.

meindnoch · 2024-04-08T13:49:47 1712584187

What does this statement even mean? Every function from a finite domain is just a lookup table, period.

DaiPlusPlus · 2024-04-08T16:32:03 1712593923

Who said anything about finite-domains?

meindnoch · 2024-04-09T13:55:29 1712670929

Every function has finite domain on a computer.

DaiPlusPlus · 2024-04-09T17:54:48 1712685288

Arbitrary-precision real-number libraries would disagree with you there.

meindnoch · 2024-04-10T10:15:24 1712744124

False. A computer always has a finite amount of memory.

ajb · 2024-04-07T12:57:12 1712494632

It isn't, because eventually the size of your logic or table becomes larger than the distance a signal can propagate in one clock tick. Before that, it likely presents practical issues (eg, is it worth dedicating that much silicon)

willcipriano · 2024-04-07T15:37:02 1712504222

Have slower ticks. A planet size CPU that runs at .5 hz but can work on impossibly large numbers.

Dylan16807 · 2024-04-07T23:43:12 1712533392

> Have slower ticks.

Yes, this solves the stated issue about huge lookup tables.

> A planet size CPU that runs at .5 hz but can work on impossibly large numbers.

This doesn't make much sense to me, though.

If your goal is "any algorithm", you'll often have to go a lot slower than .5Hz. A hash-calculating circuit that's built out of a single mountain of silicon could have a critical path that's light-years long.

But if your goal is just "work on impossibly large numbers", but it's okay to take multiple ticks, then there's no reason to drag the frequency down that low. You can run a planet-scale CPU at 1GHz. CPUs have no need for signals to go all the way across inside a single tick.

Vecr · 2024-04-08T04:00:55 1712548855

You'd need way better clocks and synchronization circuits than exist now though, but I don't see any pure physical barriers.

Dylan16807 · 2024-04-08T04:28:08 1712550488

The whole thing doesn't need to be on the same clock domain. You can put clock crossings every inch.

volemo · 2024-04-07T16:07:17 1712506037

And 27.7% [1] of the planet's crust is silicon already!

[1] Britannica: Silicon

gameshot911 · 2024-04-07T15:41:11 1712504471

That's actually a really fascinating science fiction idea!

com2kid · 2024-04-07T18:38:18 1712515098

https://en.wikipedia.org/wiki/Matrioshka_brain

:-D

It's a topic that has been explored quite a bit in science fiction literature.

willcipriano · 2024-04-07T15:58:40 1712505520

It's sort of the plot of the Douglas Adam's books.

https://hitchhikers.fandom.com/wiki/Earth

karmakaze · 2024-04-07T19:22:48 1712517768

Related to the fallacy of comparing what's slow in our world with what's computable in a simulation of it--there's no requirement for time to tick similarly.

brudgers · 2024-04-08T11:40:25 1712576425

impossibly large numbers

Forty-two for example?

rcxdude · 2024-04-07T13:16:57 1712495817

With a sufficiently large chip and a sufficiently slow clock, sure.

wheybags · 2024-04-07T13:26:18 1712496378

"Give me a lut large enough and an infinite period of time to execute, and I shall simulate the world"

dartos · 2024-04-07T15:20:45 1712503245

“Every cycle”

robinduckett · 2024-04-07T12:54:59 1712494499

I like the Quantum BogoSort as a proof of this /s

benlivengood · 2024-04-07T18:14:20 1712513660

It's not as bad for integer square root; you only need to store N^0.5 entries in a greater/lesser-than lookup table: N^2 for all the answers N. Feasible for 16-bit integers, maybe for 32-bit, not for 64-bit.

msarnoff · 2024-04-07T19:45:09 1712519109

If you wanted to expand the definition of “processor” to electromechanical contraptions, the Friden SRQ could perform square roots using just additions and shifts, with not a single electronic component other than a motor. And since you had to position the decimal points manually, it would _technically_ be an integer operation…

Video: https://youtu.be/o44a1ao5h8w

treprinum · 2024-04-07T12:20:33 1712492433

Can't you use the sequence 1 + 3 + 5 + ... + 2k + 1 to get the integer square root of any integer number? It's basically the k of the nearest lower number to your number in this sequence.

maxcoder4 · 2024-04-07T13:03:48 1712495028

Can you explain your idea? Your algorithm is correct by definition, but doing this naively would be very slow (even for 32bit number). At this point it would be much faster to just binsearch it.

__s · 2024-04-07T13:46:00 1712497560

For an example of binsearch algo, I recently dipped into this while switching some code from floats to fixed point arithmetic (reducing overall wasm blob size)

https://github.com/serprex/openEtG/blob/2011007dec2616d1a24d...

Tho I could probably save binary size more by importing Math.sqrt from JS

smcameron · 2024-04-07T19:26:55 1712518015

Recently, I had reason to dig around for a fixed point square root algorithm and found this: https://groups.google.com/g/comp.lang.c/c/IpwKbw0MAxw/m/N1xh...

tomatocracy · 2024-04-07T15:07:45 1712502465

Better might be to use the expansion (x+y)^2=x^2+2xy+y^2 along with the observation that in any base, the square root of a 2n-digit number is at most n digits, as in the common method for calculating a square root "by hand" with pen and paper. If you did this 8 bits at a time then you only need a lookup table for roots of 8bit numbers.

atq2119 · 2024-04-07T13:39:28 1712497168

And you would iterate through that sequence? That's exponential time in the bit length of the input...

sublinear · 2024-04-07T13:57:12 1712498232

It's sqrt(n) - 1 additions for the n you're trying to get the integer square root of. Memoization would make it constant time for any lesser n than the greatest n you've done this for. For greater n it's sqrt(new_big_n - prev_big_n) - 1 more additions to memoize.

You're right this isn't practical, but fun to think about. Good refresher for those out of school for a while.

klyrs · 2024-04-07T17:32:22 1712511142

Subtle detail of complexity analysis: the input is of length log(n), so taking n^k steps for any positive number k is exponential time.

HPsquared · 2024-04-07T12:31:40 1712493100

And this is one of those "embarrassingly parallel" tasks.

sublinear · 2024-04-07T14:16:13 1712499373

Yes. On a desert island we can have the whole village construct this table for newton-raphson guesses.

Combined with a cutting tool attached to a worm drive we will precisely count our turns (big radius crank for extra precision!) and begin manufacture of slide rules. Can never have too many scales and this is just one we shall etch into them!

kelnos · 2024-04-08T19:29:26 1712604566

This bit in an answer further down made me chuckle:

> My implementation of square root using binary search, that doesn't depend on a multiplier. Only basic ALU instructions are used. It is vigorously undocumented. I have no idea what I wrote but it seems to work.

A fine reminder that if we write clever code, we're probably not going to remember how it works.

moomin · 2024-04-07T14:06:16 1712498776

You need to read down a bit, but the answer “ENIAC” is hilarious.

drpixie · 2024-04-08T01:13:44 1712538824

So many people assume that everything that came before they were at school was primitive, and barely chugged along :)

A little reading shows the opposite. Most of our smart ideas were already used in 1940s/50s/60s computers, and are recycled on our fab new chips!! Pipelining, out of order exec, multiple cores, etc.

That old-time hardware might have been a bit "chunky" but the architectures used some very smart techniques.

chasd00 · 2024-04-08T15:44:10 1712591050

another example is the virtualization that pretty much enabled the whole "cloud" thing came from mainframe architecture in the 60s. Intel and others brought it to consumer grade CPUs.

xyst · 2024-04-07T14:18:33 1712499513

hardware engineer humor lol

Dwedit · 2024-04-07T20:04:17 1712520257

2 ^ (1/2 * Log2(X)) = sqrt(X)

You can get a really really rough approximation if you replace Log2(x) with 'count leading zeroes'. With a better approximation of Log(2), you can get closer to the answer.

dahart · 2024-04-07T14:34:27 1712500467

For an approximate (very rough) answer, as opposed to one accurate to the nearest integer, a right shift by half the number of bits of the leading 1’s position will do, and of course nearly every processor has a shift instruction. I’m not sure how often processors haven’t had something like FLO (Find Leading One) or FFS (Find First Set) instruction, those seem ubiquitous as well.

The super rough approximation for some uses can be approximately as good as an accurate answer. When you just need a decent starting place for some further Newton-Raphson iteration, for example. (Of course the right-shift trick is a nice way to seed a more accurate square root calculation. :P)

lordnacho · 2024-04-07T14:47:50 1712501270

Is this where the DOOM reference comes in? Somewhat famous Internet story by now featuring Carmack and a magic 32 bit number.

epcoa · 2024-04-07T15:09:11 1712502551

You mean Quake 3 and fast inverse square root? No. And it wasn’t Carmack. https://www.beyond3d.com/content/articles/15/

https://en.wikipedia.org/wiki/Fast_inverse_square_root

dboreham · 2024-04-07T15:34:55 1712504095

I wondered about the name of that function: surely it isn't inverse square root? That would be "squared". It's "1 over square root" or something. So down the rabbit hole to see if it was always called that. Yup, in Wikipedia articles at least, but the first paper seems to be by Jim Blinn in 1997, without the term in the title. So let's read the paper... Frustratingly, although I am an IEEE member, and I did subscribe to Computer Graphics and Applications in 1997, it won't let me read the paper without paying again. So curious to hear from folks knowledgeable in the space if this was a mis-naming that stuck or I'm confused about the meaning of "inverse". In my universe we used inverse in the context of functions and their inverses, and "invert" in the context of binary values (1s compliment of x is also called x inverted). Never to describe reciprocal.

epcoa · 2024-04-07T16:40:36 1712508036

It was called "inverse square root" as it is just another practical "layer" on top of "inverse" which is just meant to mean multiplicative inverse.

This is the original Blinn 97 BTW. https://web.archive.org/web/20130309073539/http://rufus.hack...

This seems to have been common usage. I never really thought about it as it was just so normal to refer to reciprocal as "inverse" in this context.

> In my universe we used inverse in the context of functions and their inverses

Yes but, the other type of inverse that is so fundamental to CS in general, and especially geometry is a matrix inverse, which is again a multiplicative inverse, so it's not too surprising how this usage became assumed in many contexts.

dahart · 2024-04-07T16:24:32 1712507072

I’ve heard inverse used to mean reciprocal often enough. And it’s technically accurate - a reciprocal is a multiplicative inverse. The problem is mainly that “inverse” is ambiguous, especially in this particular case (an inverse square root is a square!), whereas “reciprocal” is clear and unambiguous. Online you can find lots of uses of inverse, and questions about inverse vs reciprocal. So yes reciprocal is the better, more clear term to use here, but “inverse” to mean reciprocal does happen.

dahart · 2024-04-07T15:02:08 1712502128

Not really, that’s a very clever trick used on floating point numbers, and does the approximate reciprocal square root.

This right-shift thing is far simpler, not very clever, doesn’t involve magic numbers, and is much more well known than the “Quake trick”. There are easy ways to see it. One would be that multiplying a number by itself approximately doubles the number of digits. Therefore halving the number of digits is approximately the square root. You can get more technical and precise by noting that FFS(n) =~ log2(n), and if you remember logs, you know that exp(log(n)/2) = n^(1/2), so shifting right by FFS(n)/2 is just mathematically approximately a square root.

creato · 2024-04-07T17:34:24 1712511264

They are more closely related than you suggest. Both methods are using an approximation of log2 to get an initial guess. One gets it from "FPS(n)", the other gets it from the floating point representation of n, where you can roughly find the log2(n) by looking at the exponent of n in the float representation. You can also use the mantissa to refine the guess further.

dahart · 2024-04-07T19:20:00 1712517600

They are related a little, around the log2 (exponent), I totally agree. I guess I figured the magic number subtraction that turns it into n^(-1/2) is so wild that it puts the Quake trick in a bit of a different class. This right shift thing is a lot older and there’s probably a lineage of ideas from the right shift on integers to the Quake trick. I discovered another fun one on my own that is probably already well known in bit-math circles, but simply inverting the bits of a floating point exponent is a very rough approximation to the reciprocal, good for a Newton initial guess.

winwang · 2024-04-07T17:48:28 1712512108

Fun fact, FFS (and its generalization, FNS) is in CUDA: https://docs.nvidia.com/cuda/cuda-math-api/index.html#group_...

Another nice CUDA hardware intrinsic I like is log2.

bryanlarsen · 2024-04-07T15:08:51 1712502531

IIRC, most (all?) fixed point DSP's have a square root instruction and/or helper instructions.

jlarcombe · 2024-04-08T05:13:12 1712553192

Semi-related and of interest to 6502 fans, exhaustive analysis of square root algorithms: https://github.com/TobyLobster/sqrt_test

pajko · 2024-04-07T20:40:54 1712522454

ARM VFP has VSQRT

https://developer.arm.com/documentation/dui0473/m/vfp-instru...

Pet_Ant · 2024-04-08T13:46:34 1712583994

I'm sure the VAX must have? (if we are including microcode)

linksnapzz · 2024-04-08T18:02:18 1712599338

IIRC, VAX had a "factor this quadratic" instruction...