Is it possible in a single clock-cycle. Yes, with a very large lookup table. It ...

Findecanor · 2024-04-07T13:07:13 1712495233

Floating point reciprocal square root estimate (`frsqrte`) instructions are typically implemented as just such a table lookup, indexed by a few bits of the fraction and the LSB of the exponent. The precision is typically limited to similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are expected to do a few Newton-Raphson iterations afterwards if they want more.

bonzini · 2024-04-07T12:42:43 1712493763

You can compute the integer square root in n/2 iterations where n is the number of bits in the source using just shifts and adds. For each step, check if a new bit has to be set in the result n_old by computing

n2_new = (n_old + (1 << bit))^2 = n2_old + (n_old << (bit + 1)) + (1 << (bit*2))

Then compare it with the source operand, and if it's greater or equal: 1) set the bit in the result 2) update n2_old with n2_new

It can be done in n/2 or perhaps n clock cycles with a suitable microcode instruction set and ALU. With some effort it can be optimized to reduce n to the index of the leftmost set bit in the operand.

masswerk · 2024-04-07T15:12:33 1712502753

Compare the integer square root algorithm used in "Spacewar!" [1]. So, even by 1960 it should have been possible to implement a square root step instructions for each bit, much like division or multiplication shifts, and progress from this to full-fledged automatic operations by the use of a sub timing network. (I guess, it really depends on the economics of the individual use case, whether the effort does pay off or not, as you would amass a few additional hardware modules to accomplish this.)

[1] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...

m463 · 2024-04-07T22:15:16 1712528116

so, dumb question.

do lookups in large tables ever (practically, not theoretically) take one clock cycle?

If there's a large lookup table, it would have to come from memory, which might mean cache and memory hierarchy delays, right?

bagels · 2024-04-07T23:03:43 1712531023

If the table is in silicon, you can avoid this. Not sure if that is done in practice though.

js8 · 2024-04-08T11:15:50 1712574950

There definitely is a trade-off between memory size and how quickly it can be accessed.

IIRC IBM z/Arch processors (AFAIK they are internally similar to POWER) have clock limited to around 5 GHz or so, so that L1 cache lookup costs only one cycle (a design requirement).

For example, z14 has 5.2 GHz clock rate and 2x128 kB data and instruction L1 caches.

WithinReason · 2024-04-07T12:41:10 1712493670

Sounds like it's possible to run any algorithm in the world in 1 clock cycle.

retrac · 2024-04-07T15:40:42 1712504442

Yes. In theory, any pure function can be turned into a lookup table. And any lookup table that isn't just random numbers can be turned into a more compact algorithm that spends compute to save space.

Such tables may be infeasible, though. While a int8 -> int8 table only needs 256 bytes, an int32 -> int32 needs 16 gigabytes.

DaiPlusPlus · 2024-04-07T17:20:46 1712510446

Fractal functions are pure, but don’t lend themselves well to memoization nor lookup tables.

meindnoch · 2024-04-08T13:49:47 1712584187

What does this statement even mean? Every function from a finite domain is just a lookup table, period.

DaiPlusPlus · 2024-04-08T16:32:03 1712593923

Who said anything about finite-domains?

meindnoch · 2024-04-09T13:55:29 1712670929

Every function has finite domain on a computer.

DaiPlusPlus · 2024-04-09T17:54:48 1712685288

Arbitrary-precision real-number libraries would disagree with you there.

meindnoch · 2024-04-10T10:15:24 1712744124

False. A computer always has a finite amount of memory.

ajb · 2024-04-07T12:57:12 1712494632

It isn't, because eventually the size of your logic or table becomes larger than the distance a signal can propagate in one clock tick. Before that, it likely presents practical issues (eg, is it worth dedicating that much silicon)

willcipriano · 2024-04-07T15:37:02 1712504222

Have slower ticks. A planet size CPU that runs at .5 hz but can work on impossibly large numbers.

Dylan16807 · 2024-04-07T23:43:12 1712533392

> Have slower ticks.

Yes, this solves the stated issue about huge lookup tables.

> A planet size CPU that runs at .5 hz but can work on impossibly large numbers.

This doesn't make much sense to me, though.

If your goal is "any algorithm", you'll often have to go a lot slower than .5Hz. A hash-calculating circuit that's built out of a single mountain of silicon could have a critical path that's light-years long.

But if your goal is just "work on impossibly large numbers", but it's okay to take multiple ticks, then there's no reason to drag the frequency down that low. You can run a planet-scale CPU at 1GHz. CPUs have no need for signals to go all the way across inside a single tick.

Vecr · 2024-04-08T04:00:55 1712548855

You'd need way better clocks and synchronization circuits than exist now though, but I don't see any pure physical barriers.

Dylan16807 · 2024-04-08T04:28:08 1712550488

The whole thing doesn't need to be on the same clock domain. You can put clock crossings every inch.

volemo · 2024-04-07T16:07:17 1712506037

And 27.7% [1] of the planet's crust is silicon already!

[1] Britannica: Silicon

gameshot911 · 2024-04-07T15:41:11 1712504471

That's actually a really fascinating science fiction idea!

com2kid · 2024-04-07T18:38:18 1712515098

https://en.wikipedia.org/wiki/Matrioshka_brain

:-D

It's a topic that has been explored quite a bit in science fiction literature.

willcipriano · 2024-04-07T15:58:40 1712505520

It's sort of the plot of the Douglas Adam's books.

https://hitchhikers.fandom.com/wiki/Earth

karmakaze · 2024-04-07T19:22:48 1712517768

Related to the fallacy of comparing what's slow in our world with what's computable in a simulation of it--there's no requirement for time to tick similarly.

brudgers · 2024-04-08T11:40:25 1712576425

impossibly large numbers

Forty-two for example?

rcxdude · 2024-04-07T13:16:57 1712495817

With a sufficiently large chip and a sufficiently slow clock, sure.

wheybags · 2024-04-07T13:26:18 1712496378

"Give me a lut large enough and an infinite period of time to execute, and I shall simulate the world"

dartos · 2024-04-07T15:20:45 1712503245

“Every cycle”

robinduckett · 2024-04-07T12:54:59 1712494499

I like the Quantum BogoSort as a proof of this /s

benlivengood · 2024-04-07T18:14:20 1712513660

It's not as bad for integer square root; you only need to store N^0.5 entries in a greater/lesser-than lookup table: N^2 for all the answers N. Feasible for 16-bit integers, maybe for 32-bit, not for 64-bit.