Hacker News new | past | comments | ask | show | jobs | submit login

Is it possible in a single clock-cycle. Yes, with a very large lookup table. It is probably possible to reduce the size depending on how many serial logical gates can be executed within the clock-cycle. Think about that the binary root of 10000 is rather similar to that of 100 only with respect to different number of zero's.



Floating point reciprocal square root estimate (`frsqrte`) instructions are typically implemented as just such a table lookup, indexed by a few bits of the fraction and the LSB of the exponent. The precision is typically limited to similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are expected to do a few Newton-Raphson iterations afterwards if they want more.


You can compute the integer square root in n/2 iterations where n is the number of bits in the source using just shifts and adds. For each step, check if a new bit has to be set in the result n_old by computing

n2_new = (n_old + (1 << bit))^2 = n2_old + (n_old << (bit + 1)) + (1 << (bit*2))

Then compare it with the source operand, and if it's greater or equal: 1) set the bit in the result 2) update n2_old with n2_new

It can be done in n/2 or perhaps n clock cycles with a suitable microcode instruction set and ALU. With some effort it can be optimized to reduce n to the index of the leftmost set bit in the operand.


Compare the integer square root algorithm used in "Spacewar!" [1]. So, even by 1960 it should have been possible to implement a square root step instructions for each bit, much like division or multiplication shifts, and progress from this to full-fledged automatic operations by the use of a sub timing network. (I guess, it really depends on the economics of the individual use case, whether the effort does pay off or not, as you would amass a few additional hardware modules to accomplish this.)

[1] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...


so, dumb question.

do lookups in large tables ever (practically, not theoretically) take one clock cycle?

If there's a large lookup table, it would have to come from memory, which might mean cache and memory hierarchy delays, right?


If the table is in silicon, you can avoid this. Not sure if that is done in practice though.


There definitely is a trade-off between memory size and how quickly it can be accessed.

IIRC IBM z/Arch processors (AFAIK they are internally similar to POWER) have clock limited to around 5 GHz or so, so that L1 cache lookup costs only one cycle (a design requirement).

For example, z14 has 5.2 GHz clock rate and 2x128 kB data and instruction L1 caches.


Sounds like it's possible to run any algorithm in the world in 1 clock cycle.


Yes. In theory, any pure function can be turned into a lookup table. And any lookup table that isn't just random numbers can be turned into a more compact algorithm that spends compute to save space.

Such tables may be infeasible, though. While a int8 -> int8 table only needs 256 bytes, an int32 -> int32 needs 16 gigabytes.


Fractal functions are pure, but don’t lend themselves well to memoization nor lookup tables.


What does this statement even mean? Every function from a finite domain is just a lookup table, period.


Who said anything about finite-domains?


Every function has finite domain on a computer.


Arbitrary-precision real-number libraries would disagree with you there.


False. A computer always has a finite amount of memory.


It isn't, because eventually the size of your logic or table becomes larger than the distance a signal can propagate in one clock tick. Before that, it likely presents practical issues (eg, is it worth dedicating that much silicon)


Have slower ticks. A planet size CPU that runs at .5 hz but can work on impossibly large numbers.


> Have slower ticks.

Yes, this solves the stated issue about huge lookup tables.

> A planet size CPU that runs at .5 hz but can work on impossibly large numbers.

This doesn't make much sense to me, though.

If your goal is "any algorithm", you'll often have to go a lot slower than .5Hz. A hash-calculating circuit that's built out of a single mountain of silicon could have a critical path that's light-years long.

But if your goal is just "work on impossibly large numbers", but it's okay to take multiple ticks, then there's no reason to drag the frequency down that low. You can run a planet-scale CPU at 1GHz. CPUs have no need for signals to go all the way across inside a single tick.


You'd need way better clocks and synchronization circuits than exist now though, but I don't see any pure physical barriers.


The whole thing doesn't need to be on the same clock domain. You can put clock crossings every inch.


And 27.7% [1] of the planet's crust is silicon already!

[1] Britannica: Silicon


That's actually a really fascinating science fiction idea!


https://en.wikipedia.org/wiki/Matrioshka_brain

:-D

It's a topic that has been explored quite a bit in science fiction literature.


It's sort of the plot of the Douglas Adam's books.

https://hitchhikers.fandom.com/wiki/Earth


Related to the fallacy of comparing what's slow in our world with what's computable in a simulation of it--there's no requirement for time to tick similarly.


impossibly large numbers

Forty-two for example?


With a sufficiently large chip and a sufficiently slow clock, sure.


"Give me a lut large enough and an infinite period of time to execute, and I shall simulate the world"


“Every cycle”


I like the Quantum BogoSort as a proof of this /s


It's not as bad for integer square root; you only need to store N^0.5 entries in a greater/lesser-than lookup table: N^2 for all the answers N. Feasible for 16-bit integers, maybe for 32-bit, not for 64-bit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: