Is it possible in a single clock-cycle. Yes, with a very large lookup table. It is probably possible to reduce the size depending on how many serial logical gates can be executed within the clock-cycle. Think about that the binary root of 10000 is rather similar to that of 100 only with respect to different number of zero's.
Floatingpointreciprocalsquarerootestimate (`frsqrte`) instructions are typically implemented as just such a table lookup, indexed by a few bits of the fraction and the LSB of the exponent.
The precision is typically limited to similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are expected to do a few Newton-Raphson iterations afterwards if they want more.
You can compute the integer square root in n/2 iterations where n is the number of bits in the source using just shifts and adds. For each step, check if a new bit has to be set in the result n_old by computing
Then compare it with the source operand, and if it's greater or equal: 1) set the bit in the result 2) update n2_old with n2_new
It can be done in n/2 or perhaps n clock cycles with a suitable microcode instruction set and ALU. With some effort it can be optimized to reduce n to the index of the leftmost set bit in the operand.
Compare the integer square root algorithm used in "Spacewar!" [1]. So, even by 1960 it should have been possible to implement a square root step instructions for each bit, much like division or multiplication shifts, and progress from this to full-fledged automatic operations by the use of a sub timing network. (I guess, it really depends on the economics of the individual use case, whether the effort does pay off or not, as you would amass a few additional hardware modules to accomplish this.)
There definitely is a trade-off between memory size and how quickly it can be accessed.
IIRC IBM z/Arch processors (AFAIK they are internally similar to POWER) have clock limited to around 5 GHz or so, so that L1 cache lookup costs only one cycle (a design requirement).
For example, z14 has 5.2 GHz clock rate and 2x128 kB data and instruction L1 caches.
Yes. In theory, any pure function can be turned into a lookup table. And any lookup table that isn't just random numbers can be turned into a more compact algorithm that spends compute to save space.
Such tables may be infeasible, though. While a int8 -> int8 table only needs 256 bytes, an int32 -> int32 needs 16 gigabytes.
It isn't, because eventually the size of your logic or table becomes larger than the distance a signal can propagate in one clock tick. Before that, it likely presents practical issues (eg, is it worth dedicating that much silicon)
Yes, this solves the stated issue about huge lookup tables.
> A planet size CPU that runs at .5 hz but can work on impossibly large numbers.
This doesn't make much sense to me, though.
If your goal is "any algorithm", you'll often have to go a lot slower than .5Hz. A hash-calculating circuit that's built out of a single mountain of silicon could have a critical path that's light-years long.
But if your goal is just "work on impossibly large numbers", but it's okay to take multiple ticks, then there's no reason to drag the frequency down that low. You can run a planet-scale CPU at 1GHz. CPUs have no need for signals to go all the way across inside a single tick.
Related to the fallacy of comparing what's slow in our world with what's computable in a simulation of it--there's no requirement for time to tick similarly.
It's not as bad for integer square root; you only need to store N^0.5 entries in a greater/lesser-than lookup table: N^2 for all the answers N. Feasible for 16-bit integers, maybe for 32-bit, not for 64-bit.