In my experiments, anything using lookup tables was slower than a naive, branching decoder on real-world data. Reading from a lookup table in L1 cache has ~4 cycles latency which is prohibitive for the simple case of mostly ASCII bytes. You can easily achieve more than 1.5 GB/s with a naive decoder while all the "smarter" approaches are capped to ~800 MB/s.
I would expect it depends on where the lookup table enters the picture. If the table lookup is at the end of a dependency chain, it's not going to stall anything and then you're putting less strain on the branch predictor/reordering machinery.
The article is using pext, which is different from a regular table lookup. While it has horrible latency and overhead on older chips, on Zen3 it appears to decode to a single uop with a latency of 3, and on Haswell it also decodes to one uop with a latency of 3. So it's not as bad as a regular table lookup as long as you're on a new enough processor.
It'd be interesting to benchmark this on texts from different languages. English text will be very friendly to the branch predictor because it's all 1-byte codepoints. I think a language dominated by either 2-byte or 3-byte codepoints would be as well, mixing codepoint lengths is what would trip things up.
Agreed. I saw measurable speed increases when defaulting to ASCII (via macro) and only actually calling the decoder function if the value wasn't < 128.
An all-ASCII fast-path is basically obligatory for anything that cares about performance. And then the "slow path" gets a guarantee that there will be varying-length data, and, as you'll want it to not actually be slow, you'll still need to avoid branching on the now-guaranteed-variable width.
SIMD is great if you want to convert to UTF-32 but most of the time, I want to iterate over UTF-8 codepoints directly. The initial test for a 1-byte sequence is just too simple with a branch and will be hit most of the time. Asian scripts are an exception but there you have longer runs of 3-byte sequences. Testing with real-world data, I found the worst case for the branch predictor to be Eastern European scripts which often alternate between 1- and 2-byte sequences. But even with Czech language content, I got around 800 MB/s with the naive approach, IIRC.
What do you mean by "iterate over UTF-8 codepoints directly?" Is the requirement that you get a pointer to the start of each, or that you receive a stream of u32 whose content is the UTF-8 bytes, or something else? Basically I don't really understand how a UTF-8-to-UTF-32 converter is not trivially modifiable to do the required thing.
I mean there's the whole issue of quadrupling your storage space... plus you have to iterate twice... plus you'd have to write four bytes for every code point, even if most of the code points only require one byte...
You don't have to iterate twice, you don't have to use more storage other than a constant number of bytes chosen to fit in L1 along with lots of other stuff, and any system of "iterating over UTF-8 codepoints directly" has to spend many bytes on each of them anyway (For example an index into a byte array occupies multiple bytes and the minimum amount of storage to hold an arbitrary codepoint is also multiple bytes). So each part of your comment is either incorrect or applies to all possible implementations that meet the requirements. You also haven't proposed any particular meaning for "iterating over UTF-8 codepoints directly," but this is probably for the best, since it is someone else's requirement.
Okay, so, from my current perspective, everything in your comment is flat wrong. Assuming that it isn't, you and I must be talking about different things to have such divergent views.
Upthread was posted, "I want to iterate over UTF-8 codepoints directly. The initial test for a 1-byte sequence is just too simple with a branch and will be hit most of the time."
Compared to this, your proposals require iterating twice, in the context as I understand it: either (1) to convert to UCS-4 and (2) to process; or (1) to establish pointers to the starting UTF-8 code unit of each code point and (2) to process.
Additionally, if you convert to UCS-4/UTF-32, and most of the text is in ASCII, you will roughly quadruple the space required for this conversion. If you establish 32-bit pointers into the UTF-8 string for the starting code unit of each code point, you will also approximately quadruple the space required. Obviously, double that if you're using 64-bit pointers.
By contrast, iterating through a UTF-8 string directly has no such problems, at the expense of slightly more CPU time for each iteration. Such iteration can take the form of:
Instead of expanding all data to UTF-32 you can expand in chunks of some fixed amount, say, 1024 characters; so memory usage is capped, the UTF-8 unpacking can be unrolled/SIMD whatever, and the thing iterating over codepoints is trivially branchless (can even be its own SIMD). Annoying downside being potentially some wasted computation if the iteration is early-exited, but even then you could have something like doubling chunk size on each chunk, starting with 8 chars or something.
The code is careful not to read past the end of the buffer, but it doesn't explicitly check that there are enough bytes available for the current multibyte character. However, this "end of buffer in middle of character" error is caught later by the check for valid continuation bytes. I thought that was quite neat.