Decoding UTF8 with parallel extract

nwellnhof · 2024-05-05T17:13:32 1714929212

In my experiments, anything using lookup tables was slower than a naive, branching decoder on real-world data. Reading from a lookup table in L1 cache has ~4 cycles latency which is prohibitive for the simple case of mostly ASCII bytes. You can easily achieve more than 1.5 GB/s with a naive decoder while all the "smarter" approaches are capped to ~800 MB/s.

kevingadd · 2024-05-05T20:35:18 1714941318

I would expect it depends on where the lookup table enters the picture. If the table lookup is at the end of a dependency chain, it's not going to stall anything and then you're putting less strain on the branch predictor/reordering machinery.

The article is using pext, which is different from a regular table lookup. While it has horrible latency and overhead on older chips, on Zen3 it appears to decode to a single uop with a latency of 3, and on Haswell it also decodes to one uop with a latency of 3. So it's not as bad as a regular table lookup as long as you're on a new enough processor.

It's also possible to do vectorized lookup table loads if your lookup table is small enough and you have a modern CPU, see http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html and http://0x80.pl/notesen/2019-02-03-simd-switch-implementation... (there's a more generalized version of this I can't remember the URL of...)

andrewf · 2024-05-05T21:06:29 1714943189

It'd be interesting to benchmark this on texts from different languages. English text will be very friendly to the branch predictor because it's all 1-byte codepoints. I think a language dominated by either 2-byte or 3-byte codepoints would be as well, mixing codepoint lengths is what would trip things up.

nitwit005 · 2024-05-06T03:23:52 1714965832

Been a while, but I found it was best to assume all the text is ASCII, and fall back to a slower code path only if that turned out not to be the case.

A lot of tools that create JSON, XML, and so forth will escape any non ASCII characters by default. Python's json package, for example.

torstenvl · 2024-05-06T15:47:44 1715010464

Agreed. I saw measurable speed increases when defaulting to ASCII (via macro) and only actually calling the decoder function if the value wasn't < 128.

#define cdpt(s) ((s[0] < 128) ? s[0] : cdpt_func(s))

https://github.com/torstenvl/trex/blob/master/src/trex.c

dzaima · 2024-05-06T18:18:49 1715019529

An all-ASCII fast-path is basically obligatory for anything that cares about performance. And then the "slow path" gets a guarantee that there will be varying-length data, and, as you'll want it to not actually be slow, you'll still need to avoid branching on the now-guaranteed-variable width.

camel-cdr · 2024-05-05T17:35:22 1714930522

IIRC all of the simdutf implementations use lookup tables except for the AVX512 and RVV backends.

Here is e.g. the NEON code: https://github.com/simdutf/simdutf/blob/1b8ca3d1072a8e2e1026...

nwellnhof · 2024-05-05T19:59:44 1714939184

SIMD is great if you want to convert to UTF-32 but most of the time, I want to iterate over UTF-8 codepoints directly. The initial test for a 1-byte sequence is just too simple with a branch and will be hit most of the time. Asian scripts are an exception but there you have longer runs of 3-byte sequences. Testing with real-world data, I found the worst case for the branch predictor to be Eastern European scripts which often alternate between 1- and 2-byte sequences. But even with Czech language content, I got around 800 MB/s with the naive approach, IIRC.

anonymoushn · 2024-05-06T12:50:27 1714999827

What do you mean by "iterate over UTF-8 codepoints directly?" Is the requirement that you get a pointer to the start of each, or that you receive a stream of u32 whose content is the UTF-8 bytes, or something else? Basically I don't really understand how a UTF-8-to-UTF-32 converter is not trivially modifiable to do the required thing.

torstenvl · 2024-05-06T15:50:21 1715010621

I mean there's the whole issue of quadrupling your storage space... plus you have to iterate twice... plus you'd have to write four bytes for every code point, even if most of the code points only require one byte...

anonymoushn · 2024-05-06T18:53:53 1715021633

You don't have to iterate twice, you don't have to use more storage other than a constant number of bytes chosen to fit in L1 along with lots of other stuff, and any system of "iterating over UTF-8 codepoints directly" has to spend many bytes on each of them anyway (For example an index into a byte array occupies multiple bytes and the minimum amount of storage to hold an arbitrary codepoint is also multiple bytes). So each part of your comment is either incorrect or applies to all possible implementations that meet the requirements. You also haven't proposed any particular meaning for "iterating over UTF-8 codepoints directly," but this is probably for the best, since it is someone else's requirement.

torstenvl · 2024-05-06T19:07:11 1715022431

Okay, so, from my current perspective, everything in your comment is flat wrong. Assuming that it isn't, you and I must be talking about different things to have such divergent views.

Upthread was posted, "I want to iterate over UTF-8 codepoints directly. The initial test for a 1-byte sequence is just too simple with a branch and will be hit most of the time."

Compared to this, your proposals require iterating twice, in the context as I understand it: either (1) to convert to UCS-4 and (2) to process; or (1) to establish pointers to the starting UTF-8 code unit of each code point and (2) to process.

Additionally, if you convert to UCS-4/UTF-32, and most of the text is in ASCII, you will roughly quadruple the space required for this conversion. If you establish 32-bit pointers into the UTF-8 string for the starting code unit of each code point, you will also approximately quadruple the space required. Obviously, double that if you're using 64-bit pointers.

By contrast, iterating through a UTF-8 string directly has no such problems, at the expense of slightly more CPU time for each iteration. Such iteration can take the form of:

    // Extremely hot codepath. Handling well-formed UTF-8 by macro --> ~4% speedup
    #define next(R) ((R[1]>>6!=2) ? (R+1) : (R[2]>>6!=2) ? (R+2) : \
                     (R[3]>>6!=2) ? (R+3) : (R[4]>>6!=2) ? (R+4) : next_func(R+4))

    static const unsigned char *next_func(const unsigned char *rex) {
        do { rex++; } while (*rex && rex[0]>>6 == 2); return rex;
    }

Are you talking about something else? Perhaps I have misunderstood something somewhere.

dzaima · 2024-05-06T23:32:29 1715038349

Instead of expanding all data to UTF-32 you can expand in chunks of some fixed amount, say, 1024 characters; so memory usage is capped, the UTF-8 unpacking can be unrolled/SIMD whatever, and the thing iterating over codepoints is trivially branchless (can even be its own SIMD). Annoying downside being potentially some wasted computation if the iteration is early-exited, but even then you could have something like doubling chunk size on each chunk, starting with 8 chars or something.

pbsd · 2024-05-05T21:00:40 1714942840

The overlong lookup can also be written without a memory lookup as

    0x10000U >> ((0x1531U >> (i*5)) & 31);

On most current x86 chips this has a latency of 3 cycles -- LEA+SHR+SHR -- which is better than an L1 cache hit almost everywhere.

teo_zero · 2024-05-06T14:12:16 1715004736

You can do it with one shift:

  0x10880 & (24 << (i*4))

pbsd · 2024-05-06T14:33:37 1715006017

Brilliant. Can be further simplified to 0x10880 & (0x2240 << i).

hairtuq · 2024-05-06T18:17:53 1715019473

Similarly, the eexpect table can be done in 3 instructions with with t = 0x3c783023f + i; return ((t >> (t & 63)) & 0xf0808088.

moonchild · 2024-05-05T21:11:53 1714943513

checking for an overlong input is off the critical path, so latency is irrelevant. (tfa doesn't appear to use a branch—it should use a branch)

clausecker · 2024-05-06T10:42:13 1714992133

This looks very similar to the approach we recently used to transcode UTF-8 into UTF-16 using AVX-512: https://arxiv.org/pdf/2212.05098

It's part of simdutf.

masfuerte · 2024-05-06T15:59:32 1715011172

The code is careful not to read past the end of the buffer, but it doesn't explicitly check that there are enough bytes available for the current multibyte character. However, this "end of buffer in middle of character" error is caught later by the check for valid continuation bytes. I thought that was quite neat.