You don't have to iterate twice, you don't have to use more storage other than a...

torstenvl · 2024-05-06T19:07:11 1715022431

Okay, so, from my current perspective, everything in your comment is flat wrong. Assuming that it isn't, you and I must be talking about different things to have such divergent views.

Upthread was posted, "I want to iterate over UTF-8 codepoints directly. The initial test for a 1-byte sequence is just too simple with a branch and will be hit most of the time."

Compared to this, your proposals require iterating twice, in the context as I understand it: either (1) to convert to UCS-4 and (2) to process; or (1) to establish pointers to the starting UTF-8 code unit of each code point and (2) to process.

Additionally, if you convert to UCS-4/UTF-32, and most of the text is in ASCII, you will roughly quadruple the space required for this conversion. If you establish 32-bit pointers into the UTF-8 string for the starting code unit of each code point, you will also approximately quadruple the space required. Obviously, double that if you're using 64-bit pointers.

By contrast, iterating through a UTF-8 string directly has no such problems, at the expense of slightly more CPU time for each iteration. Such iteration can take the form of:

    // Extremely hot codepath. Handling well-formed UTF-8 by macro --> ~4% speedup
    #define next(R) ((R[1]>>6!=2) ? (R+1) : (R[2]>>6!=2) ? (R+2) : \
                     (R[3]>>6!=2) ? (R+3) : (R[4]>>6!=2) ? (R+4) : next_func(R+4))

    static const unsigned char *next_func(const unsigned char *rex) {
        do { rex++; } while (*rex && rex[0]>>6 == 2); return rex;
    }

Are you talking about something else? Perhaps I have misunderstood something somewhere.

dzaima · 2024-05-06T23:32:29 1715038349

Instead of expanding all data to UTF-32 you can expand in chunks of some fixed amount, say, 1024 characters; so memory usage is capped, the UTF-8 unpacking can be unrolled/SIMD whatever, and the thing iterating over codepoints is trivially branchless (can even be its own SIMD). Annoying downside being potentially some wasted computation if the iteration is early-exited, but even then you could have something like doubling chunk size on each chunk, starting with 8 chars or something.