Hacker News new | past | comments | ask | show | jobs | submit login

You don't have to iterate twice, you don't have to use more storage other than a constant number of bytes chosen to fit in L1 along with lots of other stuff, and any system of "iterating over UTF-8 codepoints directly" has to spend many bytes on each of them anyway (For example an index into a byte array occupies multiple bytes and the minimum amount of storage to hold an arbitrary codepoint is also multiple bytes). So each part of your comment is either incorrect or applies to all possible implementations that meet the requirements. You also haven't proposed any particular meaning for "iterating over UTF-8 codepoints directly," but this is probably for the best, since it is someone else's requirement.



Okay, so, from my current perspective, everything in your comment is flat wrong. Assuming that it isn't, you and I must be talking about different things to have such divergent views.

Upthread was posted, "I want to iterate over UTF-8 codepoints directly. The initial test for a 1-byte sequence is just too simple with a branch and will be hit most of the time."

Compared to this, your proposals require iterating twice, in the context as I understand it: either (1) to convert to UCS-4 and (2) to process; or (1) to establish pointers to the starting UTF-8 code unit of each code point and (2) to process.

Additionally, if you convert to UCS-4/UTF-32, and most of the text is in ASCII, you will roughly quadruple the space required for this conversion. If you establish 32-bit pointers into the UTF-8 string for the starting code unit of each code point, you will also approximately quadruple the space required. Obviously, double that if you're using 64-bit pointers.

By contrast, iterating through a UTF-8 string directly has no such problems, at the expense of slightly more CPU time for each iteration. Such iteration can take the form of:

    // Extremely hot codepath. Handling well-formed UTF-8 by macro --> ~4% speedup
    #define next(R) ((R[1]>>6!=2) ? (R+1) : (R[2]>>6!=2) ? (R+2) : \
                     (R[3]>>6!=2) ? (R+3) : (R[4]>>6!=2) ? (R+4) : next_func(R+4))

    static const unsigned char *next_func(const unsigned char *rex) {
        do { rex++; } while (*rex && rex[0]>>6 == 2); return rex;
    }
Are you talking about something else? Perhaps I have misunderstood something somewhere.


Instead of expanding all data to UTF-32 you can expand in chunks of some fixed amount, say, 1024 characters; so memory usage is capped, the UTF-8 unpacking can be unrolled/SIMD whatever, and the thing iterating over codepoints is trivially branchless (can even be its own SIMD). Annoying downside being potentially some wasted computation if the iteration is early-exited, but even then you could have something like doubling chunk size on each chunk, starting with 8 chars or something.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: