This seems odd. I mean, if their code was properly modular, they would have just one place where they "fetchUserIdByName(userName)", which returns one user ID or null if it's not used yet.
When a new user is created, it then gets assigned a unique user ID. The email address is assigned to that user ID.
Then, if they do a password reset on user = "bigbird", it should do the exact same lookup to find the email address.
The security bug was not really about having an improper function to do unicode translation. It was more about having different functions for the same check, simply because they were in different parts of the code.
Modular code is just so much better on all fronts, including security.
Based on their description of the bug, it sounded like the code was modular, but they called the function twice: once when the password reset request was generated, and again when the link in the email was clicked.
However, when the link was used, canonical_username was once again applied
So after they sent the password reset link, they called "fetchUserIdByName" again, but they passed in a username that had already been canonicalized once. Because of this bug, I wonder if password resets worked at all for users with unicode characters in their names.
You should read the article. It prominently features a very interesting description of precisely why their `canonicalise` function turned out to not be idempotent, even though it was meant to be.
That's exactly what they describe as the cause of the bug. They intended for the function to be idempotent but it wasn't because of a misunderstanding with the Python library spec.
Worse, it /did/ work that way in Python 2.4 but Python 2.5 stopped throwing an exception for invalid codepoints which broke the Twisted library which broke their canonicalization function.
This is what I thought at first. If it's the exact same check (which it should be), why is there any possibility of the answer being different? But the real "bug" here is that they were treating the canonicalized username as if it were just as good as the original username, which is only true if the adjustment function is idempotent as they say.
Another possible solution would be to assume the username given to the password reset form is already canonicalized (which would be necessarily true, as far as I understand).
EDIT: However this wouldn't solve the other bug that's been discovered here, which is that "ᴮᴵᴳᴮᴵᴿᴰ" is canonicalized differently than "BIGBIRD", thus defeating the purpose of canonicalization (for that particular case) in the first place.
What you're talking about has nothing to do with modularity. You're thinking of DRY. They having nothing to do with each other.
DRY tells us the code for looking up a user id by user name should be written in only one place, not that the code should be called from only one place.
The mistake was assuming the name->name function was idempotent, because it wasn't.
You are right to suggest using a name->id function instead. It would not suffer the same problem because the canonical name should not be stored... it's an implementation detail!
They having nothing to do with each other.
DRY tells us the code for looking up a user id by user
name should be written in only one place, not that the
code should be called from only one place.
If you write 'canonicalize(username)' in eight different places, you are not being DRY. If you need to write 'canonicalize(username)' in eight different places, your code probably doesn't separate responsibilities properly across separate modules.
As such, they have a lot to do with each other. After all, if you call code from more than one place, you are writing the calling code in more than one place. Lack of DRYness is about the fact that you are doing so. Lack af modularity is about why you need to do so.
Sounds like they are using the username as the key in their DBs, which sounds like the ultimate case of any pain:
Could the method for computing canonical usernames based on nodeprep.prepare() be salvaged? If not we would be in trouble since we use canonical usernames in various databases so that changing how to derive them in a non-backwards compatible way would be quite costly.
Well generally it makes sense to use that userid as a primary key everywhere rather than the username. You only need the mapping of username to userid in one place. Their current scheme also makes it sound hard to change your username.
Not a good excuse. There are plenty of enormous projects that provide only one basic interface to a bit of information. See every operating system API for examples.
And {Global,Heap,Local}{,re}{Alloc,Free,Lock,Unlock}. 4 sets of APIs just to allocate memory. Although Local* and Global* are mapped to Heap for a while now and (IIRC) (un)lock functions don't do anything (but they used to). This BC is still required today.
I don't see any real reason to rely on idempotence.
They could simply store two names: One is provided by the user (verbatim), and the second is its reduction to lowercase letters and digits (canonical). For all internal logic, they could use only the canonical name, and use the verbatim name in the front-end to make the user happy.
> Lower casing has the key property of being idempotent, i.e., that applying it more than once has no effect: x.lower() == x.lower().lower(). So if a username gets passed from service to service and you want to make sure it is in canonical form you can safely apply .lower() and if it was already in canonical form there is no harm done, and it is easy to stay safe.
Apparently, they thought that it's ok to use verbatim and canonical names interchangeably, relying on idenpotence property of the XMPP function.
Presumably you mean in the database? I don't see a reason to keep a copy of the lower() transformation of a string when it is incredibly cheap to transform a small string to lowercase.
What exactly is the point of that? I would just call lower() as needed, personally.
Oh, my point was that two kinds of names should not be used interchangeably. To store or not the .lower() is a matter of taste.. (Personally, I would store both names, just to avoid wasting computing time)
I think the act of storing both names is bad, because you multiply the amount of data that could possibly become wrong by 2.
With lower(), we can expect we'll get the right transformation of string A each time. If instead, we store string A, and then store string B as A.lower() and copy it... A.lower() will always be A.lower, but it's much easier for someone to come along, screw with the database, and change B.
Well, in that case you're still storing it, you're just letting the the db store it for you.
But - when the issue here is the question of the reliability of the implementation of the canonicalisation function, having it done once in python, and then again by PG is going to be a huge issue.
Yeah, I see. I'm not a web developer, so maybe this is why I did not think about this way to break the data.
Well, I still think that it is better to have two (hopefully) correct fields in the database, rather than only one. (Consistency of the two fields can be checked once in a while).
Yes, I agree that it sounds great. But, it does not mean that you really have to use this property. It was not necessary for the task, this is the point..
Well, in general, modular code should not make assumptions about other parts of the program (when possible).
You know, if the function is idempotent by definiiton, it does not mean that its implementation is. Unicode is changing too, new symbols are added.
Very interesting post -- dealing with unicode gets tricky. Because dealing with a full character repertoire is tricky, not because unicode does it poorly (unicode actually does it well).
Or maybe the twisted algorithm really is UTR#30, just not labelled that?
As far as I can tell, UTR#30 did not make it to formally being part of the unicode spec, for reasons I'm not entirely clear on -- it is nonetheless quite useful, and this case is an example. Solr for instance still uses it. (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#...).
It might be a pain to find code implementing UTR#30 in your language of choice though (I am not sure if it's part of current ICU libraries or not).
It's also worth pointing out, that in addition to this kind of 'folding' of different-but-look-the-same graphemes, in this sort of use case you ABSOLUTELY need to do byte normalization as per UAX#15 http://unicode.org/reports/tr15/ . Probably NFKC for this sort of use case.
Huh, and another alternative would be using one of the unicode collation algorithms for normalization -- which, unlike UAX#15, did make it beyond draft stage to be an official part of the unicode spec.
So if a username gets passed from service to service and you want to make sure it is in canonical form you can safely apply .lower() and if it was already in canonical form there is no harm done, and it is easy to stay safe.
Why?
Suppose you only pass the original name around instead. Then you don't require your canonicalization function to be idempotent, which might be good in your case since it wasn't.
Because they wanted "BigBird" to be the same as "bigbird" or "BIGBIRD". Hence if you wanted to rely on a userId, you would still need to ensure all the variations of BiGbirD map to the same userId.
"Unicode encodes the symbol as U+2126 Ω ohm sign, distinct from Greek omega among letterlike symbols, but it is only included for backwards compatibility and the Greek uppercase omega character U+03A9 Ω greek capital letter omega (HTML: Ω Ω) is preferred."
And from the Unicode Standards doc that is the source for that section:
"Greek Letters as Symbols: The use of Greek letters for mathematical variables and operators is well established. Characters from the Greek block may be used for these symbols.
For compatibility purposes, a few Greek letters are separately encoded as symbols in other character blocks. Examples include U+00B5 µ n the Latin-1 Supplement character block and U+2126 Ω in the Letterlike Symbols character block. The ohm sign is canonically equivalent to the capital omega, and normalization would remove any
distinction. Its use is therefore discouraged in favor of capital omega. The same equivalence does not exist between micro sign and mu, and use of either character as micro sign is common; for Greek text, only the mu should be used."
If that were the case, we'd need separate codepoints for every letter used as a unit of measurement, from A for ampere onwards. In fact, it's just there for legacy reasons, as pointed out elsewhere: the convention for units of measurement is to use normal letters, regardless of whether those are Latin or Greek.
The parent was asking a rhetorical question, and drawing an analogy between the two: omega and ohm have different meanings, therefore it might be useful to be able to distinguish between the two. (Note that I say 'might'; when it comes to a character set, I'm not sure I 100% agree ...)
Because they have rather different meanings. Capital omega (Ω, U+03A9) is a Greek letter, with the lower-case form ω; ohm (Ω, U+2126) is a symbol used in electrical engineering with a related symbol "mho" (℧, U+2127).
FYI, that all units measuring physical properties named after scientists use capitalized letters for their abbreviation[1]. So the ohm (named after Georg Ohm) is abbreviated as the uppercase omega, no idea why they are different unicode values, since they do not have different meanings.
Note that omega was probably used so that the 'O' wouldn't be confused with '0', e.g. 4O would be confusing, but 4Ω is not.
[1]The tesla is abbreviated 'T', joule is 'J', etc. etc.
Nothing new at all, even my stockbroker had similar kind of bug. There should be only one function which handles usernames. When same thing is implemented differently all over the source, this is exactly what happens.
I have seen also much worse solutions. Where actually giving username, logs you in (sets logged in session cookie) and then prompts for password. When you enter invalid password you're logged out. If you give username, and then change url, you're in. Business as usual. When you test it, it works. Username + right password = ok, Username + wrong password != ok. Tests passed, and that's it.
I'd like to point out that even though not every website accepts unicode characters on registration, some software such as vBulletin let the administrator/user to replace the username with such characters once the user is signed up.
You may not create as much havoc as in the original post, but some level of confusion at least.
How about "building" and "buiIding"? or "nodata" and "nodata"? You probably don't want people running around claiming to be admins because their name looks the same.
Similarly, do you really want to do tech support when someone forgets that their username is "nodata" instead of "Nodata", since their phone auto-capitalized things when they signed up? It happens. And how do you know they're "Nodata" and not "nOdata" or "nOdAtA"?
I'm confused about why they even had the need for canonicalising the usernames for whatever purpose? Why couldn't they just have stored and used them just as they are?
If they did as you suggest, the specific account-stealing flaw they had wouldn't happen, but since many unicode points have very similar glyphs, there would still be "copycat" accounts. That is, the strings "Oscar" and "Οscar" appear very similar (if one has the proper fonts installed), and one user could therefore pose as another.
It's true that many sites don't care about this, but I don't fault Spotify for trying to prevent it.
The letter À (A grave) can be written as the UTF-8 bytestream 0xC3 0x80 (i.e. a single "character"), or as À - i.e. a letter A, then a combining grave character i.e. 0x41 0xCC 0x80.
The two are identical. Except they have different byte representations. If you don't normalize your unicode you will run into major problems.
There are actually two kinds of normalization in play here.
The one you are talking about is the one unicode actually calls normalization, and is dealt with in UTR#15. http://unicode.org/reports/tr15/
You are absolutely right that, in almost any situation taking unicode input where you're ever going to need to compare strings (and in most where you're ever going to need to display them), you are going to need to apply one of the UTR#15 normalization forms. UTR#15 normalizes different byte representations of what, in ALL circumstances are indeed identical characters/graphemes. A lot of people don't take account of this.
Then there's the kind of canonicalization that OP talks about, which Unicode actually calls 'folding', and is about characters/graphemes which really ARE different characters but which, for _some_ but not all contexts may be treated as 'equivalent' (if not neccesarily identical). The simplest example is case insensitivity, but there are other trickier ones in the complete repertoire, like those discussed in the OP.
This second kind of 'folding' canonicalization is a lot trickier, because it is contextual, not absolute. Which is maybe why Unicode started out trying to make an algorithm for it in UTR#30 but then abandoned it. Nonetheless, despite it's trickiness and contextuality, you often still really do need to do it, as in OP.
They gave one reason in the post. They wanted usernames to be case insensitive, so that if there's a user named BigBird, somebody else can't sign up as bigbird. Case insensitive usernames are also helpful to minimize support issues when somebody forgets the exact case they used when they created their account.
I don't think you're insensitive, but on a site which wants to foster users who care about their profile and persona on the site, freedom to display a username how they like could be a valuable feature.
Yes, because lowercase ASCII does not meaningfully capture how names are written in most of the world. If you want to cater to a global audience, it is simply not sufficient.
As they say in the article, they are trying to serve a global audience. As is easily googlable, Spotify was developed in the Baltic; the developer's own names likely contain non-ASCII characters.
Interesting, I felt foolish about an hour after having written this post, realizing that in most scenarios, it really shouldn't ever be any amount of effort minus half a second of planning and felt foolish for writing off unicode. (Normally I go the other route, unicode all the things).
I suppose I feel that I was probably right to call myself arrogant then, to find out what you noted about Spotify's creators/creation. Interesting, thanks for the perspective check
I don't think you were all wrong - there's a difference between username and name. My name has a space in it, but they won't allow me to represent that in the username. And if my name is already used by someone else as their username, I can't use it as mine. This doesn't bother me, so long as my name is displayed properly in the UI (if at all).
Because, then there can be multiple accounts that are variations of BigBird, and all annoying to differentiate for users. Such as Bigbird bigBird BigBIrd. Combind with this, the fact that you have unicode characters, and it makes it very easy to 'spoof' as another user by simply changing the letter 'o', to a unicode circle, or various other tricks.
So is there a need for a TRULY idempotent equivalent of XMPP's nodeprep? Or one that handles more Unicode points? Or is it a calculated decision to support Unicode 3.2 points only? (Sorry for the nooby questions, but this was very interesting and I don't know a lot about Unicode)
The decision to only support Unicode 3.2 is simply because the StringPrep framework [1] (which XMPP's nodeprep and various other protocols use) is forever tied to that version of Unicode.
Current work is on the PRECIS framework [2] which uses the metadata for Unicode code points to determine how to handle them during canonicalization instead of relying on a hard coded set of mapping tables. There's still a lot of work to be done, mainly to review that the process works reliably and doesn't introduce subtle new issues. Peter Saint-Andre (one of the authors of PRECIS) has just started on a Python tool for testing how a given version of Unicode is handled by PRECIS (https://github.com/stpeter/PrecisMaker).
Great information, even cooler to see someone's got some code working alongside it, I might have to adapt it to Go if it's fairly reasonable to understand given my relative lack of experience with Unicode (heh as I'd mentioned and is probably obvious given your knowledge on hand).
Correct. And it would seem to me that such a function should probably be in its own separate library, or perhaps the standard library. It seems like a fairly basic need.
Reminds me of the early GMail bug, where you could sign up with "john.smith", "jo.hn.smi.th" or "johnsmith", but all mail was delivered to the period-less version:
The link agrees - it's a feature not a bug and one that considered security:
> As it turns out, even though Gmail will act as if there is no period in a username when delivering mail, it will not permit users to register accounts whose only difference to other account is a period. That is, if bob.jones is registered, bobjones cannot get an account (he could get bobjones2, obviously).
When a new user is created, it then gets assigned a unique user ID. The email address is assigned to that user ID.
Then, if they do a password reset on user = "bigbird", it should do the exact same lookup to find the email address.
The security bug was not really about having an improper function to do unicode translation. It was more about having different functions for the same check, simply because they were in different parts of the code.
Modular code is just so much better on all fronts, including security.