Robustly to what? The registrar doesn't and shouldn't have to know every possible consumer of its data, so looking at it and saying "that looks like code" is probably way, way more foolproof than any other solution (assuming that someone does actually look at each one).
It’s astonishing that handling and/or storing strings correctly is so hard, people actually suggest it’s somehow better to “just” stop such strings at administrative level.
I find it harmful assuming that some externally-sourced data will match any arbitrary format (e.g. contain only allowed characters), even if it’s really supposed to be so. (Inverse for outputs - one has to conform as strictly as they can.) Ignoring this leads to mental dismissal of validation and correct handling, and that’s how things start to crack at the seams. I have seen too many examples of “this can never be… oops”.
Add: Best one can safely assume when handling a string is that it’ll be composed of a zero or more octets (because that’s what typically OS/language would guarantee). Languages and frameworks usually provide a lot of tooling to ensure things are what they expected to be. Ignoring the failure modes (even less probable ones, like a different Unicode collation than is conventional on a certain system) makes one sloppy, not practical.
Bad analogy. In the company name case, there’s a registry (list) with a gatekeeper (filter) in front of it rejecting very simple inputs (small strings) that don’t conform to their standards. You literally can’t get your company name on this list if you don’t pass muster. One might even say the list is “sanitized”.
You probably want to say "correctly handle arbitrary input" than "sanitize" inputs.
If everybody sanitizes their inputs (in undefined ways) then companies like the one mentioned would be randomly blocked from administrative processes.
This is not what we (as a society) want.
If Bobby Tables isn't a valid name the legislation should make it invalid, instead of rubber stamping it at the government registry and let poor Bobby get random errors when making requests to various public bodies. ("Sorry, our school does not admit persons with semicolons in their names.")
> It’s astonishing that handling and/or storing strings correctly is so hard
Is it astonishing? "Don't sanitize your own strings; always use a library" is common advice for handling SQL and HTML, which implies to me that it is in fact pretty hard to do correctly.
Anything is hard, if the plank is low enough. Basic language transformations with regular grammar (like escaping a string for use in a HTML document) are, IMHO, not particularly hard. The hardest part is to actually recognize what is the language of your output and if there is a mismatch with the language of your string value.
What's astonishing is the popularity of the way of thinking that producing the cheapest code possible that still works along happy path (and simply doesn't fail too badly when it does) is is considered not only a valid practice but even some business virtue that needs to be protected.
The more I think about it, the more I like the idea of an EICAR-like records like this SCRIPT one - in the official database. It must be fully benign, of course (in a sense the script source should point to the same agency, and contain only a warning but no harmful code), and it must be well-known - effectively a test case for production systems. Rather than a pinky-swear "company name will should be okay, don't worry" that allows neglect, it's a "hey, this is a special weird case - specially to make sure you're doing things right" friendly guidance.
The fact that so many people were impacted by left-pad leads me to believe that people aren't using libraries because a problem is pretty hard, but rather because they don't even want to think about the problem that a library supposedly addresses. It can also often be way to hand off responsibility IMO.
I'm genuinely curious - where does this end? I once was curious about whether I should sanitize dynamodb inputs, and was surprised to see zero guidance for or against.
How about things like parsing strings for serializing to binary storage?
I think it's safe to put arbitrary data in DynamoDB (just use the proper API instead of concatenating it directly into a command string...) It's the systems interacting with it you have to be careful about. In general, there is no silver bullet beyond "understand your systems capabilities and limitations". Formal verification also comes to mind.
> Can everything be an injection attack?
What does this question even mean? I guess we must say "for any system accepting arbitrary input: yes". Not even sure if the "arbitrary" qualifier is necessary.
It never does, because abstractly speaking, there is no such thing as a secure computing system. This goes double for any computer that is switched on.
Practically speaking, it depends on how critical your application might be. If you're storing values for neurosurgery or automated dispersal of life-saving (or potentially life-ending) medication, you'd better be sanitizing on the way in, validating on the way out, and have some additional layers like audits and comparisons to known good values at rest. Look into defense in depth, and never trust the computer to make a decision, because the computer cannot be held accountable.
If you're storing quiz results for someone's favourite colour, or it's not internet connected, you can probably be a bit less paranoid about it.
> Can everything be an injection attack?
But yeah, anything and everything could be an injection attack if the attacker is determined enough. It's just a matter of how difficult you want to make it for them.
Is %q a JSON-compatible format? I have no idea without reading some source code! Almost certainly it won't \u-encode weird characters. That might be OK, I think the only stuff you really have to escape in JSON strings is newlines, backslashes, and double quotes? And %q probably handles those. Maybe it breaks on ASCII control characters...
But yeah, we are meant to always use a library because we have deadlines and we are willing to compromise a whole lot of quality to deliver on them.
Both cases are the result of library/runtime/env designer not thinking about the crowd. If csv.esc(s) and json(x) were available right away, without imports even, you wouldn’t have to decide whether it’s fine. Fmt should just have %j.
Specifically json and unjson I make globally available in all my projects. If I used csv more often than once in a decade, I’d have csvesc(s) too.
Sometimes you read some stdlib reference and wonder what they were thinking with things like System.out.println and without one-line one-arg readtext(), tojson(), fetch() and so on. It’s like a kitchen with all appliances still in boxes and all utensils in a tight vacuum cover. Everything is there, but preparation friction makes it absolutely unusable.
I don't think the problem we are talking about is lazy programmers or the availability of libraries.
People think hard things should be easy and with less "friction". If I want to output a string why should I have to know what the difference between stdout and stderr is? If I write CSV to a file why do I need to know the difference between CRLF and LF, and UTF-8 and UTF-16 or what a BOM is? At the end of all of this you end up with a company named 'W""oopWoop;' crashing the banking industry.
So no, you should know all of that, and more or get the fuck out of my industry.
For me it is. I feel the friction and how it disrupts the parallel flow of multiple lines of thought on the code, cause you have to stop and implement a stupid method. Also have seen this many times in less experienced or less patient programmers, who inlined lots of code that should have been a library and cut corners in there due to time, mental and other pressures. Providing them a set of tools they could paste (poor platform) into a globally loaded module improved their jobs a lot.
I think the high horse here is a bad point cause it simply claims it must be hard for no good reason. It’s not even complexity-wise hard, you just have to (metaphotically) unpack your instruments every time you use them. That’s bs at all experience levels and it must be obvious to anyone who works in a shop. Ime, the problem isn’t knowledge, but inconvenience.
It's not hard to do correctly. If you employ people to write SQL who can't tell the difference between string concatenation and parameterised queries, then your bar is too low. This can be learned in under an hour[0], and is the most fundamental thing to bear in mind when writing a query.
Are we still passing SQL statements and data to the SQL back end as single string instead of passing them separately? Why would you even need to escape SQL data in 2024?
One example that I found is that some libraries/databases don't allow DDL statements to be parameterised - so if you are managing tables and columns from code and those names came from end users then you should be checking them.
Every consumer of its data should be sanitizing its inputs before rendering them wherever they are using it. HTML, SQL, etc. Banning "computer code" as judged by a random bureaucrat from being inserted into the database is not a solution at all, much less a foolproof one.
The absolute best case scenario here is that the bureaucrats successfully block all possible actually-malicious injection attacks but the vulnerable consumers still get broken occasionally by a random apostrophe that gets thrown in.
> Every consumer of its data should be sanitizing its inputs before rendering them wherever they are using it.
This is not how the real world runs though. In the real world (outside the bubble of programmers) things are messy and a lot of stuff barely works, many people are incompetent etc.
Said otherwise, it's defense in depth.
"Should" doesn't factor in. You can't make everyone competent at the wave of a magic wand. But you can control what company names are allowed. You can't control how they will be parsed. There is one law about company names, but a myriad systems that may parse them.
It always barely works as much as you allow it to. Lower the bar even more and it will start barely working at it again.
This koolaid with protecting real world only helps perception (“I made it work now with this simple rule”), cause moving the bar down relaxes issues a bit and they don’t instantly accumulate at the new level.
It doesn’t matter where the bar is, they will always find enough competence and budget to follow it in a moment. You just have to hard-break what half-works in advance.
You can't make everyone competent at the wave of a magic wand
You can make their incompetence fail by adding random honeypots like someone suggested above. That would be a smart move. Your “out of bubble” move is just an instant gratification button.
Whenever I see a python-requests user-agent I sometimes keep the connection open indefinitely without responding, to see if the developer was incompetent and forgot to set a timeout. Responding to other certain clients with 'Location: file:///dev/urandom' is also mildly entertaining.
My point would be, I'm not sure if this wouldn't be too damaging to the mental health of programmers if everyone was doing shit like that.
On balance, blocking such names makes sense. You can secure YOUR systems, and if that was that I would agree but unless you are going to pay to audit all consumers of the data worldwide, this solution is more pragmatic. I am not sure what we gain by letting company names have code.
You are right but best to do that on day 1, which was probably in the 1970s or whenever a database of company names first existed. In the case of HTML script exploits maybe the 1990s.
So you have a transitioning issue. You suddenly allow this company name sending a script to a domain they control then it is too dangerous.
Test data like you mentioned is a great idea to increase resiliance. However I don't think that rises the overall ecosystem of consumers of this data to the right level to release actual exploits into the dataset.
Downvoters are probably thinking purely. They are thinking "everyone in the world should make their systems 100% secure against common exploits and let a company name be an arbitrary string".
The problem is that is not realistic.
It works at a corporate level but not across all actors who interact with this dataset and the global internet. You can "should" at them all you like but no one has control over this.
The government can choose: more exploits in the wild or fewer. Allowing script URLs they dont control in company names is the former.
For the register of companies in England & Wales, day 1 would have been the 5th of September, 1844.
I think we can forgive the young William Gladstone (who was President of the Board of Trade at the time) for not fully anticipating how difficult robust string handling would turn out to be!
So you're right, this could only ever be approached as a transitioning issue.
That doesn't test things in a useful way, and relies on having an official dataset lie. Good ingestion code should ignore those, and then you're not even testing the frontend of those systems.
By disallowing, we normalise deviance (security wise).
Also, there can be a problem with who/how decides what is code.
There are myriad of programming languages already, and for trolling or legal attack purposes, one could build interpreter using arbitrary words as keywords (to make problems for arbitrary company)
IMO, this is like making human names illegal because people with certain accents or native languages may struggle to pronounce them.
Our government officials are so stupid it's astounding. This doesn't make anybody safer, but there's now another minor charge after somebody has broken the law.
The issue isn’t the government systems executing it. Countless other systems use and trust these sources. And sure, the registry isn’t technically liable, but it’s good not to break your downstream consumers when possible.
> “A company was registered using characters that could have presented a security risk to a small number of our customers, if published on unprotected external websites.”
I'm confused why everybody keeps talking about sanitization when all you have to do is escape a string properly whenever you inject it verbatim into a language, be it HTML or SQL or whatever.
Because they have not understood the core issue. It's impossible to store / sanitize data correctly, when this is absolutely context / output dependent.
I liked perl's taint mode. It seemed pretty good against the "oops, forgot to sanitise this and you used it as output" situation that probably accounts for a lot of these issues. It won't force you to correctly sanitise, but assuming you have that capability it lets you know about gaps so you can plug them.