The problem with the answer is it is wrong. The question is about identifying start-tags in XHTML. This is a question of tokenization and can be solved with a regular expression. Indeed, most parsers use regular expressions for the tokenization stage. It is exactly the right tool for the job!
Furthermore, the asker specifically needs to distinguish between start tags and self-closing start tags. This is a token-level difference which is typically not exposed by XHTML parsers. So saying "use a parser" is less than helpful.
If you were looking for the reason why a regex cannot parse HTML, it is because HTML has matching nested tags and regex parsers are finite state machines (FSM).
What this means is that a regex parser is like a goldfish. It only knows about the state it is currently in (what it just read) and which possible states it may transition to (what is legally allowed to come next). The fish never remembers where it was before; there is no option to have the legality of a transition depend on what it read _before_ the current state. But this memory function is a requirement to recursively match opening tags to their closing tags - you need to keep a stack of opening stacks somewhere in order to then cross off their closing tags in reverse order. So regex cannot parse HTML.
>What this means is that a regex parser is like a goldfish.
Like a drunk goldfish, or a sober goldfish?
'A method to study short-term memory (STM) in the goldfish.'
'Twenty-one common goldfish (13-15.5 cm long) were randomly divided into alcohol (A) and nonalcohol (NA) groups and were trained in an alcohol solution of 400 mg/100 ml or in water, respectively. All alcohol fish were placed in an alcohol solution of 400 mg/100 ml for 3 hr before training in the same alcohol concentration. Fish were trained on a position discrimination task for 2 consecutive days. The door used for training was that opposite to each fish's spontaneous preference. Savings in relearning on Day 2 was taken as a measure of long term memory strength. Only fish which reached criterion on both days were immediately given 10 forced reversal trails in the opposite direction (i.e., a fish trained on right door was forced to choose the left door.) A and NA subjects were then tested after a 5 min (STM) delay, respectively, in a free choice situation for 10 trails (i.e., neither door was blocked). The results suggest that alcohol facilitates the STM of the forced reversal information.'
No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.
The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.
Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.
That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.
Sure. But that would be true for any parsing technique. No parser known to man would be able to produce a valid parse if you start it in the middle of a quoted string!
theoretically I believe an end tag really requires a valid start tag.
anyway you can probably answer any number of simple questions about a bit of HTML using regex but as code wants to grow to handle more use cases there will come a time when the solution will break down and the code that wrote to handle all the previous uses will need to be rewritten using something other than regex.
You have to distinguish between the different levels of parsing.
Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.
If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.
I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.
Had it not gotten extremely lucky, it would have a very negative score. Anything intended to be funny on StackOverflow doesn’t go over well very often.
The answer was made in '09. StackOverflow was more accepting of some humour at the time. And it is making a valid point: regex isn't the right tool for this job.
More specifically, a stack lets you keep track of nesting. See an opening tag, push something onto a stack. See a closing tag, pop the stack. If the stack is empty at the end, the tags match.
Parsing XHTML in real life is of course much more complicated than this, but this is the basic idea.
I think there are a lot of knee-jerk answer because people see "XHTML" and "regex" in the same sentence and immediately think "not possible".
But the actual question is clearly not about matching start tags to end tags or building DOM or anything like that - which indeed would require a stack. The question is about recognizing start and end tags. You can do that perfectly fine with regular expressions - indeed many parsers uses regular expressions to tokenize the input before parsing.
Furthermore, the question specifically needs to recognize the difference between start-tags and self-closing tags. A differece which is not exposed by most XHTML parsers a far as I am aware
Sorry, I misread. Indeed, actually tokenizing text is accomplished with regular expressions (although some parsers don’t need a tokenization pass, but details :).
If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)
So how would you tokenize without the use of regular expressions? What more appropriate technique would you use instead?
The example you provide in not XHTML so not really relevant for the discussion. But in any case, a regular expression have no problem recognizing a quoted string.
> So how would you tokenize without the use of regular expressions?
Since this need doesn't appear to be an everyday one, with clearly defined targets, a simple hand-written lexer isn't hard to write, and will make less mistakes than a regex. Just use a scanning approach. As a bonus, you'll still be able to read in 12 months time.
Why would a hand-written lexer have fewer mistakes than a regular expression using an off-the-shelf regex engine? They would need to encode the same lexical grammar, so at that level there is the same amount of complexity.
Writing a lexer by hand is just trading ten lines of regex (a widely known declarative DSL) with hundreds of lines of custom code. I don't see how that would be more maintainable in the long run.
The author said at one point that it was written in a bit of a huff of frustration, and if I remember correctly, also after a pint or two. It's not the best answer answer by any means, but it is one of those small gems of the web, I think.
it's sad that today the stack overflow majority shares your mindset and an answer like that would drown in downvotes or killed by moderation.
some people like to act super serious all the time like they're playing a sitcom version of what they think adulthood is in a quest to be the most boring person on earth like if that's the goal of human interaction
The problem is it is funny and wrong. Apparently it have given a lot of people really confused ideas about what is possible and what is not possible with regular expressions.
If it had been funny and right I would not have a problem with it.
The job (the question asked) is about recognizing start and end tags in XHTML. These are lexical tokens and therefore regular expressions are a perfectly fine tool for this. Indeed many parsers use regular expressions to perform the lexical analysis (https://en.wikipedia.org/wiki/Lexical_analysis) aka tokeniziation.
Quoting from wikipedia:
The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions;
If you disagree that regular expressions are an appropriate tool for lexical analysis, can you articulate why? And what technique do you propose instead?
That is just a comment and can be easily scanned by a regex. The CDATA syntax does not have any particular meaning inside an XML comment if that is what you are suggesting.
And in any case, neither comments nor CDATA sections nest, so the is no problem handling them at the lexical analysis stage.
As for "get a premade grammar" - what does that mean? Do you mean use a tool like Lex? AFAIK Lex uses regular expressions.
your missing the point, you cannot nest them but you can have that comment both around or within a cdata, so that Regex is starting to look quite complex to maintain, unless you want to have multiple stages of Regex applied in sequence, at which point congratulation, your building a whole parser the hardest way
and no I mean tool like antlr or javacc which build an in memory syntax tree you can query or observe.
this is different than querying the document as XML, since an empty node is equivalent to an empty element, but they differ in the syntax, so querying the syntax tree allowed you to know because it produces a TAG_SLASH_CLOSE
yeah, and you can have a grammar with a single production using ALL : .* - you're missing the forest for the tree.
regex are ok for text that's context free, so they match perfectly for the lexing°. once you need context then are back to whacky hacks, layers and loops of regexes or preparser that remove the hard part out of the content - basically you're rolling your own grammar except manually, on the wrong toolchain and on your own.
"can I build my raster engine to draw poligons on screen" "well yes but is going to cost a lot of time, be error prone and miss a lot of features from ready to use solutions"
°you can actually put token production in the lexer itself between expressions, so they can already handle more than plain rexeg even in lexing.
So it seems we agree that for the question the OP is actually asking a regex is the appropriate tool.
Great if you can recommend a tool which can solve the task easier. But saying it is "not possible" is unhelpful and confusing. It is just giving lots of people really confused ideas about what is possible and not possible using regexes.
To use your own example: "Can I build my raster engine to draw polygons on screen?"
Not helpful: "No that is simply not logically possible due to the laws of geometry, and you are an idiot for even thinking such a thing is possible"
Helpful: "Yeas, but it is a lot of work if you want to support anti-aliasing, smooth scaling and so on. Using a library like XYZ would probably save you a lot of work in the long run".