It’s a famous post on StackOverflow, but I don’t find it particularly helpful.

kortex · on Sept 8, 2019

I found it extremely helpful. The sheer emphasis of the reply made me very curious why the idea of using regex on xml is so bonkers.

- I will never forget that regex can't parse XHTML

- the reason being, regex is insufficiently powerful

- when I first saw this post, I knew little about regex under the hood, this sent me down a wiki hole of FSMs, pushdown automata and turing machines

- this misconception is apparently common enough to be madness-inducing to those that know better

- use a hecking xml parser instead

It almost reminds me of a Bill Nye sketch. Teaching through a bit of non-sequitur and absurdism.

unlinkr · on Sept 9, 2019

The problem with the answer is it is wrong. The question is about identifying start-tags in XHTML. This is a question of tokenization and can be solved with a regular expression. Indeed, most parsers use regular expressions for the tokenization stage. It is exactly the right tool for the job!

Furthermore, the asker specifically needs to distinguish between start tags and self-closing start tags. This is a token-level difference which is typically not exposed by XHTML parsers. So saying "use a parser" is less than helpful.

I have elaborated a bit in blog post: https://www.cargocultcode.com/solving-the-zalgo-regex/

stfwn · on Sept 8, 2019

If you were looking for the reason why a regex cannot parse HTML, it is because HTML has matching nested tags and regex parsers are finite state machines (FSM).

What this means is that a regex parser is like a goldfish. It only knows about the state it is currently in (what it just read) and which possible states it may transition to (what is legally allowed to come next). The fish never remembers where it was before; there is no option to have the legality of a transition depend on what it read _before_ the current state. But this memory function is a requirement to recursively match opening tags to their closing tags - you need to keep a stack of opening stacks somewhere in order to then cross off their closing tags in reverse order. So regex cannot parse HTML.

inflatableDodo · on Sept 8, 2019

>What this means is that a regex parser is like a goldfish.

Like a drunk goldfish, or a sober goldfish?

'A method to study short-term memory (STM) in the goldfish.'

'Twenty-one common goldfish (13-15.5 cm long) were randomly divided into alcohol (A) and nonalcohol (NA) groups and were trained in an alcohol solution of 400 mg/100 ml or in water, respectively. All alcohol fish were placed in an alcohol solution of 400 mg/100 ml for 3 hr before training in the same alcohol concentration. Fish were trained on a position discrimination task for 2 consecutive days. The door used for training was that opposite to each fish's spontaneous preference. Savings in relearning on Day 2 was taken as a measure of long term memory strength. Only fish which reached criterion on both days were immediately given 10 forced reversal trails in the opposite direction (i.e., a fish trained on right door was forced to choose the left door.) A and NA subjects were then tested after a 5 min (STM) delay, respectively, in a free choice situation for 10 trails (i.e., neither door was blocked). The results suggest that alcohol facilitates the STM of the forced reversal information.'

https://www.ncbi.nlm.nih.gov/pubmed/935220

unlinkr · on Sept 8, 2019

The question is about identifying end-tags in XHTML. This is indeed possible with a regex.

nurettin · on Sept 8, 2019

unlinkr · on Sept 8, 2019

That is not XHTML.

hk__2 · on Sept 8, 2019

What about <a>this  </a> ?

unlinkr · on Sept 9, 2019

Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.

hk__2 · on Sept 9, 2019

> Yes you can tokenize this with a regular expression and extract the valid start and end tags.

So you need more than a regular expression, hence your premise is incorrect.

unlinkr · on Sept 9, 2019

No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.

The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.

nurettin · on Sept 8, 2019

unlinkr · on Sept 8, 2019

That is a valid XHTML tag (if I remember correctly) and can be matched perfectly fine by a regex.

nurettin · on Sept 8, 2019

Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.

unlinkr · on Sept 9, 2019

That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.

nurettin · on Sept 9, 2019

It just makes the starting offset of the regex input important.

unlinkr · on Sept 9, 2019

Sure. But that would be true for any parsing technique. No parser known to man would be able to produce a valid parse if you start it in the middle of a quoted string!

bryanrasmussen · on Sept 8, 2019

theoretically I believe an end tag really requires a valid start tag.

anyway you can probably answer any number of simple questions about a bit of HTML using regex but as code wants to grow to handle more use cases there will come a time when the solution will break down and the code that wrote to handle all the previous uses will need to be rewritten using something other than regex.

unlinkr · on Sept 9, 2019

You have to distinguish between the different levels of parsing.

Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.

If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.

I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.

unlinkr · on Sept 8, 2019

An element requires a start and end tag, or a self-closing start tag.

snazz · on Sept 8, 2019

Had it not gotten extremely lucky, it would have a very negative score. Anything intended to be funny on StackOverflow doesn’t go over well very often.

ChrisSD · on Sept 8, 2019

The answer was made in '09. StackOverflow was more accepting of some humour at the time. And it is making a valid point: regex isn't the right tool for this job.

unlinkr · on Sept 8, 2019

The question is about how to identify start and end-tags in XHTML. What would be an appropriate tool for that job?

shakna · on Sept 8, 2019

A parser. Specifically, an XHTML parser.

unlinkr · on Sept 8, 2019

How do you think an XHTML parser is written? In particular, how does an XHTML parser identify tokens like start and end tags?

journalctl · on Sept 8, 2019

With a pushdown automaton [1] or something like a linear bounded automaton [2].

[1] https://en.m.wikipedia.org/wiki/Pushdown_automaton

[2] https://en.m.wikipedia.org/wiki/Linear_bounded_automaton

More specifically, a stack lets you keep track of nesting. See an opening tag, push something onto a stack. See a closing tag, pop the stack. If the stack is empty at the end, the tags match.

Parsing XHTML in real life is of course much more complicated than this, but this is the basic idea.

unlinkr · on Sept 8, 2019

I think there are a lot of knee-jerk answer because people see "XHTML" and "regex" in the same sentence and immediately think "not possible".

But the actual question is clearly not about matching start tags to end tags or building DOM or anything like that - which indeed would require a stack. The question is about recognizing start and end tags. You can do that perfectly fine with regular expressions - indeed many parsers uses regular expressions to tokenize the input before parsing.

Furthermore, the question specifically needs to recognize the difference between start-tags and self-closing tags. A differece which is not exposed by most XHTML parsers a far as I am aware

journalctl · on Sept 8, 2019

Sorry, I misread. Indeed, actually tokenizing text is accomplished with regular expressions (although some parsers don’t need a tokenization pass, but details :).

saagarjha · on Sept 8, 2019

It keeps track of state that a regular expression cannot?

unlinkr · on Sept 8, 2019

You don't need to keep track of state to match tokens like XHTML start or end tags.

shakna · on Sept 9, 2019

If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)

unlinkr · on Sept 9, 2019

So how would you tokenize without the use of regular expressions? What more appropriate technique would you use instead?

The example you provide in not XHTML so not really relevant for the discussion. But in any case, a regular expression have no problem recognizing a quoted string.

shakna · on Sept 9, 2019

> So how would you tokenize without the use of regular expressions?

Since this need doesn't appear to be an everyday one, with clearly defined targets, a simple hand-written lexer isn't hard to write, and will make less mistakes than a regex. Just use a scanning approach. As a bonus, you'll still be able to read in 12 months time.

unlinkr · on Sept 9, 2019

Why would a hand-written lexer have fewer mistakes than a regular expression using an off-the-shelf regex engine? They would need to encode the same lexical grammar, so at that level there is the same amount of complexity.

Writing a lexer by hand is just trading ten lines of regex (a widely known declarative DSL) with hundreds of lines of custom code. I don't see how that would be more maintainable in the long run.

joppy · on Sept 8, 2019

I don't think that answer was written with the intent of being particularly helpful, I think it was written with a different goal in mind.

jacobush · on Sept 8, 2019

It is helpful in that almost koan way though.

wool_gather · on Sept 9, 2019

The author said at one point that it was written in a bit of a huff of frustration, and if I remember correctly, also after a pint or two. It's not the best answer answer by any means, but it is one of those small gems of the web, I think.

swasheck · on Sept 8, 2019

It is helpful. It may not provide the answer the OP wanted, but it provides the answer they needed.

avereveard · on Sept 8, 2019

it's sad that today the stack overflow majority shares your mindset and an answer like that would drown in downvotes or killed by moderation.

some people like to act super serious all the time like they're playing a sitcom version of what they think adulthood is in a quest to be the most boring person on earth like if that's the goal of human interaction

hk__2 · on Sept 8, 2019

You can be funny and helpful at the same time.

avereveard · on Sept 8, 2019

which is exactly what that so answer was

unlinkr · on Sept 9, 2019

The problem is it is funny and wrong. Apparently it have given a lot of people really confused ideas about what is possible and what is not possible with regular expressions.

If it had been funny and right I would not have a problem with it.

avereveard · on Sept 9, 2019

Stack overflow is not a homework help group, the "Regex are the wrong tool for the job" answer is more correct than the answer with the working Regex.

unlinkr · on Sept 9, 2019

The job (the question asked) is about recognizing start and end tags in XHTML. These are lexical tokens and therefore regular expressions are a perfectly fine tool for this. Indeed many parsers use regular expressions to perform the lexical analysis (https://en.wikipedia.org/wiki/Lexical_analysis) aka tokeniziation.

Quoting from wikipedia:

The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions;

If you disagree that regular expressions are an appropriate tool for lexical analysis, can you articulate why? And what technique do you propose instead?

avereveard · on Sept 9, 2019

because this

as for the solution, one can get a premade grammar and query the ast

unlinkr · on Sept 9, 2019

That is just a comment and can be easily scanned by a regex. The CDATA syntax does not have any particular meaning inside an XML comment if that is what you are suggesting.

And in any case, neither comments nor CDATA sections nest, so the is no problem handling them at the lexical analysis stage.

As for "get a premade grammar" - what does that mean? Do you mean use a tool like Lex? AFAIK Lex uses regular expressions.

avereveard · on Sept 9, 2019

your missing the point, you cannot nest them but you can have that comment both around or within a cdata, so that Regex is starting to look quite complex to maintain, unless you want to have multiple stages of Regex applied in sequence, at which point congratulation, your building a whole parser the hardest way

and no I mean tool like antlr or javacc which build an in memory syntax tree you can query or observe.

https://github.com/antlr/grammars-v4/blob/72810b7c59bb481750...

this is different than querying the document as XML, since an empty node is equivalent to an empty element, but they differ in the syntax, so querying the syntax tree allowed you to know because it produces a TAG_SLASH_CLOSE

unlinkr · on Sept 10, 2019

I'm not familiar with antlr or javacc, but don't they use regular expressions for lexing?

avereveard · on Sept 10, 2019

yeah, and you can have a grammar with a single production using ALL : .* - you're missing the forest for the tree.

regex are ok for text that's context free, so they match perfectly for the lexing°. once you need context then are back to whacky hacks, layers and loops of regexes or preparser that remove the hard part out of the content - basically you're rolling your own grammar except manually, on the wrong toolchain and on your own.

"can I build my raster engine to draw poligons on screen" "well yes but is going to cost a lot of time, be error prone and miss a lot of features from ready to use solutions"

°you can actually put token production in the lexer itself between expressions, so they can already handle more than plain rexeg even in lexing.

unlinkr · on Sept 10, 2019

So it seems we agree that for the question the OP is actually asking a regex is the appropriate tool.

Great if you can recommend a tool which can solve the task easier. But saying it is "not possible" is unhelpful and confusing. It is just giving lots of people really confused ideas about what is possible and not possible using regexes.

To use your own example: "Can I build my raster engine to draw polygons on screen?"

Not helpful: "No that is simply not logically possible due to the laws of geometry, and you are an idiot for even thinking such a thing is possible"

Helpful: "Yeas, but it is a lot of work if you want to support anti-aliasing, smooth scaling and so on. Using a library like XYZ would probably save you a lot of work in the long run".