The job (the question asked) is about recognizing start and end tags in XHTML. T...

avereveard · on Sept 9, 2019

because this

as for the solution, one can get a premade grammar and query the ast

unlinkr · on Sept 9, 2019

That is just a comment and can be easily scanned by a regex. The CDATA syntax does not have any particular meaning inside an XML comment if that is what you are suggesting.

And in any case, neither comments nor CDATA sections nest, so the is no problem handling them at the lexical analysis stage.

As for "get a premade grammar" - what does that mean? Do you mean use a tool like Lex? AFAIK Lex uses regular expressions.

avereveard · on Sept 9, 2019

your missing the point, you cannot nest them but you can have that comment both around or within a cdata, so that Regex is starting to look quite complex to maintain, unless you want to have multiple stages of Regex applied in sequence, at which point congratulation, your building a whole parser the hardest way

and no I mean tool like antlr or javacc which build an in memory syntax tree you can query or observe.

https://github.com/antlr/grammars-v4/blob/72810b7c59bb481750...

this is different than querying the document as XML, since an empty node is equivalent to an empty element, but they differ in the syntax, so querying the syntax tree allowed you to know because it produces a TAG_SLASH_CLOSE

unlinkr · on Sept 10, 2019

I'm not familiar with antlr or javacc, but don't they use regular expressions for lexing?

avereveard · on Sept 10, 2019

yeah, and you can have a grammar with a single production using ALL : .* - you're missing the forest for the tree.

regex are ok for text that's context free, so they match perfectly for the lexing°. once you need context then are back to whacky hacks, layers and loops of regexes or preparser that remove the hard part out of the content - basically you're rolling your own grammar except manually, on the wrong toolchain and on your own.

"can I build my raster engine to draw poligons on screen" "well yes but is going to cost a lot of time, be error prone and miss a lot of features from ready to use solutions"

°you can actually put token production in the lexer itself between expressions, so they can already handle more than plain rexeg even in lexing.

unlinkr · on Sept 10, 2019

So it seems we agree that for the question the OP is actually asking a regex is the appropriate tool.

Great if you can recommend a tool which can solve the task easier. But saying it is "not possible" is unhelpful and confusing. It is just giving lots of people really confused ideas about what is possible and not possible using regexes.

To use your own example: "Can I build my raster engine to draw polygons on screen?"

Not helpful: "No that is simply not logically possible due to the laws of geometry, and you are an idiot for even thinking such a thing is possible"

Helpful: "Yeas, but it is a lot of work if you want to support anti-aliasing, smooth scaling and so on. Using a library like XYZ would probably save you a lot of work in the long run".