Is there a regular expression to detect a valid regular expression?

tobinfricke · on Sept 8, 2019

> Is there a regular expression to detect a valid regular expression?

No, there is not. For example, parentheses in a regex must be balanced, and (famously) there is no regex to detect balanced parentheses.

mannykannot · on Sept 8, 2019

More precisely, it is arbitrarily deep nested clauses that cannot be parsed, on account of true REs not being able to either use recursion or keep count of how deep they are.

https://blogs.msdn.microsoft.com/jaredpar/2008/10/15/regular...

benj111 · on Sept 8, 2019

Can regex engines actually process arbitrarily deep nestings themselves though?

It seems to me it would start getting computationally expensive very quickly to search for those kinds of regexes, so is 'arbitrarily deep' a practical concern for todays hardware?

gpm · on Sept 8, 2019

Yes, regex engines that are designed to do so run in linear time on the size of the regex (as well as linear time with respect to the length of the input).

One such engine is rust's https://github.com/rust-lang/regex

rstuart4133 · on Sept 9, 2019

> One such engine is rust's https://github.com/rust-lang/regex

It can process arbitrary deep nestings? I'm not familiar with it, but I had a look and I can't see anything in there that would allow it to do it. Can you point to syntax that allows it?

In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).

A DFA is part of the Chomsky hierarchy of languages that has DFA's at the bottom (aka regex's, least powerful - can't handle recursive structures), a DFA coupled with an infinite stack for memory (aka LR parsers, can't handle context sensitive grammars, also runs in very close to linear time), and finally a DFA coupled with an infinite tape (aka Turing Machine, can do any sort of computation).

In each of these cases the DFA (regex) is essentially a computer program and all they are doing is allowing it to store stuff on various kinds of memory (none, stack, tape with arbitrary movements). An interesting corollary of this is any computer program can be expressed as a DFA.

burntsushi · on Sept 12, 2019

> It can process arbitrary deep nestings?

The GP is presumably citing it as an example of a linear time regex, and not as something that can parse arbitrarily deep nestings. Note that "linear time regex engines" is a characterization that holds the size of the regex constant, such that "linear time" is with respect to the input only. The actual worst case time complexity of such regex engines is O(m * n), where m ~ len(regex) and n ~ len(input). So even with a linear time regex engine, if you bloat the size of the regex, you'll make searching slower. But, it will still be done linearly with respect to the size of the input.

> In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).

No. Please stop spreading the misuse of these terms. DFAs and NFAs have very precise meanings. It's one thing to call things regexes that aren't actually regular, but DFAs and NFAs should retain their precise theoretical meaning. A linear time regex engine such as Rust's `regex` crate uses both DFAs and NFAs in its implementation.

The Perl and PCRE folks (along with Jeffrey Friedl) have bastardized this terminology. PCRE for example claims to provide a "DFA" regex engine in its API, but it's actually an NFA. The main implementation inside PCRE (and Perl) is backtracking oriented, and carries with it much more power than an NFA. If it didn't, then the NFA could be executed in linear time, because NFAs and DFAs have equivalent power.

> An interesting corollary of this is any computer program can be expressed as a DFA.

No they can't. You're confusing terminology quite a bit here.

ridiculous_fish · on Sept 8, 2019

Does "linear wrt A and linear wrt B" mean O(A*B) or O(A+B)?

gpm · on Sept 8, 2019

It means O(AB), but is carefully worded to not say (or suggest) it's not possible to do O(A+B) which is a subset of O(AB) ;)

I'm not sure if it's possible to match regex with that time complexity, it looks like the implementation I referenced is O(AB). https://docs.rs/regex/1.3.1/regex/#untrusted-input.

(star symbol removed because hn decided to turn them into a block of italics)

mannykannot · on Sept 8, 2019

There is an importance to theoretical considerations that goes beyond their immediate practical applications. Ultimately, it comes down to how well one understands what is going on.

Having said that, I think there are practical applications. Suppose you were unaware of this result, and thought a true RE could validate your input? A more knowledgeable attacker might be able to craft an input that exploits the inherent limitations of your validator.

Also in practice, attempting to write a true RE to parse a language that allows restricted recursion rapidly becomes complicated. It is the wrong tool for the job.

jessaustin · on Sept 8, 2019

Just considering the language of balanced parentheses, the only memory needed is a single integer to track depth. If we see a "(", increment. If we see a ")", decrement. If we never go negative and end up with zero, that's a balanced parentheses expression. (To be clear, the parser I've described is not a regex parser.)

Adding the rest of the regex machinery makes this a bit more complex, but in general, if the computer can store the string, it can check whether it's regular.

taejo · on Sept 9, 2019

As you say, this is not a regex parser, but some readers may not know the theorem that the regular languages are exactly those which can be parsed by a Turing machine with constant memory, and parenthesis count can have log n bits.

In fact, any Turing machine using o(log log n) space recognizes a regular language (so there is an equivalent machine using O(1) space).

dyarosla · on Sept 8, 2019

This only works if you have a single type of parenthesis. Is that applicable to regex expressions in general that only ()’s can nest?

gpm · on Sept 8, 2019

As typically defined academically, yes. In the real world things might be more complex, but I think it would hold true for many implementations.

Academically a regex is a string of of:

- character literals

- "epsilon" characters (meaning empty string)

- "+" characters (which means either what we see before the + or what we see after the +, in real-world regex these are usually represented as | instead of + as well as used implicitly in constructs like [a-zA-Z])

- "*" characters (same meaning as in real-world regex)

- And parentheses to allow controlling the order of operations

Piskvorrr · on Sept 8, 2019

In other words, "if we build a state machine which is different from a regex, it will perform differently than a regex would." That's trivially true, but is it interesting beyond restating "there is no Silver Bullet"?

gpm · on Sept 8, 2019

Technically we aren't discussing the regex itself here but the syntax for defining a regex.

You could use a different syntax for defining the same regex, and that syntax could have different properties. For instance you could use parantheses like racket where "(...)" and "[...]" have the same meaning, but "(...]" is not well formed syntax (if I remember my racket correctly). Using such a syntax to define regex a counter would no longer suffice to decide if a string is a regex.

jessaustin · on Sept 9, 2019

You could have a stack of (paren-type, integer) pairs. If the paren has a different type than the top pair on the stack (or the stack is empty), push a new pair with the integer set to one. If the integer goes to zero, pop the pair. At the end of the expression, you have to have an empty stack.

nightfly · on Sept 8, 2019

)))((()(()

gerdesj · on Sept 8, 2019

When I check brackets by hand or manually I do actually get my hands out and when I say open I uncurl a finger, and when I say close, I furl a finger. "Open, open, open, close, open, close, close, close". Provided I have no fingers on display at the end, then the brackets are balanced. They might not be in the right place but at least they are balanced. If I find myself starting with close then I close the lid and go and do something else.

I'm sure some sort of rather cheap algorithm falls out of the above. It wont guarantee correctness in what the balanced brackets actually contain but they will at least be balanced.

sixstringtheory · on Sept 8, 2019

Good edge case. I’d say to immediately return an ‘unbalanced’ result once the count goes negative.

lioeters · on Sept 8, 2019

((((()

detaro · on Sept 8, 2019

doesn't end up with zero at the end, rejected?

lioeters · on Sept 8, 2019

I was just trying to suggest a possible edge case, a failing test.

jessaustin · on Sept 9, 2019

"never go negative"

jameshart · on Sept 8, 2019

This is true if you want to handle arbitrarily deep nested parents.

But any given string you are asked to validate to see if it is a regex will be a finite length, and contain a finite number of opening paren characters. So it’s maximum possible nesting depth is known.

And you can construct, fairly trivially, a regex that can validate paren nesting up to a fixed depth.

So, in practice, you could use a regular expression to validate the paren nesting of any given string - if you were allowed to prepare the regex based on the string length, or on running another regex over the string first.

What I suspect you might not be able to validate is that in a regex, while [a-z] is valid, [z-a] is not.

daveFNbuck · on Sept 8, 2019

If you allow generating a regular expression based on string length, you can validate any computable language with a regular expression. It's trivial to just run a solver on all strings of that length and or the accepted ones together.

> What I suspect you might not be able to validate is that in a regex, while [a-z] is valid, [z-a] is not.

It's fairly trivial to validate things like a-z being valid in a character class while z-a isn't. There are only finitely many legal ranges. You can just list them all, like \[(a-a|a-b|a-c|...|z-z)\].

repsilat · on Sept 8, 2019

For a fixed input length all languages are decidable by regular expressions. Trivially, the union of all valid input strings of that length, like "(abc|def|...)".

It's not an interesting property.

jimhefferon · on Sept 8, 2019

The problem is that "regular expression" is equivocal.

Sometimes you hear "regular expression" used to mean the language of things accepted by some Finite State machine. Lots of folks on HN took a Theory of Computation class and learned about these in that class. In that meaning, yes, you are right.

But sometimes in professional conversations you hear "regular expression" used to mean the strings that can be matched using, say, Perl's regular expressions. I know I have heard it used that way often. Since Perl regular expressions can accept anything a Turing machine can accept, the answer to the question in this case is yes.

wahern · on Sept 8, 2019

Very few people know that Perl's regex engine (since 5.10) supports recursion. And recursive matching cannot perform recursive captures, which is why at best it can only be used to validate a string, not parse it, making it not particularly useful in practice. For these reasons I would never assume that someone using the term "regular expression" inclusively of Perl would have in mind such capabilities. IME, such a loose definition simply implies things like zero-width assertions, backcaptures, etc.

Here's a proof-of-concept Perl 5.10+ JSON validator I came up with for a presentation to Perl engineers introducing PEGs and Lua's LPeg module. Of the 20-30 people in the room, I doubt anybody in the audience knew this was even possible with Perl.

  my $grammar = qr{
  ^(?&Value) $
  (?(DEFINE)
    (?<Value> \s∗ (?:
        (?&Array)
      | (?&Object)
      | (?&Boolean)
      | (?&Number)
      | (?&String)
      | (?&Null)
    ) \s∗ )
    (?<Array> \[ \s∗ (?:(?&Value) (?:\s∗,\s∗ (?&Value))∗)? \s∗ \])
    (?<Object> \{ \s∗ (?:(?&KeyV) (?:\s∗,\s∗ (?&KeyV))∗)? \s∗ \})
    (?<KeyV> \s∗ (?:(?&String) \s∗:\s∗ (?&Value))) \s∗
    (?<Boolean> true | false)
    (?<Number> \d+)
    (?<String> "[^\"]∗")
    (?<Null> null)
  )
  }xs;

Note: I was trying to fit it all on a single slide, so the definition for String doesn't handle escaped characters. There may be other deficiencies. I copy+pasted this from the PDF slide deck as I can't find the original Beamer source. Any broken spacing and Unicode substitutions probably aren't original.

mannykannot · on Sept 8, 2019

Sort of, but if (and only if) you know what a regular expression formally means, you can disambiguate the equivocation introduced by ambiguous usage.

tgv · on Sept 8, 2019

It depends on the representation. If the representation is in the FSA format (state, character, state), it becomes nearly trivial.

ggrrhh_ta · on Sept 8, 2019

Very underrated comment. Very good point to give food for thought although I think it is clear that it was meant to be on the usual textual representation using the simplest re language (without extensions).

morelisp · on Sept 8, 2019

An analogous problem for this representation would be detecting transitions to undefined states, or ambiguous (duplicate) transitions.

tgv · on Sept 9, 2019

I didn't verify it, but my intuition says that can't be done with a CFG; it seems at least context-sensitive.

lioeters · on Sept 8, 2019

> (famously) there is no regex to detect balanced parentheses

God, I remember running into this wall with an in-company domain-specific language that used regex as its tokenizer/parser.

Adding support for nested structures required us to untangle the whole thing and rewrite the regex into explicit algorithms. Fortunately the regex was only a few pages long..

In short, I agree with you that regex cannot parse regular expressions completely. Would be happy to be proven wrong, though!

blibble · on Sept 8, 2019

> In short, I agree with you that regex cannot parse regular expressions completely. Would be happy to be proven wrong, though!

the proof that you can't parse nested pararetheses with a regular language is a pretty standard part of first year computer science

lioeters · on Sept 8, 2019

It doesn't stop some of us from trying though! :)

tobinfricke · on Sept 8, 2019

Follow-up question: without the grouping parentheses, is the regex language regular?

trisiak · on Sept 8, 2019

Yes, but you're taking too much. Without grouping, regex is weaker than regular languages.

Grouping & capturing parentheses can stay. It's only the back-referrences that you need to remove.

gbacon · on Sept 8, 2019

Regex matching with backreferences is NP-hard.

https://perl.plover.com/NPC/

saltcured · on Sept 8, 2019

Grouping is very useful, but I think you can get pretty far in practical patterns with a limit of k-nesting for k=0..3 or so. That pattern language would still be regular.

lonelappde · on Sept 8, 2019

Yes but it's mostly useless, since it can't match (variable-count) repeated strings of length greater than 1. Goruping isn't only about extraction.

raldi · on Sept 8, 2019

Sure there is. Read the linked answer, or just google it.

Example:

https://regular-expressions.mobi/recurse.html?wlr=1

jtanderson · on Sept 8, 2019

Right but parent is thinking of the quesion "is the language of valid regular expressions itself regular?", which a very different question and of course false by an easy application of the pumping lemma. However, as shown in the SO answer, they "cheat" by using some features that don't align completely with the mathematical definitions.

edit: typo.

josalhor · on Sept 8, 2019

It depends if we're talking regex in the mathematical sense or in the programming one.

noobiemcfoob · on Sept 8, 2019

"Evaluate it in a try..catch or whatever your language provides."

"That's not very enterprisey of you"

Oooh, I'm laughing so hard it hurts. It's been a particularly 'enterprisey' week at work.

braythwayt · on Sept 8, 2019

That answer feels entirely wrong to me. As pointed out, it may be an XY answer: It presumes the person posing the question wants to validate regular expressions.

But with something so blatantly self-referential, it actually feels unlikely to me that what they want to do is validate regular expressions. My guess is that they are generally curious about whether Regexen (PCRE or strict regular expressions) are powerful enough to validate Regexen (again whether PCRE or strict).

XY answers are good for avoiding a lot of unnecessary yak-shaving/accidental complexity of a bad solution. But the conversation around whether we are talking about recognizing strict Regexen or PCRE Regexen, and in turn whether we are using strict Regexen or PCRE Regexen to recognize them is not accidental complexity or yak-shaving, it is intrinsic to understanding the nature of the problem and solution spaces.

I too find the answer humorous for the "enterprisey" reference, but I think it would be a very bad answer if we are judging it strictly on the basis of its value.

nolok · on Sept 8, 2019

"We could do it in a easy two liners that make sense, but, and hear me out here, what if we instead built an entire 10k LOC system that would barely work half of the time, using dozens of man hours for it ? And then never re-use it nor train anyone about it so that it becomes a mythical black box after your contract expires ?"

"Well, I gave you my recommendation but hey at the end of the day you're the customer and I'm paid by the hour, so whatever you say boss"

I now consider myself very lucky to be successful enough to not have to deal with projects like that, but back when I wasn't I paid quite a few things I wanted with those sort of stupid decisions.

murukesh_s · on Sept 8, 2019

Actually that is how enterprise works afaik. But few well funded startups on the other hand...

gamache · on Sept 8, 2019

I believe Zalgo has the answer to this, via an equivalent question. https://stackoverflow.com/questions/1732348/regex-match-open...

lolinder · on Sept 8, 2019

My favorite part of it is the moderators' note at the end:

> Moderator's Note

> This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

gnud · on Sept 8, 2019

The big problem with the content is suggesting to use an XML parser to parse HTML. That might often work, and several XML parsers have some sort of HTML "mode", but in general, no, not really. HTML documents are frequently invalid XML. HTML documents lack an XML declaration, they don't close all tags, etc. They might even lack a root element. Someone mentioned this in a comment on SO, but it was (as far as I could see) not really addressed or answered.

HTML 5 has it's own parsing rules, specified at [1]. However, I don't know of any implementation of these outside of browsers. I normally use Beatuiful Soup [2] (for Python) or HTML Agility Pack [3] (for .NET), and while I don't think these implement the exact standard, they're easier than struggling with a XML parser for HTML "in the wild".

1: https://html.spec.whatwg.org/multipage/parsing.html 2: https://www.crummy.com/software/BeautifulSoup/ 3: https://html-agility-pack.net/

kortex · on Sept 8, 2019

The question title implies he's parsing XHTML text.

> XHTML is an XML-based HTML. It serves the same function as HTML, but with the same rules as XML documents.

https://stackoverflow.com/q/1429065

ChrisSD · on Sept 8, 2019

The question title is ambiguous but the question tags include both HTML and XHTML. Self-closing `<element />` is an XML-ism which in HTML is treated the same as `<element>` but it was still fashionable to use them in HTML when XHTML was at its height.

tzs · on Sept 8, 2019

From that answer: "You can't parse [X]HTML with regex".

I've always wondered how that came about. How come very early in the development of the web no one important enough for people to pay attention to them said, "Hey...wait a second. If this thing becomes popular, people are going to really want to processes web documents with their usual text file processing tools and techniques. We really outta make this thing reasonably easy to process with grep and sed and awk and Perl and such"?

tom_mellior · on Sept 8, 2019

The problem isn't with the surface syntax of HTML. Anything else, like S-expressions or JSON or whatever would have the same problem. The problem is arbitrary nesting of elements. No language with nesting can be parsed with regex (modulo various copouts mentioned elsethread). But such nesting is useful and necessary for a document language like HTML. This goal is important; "we must be able to use awk" isn't.

abraCadabstrax · on Sept 8, 2019

Perhaps the most epic reply I have ever seen on SO.

nolok · on Sept 8, 2019

In SO's hall of fame of regular members being feed up of a common yet stupid question always coming back up it's up there alongside the thread about making addition with jQuery

hk__2 · on Sept 8, 2019

It’s a famous post on StackOverflow, but I don’t find it particularly helpful.

kortex · on Sept 8, 2019

I found it extremely helpful. The sheer emphasis of the reply made me very curious why the idea of using regex on xml is so bonkers.

- I will never forget that regex can't parse XHTML

- the reason being, regex is insufficiently powerful

- when I first saw this post, I knew little about regex under the hood, this sent me down a wiki hole of FSMs, pushdown automata and turing machines

- this misconception is apparently common enough to be madness-inducing to those that know better

- use a hecking xml parser instead

It almost reminds me of a Bill Nye sketch. Teaching through a bit of non-sequitur and absurdism.

unlinkr · on Sept 9, 2019

The problem with the answer is it is wrong. The question is about identifying start-tags in XHTML. This is a question of tokenization and can be solved with a regular expression. Indeed, most parsers use regular expressions for the tokenization stage. It is exactly the right tool for the job!

Furthermore, the asker specifically needs to distinguish between start tags and self-closing start tags. This is a token-level difference which is typically not exposed by XHTML parsers. So saying "use a parser" is less than helpful.

I have elaborated a bit in blog post: https://www.cargocultcode.com/solving-the-zalgo-regex/

stfwn · on Sept 8, 2019

If you were looking for the reason why a regex cannot parse HTML, it is because HTML has matching nested tags and regex parsers are finite state machines (FSM).

What this means is that a regex parser is like a goldfish. It only knows about the state it is currently in (what it just read) and which possible states it may transition to (what is legally allowed to come next). The fish never remembers where it was before; there is no option to have the legality of a transition depend on what it read _before_ the current state. But this memory function is a requirement to recursively match opening tags to their closing tags - you need to keep a stack of opening stacks somewhere in order to then cross off their closing tags in reverse order. So regex cannot parse HTML.

inflatableDodo · on Sept 8, 2019

>What this means is that a regex parser is like a goldfish.

Like a drunk goldfish, or a sober goldfish?

'A method to study short-term memory (STM) in the goldfish.'

'Twenty-one common goldfish (13-15.5 cm long) were randomly divided into alcohol (A) and nonalcohol (NA) groups and were trained in an alcohol solution of 400 mg/100 ml or in water, respectively. All alcohol fish were placed in an alcohol solution of 400 mg/100 ml for 3 hr before training in the same alcohol concentration. Fish were trained on a position discrimination task for 2 consecutive days. The door used for training was that opposite to each fish's spontaneous preference. Savings in relearning on Day 2 was taken as a measure of long term memory strength. Only fish which reached criterion on both days were immediately given 10 forced reversal trails in the opposite direction (i.e., a fish trained on right door was forced to choose the left door.) A and NA subjects were then tested after a 5 min (STM) delay, respectively, in a free choice situation for 10 trails (i.e., neither door was blocked). The results suggest that alcohol facilitates the STM of the forced reversal information.'

https://www.ncbi.nlm.nih.gov/pubmed/935220

unlinkr · on Sept 8, 2019

The question is about identifying end-tags in XHTML. This is indeed possible with a regex.

nurettin · on Sept 8, 2019

unlinkr · on Sept 8, 2019

That is not XHTML.

hk__2 · on Sept 8, 2019

What about <a>this  </a> ?

unlinkr · on Sept 9, 2019

Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.

hk__2 · on Sept 9, 2019

> Yes you can tokenize this with a regular expression and extract the valid start and end tags.

So you need more than a regular expression, hence your premise is incorrect.

unlinkr · on Sept 9, 2019

No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.

The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.

nurettin · on Sept 8, 2019

unlinkr · on Sept 8, 2019

That is a valid XHTML tag (if I remember correctly) and can be matched perfectly fine by a regex.

nurettin · on Sept 8, 2019

Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.

unlinkr · on Sept 9, 2019

That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.

nurettin · on Sept 9, 2019

It just makes the starting offset of the regex input important.

unlinkr · on Sept 9, 2019

Sure. But that would be true for any parsing technique. No parser known to man would be able to produce a valid parse if you start it in the middle of a quoted string!

bryanrasmussen · on Sept 8, 2019

theoretically I believe an end tag really requires a valid start tag.

anyway you can probably answer any number of simple questions about a bit of HTML using regex but as code wants to grow to handle more use cases there will come a time when the solution will break down and the code that wrote to handle all the previous uses will need to be rewritten using something other than regex.

unlinkr · on Sept 9, 2019

You have to distinguish between the different levels of parsing.

Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.

If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.

I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.

unlinkr · on Sept 8, 2019

An element requires a start and end tag, or a self-closing start tag.

snazz · on Sept 8, 2019

Had it not gotten extremely lucky, it would have a very negative score. Anything intended to be funny on StackOverflow doesn’t go over well very often.

ChrisSD · on Sept 8, 2019

The answer was made in '09. StackOverflow was more accepting of some humour at the time. And it is making a valid point: regex isn't the right tool for this job.

unlinkr · on Sept 8, 2019

The question is about how to identify start and end-tags in XHTML. What would be an appropriate tool for that job?

shakna · on Sept 8, 2019

A parser. Specifically, an XHTML parser.

unlinkr · on Sept 8, 2019

How do you think an XHTML parser is written? In particular, how does an XHTML parser identify tokens like start and end tags?

journalctl · on Sept 8, 2019

With a pushdown automaton [1] or something like a linear bounded automaton [2].

[1] https://en.m.wikipedia.org/wiki/Pushdown_automaton

[2] https://en.m.wikipedia.org/wiki/Linear_bounded_automaton

More specifically, a stack lets you keep track of nesting. See an opening tag, push something onto a stack. See a closing tag, pop the stack. If the stack is empty at the end, the tags match.

Parsing XHTML in real life is of course much more complicated than this, but this is the basic idea.

unlinkr · on Sept 8, 2019

I think there are a lot of knee-jerk answer because people see "XHTML" and "regex" in the same sentence and immediately think "not possible".

But the actual question is clearly not about matching start tags to end tags or building DOM or anything like that - which indeed would require a stack. The question is about recognizing start and end tags. You can do that perfectly fine with regular expressions - indeed many parsers uses regular expressions to tokenize the input before parsing.

Furthermore, the question specifically needs to recognize the difference between start-tags and self-closing tags. A differece which is not exposed by most XHTML parsers a far as I am aware

journalctl · on Sept 8, 2019

Sorry, I misread. Indeed, actually tokenizing text is accomplished with regular expressions (although some parsers don’t need a tokenization pass, but details :).

saagarjha · on Sept 8, 2019

It keeps track of state that a regular expression cannot?

unlinkr · on Sept 8, 2019

You don't need to keep track of state to match tokens like XHTML start or end tags.

shakna · on Sept 9, 2019

If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)

unlinkr · on Sept 9, 2019

So how would you tokenize without the use of regular expressions? What more appropriate technique would you use instead?

The example you provide in not XHTML so not really relevant for the discussion. But in any case, a regular expression have no problem recognizing a quoted string.

shakna · on Sept 9, 2019

> So how would you tokenize without the use of regular expressions?

Since this need doesn't appear to be an everyday one, with clearly defined targets, a simple hand-written lexer isn't hard to write, and will make less mistakes than a regex. Just use a scanning approach. As a bonus, you'll still be able to read in 12 months time.

unlinkr · on Sept 9, 2019

Why would a hand-written lexer have fewer mistakes than a regular expression using an off-the-shelf regex engine? They would need to encode the same lexical grammar, so at that level there is the same amount of complexity.

Writing a lexer by hand is just trading ten lines of regex (a widely known declarative DSL) with hundreds of lines of custom code. I don't see how that would be more maintainable in the long run.

joppy · on Sept 8, 2019

I don't think that answer was written with the intent of being particularly helpful, I think it was written with a different goal in mind.

jacobush · on Sept 8, 2019

It is helpful in that almost koan way though.

wool_gather · on Sept 9, 2019

The author said at one point that it was written in a bit of a huff of frustration, and if I remember correctly, also after a pint or two. It's not the best answer answer by any means, but it is one of those small gems of the web, I think.

swasheck · on Sept 8, 2019

It is helpful. It may not provide the answer the OP wanted, but it provides the answer they needed.

avereveard · on Sept 8, 2019

it's sad that today the stack overflow majority shares your mindset and an answer like that would drown in downvotes or killed by moderation.

some people like to act super serious all the time like they're playing a sitcom version of what they think adulthood is in a quest to be the most boring person on earth like if that's the goal of human interaction

hk__2 · on Sept 8, 2019

You can be funny and helpful at the same time.

avereveard · on Sept 8, 2019

which is exactly what that so answer was

unlinkr · on Sept 9, 2019

The problem is it is funny and wrong. Apparently it have given a lot of people really confused ideas about what is possible and what is not possible with regular expressions.

If it had been funny and right I would not have a problem with it.

avereveard · on Sept 9, 2019

Stack overflow is not a homework help group, the "Regex are the wrong tool for the job" answer is more correct than the answer with the working Regex.

unlinkr · on Sept 9, 2019

The job (the question asked) is about recognizing start and end tags in XHTML. These are lexical tokens and therefore regular expressions are a perfectly fine tool for this. Indeed many parsers use regular expressions to perform the lexical analysis (https://en.wikipedia.org/wiki/Lexical_analysis) aka tokeniziation.

Quoting from wikipedia:

The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions;

If you disagree that regular expressions are an appropriate tool for lexical analysis, can you articulate why? And what technique do you propose instead?

avereveard · on Sept 9, 2019

because this

as for the solution, one can get a premade grammar and query the ast

unlinkr · on Sept 9, 2019

That is just a comment and can be easily scanned by a regex. The CDATA syntax does not have any particular meaning inside an XML comment if that is what you are suggesting.

And in any case, neither comments nor CDATA sections nest, so the is no problem handling them at the lexical analysis stage.

As for "get a premade grammar" - what does that mean? Do you mean use a tool like Lex? AFAIK Lex uses regular expressions.

avereveard · on Sept 9, 2019

your missing the point, you cannot nest them but you can have that comment both around or within a cdata, so that Regex is starting to look quite complex to maintain, unless you want to have multiple stages of Regex applied in sequence, at which point congratulation, your building a whole parser the hardest way

and no I mean tool like antlr or javacc which build an in memory syntax tree you can query or observe.

https://github.com/antlr/grammars-v4/blob/72810b7c59bb481750...

this is different than querying the document as XML, since an empty node is equivalent to an empty element, but they differ in the syntax, so querying the syntax tree allowed you to know because it produces a TAG_SLASH_CLOSE

unlinkr · on Sept 10, 2019

I'm not familiar with antlr or javacc, but don't they use regular expressions for lexing?

avereveard · on Sept 10, 2019

yeah, and you can have a grammar with a single production using ALL : .* - you're missing the forest for the tree.

regex are ok for text that's context free, so they match perfectly for the lexing°. once you need context then are back to whacky hacks, layers and loops of regexes or preparser that remove the hard part out of the content - basically you're rolling your own grammar except manually, on the wrong toolchain and on your own.

"can I build my raster engine to draw poligons on screen" "well yes but is going to cost a lot of time, be error prone and miss a lot of features from ready to use solutions"

°you can actually put token production in the lexer itself between expressions, so they can already handle more than plain rexeg even in lexing.

unlinkr · on Sept 10, 2019

So it seems we agree that for the question the OP is actually asking a regex is the appropriate tool.

Great if you can recommend a tool which can solve the task easier. But saying it is "not possible" is unhelpful and confusing. It is just giving lots of people really confused ideas about what is possible and not possible using regexes.

To use your own example: "Can I build my raster engine to draw polygons on screen?"

Not helpful: "No that is simply not logically possible due to the laws of geometry, and you are an idiot for even thinking such a thing is possible"

Helpful: "Yeas, but it is a lot of work if you want to support anti-aliasing, smooth scaling and so on. Using a library like XYZ would probably save you a lot of work in the long run".

hammerbrostime · on Sept 8, 2019

Wow it looks like It was designed by David Carson.

unlinkr · on Sept 8, 2019

Here is the answer to the question: https://www.cargocultcode.com/solving-the-zalgo-regex/

tl;dr: It can indeed be solved relatively easily with a regex.

kortex · on Sept 8, 2019

This is a bit out of my wheelhouse, but this feels wrong, or at least naively capable. Like it feels like this sort of reasoning leads to the kind of bugs (depending on what you use the result of the rexex for) that allow for code injection, a la the Equifax hack.

Maybe another HN poster can back me up, or explain why in fact Zalgo is mistaken and CargoCode is correct.

Either way, this sort of complexity is one reason I avoid XML like the plague and keep HTML at arm's length.

lonelappde · on Sept 8, 2019

CargoCode is correct. Zalgo simply misread the question because he was so sick of similar subtly different questions.

unlinkr · on Sept 9, 2019

This is what I really hate about the Zalgo answer. It is instilling people some vague sense that regular expressions are somehow bad, wrong and dangerous. But without any real arguments or contexts which would allow you to evaluate if the feeling is justified.

lol768 · on Sept 8, 2019

It doesn't work for me with regex101. "The preceding token is not quantifiable" on this part:

  | < (? \w+ )

kortex · on Sept 8, 2019

See, this is kinda what I mean. Maybe you can detect tags with regex, but maybe you shouldn't, given the widespread but subtle differences in regex engines.

Perhaps the entire approach of "why are you trying to parse X?" Needs to be traced and re-evaluated.

unlinkr · on Sept 8, 2019

> Maybe you can detect tags with regex, but maybe you shouldn't...

So what do you think would be a more appropriate choice for writing a tokenizer?

majewsky · on Sept 8, 2019

You want (?:, not (?

Without the colon, the parser appears to be interpreting (? as "one or more instances of (", but ( is no a full expression by itself and therefore cannot be modified with a quantifier.

unlinkr · on Sept 9, 2019

I actually meant (?<tag> in order to create a named capture.

unlinkr · on Sept 9, 2019

It was supposed to be (?<tag> \w+ ) in order to create a named capture. The <tag> was apparently lost in editing. Thanks for the heads-up.

fortran77 · on Sept 8, 2019

We see good examples of "the problem with StackOverflow" here.

The second highest-rated answer is "Evaluate it in a try..catch or whatever your language provides." and it's justified because "Surely the real question is 'how do I validate a regular expression'."

This is a fascinating computer science question and I'm pretty sure the questioner wasn't asking "how do I validate a regular expression" because he would have asked that.

mholt · on Sept 8, 2019

I also think it's a fascinating question, but if the asker wanted a theoretical answer more than a practical one, it should have been asked on the CS or CS theory stack exchange: https://cs.stackexchange.com/ & https://cstheory.stackexchange.com/

piedar · on Sept 8, 2019

...neither of which existed when the question was asked.

sosborn · on Sept 8, 2019

Personally, I think this is a strength of StackOverflow. I think most people read through all of the answers, and this particular one keeps people who know nothing about regex from trying to evaluate regex with regex.

And for those with a deeper understanding, they have an answer that provides the intellectual stimulation they are looking for.

jogjayr · on Sept 8, 2019

> I think most people read through all of the answers

You're vastly overestimating what most SO users come to the site for. They (and I include myself) just want something that'll work and isn't horrible. Even "not horrible" is something I care about but I know that many devs don't.

perl4ever · on Sept 9, 2019

If somebody doesn't know the answer to their question in advance, which by definition they don't, it's by no means guaranteed they will know a horrible answer when they see it.

avar · on Sept 8, 2019

It really doesn't matter what the questioner wanted. The main utility of sites like Stack Overflow is not to give this one person the answer to their specific problem, but to be a useful landing page to people searching for things related to the question later.

A lot of those people are novices when it comes to regexes, and it's useful to make it clear that there's some things that are better solved with other tools, even if you can abuse extended regexes syntaxes to solve them.

mannykannot · on Sept 8, 2019

The answer in this case does not even satisfy that standard.

Izkata · on Sept 9, 2019

I think it's a more frustrating example:

> If so please give example code below.

The phrasing, especially combined with what time of year it was asked, makes me think it's a homework question from a CS logic course where they have to provide an example showing the answer is "no", and explain why.

thereare5lights · on Sept 8, 2019

> I'm pretty sure the questioner wasn't asking "how do I validate a regular expression" because he would have asked that.

In my experience, people sometimes ask for how to solve the more immediate detail they’re working on rather than the broader problem.

corobo · on Sept 8, 2019

This came up recently. Going to start calling it the XYZ problem

https://news.ycombinator.com/item?id=20861806

Edit: page refreshed and sure enough the sibling comment calls it out by name

Edit2: Called it the XYZ problem: https://cohan.io/the-xyz-problem/

meithecatte · on Sept 8, 2019

See also: the XY problem [0][1].

[0]: http://xyproblem.info/

[1]: https://en.wikipedia.org/wiki/XY_problem

ridaj · on Sept 8, 2019

Maybe the second-ranked answer is trying to second-guess the question, but at least the highest-ranked answer is precise...

GuB-42 · on Sept 8, 2019

An interesting question would be "which regex variant can be parsed by itself, and how?".

Parsing simple regex with PCRE is cheating. If you are using PCRE, you should be parsing PCRE.

The simplest case would be DOS style globs with just * and ?, and it can parse itself: just use "*".

kyberias · on Sept 8, 2019

I hate it how the accepted answer is incorrect and does not teach the fundamental property of regular expressions.

PeterisP · on Sept 8, 2019

The fundamental property of regular expressions is that the programming/software engineering concept of "regex" or "regular expression" is fundamentally different from the mathematical concept of "regular language".

If you see "regex" without any special context and start thinking about what it can/can't do, then the first association should be with something like PCRE and its capabilities, not something that's bound by the pumping lemma.

unlinkr · on Sept 8, 2019

It only works in practice, not in theory.

yifanl · on Sept 9, 2019

It doesnt work in practice, PCRE does not validate all valid PCRE expressions.

z3c0 · on Sept 8, 2019

The answer is as horrifying as you'd expect, but impressive nonetheless. Recursive regex is really fun and powerful, when it works.

I once made a recursive regex for matching a full name, complete with checking for suffixes, prefixes, an undefined amount of middle names, hyphenated last names, and Scotch-Irish names (Mc-, O'-, etc). Still simpler than this monstrosity.

andrewflnr · on Sept 8, 2019

I've always been mildly amused by the fact that while regular expressions are a context-free language, context-free grammars are a regular language.

jraph · on Sept 8, 2019

My first thought seeing this title: "Why, no. Next!"

My second thought: the voice of Linus Torvalds at DebConf 14 saying "Hum… No! Hum… that was quick." [1]

Though this is only speaking about a recursively defined regular expression language which is infinite, which strictly handles regular languages, as defined in computer science lessons in university.

And then, "Why not just try and see if it breaks the provided regex parser since you have one?", and it's actually one of the answers in the link… awesome. I wonder it is has security implications though (are forged regexes exploiting flawed regex parsers a thing?)

[1] https://youtu.be/5PmHRSeA2c8?t=110

gpm · on Sept 8, 2019

> are forged regexes exploiting flawed regex parsers a thing?

Looks like yes, depending on the engine.

PCRE for instance has a long list of security vulnerabilities including some with arbitrary code execution: https://www.cvedetails.com/vulnerability-list.php?vendor_id=...

tannhaeuser · on Sept 8, 2019

Basic classical regular expression syntax without grouping parentheses is trivially regular. With parentheses, the maximum nesting level must be bounded to remain regular because "regexpes can't count".

proverbialbunny · on Sept 9, 2019

Maybe I'm taking this a step too far but doesn't Gödel's Incompleteness Theorems (https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...) state that a language can not define itself? You need a meta language to be able to define and specify a language in its entirety.

In other words, regex can not parse regex in its entirety. It's impossible.

wahern · on Sept 9, 2019

Can a C program parse C source code? A Java program parse Java source code? Yes, they can, so such a general limitation couldn't be the reason regular expressions can't parse themselves.

Perhaps you had in mind the Halting Problem: https://en.wikipedia.org/wiki/Halting_problem#G%C3%B6del's_i...

wool_gather · on Sept 9, 2019

No, not really. This is probably closer to Chomsky's area than Gödel's. Gödel's theorem is about semantics, not syntax.

As a nice counter-example to what you said, you can define the Backus-Naur notation using Backus-Naur notation: https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form#Furth...

mkagenius · on Sept 8, 2019

It may be noted that regular languages (as in compiler design) do not have such recursive constructs.

kmstout · on Sept 8, 2019

No. I wrote about this awhile back (https://reindeereffect.github.io/2018/06/24/regex/#the-limit...). From that post:

[M]atching parentheses requires our recognition device to remember how many unmatched open parentheses there are. Since the only way for a DFA to remember anything is to be in one of a set of states corresponding to it, and since the unmatched open parentheses could easily outnumber the available states, we can see that the fundamental limitation of a DFA is that it can store only a finite amount of information (remember the ‘F’ in DFA?). This limitation applies to any string matching task that involves recursive structures or algebraic relationships between substrings. It is why “HTML and regex go together like love, marriage, and ritual infanticide.”

ac42 · on Sept 8, 2019

IIRC, regex can be implemented by an NFA which can't solve balanced parentheses that are part of regex syntax.

mikejb · on Sept 8, 2019

Correct, though there are multiple interpretations of what a "regex" is. In the theoretical sense you're correct, though some (most programmers, I'd say) mean "PCRE" when they say regex. And PCREs are pretty powerful, see [1] for example

[1] https://news.ycombinator.com/item?id=9748736

nabla9 · on Sept 8, 2019

Regex has two different meaning depending on the context.

One is CS definition of regular expression matcher for regular languages. The other is extension of that into non-regular languages, typically PCRE compatible.

ac42 · on Sept 8, 2019

https://en.m.wikipedia.org/wiki/Nondeterministic_finite_auto...

rocqua · on Sept 8, 2019

So, given the much discussed limitations of reg-exps and the desire to parse context-free grammars. My question is, why are we still using regular expressions. Or rather, why isn't there something as easy to use as regular expressions that can processes context-free grammars?

heinrichhartman · on Sept 8, 2019

I have been a happy user of PEG grammars for a while:

https://en.wikipedia.org/wiki/Parsing_expression_grammar

In particular this lua implementation:

http://www.inf.puc-rio.br/~roberto/lpeg/re.html

It let's you write stuff like this:

      list <- (name s)*
      name <- [a-z][a-z]*
      s <- %s*

Features:

- Looks like EBNF

- Regular expressions are a subset

- Fast parser.

YeGoblynQueenne · on Sept 8, 2019

There are Definite Clause Grammars (which can also represent context-sensitive languages) but they are a Prolog thing:

https://en.wikipedia.org/wiki/Definite_clause_grammar#Exampl...

zokier · on Sept 8, 2019

Something like Perl6 grammars[1], or maybe Rosie Pattern Language[2]? Of course Perl6 regexes also go well beyond regular expressions, and I suspect they could be used to match context-free grammars if pressed hard enough. Both P6 grammars and RPL are based on parsing expression grammars, and there are also tools/libraries for many other languages based on PEGs. But now you are entering in the scary realm of parsers and parser generators and all that jazz, and can debate how easy to use they really are.

[1] https://docs.perl6.org/language/grammars

[2] https://developer.ibm.com/open/projects/rosie-pattern-langua...

wahern · on Sept 8, 2019

PEGs are amazing precisely because they're as easy to use as, if not easier than, common regular expression syntax. PEGs are literally the same as regular expressions except 1) alternations are ordered, 2) zero-width assertions are formalized, and 3) quantifiers match greedily. This is effectively the same behavior as the Perl-compatible regular expressions with which most people are familiar.

Many PEG engines, especially for dynamic languages, permit grammar composition using first-class variables. That might be a small barrier to people more familiar with the terseness and conceptual simplicity of regular expressions as string'ish values. But it's fairly trivial to implement the latter using PEGs. For example, LPeg provides a small auxiliary module for doing that: http://www.inf.puc-rio.br/~roberto/lpeg/re.html

Also, Rosie seems amazing. I've not yet had the opportunity to make use of it, but I attended a presentation of Rosie by the author at a Lua workshop which left me very impressed.

andrewflnr · on Sept 8, 2019

I don't know if it's possible to do the "string describing a set strings" thing that regex does with more powerful parsing while maintaining the ease of use. CFGs make heavy use of named sub-languages (productions). A regex-style parser would at least have to have some kind of recursion-inducing metacharacter.

YeGoblynQueenne · on Sept 8, 2019

How about this for ease-of-use?

  sentence --> noun_phrase, verb_phrase.
  noun_phrase --> det, noun.
  verb_phrase --> verb, noun_phrase.
  det --> [the].
  det --> [a].
  noun --> [cat].
  noun --> [bat].
  verb --> [eats].

Reads -and behaves- as BNF. Strings in []'s are terminals, the rest are nonterminals.

It's directly executable as a logic program (it's Prolog syntactic sugar).

andrewflnr · on Sept 8, 2019

That looks great, but it's hard to position it as a regex competitor. I should be more explicit about what I mean by "regex-style": The cool thing about regex that makes them so approachable is that they kind of look like the thing they're describing/matching. Your thing here mostly does not.

YeGoblynQueenne · on Sept 9, 2019

"My thing here" is a grammar so it describes a class of strings. Perhaps this is why it looks to you unlike what it describes?

You can make a DCG rule as specific or as general as you like. As a for instance, this is a vim regex I retrieved from my recent history:

  \[13\/13,15\/12,24-24]

You could write this like so in DCG notation:

  s --> ['[',13,'/',13,,,15,'/',12,,,24,-,24,']'].

And that would match the string "[13/13,15/12,24-24]", no less, no more.

DCGs are Turing-complete, so you can go all the way from programs to finite automata when you write a pattern to match.

They're not really a regex competitor. They were invented in the '70s as a formalism for context-free grammars to be used in representing natural language. They fit right into the logic programming language Prolog that was created soon after (and by some of the same people).

andrewflnr · on Sept 9, 2019

Well, we started this subthread looking for alternatives to regex, so if DCGs aren't in the running I'm not sure what your point is. Also, I happen to believe Turing-completeness is a misfeature for this job.

YeGoblynQueenne · on Sept 9, 2019

The OP asked for "something as easy to use as regular expressions that can processes context-free grammars". You requested ease-of-use. I proposed DCGs. They are not an alternative to regexes in the sense that regexes can't represent anything beyond regular grammars, including CFGs. They can be as simple or as complicated as you want them and they are easy to use.

I'm not sure what is a "misfeature". What do you mean?

Edit: Apologies if I sound too terse. I'm confused by the terminology of "competitor", "alternative" etc. Are we in some kind of competition to find a technological solution that will take some prize? If so, I'd like to know the rules before I commit to any solution. What exactly are we trying to achieve here?

andrewflnr · on Sept 10, 2019

My interpretation of "as easy to use as regular expressions" is that the desired... formalism be usable in the same contexts and with the same casual ease as regex. Think Unix command lines, hairy ad-hoc Python parsing scripts, etc.

One of the things that makes regex so well suited to that role is that a string is a regex that matches itself, and you can iteratively add sophistication from there. At the very least, you would want to maintain (or improve, Cthulhu knows there's plenty of room) that incremental quality in any proposed replacement, while increasing its power.

It looks like with a DCG, you have to know you're writing one up front, and have to think about what you're parsing at a much more abstract level. The average sysadmin could probably not casually toss one off for a log parsing task. If there's an alternate syntax that gets around that problem, then I'm interested. OTOH...

Re "misfeature": one of the cool things about existing parsing formalisms is all the known terminating, mostly (all?) polynomial algorithms for analyzing them. If DCGs are Turing complete, they don't have those algorithms. Turing completeness is not usually a property I want in my parser. That's what I meant by saying it's a misfeature.

YeGoblynQueenne · on Sept 10, 2019

I don't quite understand what you mean by "a regex that matches itself". Could you explain? What you describe sound smore like a quine.

Regarding Turing completeness and termination- the DCG formalism (thank you) itself is expressive enough to represent anything from finite automata to UTMs, but that doesn't mean that every grammar you write using DCG notation is Turing-complete. So, for instance, if you write a right-regular grammar as a DCG, that DCG will not be Turing-complete, it'd just be a right-regular grammar.

The DCG examples I gave above are a CFG grammar for a fragment of natural English and a single-rule grammar that matches a single string. Those are definitely not Turing-complete and parsable in polynomial time.

The difference with regexes is that you can't express CFGs or above using regexes, but you _can_ represent both regexes and CFGs using DCG notation.

>> If there's an alternate syntax that gets around that problem, then I'm interested. OTOH...

May I make a personal comment? I think your insistence on competitive language, like "i'm [not] interested", "competitor to regexes" etc, is a case of Déformation professionnelle. You sound just like a software developer hyper-focused on finding tools to maximise productivity in the office, and nothing else.

You should perhaps consider the possibilty that what we are discussing here goes a bit beyond a product that you can package and sell as an alternative to a popular tool. I mean, personally, when I realised that DCGs are executable grammars that can be run as both recognisers and generators- well let's say it shifted my understanding of what is possible to do with a computer and a programming language.

In any case I don't see the complexity you seem to see in DCGs. Like I say above, they're basically BNF. I struggle to think of a sysadmin worth her salt (so, one who knows perl, eh?) who would sweat it to write and read BNF. The kind of sysadmin I have in mind, the problem would be to drag them away from the keyboard, once they started writing a DCG to parse a log file- and realised they could write one to parse _all_ log files ever.

andrewflnr · on Sept 12, 2019

"foo" is a string. It's also a regex that matches exactly the string "foo".

Regarding Turing-completeness, the entire point is about the termination of algorithms that examine parsers, not the runtime of the parsers themselves.

I also don't know how many times I need to point out that the reason I'm focusing on regexes is that that was the context for this conversation. Under any other context I'm quite interested in new parsing techniques.

Lastly, I would point out that regexes have uses far beyond a certain arbitrarily decided set of sysadmins and developers "worth their salt". Even if it's true that a meaningful subset of sysadmins are capable of writing basically sound BNF (which is not something I would count on even for CS grads, but maybe you work with smarter people than I do), there are lots of other people who could use a little more parsing power if it was offered in the right way.

carapace · on Sept 8, 2019

And this can be used to generate sentences as well as parse them. ;-)

YeGoblynQueenne · on Sept 9, 2019

Indeed. I tend to forget how counter-intuitive this is.

With DCGs you get your recognisers for nothing and your generators for free. As I like to say.

timeattack · on Sept 8, 2019

This is perfect example of a insight which leads to a rabbit hole of intertwined complexity of our concepts when you try to understand why regexp for validating regexp is actually much simpler than regexp for validating email address.

ChrisSD · on Sept 8, 2019

It depends what you mean by "validating email addresses".

A regex that tests if a string looks like a valid email address is simple. Forget the ancient RFCs, a real world email address is in the form of `mailbox@domainname`. Which is not so difficult to test for with a bit of care.

However, testing if the email address is a valid mailbox is harder and indeed impossible using regex alone. The domain name can be validated using standard domain tools but in practice the mailbox can be anything that the server will route to a valid mailbox. The only way to validate it is to send an email.

dillonmckay · on Sept 8, 2019

Even then, there is no way to determine if there is a ‘catch-all’ address configured, or some email tarpit.

I have a domain w/ a .tech TLD, and quite a few frontend JS validators do not accept my email as a valid address (government sites, some banks).

cbarrick · on Sept 8, 2019

"Regular Expression Matching Can Be Simple And Fast" [1] is worth a read (by Russ Cox of Bell Labs and Go fame).

[1]: https://swtch.com/~rsc/regexp/regexp1.html

seanmcdirmid · on Sept 8, 2019

A regular language for writing regular language recognizers is an interesting thought exercise.

notyourday · on Sept 8, 2019

Don't use a regex to do it. Most of languages have a way to catch runtime errors. Stick regex to a variable, create a dummy input and run that block with a catcher around it. If catcher throws a runtime error, you have a bad regex. If it does not, you have a valid regex.

a3n · on Sept 8, 2019

Another way to phrase the question: Are regexes self hosting?

Piskvorrr · on Sept 8, 2019

TL;DR: No. But there exist languages built on top of regular expressions (notably PCRE) that can. They can't validate themselves, though - turtles all the way down to Gödel.

tom_mellior · on Sept 9, 2019

Plenty of parser formalisms can define their own syntax in their own syntax. As mentioned elsewhere, C compilers can parse their own source code. Gödel has nothing to do with it.

anewguy9000 · on Sept 8, 2019

it would become conscious

ukj · on Sept 8, 2019

[flagged]

Iv · on Sept 8, 2019

That does not really apply. A compiler can compile itself, it can check if its own code is valid.

ukj · on Sept 8, 2019

[flagged]

mannykannot · on Sept 8, 2019

The fact that someone might make a mistake in doing something does not show it cannot be done. More generally, You seem to be mistaking validating a program's source with the question of whether it performs its intended purpose. These are different things, and attempting to conflate them will only lead to confusion.

ukj · on Sept 8, 2019

> You seem to be mistaking validating a program's source with the question of whether it performs its intended purpose.

In general, that is a useful distinction to make, but you forgot about the edge case where the distinction is meaningless.

A self-hosting compiler's intended purpose is to validate its own source code.

mannykannot · on Sept 8, 2019

Then your mistake appears to be in failing to see that your perceived edge case does not invalidate the first sentence of my reply. If the sole purpose of a parser is to syntactically validate its own source (which is not the case for a compiler's parser, by the way, not even if we expand 'its own source' to 'arbitrary input'), then if it does that correctly, that's all there is to it.

ukj · on Sept 8, 2019

[flagged]

mannykannot · on Sept 8, 2019

Godel incompleteness only applies to sufficiently powerful formal systems.

ukj · on Sept 8, 2019

Indeed. Type 0 Grammars are the most powerful grammars we have.

https://en.wikipedia.org/wiki/Chomsky_hierarchy#Type-0_gramm...

In this type of grammar Godel's incompleteness theorem is equivalent to the Halting problem.

mannykannot · on Sept 8, 2019

And which programming language did you have in mind?

ukj · on Sept 8, 2019

All of them.

"Type-0 grammars include all formal grammars. They generate exactly all languages that can be recognized by a Turing machine."

mannykannot · on Sept 8, 2019

I see - you have the complexity relation the wrong way round.

ukj · on Sept 8, 2019

Are you sure you aren't mistaken? :-)

mannykannot · on Sept 8, 2019

Go ahead - show me wrong.

ukj · on Sept 8, 2019

I have already done that, what I am unsure of is how to go about convincing you.

With compilers - I just fix the bug myself. With humans, I have to convince the human to self-correct.

Common problem in distributed systems that - leader election.

mannykannot · on Sept 8, 2019

It is a general rule that those who avoid answering a question do not have an answer, and this is no exception. Here, You completely misunderstand Chomsky’s hierarchy: By your inverted-hierarchy argument, the simplest regular language would be complex enough that incompleteness would be an issue in its validation.

ukj · on Sept 8, 2019

Well, this looks like abuse of Cunningham's law, but I'll bite.

It is a general rule that general rules have exceptions. And you have (incorrectly) asserted that this is not an exception.

Q.E.D

Even the most powerful languages (Type 0 in the hierarchy) cannot solve the halting problem. Which is equivalent to Godel's incompleteness theorem.

https://www.scottaaronson.com/blog/?p=710

If a Type 3 grammar can recursively prove its own correctness, it's not a Type 3 grammar!

But if you desperately want to be right, I will happily lie to you (you are right, I am wrong), so I can move on with my life.

mannykannot · on Sept 8, 2019

Your second paragraph is just a direct begging-of-the-question.

More significantly, verifying that sentences are well-formed in some language is not the same as proving the language’s correctness.

As a response to you argument, consider this regular language:

A

ukj · on Sept 8, 2019

Curry-Howard isomorphism.

Proving that a sentences are “well-formed” is the same as implementing an algorithm which takes a string and returns a Boolean.

You can call this function “is_well_formed?”

I await your proof in the regular language you have specified above.

dang · on Sept 8, 2019

Please also stop.

dang · on Sept 8, 2019

Please stop.

zwkrt · on Sept 8, 2019

A self-hosted compiler is definitely not a case of the liars paradox, so I don’t think this applies. In fact, it doesn’t matter what the parser is written in, at some point you do have to trust that it actually follows the spec, probably through testing.

ukj · on Sept 8, 2019

So what you have to trust is that the spec doesn't produce lies ? ;)

FYI. I am just having some philosophical fun with GIGO.

Ultimately, computers are just instruments which amplify human intention, it's not the computers's job to be ultimate moral arbiter of our words.

Gödel's incompleteness theorem already tells us that a system cannot prove its own correctness.