Hacker News new | past | comments | ask | show | jobs | submit login
Is there a regular expression to detect a valid regular expression? (stackoverflow.com)
376 points by lelf on Sept 8, 2019 | hide | past | favorite | 219 comments



> Is there a regular expression to detect a valid regular expression?

No, there is not. For example, parentheses in a regex must be balanced, and (famously) there is no regex to detect balanced parentheses.


More precisely, it is arbitrarily deep nested clauses that cannot be parsed, on account of true REs not being able to either use recursion or keep count of how deep they are.

https://blogs.msdn.microsoft.com/jaredpar/2008/10/15/regular...


Can regex engines actually process arbitrarily deep nestings themselves though?

It seems to me it would start getting computationally expensive very quickly to search for those kinds of regexes, so is 'arbitrarily deep' a practical concern for todays hardware?


Yes, regex engines that are designed to do so run in linear time on the size of the regex (as well as linear time with respect to the length of the input).

One such engine is rust's https://github.com/rust-lang/regex


> One such engine is rust's https://github.com/rust-lang/regex

It can process arbitrary deep nestings? I'm not familiar with it, but I had a look and I can't see anything in there that would allow it to do it. Can you point to syntax that allows it?

In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).

A DFA is part of the Chomsky hierarchy of languages that has DFA's at the bottom (aka regex's, least powerful - can't handle recursive structures), a DFA coupled with an infinite stack for memory (aka LR parsers, can't handle context sensitive grammars, also runs in very close to linear time), and finally a DFA coupled with an infinite tape (aka Turing Machine, can do any sort of computation).

In each of these cases the DFA (regex) is essentially a computer program and all they are doing is allowing it to store stuff on various kinds of memory (none, stack, tape with arbitrary movements). An interesting corollary of this is any computer program can be expressed as a DFA.


> It can process arbitrary deep nestings?

The GP is presumably citing it as an example of a linear time regex, and not as something that can parse arbitrarily deep nestings. Note that "linear time regex engines" is a characterization that holds the size of the regex constant, such that "linear time" is with respect to the input only. The actual worst case time complexity of such regex engines is O(m * n), where m ~ len(regex) and n ~ len(input). So even with a linear time regex engine, if you bloat the size of the regex, you'll make searching slower. But, it will still be done linearly with respect to the size of the input.

> In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).

No. Please stop spreading the misuse of these terms. DFAs and NFAs have very precise meanings. It's one thing to call things regexes that aren't actually regular, but DFAs and NFAs should retain their precise theoretical meaning. A linear time regex engine such as Rust's `regex` crate uses both DFAs and NFAs in its implementation.

The Perl and PCRE folks (along with Jeffrey Friedl) have bastardized this terminology. PCRE for example claims to provide a "DFA" regex engine in its API, but it's actually an NFA. The main implementation inside PCRE (and Perl) is backtracking oriented, and carries with it much more power than an NFA. If it didn't, then the NFA could be executed in linear time, because NFAs and DFAs have equivalent power.

> An interesting corollary of this is any computer program can be expressed as a DFA.

No they can't. You're confusing terminology quite a bit here.


Does "linear wrt A and linear wrt B" mean O(A*B) or O(A+B)?


It means O(AB), but is carefully worded to not say (or suggest) it's not possible to do O(A+B) which is a subset of O(AB) ;)

I'm not sure if it's possible to match regex with that time complexity, it looks like the implementation I referenced is O(AB). https://docs.rs/regex/1.3.1/regex/#untrusted-input.

(star symbol removed because hn decided to turn them into a block of italics)


There is an importance to theoretical considerations that goes beyond their immediate practical applications. Ultimately, it comes down to how well one understands what is going on.

Having said that, I think there are practical applications. Suppose you were unaware of this result, and thought a true RE could validate your input? A more knowledgeable attacker might be able to craft an input that exploits the inherent limitations of your validator.

Also in practice, attempting to write a true RE to parse a language that allows restricted recursion rapidly becomes complicated. It is the wrong tool for the job.


Just considering the language of balanced parentheses, the only memory needed is a single integer to track depth. If we see a "(", increment. If we see a ")", decrement. If we never go negative and end up with zero, that's a balanced parentheses expression. (To be clear, the parser I've described is not a regex parser.)

Adding the rest of the regex machinery makes this a bit more complex, but in general, if the computer can store the string, it can check whether it's regular.


As you say, this is not a regex parser, but some readers may not know the theorem that the regular languages are exactly those which can be parsed by a Turing machine with constant memory, and parenthesis count can have log n bits.

In fact, any Turing machine using o(log log n) space recognizes a regular language (so there is an equivalent machine using O(1) space).


This only works if you have a single type of parenthesis. Is that applicable to regex expressions in general that only ()’s can nest?


As typically defined academically, yes. In the real world things might be more complex, but I think it would hold true for many implementations.

Academically a regex is a string of of:

- character literals

- "epsilon" characters (meaning empty string)

- "+" characters (which means either what we see before the + or what we see after the +, in real-world regex these are usually represented as | instead of + as well as used implicitly in constructs like [a-zA-Z])

- "*" characters (same meaning as in real-world regex)

- And parentheses to allow controlling the order of operations


In other words, "if we build a state machine which is different from a regex, it will perform differently than a regex would." That's trivially true, but is it interesting beyond restating "there is no Silver Bullet"?


Technically we aren't discussing the regex itself here but the syntax for defining a regex.

You could use a different syntax for defining the same regex, and that syntax could have different properties. For instance you could use parantheses like racket where "(...)" and "[...]" have the same meaning, but "(...]" is not well formed syntax (if I remember my racket correctly). Using such a syntax to define regex a counter would no longer suffice to decide if a string is a regex.


You could have a stack of (paren-type, integer) pairs. If the paren has a different type than the top pair on the stack (or the stack is empty), push a new pair with the integer set to one. If the integer goes to zero, pop the pair. At the end of the expression, you have to have an empty stack.


)))((()(()


When I check brackets by hand or manually I do actually get my hands out and when I say open I uncurl a finger, and when I say close, I furl a finger. "Open, open, open, close, open, close, close, close". Provided I have no fingers on display at the end, then the brackets are balanced. They might not be in the right place but at least they are balanced. If I find myself starting with close then I close the lid and go and do something else.

I'm sure some sort of rather cheap algorithm falls out of the above. It wont guarantee correctness in what the balanced brackets actually contain but they will at least be balanced.


Good edge case. I’d say to immediately return an ‘unbalanced’ result once the count goes negative.


((((()


doesn't end up with zero at the end, rejected?


I was just trying to suggest a possible edge case, a failing test.


"never go negative"


This is true if you want to handle arbitrarily deep nested parents.

But any given string you are asked to validate to see if it is a regex will be a finite length, and contain a finite number of opening paren characters. So it’s maximum possible nesting depth is known.

And you can construct, fairly trivially, a regex that can validate paren nesting up to a fixed depth.

So, in practice, you could use a regular expression to validate the paren nesting of any given string - if you were allowed to prepare the regex based on the string length, or on running another regex over the string first.

What I suspect you might not be able to validate is that in a regex, while [a-z] is valid, [z-a] is not.


If you allow generating a regular expression based on string length, you can validate any computable language with a regular expression. It's trivial to just run a solver on all strings of that length and or the accepted ones together.

> What I suspect you might not be able to validate is that in a regex, while [a-z] is valid, [z-a] is not.

It's fairly trivial to validate things like a-z being valid in a character class while z-a isn't. There are only finitely many legal ranges. You can just list them all, like \[(a-a|a-b|a-c|...|z-z)\].


For a fixed input length all languages are decidable by regular expressions. Trivially, the union of all valid input strings of that length, like "(abc|def|...)".

It's not an interesting property.


The problem is that "regular expression" is equivocal.

Sometimes you hear "regular expression" used to mean the language of things accepted by some Finite State machine. Lots of folks on HN took a Theory of Computation class and learned about these in that class. In that meaning, yes, you are right.

But sometimes in professional conversations you hear "regular expression" used to mean the strings that can be matched using, say, Perl's regular expressions. I know I have heard it used that way often. Since Perl regular expressions can accept anything a Turing machine can accept, the answer to the question in this case is yes.


Very few people know that Perl's regex engine (since 5.10) supports recursion. And recursive matching cannot perform recursive captures, which is why at best it can only be used to validate a string, not parse it, making it not particularly useful in practice. For these reasons I would never assume that someone using the term "regular expression" inclusively of Perl would have in mind such capabilities. IME, such a loose definition simply implies things like zero-width assertions, backcaptures, etc.

Here's a proof-of-concept Perl 5.10+ JSON validator I came up with for a presentation to Perl engineers introducing PEGs and Lua's LPeg module. Of the 20-30 people in the room, I doubt anybody in the audience knew this was even possible with Perl.

  my $grammar = qr{
  ^(?&Value) $
  (?(DEFINE)
    (?<Value> \s∗ (?:
        (?&Array)
      | (?&Object)
      | (?&Boolean)
      | (?&Number)
      | (?&String)
      | (?&Null)
    ) \s∗ )
    (?<Array> \[ \s∗ (?:(?&Value) (?:\s∗,\s∗ (?&Value))∗)? \s∗ \])
    (?<Object> \{ \s∗ (?:(?&KeyV) (?:\s∗,\s∗ (?&KeyV))∗)? \s∗ \})
    (?<KeyV> \s∗ (?:(?&String) \s∗:\s∗ (?&Value))) \s∗
    (?<Boolean> true | false)
    (?<Number> \d+)
    (?<String> "[^\"]∗")
    (?<Null> null)
  )
  }xs;
Note: I was trying to fit it all on a single slide, so the definition for String doesn't handle escaped characters. There may be other deficiencies. I copy+pasted this from the PDF slide deck as I can't find the original Beamer source. Any broken spacing and Unicode substitutions probably aren't original.


Sort of, but if (and only if) you know what a regular expression formally means, you can disambiguate the equivocation introduced by ambiguous usage.


It depends on the representation. If the representation is in the FSA format (state, character, state), it becomes nearly trivial.


Very underrated comment. Very good point to give food for thought although I think it is clear that it was meant to be on the usual textual representation using the simplest re language (without extensions).


An analogous problem for this representation would be detecting transitions to undefined states, or ambiguous (duplicate) transitions.


I didn't verify it, but my intuition says that can't be done with a CFG; it seems at least context-sensitive.


> (famously) there is no regex to detect balanced parentheses

God, I remember running into this wall with an in-company domain-specific language that used regex as its tokenizer/parser.

Adding support for nested structures required us to untangle the whole thing and rewrite the regex into explicit algorithms. Fortunately the regex was only a few pages long..

In short, I agree with you that regex cannot parse regular expressions completely. Would be happy to be proven wrong, though!


> In short, I agree with you that regex cannot parse regular expressions completely. Would be happy to be proven wrong, though!

the proof that you can't parse nested pararetheses with a regular language is a pretty standard part of first year computer science


It doesn't stop some of us from trying though! :)


Follow-up question: without the grouping parentheses, is the regex language regular?


Yes, but you're taking too much. Without grouping, regex is weaker than regular languages.

Grouping & capturing parentheses can stay. It's only the back-referrences that you need to remove.


Regex matching with backreferences is NP-hard.

https://perl.plover.com/NPC/


Grouping is very useful, but I think you can get pretty far in practical patterns with a limit of k-nesting for k=0..3 or so. That pattern language would still be regular.


Yes but it's mostly useless, since it can't match (variable-count) repeated strings of length greater than 1. Goruping isn't only about extraction.


Sure there is. Read the linked answer, or just google it.

Example:

https://regular-expressions.mobi/recurse.html?wlr=1


Right but parent is thinking of the quesion "is the language of valid regular expressions itself regular?", which a very different question and of course false by an easy application of the pumping lemma. However, as shown in the SO answer, they "cheat" by using some features that don't align completely with the mathematical definitions.

edit: typo.


It depends if we're talking regex in the mathematical sense or in the programming one.


"Evaluate it in a try..catch or whatever your language provides."

"That's not very enterprisey of you"

Oooh, I'm laughing so hard it hurts. It's been a particularly 'enterprisey' week at work.


That answer feels entirely wrong to me. As pointed out, it may be an XY answer: It presumes the person posing the question wants to validate regular expressions.

But with something so blatantly self-referential, it actually feels unlikely to me that what they want to do is validate regular expressions. My guess is that they are generally curious about whether Regexen (PCRE or strict regular expressions) are powerful enough to validate Regexen (again whether PCRE or strict).

XY answers are good for avoiding a lot of unnecessary yak-shaving/accidental complexity of a bad solution. But the conversation around whether we are talking about recognizing strict Regexen or PCRE Regexen, and in turn whether we are using strict Regexen or PCRE Regexen to recognize them is not accidental complexity or yak-shaving, it is intrinsic to understanding the nature of the problem and solution spaces.

I too find the answer humorous for the "enterprisey" reference, but I think it would be a very bad answer if we are judging it strictly on the basis of its value.


"We could do it in a easy two liners that make sense, but, and hear me out here, what if we instead built an entire 10k LOC system that would barely work half of the time, using dozens of man hours for it ? And then never re-use it nor train anyone about it so that it becomes a mythical black box after your contract expires ?"

"Well, I gave you my recommendation but hey at the end of the day you're the customer and I'm paid by the hour, so whatever you say boss"

I now consider myself very lucky to be successful enough to not have to deal with projects like that, but back when I wasn't I paid quite a few things I wanted with those sort of stupid decisions.


Actually that is how enterprise works afaik. But few well funded startups on the other hand...


I believe Zalgo has the answer to this, via an equivalent question. https://stackoverflow.com/questions/1732348/regex-match-open...


My favorite part of it is the moderators' note at the end:

> Moderator's Note

> This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.


The big problem with the content is suggesting to use an XML parser to parse HTML. That might often work, and several XML parsers have some sort of HTML "mode", but in general, no, not really. HTML documents are frequently invalid XML. HTML documents lack an XML declaration, they don't close all tags, etc. They might even lack a root element. Someone mentioned this in a comment on SO, but it was (as far as I could see) not really addressed or answered.

HTML 5 has it's own parsing rules, specified at [1]. However, I don't know of any implementation of these outside of browsers. I normally use Beatuiful Soup [2] (for Python) or HTML Agility Pack [3] (for .NET), and while I don't think these implement the exact standard, they're easier than struggling with a XML parser for HTML "in the wild".

1: https://html.spec.whatwg.org/multipage/parsing.html 2: https://www.crummy.com/software/BeautifulSoup/ 3: https://html-agility-pack.net/


The question title implies he's parsing XHTML text.

> XHTML is an XML-based HTML. It serves the same function as HTML, but with the same rules as XML documents.

https://stackoverflow.com/q/1429065


The question title is ambiguous but the question tags include both HTML and XHTML. Self-closing `<element />` is an XML-ism which in HTML is treated the same as `<element>` but it was still fashionable to use them in HTML when XHTML was at its height.


From that answer: "You can't parse [X]HTML with regex".

I've always wondered how that came about. How come very early in the development of the web no one important enough for people to pay attention to them said, "Hey...wait a second. If this thing becomes popular, people are going to really want to processes web documents with their usual text file processing tools and techniques. We really outta make this thing reasonably easy to process with grep and sed and awk and Perl and such"?


The problem isn't with the surface syntax of HTML. Anything else, like S-expressions or JSON or whatever would have the same problem. The problem is arbitrary nesting of elements. No language with nesting can be parsed with regex (modulo various copouts mentioned elsethread). But such nesting is useful and necessary for a document language like HTML. This goal is important; "we must be able to use awk" isn't.


Perhaps the most epic reply I have ever seen on SO.


In SO's hall of fame of regular members being feed up of a common yet stupid question always coming back up it's up there alongside the thread about making addition with jQuery


It’s a famous post on StackOverflow, but I don’t find it particularly helpful.


I found it extremely helpful. The sheer emphasis of the reply made me very curious why the idea of using regex on xml is so bonkers.

- I will never forget that regex can't parse XHTML

- the reason being, regex is insufficiently powerful

- when I first saw this post, I knew little about regex under the hood, this sent me down a wiki hole of FSMs, pushdown automata and turing machines

- this misconception is apparently common enough to be madness-inducing to those that know better

- use a hecking xml parser instead

It almost reminds me of a Bill Nye sketch. Teaching through a bit of non-sequitur and absurdism.


The problem with the answer is it is wrong. The question is about identifying start-tags in XHTML. This is a question of tokenization and can be solved with a regular expression. Indeed, most parsers use regular expressions for the tokenization stage. It is exactly the right tool for the job!

Furthermore, the asker specifically needs to distinguish between start tags and self-closing start tags. This is a token-level difference which is typically not exposed by XHTML parsers. So saying "use a parser" is less than helpful.

I have elaborated a bit in blog post: https://www.cargocultcode.com/solving-the-zalgo-regex/


If you were looking for the reason why a regex cannot parse HTML, it is because HTML has matching nested tags and regex parsers are finite state machines (FSM).

What this means is that a regex parser is like a goldfish. It only knows about the state it is currently in (what it just read) and which possible states it may transition to (what is legally allowed to come next). The fish never remembers where it was before; there is no option to have the legality of a transition depend on what it read _before_ the current state. But this memory function is a requirement to recursively match opening tags to their closing tags - you need to keep a stack of opening stacks somewhere in order to then cross off their closing tags in reverse order. So regex cannot parse HTML.


>What this means is that a regex parser is like a goldfish.

Like a drunk goldfish, or a sober goldfish?

'A method to study short-term memory (STM) in the goldfish.'

'Twenty-one common goldfish (13-15.5 cm long) were randomly divided into alcohol (A) and nonalcohol (NA) groups and were trained in an alcohol solution of 400 mg/100 ml or in water, respectively. All alcohol fish were placed in an alcohol solution of 400 mg/100 ml for 3 hr before training in the same alcohol concentration. Fish were trained on a position discrimination task for 2 consecutive days. The door used for training was that opposite to each fish's spontaneous preference. Savings in relearning on Day 2 was taken as a measure of long term memory strength. Only fish which reached criterion on both days were immediately given 10 forced reversal trails in the opposite direction (i.e., a fish trained on right door was forced to choose the left door.) A and NA subjects were then tested after a 5 min (STM) delay, respectively, in a free choice situation for 10 trails (i.e., neither door was blocked). The results suggest that alcohol facilitates the STM of the forced reversal information.'

https://www.ncbi.nlm.nih.gov/pubmed/935220


The question is about identifying end-tags in XHTML. This is indeed possible with a regex.


<a href="</a>">try this</a>


That is not XHTML.


What about <a>this <!-- </a> --> </a> <!-- </a> -->?


Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.


> Yes you can tokenize this with a regular expression and extract the valid start and end tags.

So you need more than a regular expression, hence your premise is incorrect.


No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.

The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.


<input value="how about this? />"/>


That is a valid XHTML tag (if I remember correctly) and can be matched perfectly fine by a regex.


Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.


That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.


It just makes the starting offset of the regex input important.


Sure. But that would be true for any parsing technique. No parser known to man would be able to produce a valid parse if you start it in the middle of a quoted string!


theoretically I believe an end tag really requires a valid start tag.

anyway you can probably answer any number of simple questions about a bit of HTML using regex but as code wants to grow to handle more use cases there will come a time when the solution will break down and the code that wrote to handle all the previous uses will need to be rewritten using something other than regex.


You have to distinguish between the different levels of parsing.

Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.

If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.

I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.


An element requires a start and end tag, or a self-closing start tag.


Had it not gotten extremely lucky, it would have a very negative score. Anything intended to be funny on StackOverflow doesn’t go over well very often.


The answer was made in '09. StackOverflow was more accepting of some humour at the time. And it is making a valid point: regex isn't the right tool for this job.


The question is about how to identify start and end-tags in XHTML. What would be an appropriate tool for that job?


A parser. Specifically, an XHTML parser.


How do you think an XHTML parser is written? In particular, how does an XHTML parser identify tokens like start and end tags?


With a pushdown automaton [1] or something like a linear bounded automaton [2].

[1] https://en.m.wikipedia.org/wiki/Pushdown_automaton

[2] https://en.m.wikipedia.org/wiki/Linear_bounded_automaton

More specifically, a stack lets you keep track of nesting. See an opening tag, push something onto a stack. See a closing tag, pop the stack. If the stack is empty at the end, the tags match.

Parsing XHTML in real life is of course much more complicated than this, but this is the basic idea.


I think there are a lot of knee-jerk answer because people see "XHTML" and "regex" in the same sentence and immediately think "not possible".

But the actual question is clearly not about matching start tags to end tags or building DOM or anything like that - which indeed would require a stack. The question is about recognizing start and end tags. You can do that perfectly fine with regular expressions - indeed many parsers uses regular expressions to tokenize the input before parsing.

Furthermore, the question specifically needs to recognize the difference between start-tags and self-closing tags. A differece which is not exposed by most XHTML parsers a far as I am aware


Sorry, I misread. Indeed, actually tokenizing text is accomplished with regular expressions (although some parsers don’t need a tokenization pass, but details :).


It keeps track of state that a regular expression cannot?


You don't need to keep track of state to match tokens like XHTML start or end tags.


If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)


So how would you tokenize without the use of regular expressions? What more appropriate technique would you use instead?

The example you provide in not XHTML so not really relevant for the discussion. But in any case, a regular expression have no problem recognizing a quoted string.


> So how would you tokenize without the use of regular expressions?

Since this need doesn't appear to be an everyday one, with clearly defined targets, a simple hand-written lexer isn't hard to write, and will make less mistakes than a regex. Just use a scanning approach. As a bonus, you'll still be able to read in 12 months time.


Why would a hand-written lexer have fewer mistakes than a regular expression using an off-the-shelf regex engine? They would need to encode the same lexical grammar, so at that level there is the same amount of complexity.

Writing a lexer by hand is just trading ten lines of regex (a widely known declarative DSL) with hundreds of lines of custom code. I don't see how that would be more maintainable in the long run.


I don't think that answer was written with the intent of being particularly helpful, I think it was written with a different goal in mind.


It is helpful in that almost koan way though.


The author said at one point that it was written in a bit of a huff of frustration, and if I remember correctly, also after a pint or two. It's not the best answer answer by any means, but it is one of those small gems of the web, I think.


It is helpful. It may not provide the answer the OP wanted, but it provides the answer they needed.


it's sad that today the stack overflow majority shares your mindset and an answer like that would drown in downvotes or killed by moderation.

some people like to act super serious all the time like they're playing a sitcom version of what they think adulthood is in a quest to be the most boring person on earth like if that's the goal of human interaction


You can be funny and helpful at the same time.


which is exactly what that so answer was


The problem is it is funny and wrong. Apparently it have given a lot of people really confused ideas about what is possible and what is not possible with regular expressions.

If it had been funny and right I would not have a problem with it.


Stack overflow is not a homework help group, the "Regex are the wrong tool for the job" answer is more correct than the answer with the working Regex.


The job (the question asked) is about recognizing start and end tags in XHTML. These are lexical tokens and therefore regular expressions are a perfectly fine tool for this. Indeed many parsers use regular expressions to perform the lexical analysis (https://en.wikipedia.org/wiki/Lexical_analysis) aka tokeniziation.

Quoting from wikipedia:

The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions;

If you disagree that regular expressions are an appropriate tool for lexical analysis, can you articulate why? And what technique do you propose instead?


because this <!-- <![CDATA[ -->

as for the solution, one can get a premade grammar and query the ast


That is just a comment and can be easily scanned by a regex. The CDATA syntax does not have any particular meaning inside an XML comment if that is what you are suggesting.

And in any case, neither comments nor CDATA sections nest, so the is no problem handling them at the lexical analysis stage.

As for "get a premade grammar" - what does that mean? Do you mean use a tool like Lex? AFAIK Lex uses regular expressions.


your missing the point, you cannot nest them but you can have that comment both around or within a cdata, so that Regex is starting to look quite complex to maintain, unless you want to have multiple stages of Regex applied in sequence, at which point congratulation, your building a whole parser the hardest way

and no I mean tool like antlr or javacc which build an in memory syntax tree you can query or observe.

https://github.com/antlr/grammars-v4/blob/72810b7c59bb481750...

this is different than querying the document as XML, since an empty node is equivalent to an empty element, but they differ in the syntax, so querying the syntax tree allowed you to know because it produces a TAG_SLASH_CLOSE


I'm not familiar with antlr or javacc, but don't they use regular expressions for lexing?


yeah, and you can have a grammar with a single production using ALL : .* - you're missing the forest for the tree.

regex are ok for text that's context free, so they match perfectly for the lexing°. once you need context then are back to whacky hacks, layers and loops of regexes or preparser that remove the hard part out of the content - basically you're rolling your own grammar except manually, on the wrong toolchain and on your own.

"can I build my raster engine to draw poligons on screen" "well yes but is going to cost a lot of time, be error prone and miss a lot of features from ready to use solutions"

°you can actually put token production in the lexer itself between expressions, so they can already handle more than plain rexeg even in lexing.


So it seems we agree that for the question the OP is actually asking a regex is the appropriate tool.

Great if you can recommend a tool which can solve the task easier. But saying it is "not possible" is unhelpful and confusing. It is just giving lots of people really confused ideas about what is possible and not possible using regexes.

To use your own example: "Can I build my raster engine to draw polygons on screen?"

Not helpful: "No that is simply not logically possible due to the laws of geometry, and you are an idiot for even thinking such a thing is possible"

Helpful: "Yeas, but it is a lot of work if you want to support anti-aliasing, smooth scaling and so on. Using a library like XYZ would probably save you a lot of work in the long run".


Wow it looks like It was designed by David Carson.


Here is the answer to the question: https://www.cargocultcode.com/solving-the-zalgo-regex/

tl;dr: It can indeed be solved relatively easily with a regex.


This is a bit out of my wheelhouse, but this feels wrong, or at least naively capable. Like it feels like this sort of reasoning leads to the kind of bugs (depending on what you use the result of the rexex for) that allow for code injection, a la the Equifax hack.

Maybe another HN poster can back me up, or explain why in fact Zalgo is mistaken and CargoCode is correct.

Either way, this sort of complexity is one reason I avoid XML like the plague and keep HTML at arm's length.


CargoCode is correct. Zalgo simply misread the question because he was so sick of similar subtly different questions.


This is what I really hate about the Zalgo answer. It is instilling people some vague sense that regular expressions are somehow bad, wrong and dangerous. But without any real arguments or contexts which would allow you to evaluate if the feeling is justified.


It doesn't work for me with regex101. "The preceding token is not quantifiable" on this part:

  | < (? \w+ )


See, this is kinda what I mean. Maybe you can detect tags with regex, but maybe you shouldn't, given the widespread but subtle differences in regex engines.

Perhaps the entire approach of "why are you trying to parse X?" Needs to be traced and re-evaluated.


> Maybe you can detect tags with regex, but maybe you shouldn't...

So what do you think would be a more appropriate choice for writing a tokenizer?


You want (?:, not (?

Without the colon, the parser appears to be interpreting (? as "one or more instances of (", but ( is no a full expression by itself and therefore cannot be modified with a quantifier.


I actually meant (?<tag> in order to create a named capture.


It was supposed to be (?<tag> \w+ ) in order to create a named capture. The <tag> was apparently lost in editing. Thanks for the heads-up.


We see good examples of "the problem with StackOverflow" here.

The second highest-rated answer is "Evaluate it in a try..catch or whatever your language provides." and it's justified because "Surely the real question is 'how do I validate a regular expression'."

This is a fascinating computer science question and I'm pretty sure the questioner wasn't asking "how do I validate a regular expression" because he would have asked that.


I also think it's a fascinating question, but if the asker wanted a theoretical answer more than a practical one, it should have been asked on the CS or CS theory stack exchange: https://cs.stackexchange.com/ & https://cstheory.stackexchange.com/


...neither of which existed when the question was asked.


Personally, I think this is a strength of StackOverflow. I think most people read through all of the answers, and this particular one keeps people who know nothing about regex from trying to evaluate regex with regex.

And for those with a deeper understanding, they have an answer that provides the intellectual stimulation they are looking for.


> I think most people read through all of the answers

You're vastly overestimating what most SO users come to the site for. They (and I include myself) just want something that'll work and isn't horrible. Even "not horrible" is something I care about but I know that many devs don't.


If somebody doesn't know the answer to their question in advance, which by definition they don't, it's by no means guaranteed they will know a horrible answer when they see it.


It really doesn't matter what the questioner wanted. The main utility of sites like Stack Overflow is not to give this one person the answer to their specific problem, but to be a useful landing page to people searching for things related to the question later.

A lot of those people are novices when it comes to regexes, and it's useful to make it clear that there's some things that are better solved with other tools, even if you can abuse extended regexes syntaxes to solve them.


The answer in this case does not even satisfy that standard.


I think it's a more frustrating example:

> If so please give example code below.

The phrasing, especially combined with what time of year it was asked, makes me think it's a homework question from a CS logic course where they have to provide an example showing the answer is "no", and explain why.


> I'm pretty sure the questioner wasn't asking "how do I validate a regular expression" because he would have asked that.

In my experience, people sometimes ask for how to solve the more immediate detail they’re working on rather than the broader problem.


This came up recently. Going to start calling it the XYZ problem

https://news.ycombinator.com/item?id=20861806

Edit: page refreshed and sure enough the sibling comment calls it out by name

Edit2: Called it the XYZ problem: https://cohan.io/the-xyz-problem/


See also: the XY problem [0][1].

[0]: http://xyproblem.info/

[1]: https://en.wikipedia.org/wiki/XY_problem


Maybe the second-ranked answer is trying to second-guess the question, but at least the highest-ranked answer is precise...


An interesting question would be "which regex variant can be parsed by itself, and how?".

Parsing simple regex with PCRE is cheating. If you are using PCRE, you should be parsing PCRE.

The simplest case would be DOS style globs with just * and ?, and it can parse itself: just use "*".


I hate it how the accepted answer is incorrect and does not teach the fundamental property of regular expressions.


The fundamental property of regular expressions is that the programming/software engineering concept of "regex" or "regular expression" is fundamentally different from the mathematical concept of "regular language".

If you see "regex" without any special context and start thinking about what it can/can't do, then the first association should be with something like PCRE and its capabilities, not something that's bound by the pumping lemma.


It only works in practice, not in theory.


It doesnt work in practice, PCRE does not validate all valid PCRE expressions.


The answer is as horrifying as you'd expect, but impressive nonetheless. Recursive regex is really fun and powerful, when it works.

I once made a recursive regex for matching a full name, complete with checking for suffixes, prefixes, an undefined amount of middle names, hyphenated last names, and Scotch-Irish names (Mc-, O'-, etc). Still simpler than this monstrosity.


I've always been mildly amused by the fact that while regular expressions are a context-free language, context-free grammars are a regular language.


My first thought seeing this title: "Why, no. Next!"

My second thought: the voice of Linus Torvalds at DebConf 14 saying "Hum… No! Hum… that was quick." [1]

Though this is only speaking about a recursively defined regular expression language which is infinite, which strictly handles regular languages, as defined in computer science lessons in university.

And then, "Why not just try and see if it breaks the provided regex parser since you have one?", and it's actually one of the answers in the link… awesome. I wonder it is has security implications though (are forged regexes exploiting flawed regex parsers a thing?)

[1] https://youtu.be/5PmHRSeA2c8?t=110


> are forged regexes exploiting flawed regex parsers a thing?

Looks like yes, depending on the engine.

PCRE for instance has a long list of security vulnerabilities including some with arbitrary code execution: https://www.cvedetails.com/vulnerability-list.php?vendor_id=...


Basic classical regular expression syntax without grouping parentheses is trivially regular. With parentheses, the maximum nesting level must be bounded to remain regular because "regexpes can't count".


Maybe I'm taking this a step too far but doesn't Gödel's Incompleteness Theorems (https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...) state that a language can not define itself? You need a meta language to be able to define and specify a language in its entirety.

In other words, regex can not parse regex in its entirety. It's impossible.


Can a C program parse C source code? A Java program parse Java source code? Yes, they can, so such a general limitation couldn't be the reason regular expressions can't parse themselves.

Perhaps you had in mind the Halting Problem: https://en.wikipedia.org/wiki/Halting_problem#G%C3%B6del's_i...


No, not really. This is probably closer to Chomsky's area than Gödel's. Gödel's theorem is about semantics, not syntax.

As a nice counter-example to what you said, you can define the Backus-Naur notation using Backus-Naur notation: https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form#Furth...


It may be noted that regular languages (as in compiler design) do not have such recursive constructs.


No. I wrote about this awhile back (https://reindeereffect.github.io/2018/06/24/regex/#the-limit...). From that post:

[M]atching parentheses requires our recognition device to remember how many unmatched open parentheses there are. Since the only way for a DFA to remember anything is to be in one of a set of states corresponding to it, and since the unmatched open parentheses could easily outnumber the available states, we can see that the fundamental limitation of a DFA is that it can store only a finite amount of information (remember the ‘F’ in DFA?). This limitation applies to any string matching task that involves recursive structures or algebraic relationships between substrings. It is why “HTML and regex go together like love, marriage, and ritual infanticide.”


IIRC, regex can be implemented by an NFA which can't solve balanced parentheses that are part of regex syntax.


Correct, though there are multiple interpretations of what a "regex" is. In the theoretical sense you're correct, though some (most programmers, I'd say) mean "PCRE" when they say regex. And PCREs are pretty powerful, see [1] for example

[1] https://news.ycombinator.com/item?id=9748736


Regex has two different meaning depending on the context.

One is CS definition of regular expression matcher for regular languages. The other is extension of that into non-regular languages, typically PCRE compatible.



So, given the much discussed limitations of reg-exps and the desire to parse context-free grammars. My question is, why are we still using regular expressions. Or rather, why isn't there something as easy to use as regular expressions that can processes context-free grammars?


I have been a happy user of PEG grammars for a while:

https://en.wikipedia.org/wiki/Parsing_expression_grammar

In particular this lua implementation:

http://www.inf.puc-rio.br/~roberto/lpeg/re.html

It let's you write stuff like this:

      list <- (name s)*
      name <- [a-z][a-z]*
      s <- %s*
Features:

- Looks like EBNF

- Regular expressions are a subset

- Fast parser.


There are Definite Clause Grammars (which can also represent context-sensitive languages) but they are a Prolog thing:

https://en.wikipedia.org/wiki/Definite_clause_grammar#Exampl...


Something like Perl6 grammars[1], or maybe Rosie Pattern Language[2]? Of course Perl6 regexes also go well beyond regular expressions, and I suspect they could be used to match context-free grammars if pressed hard enough. Both P6 grammars and RPL are based on parsing expression grammars, and there are also tools/libraries for many other languages based on PEGs. But now you are entering in the scary realm of parsers and parser generators and all that jazz, and can debate how easy to use they really are.

[1] https://docs.perl6.org/language/grammars

[2] https://developer.ibm.com/open/projects/rosie-pattern-langua...


PEGs are amazing precisely because they're as easy to use as, if not easier than, common regular expression syntax. PEGs are literally the same as regular expressions except 1) alternations are ordered, 2) zero-width assertions are formalized, and 3) quantifiers match greedily. This is effectively the same behavior as the Perl-compatible regular expressions with which most people are familiar.

Many PEG engines, especially for dynamic languages, permit grammar composition using first-class variables. That might be a small barrier to people more familiar with the terseness and conceptual simplicity of regular expressions as string'ish values. But it's fairly trivial to implement the latter using PEGs. For example, LPeg provides a small auxiliary module for doing that: http://www.inf.puc-rio.br/~roberto/lpeg/re.html

Also, Rosie seems amazing. I've not yet had the opportunity to make use of it, but I attended a presentation of Rosie by the author at a Lua workshop which left me very impressed.


I don't know if it's possible to do the "string describing a set strings" thing that regex does with more powerful parsing while maintaining the ease of use. CFGs make heavy use of named sub-languages (productions). A regex-style parser would at least have to have some kind of recursion-inducing metacharacter.


How about this for ease-of-use?

  sentence --> noun_phrase, verb_phrase.
  noun_phrase --> det, noun.
  verb_phrase --> verb, noun_phrase.
  det --> [the].
  det --> [a].
  noun --> [cat].
  noun --> [bat].
  verb --> [eats].
Reads -and behaves- as BNF. Strings in []'s are terminals, the rest are nonterminals.

It's directly executable as a logic program (it's Prolog syntactic sugar).


That looks great, but it's hard to position it as a regex competitor. I should be more explicit about what I mean by "regex-style": The cool thing about regex that makes them so approachable is that they kind of look like the thing they're describing/matching. Your thing here mostly does not.


"My thing here" is a grammar so it describes a class of strings. Perhaps this is why it looks to you unlike what it describes?

You can make a DCG rule as specific or as general as you like. As a for instance, this is a vim regex I retrieved from my recent history:

  \[13\/13,15\/12,24-24]
You could write this like so in DCG notation:

  s --> ['[',13,'/',13,,,15,'/',12,,,24,-,24,']']. 
And that would match the string "[13/13,15/12,24-24]", no less, no more.

DCGs are Turing-complete, so you can go all the way from programs to finite automata when you write a pattern to match.

They're not really a regex competitor. They were invented in the '70s as a formalism for context-free grammars to be used in representing natural language. They fit right into the logic programming language Prolog that was created soon after (and by some of the same people).


Well, we started this subthread looking for alternatives to regex, so if DCGs aren't in the running I'm not sure what your point is. Also, I happen to believe Turing-completeness is a misfeature for this job.


The OP asked for "something as easy to use as regular expressions that can processes context-free grammars". You requested ease-of-use. I proposed DCGs. They are not an alternative to regexes in the sense that regexes can't represent anything beyond regular grammars, including CFGs. They can be as simple or as complicated as you want them and they are easy to use.

I'm not sure what is a "misfeature". What do you mean?

Edit: Apologies if I sound too terse. I'm confused by the terminology of "competitor", "alternative" etc. Are we in some kind of competition to find a technological solution that will take some prize? If so, I'd like to know the rules before I commit to any solution. What exactly are we trying to achieve here?


My interpretation of "as easy to use as regular expressions" is that the desired... formalism be usable in the same contexts and with the same casual ease as regex. Think Unix command lines, hairy ad-hoc Python parsing scripts, etc.

One of the things that makes regex so well suited to that role is that a string is a regex that matches itself, and you can iteratively add sophistication from there. At the very least, you would want to maintain (or improve, Cthulhu knows there's plenty of room) that incremental quality in any proposed replacement, while increasing its power.

It looks like with a DCG, you have to know you're writing one up front, and have to think about what you're parsing at a much more abstract level. The average sysadmin could probably not casually toss one off for a log parsing task. If there's an alternate syntax that gets around that problem, then I'm interested. OTOH...

Re "misfeature": one of the cool things about existing parsing formalisms is all the known terminating, mostly (all?) polynomial algorithms for analyzing them. If DCGs are Turing complete, they don't have those algorithms. Turing completeness is not usually a property I want in my parser. That's what I meant by saying it's a misfeature.


I don't quite understand what you mean by "a regex that matches itself". Could you explain? What you describe sound smore like a quine.

Regarding Turing completeness and termination- the DCG formalism (thank you) itself is expressive enough to represent anything from finite automata to UTMs, but that doesn't mean that every grammar you write using DCG notation is Turing-complete. So, for instance, if you write a right-regular grammar as a DCG, that DCG will not be Turing-complete, it'd just be a right-regular grammar.

The DCG examples I gave above are a CFG grammar for a fragment of natural English and a single-rule grammar that matches a single string. Those are definitely not Turing-complete and parsable in polynomial time.

The difference with regexes is that you can't express CFGs or above using regexes, but you _can_ represent both regexes and CFGs using DCG notation.

>> If there's an alternate syntax that gets around that problem, then I'm interested. OTOH...

May I make a personal comment? I think your insistence on competitive language, like "i'm [not] interested", "competitor to regexes" etc, is a case of Déformation professionnelle. You sound just like a software developer hyper-focused on finding tools to maximise productivity in the office, and nothing else.

You should perhaps consider the possibilty that what we are discussing here goes a bit beyond a product that you can package and sell as an alternative to a popular tool. I mean, personally, when I realised that DCGs are executable grammars that can be run as both recognisers and generators- well let's say it shifted my understanding of what is possible to do with a computer and a programming language.

In any case I don't see the complexity you seem to see in DCGs. Like I say above, they're basically BNF. I struggle to think of a sysadmin worth her salt (so, one who knows perl, eh?) who would sweat it to write and read BNF. The kind of sysadmin I have in mind, the problem would be to drag them away from the keyboard, once they started writing a DCG to parse a log file- and realised they could write one to parse _all_ log files ever.


"foo" is a string. It's also a regex that matches exactly the string "foo".

Regarding Turing-completeness, the entire point is about the termination of algorithms that examine parsers, not the runtime of the parsers themselves.

I also don't know how many times I need to point out that the reason I'm focusing on regexes is that that was the context for this conversation. Under any other context I'm quite interested in new parsing techniques.

Lastly, I would point out that regexes have uses far beyond a certain arbitrarily decided set of sysadmins and developers "worth their salt". Even if it's true that a meaningful subset of sysadmins are capable of writing basically sound BNF (which is not something I would count on even for CS grads, but maybe you work with smarter people than I do), there are lots of other people who could use a little more parsing power if it was offered in the right way.


And this can be used to generate sentences as well as parse them. ;-)


Indeed. I tend to forget how counter-intuitive this is.

With DCGs you get your recognisers for nothing and your generators for free. As I like to say.


This is perfect example of a insight which leads to a rabbit hole of intertwined complexity of our concepts when you try to understand why regexp for validating regexp is actually much simpler than regexp for validating email address.


It depends what you mean by "validating email addresses".

A regex that tests if a string looks like a valid email address is simple. Forget the ancient RFCs, a real world email address is in the form of `mailbox@domainname`. Which is not so difficult to test for with a bit of care.

However, testing if the email address is a valid mailbox is harder and indeed impossible using regex alone. The domain name can be validated using standard domain tools but in practice the mailbox can be anything that the server will route to a valid mailbox. The only way to validate it is to send an email.


Even then, there is no way to determine if there is a ‘catch-all’ address configured, or some email tarpit.

I have a domain w/ a .tech TLD, and quite a few frontend JS validators do not accept my email as a valid address (government sites, some banks).


"Regular Expression Matching Can Be Simple And Fast" [1] is worth a read (by Russ Cox of Bell Labs and Go fame).

[1]: https://swtch.com/~rsc/regexp/regexp1.html


A regular language for writing regular language recognizers is an interesting thought exercise.


Don't use a regex to do it. Most of languages have a way to catch runtime errors. Stick regex to a variable, create a dummy input and run that block with a catcher around it. If catcher throws a runtime error, you have a bad regex. If it does not, you have a valid regex.


Another way to phrase the question: Are regexes self hosting?


TL;DR: No. But there exist languages built on top of regular expressions (notably PCRE) that can. They can't validate themselves, though - turtles all the way down to Gödel.


Plenty of parser formalisms can define their own syntax in their own syntax. As mentioned elsewhere, C compilers can parse their own source code. Gödel has nothing to do with it.


it would become conscious


[flagged]


That does not really apply. A compiler can compile itself, it can check if its own code is valid.


[flagged]


The fact that someone might make a mistake in doing something does not show it cannot be done. More generally, You seem to be mistaking validating a program's source with the question of whether it performs its intended purpose. These are different things, and attempting to conflate them will only lead to confusion.


> You seem to be mistaking validating a program's source with the question of whether it performs its intended purpose.

In general, that is a useful distinction to make, but you forgot about the edge case where the distinction is meaningless.

A self-hosting compiler's intended purpose is to validate its own source code.


Then your mistake appears to be in failing to see that your perceived edge case does not invalidate the first sentence of my reply. If the sole purpose of a parser is to syntactically validate its own source (which is not the case for a compiler's parser, by the way, not even if we expand 'its own source' to 'arbitrary input'), then if it does that correctly, that's all there is to it.


[flagged]


Godel incompleteness only applies to sufficiently powerful formal systems.


Indeed. Type 0 Grammars are the most powerful grammars we have.

https://en.wikipedia.org/wiki/Chomsky_hierarchy#Type-0_gramm...

In this type of grammar Godel's incompleteness theorem is equivalent to the Halting problem.


And which programming language did you have in mind?


All of them.

"Type-0 grammars include all formal grammars. They generate exactly all languages that can be recognized by a Turing machine."


I see - you have the complexity relation the wrong way round.


Are you sure you aren't mistaken? :-)


Go ahead - show me wrong.


I have already done that, what I am unsure of is how to go about convincing you.

With compilers - I just fix the bug myself. With humans, I have to convince the human to self-correct.

Common problem in distributed systems that - leader election.


It is a general rule that those who avoid answering a question do not have an answer, and this is no exception. Here, You completely misunderstand Chomsky’s hierarchy: By your inverted-hierarchy argument, the simplest regular language would be complex enough that incompleteness would be an issue in its validation.


Well, this looks like abuse of Cunningham's law, but I'll bite.

It is a general rule that general rules have exceptions. And you have (incorrectly) asserted that this is not an exception.

Q.E.D

Even the most powerful languages (Type 0 in the hierarchy) cannot solve the halting problem. Which is equivalent to Godel's incompleteness theorem.

https://www.scottaaronson.com/blog/?p=710

If a Type 3 grammar can recursively prove its own correctness, it's not a Type 3 grammar!

But if you desperately want to be right, I will happily lie to you (you are right, I am wrong), so I can move on with my life.


Your second paragraph is just a direct begging-of-the-question.

More significantly, verifying that sentences are well-formed in some language is not the same as proving the language’s correctness.

As a response to you argument, consider this regular language:

A


Curry-Howard isomorphism.

Proving that a sentences are “well-formed” is the same as implementing an algorithm which takes a string and returns a Boolean.

You can call this function “is_well_formed?”

I await your proof in the regular language you have specified above.


Please also stop.


Please stop.


A self-hosted compiler is definitely not a case of the liars paradox, so I don’t think this applies. In fact, it doesn’t matter what the parser is written in, at some point you do have to trust that it actually follows the spec, probably through testing.


So what you have to trust is that the spec doesn't produce lies ? ;)

FYI. I am just having some philosophical fun with GIGO.

Ultimately, computers are just instruments which amplify human intention, it's not the computers's job to be ultimate moral arbiter of our words.

Gödel's incompleteness theorem already tells us that a system cannot prove its own correctness.


The "liar" only says "I can proofread Essays" and not "I am lying".


The "liar" is saying that the Essay is free of errors. (To the best of the "liar's" ability to detect them).

The liar's knowledge is incomplete.

https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...


Free of English errors. Not semantic errors. We don't even know it's a "liar".


Precisely, but as far as a compiler/interpreter is concerned syntax is semantics.

You can't codify moral judgments.


Not sure if you are trolling, but here we go...

In the case at hand, correctness of the validator expression V clearly means "V determines well-formedness of any regular expressions" which is clearly not implied by "V is well-formed" (a much weaker statement because ".*" is well-formed but matches everything). Therefore, when applying V to itself, we only learn if a weak requirement for V's correctness holds.

Similar, perhaps, to validating whether a given number is a Gödel number of a well-formed logical statement rather than assessing the verity of the logical statement it encodes.

Also, I am not saying whether it is or is not possible to build such a regular expression. Rather, the question just doesn't tick the boxes of either, the Liar's Paradox, nor Gödel's Incompleteness result -- contrary to what was suggested. So you could still be right but for different reasons.


You are axiomatically assuming that the proposition "V determines well-formedness of any regular expressions" to be true.

I am asking you to prove that. Constructively.


It's not possible to do this for regular expressions, but it is possible to do it for context free grammars. You can write a context-free grammar in Backus-Naur form that recognizes all context-free grammars in Backus-Naur form:

   <syntax>         ::= <rule> | <rule> <syntax>
   <rule>           ::= <opt-whitespace> "<" <rule-name> ">" <opt-whitespace> "::=" <opt-whitespace> <expression> <line-end>
   <opt-whitespace> ::= " " <opt-whitespace> | ""
   <expression>     ::= <list> | <list> <opt-whitespace> "|" <opt-whitespace> <expression>
   <line-end>       ::= <opt-whitespace> <EOL> | <line-end> <line-end>
   <list>           ::= <term> | <term> <opt-whitespace> <list>
   <term>           ::= <literal> | "<" <rule-name> ">"
   <literal>        ::= '"' <text1> '"' | "'" <text2> "'"
   <text1>          ::= "" | <character1> <text1>
   <text2>          ::= '' | <character2> <text2>
   <character>      ::= <letter> | <digit> | <symbol>
   <letter>         ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
   <digit>          ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
   <symbol>         ::=  "|" | " " | "!" | "#" | "$" | "%" | "&" | "(" | ")" | "*" | "+" | "," | "-" | "." | "/" | ":" | ";" | ">" | "=" | "<" | "?" | "@" | "[" | "\" | "]" | "^" | "_" | "`" | "{" | "}" | "~"
   <character1>     ::= <character> | "'"
   <character2>     ::= <character> | '"'
   <rule-name>      ::= <letter> | <rule-name> <rule-char>
   <rule-char>      ::= <letter> | <digit> | "-"
I got this from Wikipedia: https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form#Furth...


Sure. Total functional programming (provably terminating code) comes at the cost of Turing-completeness.

The choice exists.

https://en.wikipedia.org/wiki/Total_functional_programming

Observe though that BNF is just a notation. Parsers for BNF are not implemented in BNF. Bison. Yacc.


:-)


So we agree then that when the system says "I am free of errors", some (most?) of the time it's a lie ? ;)

All models are wrong - some are useful.


Why would you want a regular expression to detect a valid regular expression?


If you could do this with a proper regex: they’re easy to parse and don’t have a lot of structure, so you can do neat things like produce generators from them. Now, you can get your regex for regexes to produce new regexes! This can be useful if you want to do generative testing.


A regular expression is a domain-specific language that defines a language consisting of all the strings it matches.

In principle, this serves as executable documentation.

A regular expression that matches valid regular expressions could be viewed as a human-readable documentation of valid regular expression syntax that has the valuable side-effect of being executable.

If, instead, we detect whether a string is a regular expression by compiling it, we may find it very difficult to read the compiler source code.

It is much more work to make an optimized regular expression engine that is also self-documenting than a regular expression.


I think we have different definitions of human readable documentation. I'd find normal code, with or without appropriate comments, far more readable than a mess of regex symbols.


Regular expressions are not my favourite either, but one thing I will say is that normal code becomes difficult to read over time as features are bolted on and optimizations are added to address performance concerns.

Whereas, some form of executable DSL can serve as self-testing documentation indefinitely. DSLs can provide a separation of concerns, where the syntax is designed for readability, and the engine is designed for performance.

Modern Regexen are a dumpster fire, of course, but you asked why someone might want to, not whether this is something Everyone might want to do.


Probably to check if a regular expression is valid


Couldn't you just compile it and check for return value?


Yes, checking to see if a regular expression is valid is useful but that wasn't what I was asking. Why specifically would you want to use a regular expression for this job?


The simplest explanation is because you already have a lot of functionality implemented this way.

Still probably the library you are using exposes a validation function...


Maybe because you can :/


There isn't even a regular expression to detect a valid email, bruh.


I did enjoy the comment with the Xhibit/Pimp my Ride reference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: