More precisely, it is arbitrarily deep nested clauses that cannot be parsed, on account of true REs not being able to either use recursion or keep count of how deep they are.
Can regex engines actually process arbitrarily deep nestings themselves though?
It seems to me it would start getting computationally expensive very quickly to search for those kinds of regexes, so is 'arbitrarily deep' a practical concern for todays hardware?
Yes, regex engines that are designed to do so run in linear time on the size of the regex (as well as linear time with respect to the length of the input).
It can process arbitrary deep nestings? I'm not familiar with it, but I had a look and I can't see anything in there that would allow it to do it. Can you point to syntax that allows it?
In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).
A DFA is part of the Chomsky hierarchy of languages that has DFA's at the bottom (aka regex's, least powerful - can't handle recursive structures), a DFA coupled with an infinite stack for memory (aka LR parsers, can't handle context sensitive grammars, also runs in very close to linear time), and finally a DFA coupled with an infinite tape (aka Turing Machine, can do any sort of computation).
In each of these cases the DFA (regex) is essentially a computer program and all they are doing is allowing it to store stuff on various kinds of memory (none, stack, tape with arbitrary movements). An interesting corollary of this is any computer program can be expressed as a DFA.
The GP is presumably citing it as an example of a linear time regex, and not as something that can parse arbitrarily deep nestings. Note that "linear time regex engines" is a characterization that holds the size of the regex constant, such that "linear time" is with respect to the input only. The actual worst case time complexity of such regex engines is O(m * n), where m ~ len(regex) and n ~ len(input). So even with a linear time regex engine, if you bloat the size of the regex, you'll make searching slower. But, it will still be done linearly with respect to the size of the input.
> In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).
No. Please stop spreading the misuse of these terms. DFAs and NFAs have very precise meanings. It's one thing to call things regexes that aren't actually regular, but DFAs and NFAs should retain their precise theoretical meaning. A linear time regex engine such as Rust's `regex` crate uses both DFAs and NFAs in its implementation.
The Perl and PCRE folks (along with Jeffrey Friedl) have bastardized this terminology. PCRE for example claims to provide a "DFA" regex engine in its API, but it's actually an NFA. The main implementation inside PCRE (and Perl) is backtracking oriented, and carries with it much more power than an NFA. If it didn't, then the NFA could be executed in linear time, because NFAs and DFAs have equivalent power.
> An interesting corollary of this is any computer program can be expressed as a DFA.
No they can't. You're confusing terminology quite a bit here.
There is an importance to theoretical considerations that goes beyond their immediate practical applications. Ultimately, it comes down to how well one understands what is going on.
Having said that, I think there are practical applications. Suppose you were unaware of this result, and thought a true RE could validate your input? A more knowledgeable attacker might be able to craft an input that exploits the inherent limitations of your validator.
Also in practice, attempting to write a true RE to parse a language that allows restricted recursion rapidly becomes complicated. It is the wrong tool for the job.
Just considering the language of balanced parentheses, the only memory needed is a single integer to track depth. If we see a "(", increment. If we see a ")", decrement. If we never go negative and end up with zero, that's a balanced parentheses expression. (To be clear, the parser I've described is not a regex parser.)
Adding the rest of the regex machinery makes this a bit more complex, but in general, if the computer can store the string, it can check whether it's regular.
As you say, this is not a regex parser, but some readers may not know the theorem that the regular languages are exactly those which can be parsed by a Turing machine with constant memory, and parenthesis count can have log n bits.
In fact, any Turing machine using o(log log n) space recognizes a regular language (so there is an equivalent machine using O(1) space).
As typically defined academically, yes. In the real world things might be more complex, but I think it would hold true for many implementations.
Academically a regex is a string of of:
- character literals
- "epsilon" characters (meaning empty string)
- "+" characters (which means either what we see before the + or what we see after the +, in real-world regex these are usually represented as | instead of + as well as used implicitly in constructs like [a-zA-Z])
- "*" characters (same meaning as in real-world regex)
- And parentheses to allow controlling the order of operations
In other words, "if we build a state machine which is different from a regex, it will perform differently than a regex would." That's trivially true, but is it interesting beyond restating "there is no Silver Bullet"?
Technically we aren't discussing the regex itself here but the syntax for defining a regex.
You could use a different syntax for defining the same regex, and that syntax could have different properties. For instance you could use parantheses like racket where "(...)" and "[...]" have the same meaning, but "(...]" is not well formed syntax (if I remember my racket correctly). Using such a syntax to define regex a counter would no longer suffice to decide if a string is a regex.
You could have a stack of (paren-type, integer) pairs. If the paren has a different type than the top pair on the stack (or the stack is empty), push a new pair with the integer set to one. If the integer goes to zero, pop the pair. At the end of the expression, you have to have an empty stack.
When I check brackets by hand or manually I do actually get my hands out and when I say open I uncurl a finger, and when I say close, I furl a finger. "Open, open, open, close, open, close, close, close". Provided I have no fingers on display at the end, then the brackets are balanced. They might not be in the right place but at least they are balanced. If I find myself starting with close then I close the lid and go and do something else.
I'm sure some sort of rather cheap algorithm falls out of the above. It wont guarantee correctness in what the balanced brackets actually contain but they will at least be balanced.
This is true if you want to handle arbitrarily deep nested parents.
But any given string you are asked to validate to see if it is a regex will be a finite length, and contain a finite number of opening paren characters. So it’s maximum possible nesting depth is known.
And you can construct, fairly trivially, a regex that can validate paren nesting up to a fixed depth.
So, in practice, you could use a regular expression to validate the paren nesting of any given string - if you were allowed to prepare the regex based on the string length, or on running another regex over the string first.
What I suspect you might not be able to validate is that in a regex, while [a-z] is valid, [z-a] is not.
If you allow generating a regular expression based on string length, you can validate any computable language with a regular expression. It's trivial to just run a solver on all strings of that length and or the accepted ones together.
> What I suspect you might not be able to validate is that in a regex, while [a-z] is valid, [z-a] is not.
It's fairly trivial to validate things like a-z being valid in a character class while z-a isn't. There are only finitely many legal ranges. You can just list them all, like \[(a-a|a-b|a-c|...|z-z)\].
For a fixed input length all languages are decidable by regular expressions. Trivially, the union of all valid input strings of that length, like "(abc|def|...)".
The problem is that "regular expression" is equivocal.
Sometimes you hear "regular expression" used to mean the language of things accepted by some Finite State machine. Lots of folks on HN took a Theory of Computation class and learned about these in that class. In that meaning, yes, you are right.
But sometimes in professional conversations you hear "regular expression" used to mean the strings that can be matched using, say, Perl's regular expressions. I know I have heard it used that way often. Since Perl regular expressions can accept anything a Turing machine can accept, the answer to the question in this case is yes.
Very few people know that Perl's regex engine (since 5.10) supports recursion. And recursive matching cannot perform recursive captures, which is why at best it can only be used to validate a string, not parse it, making it not particularly useful in practice. For these reasons I would never assume that someone using the term "regular expression" inclusively of Perl would have in mind such capabilities. IME, such a loose definition simply implies things like zero-width assertions, backcaptures, etc.
Here's a proof-of-concept Perl 5.10+ JSON validator I came up with for a presentation to Perl engineers introducing PEGs and Lua's LPeg module. Of the 20-30 people in the room, I doubt anybody in the audience knew this was even possible with Perl.
Note: I was trying to fit it all on a single slide, so the definition for String doesn't handle escaped characters. There may be other deficiencies. I copy+pasted this from the PDF slide deck as I can't find the original Beamer source. Any broken spacing and Unicode substitutions probably aren't original.
Very underrated comment. Very good point to give food for thought although I think it is clear that it was meant to be on the usual textual representation using the simplest re language (without extensions).
> (famously) there is no regex to detect balanced parentheses
God, I remember running into this wall with an in-company domain-specific language that used regex as its tokenizer/parser.
Adding support for nested structures required us to untangle the whole thing and rewrite the regex into explicit algorithms. Fortunately the regex was only a few pages long..
In short, I agree with you that regex cannot parse regular expressions completely. Would be happy to be proven wrong, though!
Grouping is very useful, but I think you can get pretty far in practical patterns with a limit of k-nesting for k=0..3 or so. That pattern language would still be regular.
Right but parent is thinking of the quesion "is the language of valid regular expressions itself regular?", which a very different question and of course false by an easy application of the pumping lemma. However, as shown in the SO answer, they "cheat" by using some features that don't align completely with the mathematical definitions.
That answer feels entirely wrong to me. As pointed out, it may be an XY answer: It presumes the person posing the question wants to validate regular expressions.
But with something so blatantly self-referential, it actually feels unlikely to me that what they want to do is validate regular expressions. My guess is that they are generally curious about whether Regexen (PCRE or strict regular expressions) are powerful enough to validate Regexen (again whether PCRE or strict).
XY answers are good for avoiding a lot of unnecessary yak-shaving/accidental complexity of a bad solution. But the conversation around whether we are talking about recognizing strict Regexen or PCRE Regexen, and in turn whether we are using strict Regexen or PCRE Regexen to recognize them is not accidental complexity or yak-shaving, it is intrinsic to understanding the nature of the problem and solution spaces.
I too find the answer humorous for the "enterprisey" reference, but I think it would be a very bad answer if we are judging it strictly on the basis of its value.
"We could do it in a easy two liners that make sense, but, and hear me out here, what if we instead built an entire 10k LOC system that would barely work half of the time, using dozens of man hours for it ? And then never re-use it nor train anyone about it so that it becomes a mythical black box after your contract expires ?"
"Well, I gave you my recommendation but hey at the end of the day you're the customer and I'm paid by the hour, so whatever you say boss"
I now consider myself very lucky to be successful enough to not have to deal with projects like that, but back when I wasn't I paid quite a few things I wanted with those sort of stupid decisions.
My favorite part of it is the moderators' note at the end:
> Moderator's Note
> This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.
The big problem with the content is suggesting to use an XML parser to parse HTML. That might often work, and several XML parsers have some sort of HTML "mode", but in general, no, not really.
HTML documents are frequently invalid XML.
HTML documents lack an XML declaration, they don't close all tags, etc. They might even lack a root element. Someone mentioned this in a comment on SO, but it was (as far as I could see) not really addressed or answered.
HTML 5 has it's own parsing rules, specified at [1].
However, I don't know of any implementation of these outside of browsers. I normally use Beatuiful Soup [2] (for Python) or HTML Agility Pack [3] (for .NET), and while I don't think these implement the exact standard, they're easier than struggling with a XML parser for HTML "in the wild".
The question title is ambiguous but the question tags include both HTML and XHTML. Self-closing `<element />` is an XML-ism which in HTML is treated the same as `<element>` but it was still fashionable to use them in HTML when XHTML was at its height.
From that answer: "You can't parse [X]HTML with regex".
I've always wondered how that came about. How come very early in the development of the web no one important enough for people to pay attention to them said, "Hey...wait a second. If this thing becomes popular, people are going to really want to processes web documents with their usual text file processing tools and techniques. We really outta make this thing reasonably easy to process with grep and sed and awk and Perl and such"?
The problem isn't with the surface syntax of HTML. Anything else, like S-expressions or JSON or whatever would have the same problem. The problem is arbitrary nesting of elements. No language with nesting can be parsed with regex (modulo various copouts mentioned elsethread). But such nesting is useful and necessary for a document language like HTML. This goal is important; "we must be able to use awk" isn't.
In SO's hall of fame of regular members being feed up of a common yet stupid question always coming back up it's up there alongside the thread about making addition with jQuery
The problem with the answer is it is wrong. The question is about identifying start-tags in XHTML. This is a question of tokenization and can be solved with a regular expression. Indeed, most parsers use regular expressions for the tokenization stage. It is exactly the right tool for the job!
Furthermore, the asker specifically needs to distinguish between start tags and self-closing start tags. This is a token-level difference which is typically not exposed by XHTML parsers. So saying "use a parser" is less than helpful.
If you were looking for the reason why a regex cannot parse HTML, it is because HTML has matching nested tags and regex parsers are finite state machines (FSM).
What this means is that a regex parser is like a goldfish. It only knows about the state it is currently in (what it just read) and which possible states it may transition to (what is legally allowed to come next). The fish never remembers where it was before; there is no option to have the legality of a transition depend on what it read _before_ the current state. But this memory function is a requirement to recursively match opening tags to their closing tags - you need to keep a stack of opening stacks somewhere in order to then cross off their closing tags in reverse order. So regex cannot parse HTML.
>What this means is that a regex parser is like a goldfish.
Like a drunk goldfish, or a sober goldfish?
'A method to study short-term memory (STM) in the goldfish.'
'Twenty-one common goldfish (13-15.5 cm long) were randomly divided into alcohol (A) and nonalcohol (NA) groups and were trained in an alcohol solution of 400 mg/100 ml or in water, respectively. All alcohol fish were placed in an alcohol solution of 400 mg/100 ml for 3 hr before training in the same alcohol concentration. Fish were trained on a position discrimination task for 2 consecutive days. The door used for training was that opposite to each fish's spontaneous preference. Savings in relearning on Day 2 was taken as a measure of long term memory strength. Only fish which reached criterion on both days were immediately given 10 forced reversal trails in the opposite direction (i.e., a fish trained on right door was forced to choose the left door.) A and NA subjects were then tested after a 5 min (STM) delay, respectively, in a free choice situation for 10 trails (i.e., neither door was blocked). The results suggest that alcohol facilitates the STM of the forced reversal information.'
No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.
The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.
Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.
That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.
Sure. But that would be true for any parsing technique. No parser known to man would be able to produce a valid parse if you start it in the middle of a quoted string!
theoretically I believe an end tag really requires a valid start tag.
anyway you can probably answer any number of simple questions about a bit of HTML using regex but as code wants to grow to handle more use cases there will come a time when the solution will break down and the code that wrote to handle all the previous uses will need to be rewritten using something other than regex.
You have to distinguish between the different levels of parsing.
Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.
If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.
I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.
Had it not gotten extremely lucky, it would have a very negative score. Anything intended to be funny on StackOverflow doesn’t go over well very often.
The answer was made in '09. StackOverflow was more accepting of some humour at the time. And it is making a valid point: regex isn't the right tool for this job.
More specifically, a stack lets you keep track of nesting. See an opening tag, push something onto a stack. See a closing tag, pop the stack. If the stack is empty at the end, the tags match.
Parsing XHTML in real life is of course much more complicated than this, but this is the basic idea.
I think there are a lot of knee-jerk answer because people see "XHTML" and "regex" in the same sentence and immediately think "not possible".
But the actual question is clearly not about matching start tags to end tags or building DOM or anything like that - which indeed would require a stack. The question is about recognizing start and end tags. You can do that perfectly fine with regular expressions - indeed many parsers uses regular expressions to tokenize the input before parsing.
Furthermore, the question specifically needs to recognize the difference between start-tags and self-closing tags. A differece which is not exposed by most XHTML parsers a far as I am aware
Sorry, I misread. Indeed, actually tokenizing text is accomplished with regular expressions (although some parsers don’t need a tokenization pass, but details :).
If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)
So how would you tokenize without the use of regular expressions? What more appropriate technique would you use instead?
The example you provide in not XHTML so not really relevant for the discussion. But in any case, a regular expression have no problem recognizing a quoted string.
> So how would you tokenize without the use of regular expressions?
Since this need doesn't appear to be an everyday one, with clearly defined targets, a simple hand-written lexer isn't hard to write, and will make less mistakes than a regex. Just use a scanning approach. As a bonus, you'll still be able to read in 12 months time.
Why would a hand-written lexer have fewer mistakes than a regular expression using an off-the-shelf regex engine? They would need to encode the same lexical grammar, so at that level there is the same amount of complexity.
Writing a lexer by hand is just trading ten lines of regex (a widely known declarative DSL) with hundreds of lines of custom code. I don't see how that would be more maintainable in the long run.
The author said at one point that it was written in a bit of a huff of frustration, and if I remember correctly, also after a pint or two. It's not the best answer answer by any means, but it is one of those small gems of the web, I think.
it's sad that today the stack overflow majority shares your mindset and an answer like that would drown in downvotes or killed by moderation.
some people like to act super serious all the time like they're playing a sitcom version of what they think adulthood is in a quest to be the most boring person on earth like if that's the goal of human interaction
The problem is it is funny and wrong. Apparently it have given a lot of people really confused ideas about what is possible and what is not possible with regular expressions.
If it had been funny and right I would not have a problem with it.
The job (the question asked) is about recognizing start and end tags in XHTML. These are lexical tokens and therefore regular expressions are a perfectly fine tool for this. Indeed many parsers use regular expressions to perform the lexical analysis (https://en.wikipedia.org/wiki/Lexical_analysis) aka tokeniziation.
Quoting from wikipedia:
The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions;
If you disagree that regular expressions are an appropriate tool for lexical analysis, can you articulate why? And what technique do you propose instead?
That is just a comment and can be easily scanned by a regex. The CDATA syntax does not have any particular meaning inside an XML comment if that is what you are suggesting.
And in any case, neither comments nor CDATA sections nest, so the is no problem handling them at the lexical analysis stage.
As for "get a premade grammar" - what does that mean? Do you mean use a tool like Lex? AFAIK Lex uses regular expressions.
your missing the point, you cannot nest them but you can have that comment both around or within a cdata, so that Regex is starting to look quite complex to maintain, unless you want to have multiple stages of Regex applied in sequence, at which point congratulation, your building a whole parser the hardest way
and no I mean tool like antlr or javacc which build an in memory syntax tree you can query or observe.
this is different than querying the document as XML, since an empty node is equivalent to an empty element, but they differ in the syntax, so querying the syntax tree allowed you to know because it produces a TAG_SLASH_CLOSE
yeah, and you can have a grammar with a single production using ALL : .* - you're missing the forest for the tree.
regex are ok for text that's context free, so they match perfectly for the lexing°. once you need context then are back to whacky hacks, layers and loops of regexes or preparser that remove the hard part out of the content - basically you're rolling your own grammar except manually, on the wrong toolchain and on your own.
"can I build my raster engine to draw poligons on screen" "well yes but is going to cost a lot of time, be error prone and miss a lot of features from ready to use solutions"
°you can actually put token production in the lexer itself between expressions, so they can already handle more than plain rexeg even in lexing.
So it seems we agree that for the question the OP is actually asking a regex is the appropriate tool.
Great if you can recommend a tool which can solve the task easier. But saying it is "not possible" is unhelpful and confusing. It is just giving lots of people really confused ideas about what is possible and not possible using regexes.
To use your own example: "Can I build my raster engine to draw polygons on screen?"
Not helpful: "No that is simply not logically possible due to the laws of geometry, and you are an idiot for even thinking such a thing is possible"
Helpful: "Yeas, but it is a lot of work if you want to support anti-aliasing, smooth scaling and so on. Using a library like XYZ would probably save you a lot of work in the long run".
This is a bit out of my wheelhouse, but this feels wrong, or at least naively capable. Like it feels like this sort of reasoning leads to the kind of bugs (depending on what you use the result of the rexex for) that allow for code injection, a la the Equifax hack.
Maybe another HN poster can back me up, or explain why in fact Zalgo is mistaken and CargoCode is correct.
Either way, this sort of complexity is one reason I avoid XML like the plague and keep HTML at arm's length.
This is what I really hate about the Zalgo answer. It is instilling people some vague sense that regular expressions are somehow bad, wrong and dangerous. But without any real arguments or contexts which would allow you to evaluate if the feeling is justified.
See, this is kinda what I mean. Maybe you can detect tags with regex, but maybe you shouldn't, given the widespread but subtle differences in regex engines.
Perhaps the entire approach of "why are you trying to parse X?" Needs to be traced and re-evaluated.
Without the colon, the parser appears to be interpreting (? as "one or more instances of (", but ( is no a full expression by itself and therefore cannot be modified with a quantifier.
We see good examples of "the problem with StackOverflow" here.
The second highest-rated answer is "Evaluate it in a try..catch or whatever your language provides." and it's justified because "Surely the real question is 'how do I validate a regular expression'."
This is a fascinating computer science question and I'm pretty sure the questioner wasn't asking "how do I validate a regular expression" because he would have asked that.
Personally, I think this is a strength of StackOverflow. I think most people read through all of the answers, and this particular one keeps people who know nothing about regex from trying to evaluate regex with regex.
And for those with a deeper understanding, they have an answer that provides the intellectual stimulation they are looking for.
> I think most people read through all of the answers
You're vastly overestimating what most SO users come to the site for. They (and I include myself) just want something that'll work and isn't horrible. Even "not horrible" is something I care about but I know that many devs don't.
If somebody doesn't know the answer to their question in advance, which by definition they don't, it's by no means guaranteed they will know a horrible answer when they see it.
It really doesn't matter what the questioner wanted. The main utility of sites like Stack Overflow is not to give this one person the answer to their specific problem, but to be a useful landing page to people searching for things related to the question later.
A lot of those people are novices when it comes to regexes, and it's useful to make it clear that there's some things that are better solved with other tools, even if you can abuse extended regexes syntaxes to solve them.
The phrasing, especially combined with what time of year it was asked, makes me think it's a homework question from a CS logic course where they have to provide an example showing the answer is "no", and explain why.
The fundamental property of regular expressions is that the programming/software engineering concept of "regex" or "regular expression" is fundamentally different from the mathematical concept of "regular language".
If you see "regex" without any special context and start thinking about what it can/can't do, then the first association should be with something like PCRE and its capabilities, not something that's bound by the pumping lemma.
The answer is as horrifying as you'd expect, but impressive nonetheless. Recursive regex is really fun and powerful, when it works.
I once made a recursive regex for matching a full name, complete with checking for suffixes, prefixes, an undefined amount of middle names, hyphenated last names, and Scotch-Irish names (Mc-, O'-, etc). Still simpler than this monstrosity.
My first thought seeing this title: "Why, no. Next!"
My second thought: the voice of Linus Torvalds at DebConf 14 saying "Hum… No! Hum… that was quick." [1]
Though this is only speaking about a recursively defined regular expression language which is infinite, which strictly handles regular languages, as defined in computer science lessons in university.
And then, "Why not just try and see if it breaks the provided regex parser since you have one?", and it's actually one of the answers in the link… awesome. I wonder it is has security implications though (are forged regexes exploiting flawed regex parsers a thing?)
Basic classical regular expression syntax without grouping parentheses is trivially regular. With parentheses, the maximum nesting level must be bounded to remain regular because "regexpes can't count".
Maybe I'm taking this a step too far but doesn't Gödel's Incompleteness Theorems (https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...) state that a language can not define itself? You need a meta language to be able to define and specify a language in its entirety.
In other words, regex can not parse regex in its entirety. It's impossible.
Can a C program parse C source code? A Java program parse Java source code? Yes, they can, so such a general limitation couldn't be the reason regular expressions can't parse themselves.
[M]atching parentheses requires our recognition device to remember how many unmatched open parentheses there are. Since the only way for a DFA to remember anything is to be in one of a set of states corresponding to it, and since the unmatched open parentheses could easily outnumber the available states, we can see that the fundamental limitation of a DFA is that it can store only a finite amount of information (remember the ‘F’ in DFA?). This limitation applies to any string matching task that involves recursive structures or algebraic relationships between substrings. It is why “HTML and regex go together like love, marriage, and ritual infanticide.”
Correct, though there are multiple interpretations of what a "regex" is. In the theoretical sense you're correct, though some (most programmers, I'd say) mean "PCRE" when they say regex. And PCREs are pretty powerful, see [1] for example
Regex has two different meaning depending on the context.
One is CS definition of regular expression matcher for regular languages. The other is extension of that into non-regular languages, typically PCRE compatible.
So, given the much discussed limitations of reg-exps and the desire to parse context-free grammars. My question is, why are we still using regular expressions. Or rather, why isn't there something as easy to use as regular expressions that can processes context-free grammars?
Something like Perl6 grammars[1], or maybe Rosie Pattern Language[2]? Of course Perl6 regexes also go well beyond regular expressions, and I suspect they could be used to match context-free grammars if pressed hard enough. Both P6 grammars and RPL are based on parsing expression grammars, and there are also tools/libraries for many other languages based on PEGs. But now you are entering in the scary realm of parsers and parser generators and all that jazz, and can debate how easy to use they really are.
PEGs are amazing precisely because they're as easy to use as, if not easier than, common regular expression syntax. PEGs are literally the same as regular expressions except 1) alternations are ordered, 2) zero-width assertions are formalized, and 3) quantifiers match greedily. This is effectively the same behavior as the Perl-compatible regular expressions with which most people are familiar.
Many PEG engines, especially for dynamic languages, permit grammar composition using first-class variables. That might be a small barrier to people more familiar with the terseness and conceptual simplicity of regular expressions as string'ish values. But it's fairly trivial to implement the latter using PEGs. For example, LPeg provides a small auxiliary module for doing that: http://www.inf.puc-rio.br/~roberto/lpeg/re.html
Also, Rosie seems amazing. I've not yet had the opportunity to make use of it, but I attended a presentation of Rosie by the author at a Lua workshop which left me very impressed.
I don't know if it's possible to do the "string describing a set strings" thing that regex does with more powerful parsing while maintaining the ease of use. CFGs make heavy use of named sub-languages (productions). A regex-style parser would at least have to have some kind of recursion-inducing metacharacter.
That looks great, but it's hard to position it as a regex competitor. I should be more explicit about what I mean by "regex-style": The cool thing about regex that makes them so approachable is that they kind of look like the thing they're describing/matching. Your thing here mostly does not.
"My thing here" is a grammar so it describes a class of strings. Perhaps this is why it looks to you unlike what it describes?
You can make a DCG rule as specific or as general as you like. As a for instance, this is a vim regex I retrieved from my recent history:
\[13\/13,15\/12,24-24]
You could write this like so in DCG notation:
s --> ['[',13,'/',13,,,15,'/',12,,,24,-,24,']'].
And that would match the string "[13/13,15/12,24-24]", no less, no more.
DCGs are Turing-complete, so you can go all the way from programs to finite automata when you write a pattern to match.
They're not really a regex competitor. They were invented in the '70s as a formalism for context-free grammars to be used in representing natural language. They fit right into the logic programming language Prolog that was created soon after (and by some of the same people).
Well, we started this subthread looking for alternatives to regex, so if DCGs aren't in the running I'm not sure what your point is. Also, I happen to believe Turing-completeness is a misfeature for this job.
The OP asked for "something as easy to use as regular expressions that can processes context-free grammars". You requested ease-of-use. I proposed DCGs. They are not an alternative to regexes in the sense that regexes can't represent anything beyond regular grammars, including CFGs. They can be as simple or as complicated as you want them and they are easy to use.
I'm not sure what is a "misfeature". What do you mean?
Edit: Apologies if I sound too terse. I'm confused by the terminology of "competitor", "alternative" etc. Are we in some kind of competition to find a technological solution that will take some prize? If so, I'd like to know the rules before I commit to any solution. What exactly are we trying to achieve here?
My interpretation of "as easy to use as regular expressions" is that the desired... formalism be usable in the same contexts and with the same casual ease as regex. Think Unix command lines, hairy ad-hoc Python parsing scripts, etc.
One of the things that makes regex so well suited to that role is that a string is a regex that matches itself, and you can iteratively add sophistication from there. At the very least, you would want to maintain (or improve, Cthulhu knows there's plenty of room) that incremental quality in any proposed replacement, while increasing its power.
It looks like with a DCG, you have to know you're writing one up front, and have to think about what you're parsing at a much more abstract level. The average sysadmin could probably not casually toss one off for a log parsing task. If there's an alternate syntax that gets around that problem, then I'm interested. OTOH...
Re "misfeature": one of the cool things about existing parsing formalisms is all the known terminating, mostly (all?) polynomial algorithms for analyzing them. If DCGs are Turing complete, they don't have those algorithms. Turing completeness is not usually a property I want in my parser. That's what I meant by saying it's a misfeature.
I don't quite understand what you mean by "a regex that matches itself". Could you explain? What you describe sound smore like a quine.
Regarding Turing completeness and termination- the DCG formalism (thank you) itself is expressive enough to represent anything from finite automata to UTMs, but that doesn't mean that every grammar you write using DCG notation is Turing-complete. So, for instance, if you write a right-regular grammar as a DCG, that DCG will not be Turing-complete, it'd just be a right-regular grammar.
The DCG examples I gave above are a CFG grammar for a fragment of natural English and a single-rule grammar that matches a single string. Those are definitely not Turing-complete and parsable in polynomial time.
The difference with regexes is that you can't express CFGs or above using regexes, but you _can_ represent both regexes and CFGs using DCG notation.
>> If there's an alternate syntax that gets around that problem, then I'm interested. OTOH...
May I make a personal comment? I think your insistence on competitive language, like "i'm [not] interested", "competitor to regexes" etc, is a case of Déformation professionnelle. You sound just like a software developer hyper-focused on finding tools to maximise productivity in the office, and nothing else.
You should perhaps consider the possibilty that what we are discussing here goes a bit beyond a product that you can package and sell as an alternative to a popular tool. I mean, personally, when I realised that DCGs are executable grammars that can be run as both recognisers and generators- well let's say it shifted my understanding of what is possible to do with a computer and a programming language.
In any case I don't see the complexity you seem to see in DCGs. Like I say above, they're basically BNF. I struggle to think of a sysadmin worth her salt (so, one who knows perl, eh?) who would sweat it to write and read BNF. The kind of sysadmin I have in mind, the problem would be to drag them away from the keyboard, once they started writing a DCG to parse a log file- and realised they could write one to parse _all_ log files ever.
"foo" is a string. It's also a regex that matches exactly the string "foo".
Regarding Turing-completeness, the entire point is about the termination of algorithms that examine parsers, not the runtime of the parsers themselves.
I also don't know how many times I need to point out that the reason I'm focusing on regexes is that that was the context for this conversation. Under any other context I'm quite interested in new parsing techniques.
Lastly, I would point out that regexes have uses far beyond a certain arbitrarily decided set of sysadmins and developers "worth their salt". Even if it's true that a meaningful subset of sysadmins are capable of writing basically sound BNF (which is not something I would count on even for CS grads, but maybe you work with smarter people than I do), there are lots of other people who could use a little more parsing power if it was offered in the right way.
This is perfect example of a insight which leads to a rabbit hole of intertwined complexity of our concepts when you try to understand why regexp for validating regexp is actually much simpler than regexp for validating email address.
It depends what you mean by "validating email addresses".
A regex that tests if a string looks like a valid email address is simple. Forget the ancient RFCs, a real world email address is in the form of `mailbox@domainname`. Which is not so difficult to test for with a bit of care.
However, testing if the email address is a valid mailbox is harder and indeed impossible using regex alone. The domain name can be validated using standard domain tools but in practice the mailbox can be anything that the server will route to a valid mailbox. The only way to validate it is to send an email.
Don't use a regex to do it. Most of languages have a way to catch runtime errors. Stick regex to a variable, create a dummy input and run that block with a catcher around it. If catcher throws a runtime error, you have a bad regex. If it does not, you have a valid regex.
TL;DR: No. But there exist languages built on top of regular expressions (notably PCRE) that can. They can't validate themselves, though - turtles all the way down to Gödel.
Plenty of parser formalisms can define their own syntax in their own syntax. As mentioned elsewhere, C compilers can parse their own source code. Gödel has nothing to do with it.
The fact that someone might make a mistake in doing something does not show it cannot be done. More generally, You seem to be mistaking validating a program's source with the question of whether it performs its intended purpose. These are different things, and attempting to conflate them will only lead to confusion.
Then your mistake appears to be in failing to see that your perceived edge case does not invalidate the first sentence of my reply. If the sole purpose of a parser is to syntactically validate its own source (which is not the case for a compiler's parser, by the way, not even if we expand 'its own source' to 'arbitrary input'), then if it does that correctly, that's all there is to it.
It is a general rule that those who avoid answering a question do not have an answer, and this is no exception. Here, You completely misunderstand Chomsky’s hierarchy: By your inverted-hierarchy argument, the simplest regular language would be complex enough that incompleteness would be an issue in its validation.
A self-hosted compiler is definitely not a case of the liars paradox, so I don’t think this applies. In fact, it doesn’t matter what the parser is written in, at some point you do have to trust that it actually follows the spec, probably through testing.
In the case at hand, correctness of the validator expression V clearly means "V determines well-formedness of any regular expressions" which is clearly not implied by "V is well-formed" (a much weaker statement because ".*" is well-formed but matches everything). Therefore, when applying V to itself, we only learn if a weak requirement for V's correctness holds.
Similar, perhaps, to validating whether a given number is a Gödel number of a well-formed logical statement rather than assessing the verity of the logical statement it encodes.
Also, I am not saying whether it is or is not possible to build such a regular expression. Rather, the question just doesn't tick the boxes of either, the Liar's Paradox, nor Gödel's Incompleteness result -- contrary to what was suggested. So you could still be right but for different reasons.
It's not possible to do this for regular expressions, but it is possible to do it for context free grammars. You can write a context-free grammar in Backus-Naur form that recognizes all context-free grammars in Backus-Naur form:
If you could do this with a proper regex: they’re easy to parse and don’t have a lot of structure, so you can do neat things like produce generators from them. Now, you can get your regex for regexes to produce new regexes! This can be useful if you want to do generative testing.
A regular expression is a domain-specific language that defines a language consisting of all the strings it matches.
In principle, this serves as executable documentation.
A regular expression that matches valid regular expressions could be viewed as a human-readable documentation of valid regular expression syntax that has the valuable side-effect of being executable.
If, instead, we detect whether a string is a regular expression by compiling it, we may find it very difficult to read the compiler source code.
It is much more work to make an optimized regular expression engine that is also self-documenting than a regular expression.
I think we have different definitions of human readable documentation. I'd find normal code, with or without appropriate comments, far more readable than a mess of regex symbols.
Regular expressions are not my favourite either, but one thing I will say is that normal code becomes difficult to read over time as features are bolted on and optimizations are added to address performance concerns.
Whereas, some form of executable DSL can serve as self-testing documentation indefinitely. DSLs can provide a separation of concerns, where the syntax is designed for readability, and the engine is designed for performance.
Modern Regexen are a dumpster fire, of course, but you asked why someone might want to, not whether this is something Everyone might want to do.
Yes, checking to see if a regular expression is valid is useful but that wasn't what I was asking. Why specifically would you want to use a regular expression for this job?
No, there is not. For example, parentheses in a regex must be balanced, and (famously) there is no regex to detect balanced parentheses.