╔═╗╔═╗╔═╗╔═╗═╗ ╦ ╦═╗╔═╗╔═╗╦ ╦╦═╗╔═╗╦╔═╗╔╗╔ ╠╦╝║╣ ║ ╦║╣ ╔╩╦╝ ╠╦╝║╣ ║ ║ ║╠╦╝╚═╗║║ ║║║║ ╩╚═╚═╝╚═╝╚═╝╩ ╚═ ╩╚═╚═╝╚═╝╚═╝╩╚═╚═╝╩╚═╝╝╚╝ ============================================================================ Nested parentheses. The thing that breaks every regex you've ever written. You know the feeling. You need to match (something (with nested (parens))) and your regex engine just laughs at you. Regular expressions can't count. They're finite automata. They don't do recursion. Except Perl does. my $nested = qr~\( ( (?: [^()]+ | \( (?1) \) )* ) \)~x; That pattern matches arbitrarily deep nested parentheses. One line. No parsing libraries. No state machines. Just regex. Looks like someone sneezed on a keyboard, right? Let's crack it open. ============================================================================ PART 1: THE OUTER SHELL ----------------------- Strip away the recursion magic and you've got a sandwich: \( ...stuff... \) Literal open paren. Stuff in the middle. Literal close paren. That's the easy part. The stuff in the middle is where it gets weird. ============================================================================ PART 2: THE CAPTURING GROUP --------------------------- \( ( (?: [^()]+ | \( (?1) \) )* ) \) ^ ^ | | +------- GROUP 1 ------------+ Those outer parentheses (inside the literal \( \)) create Group 1. This is the group that recurses. Everything it matches gets captured into $1. For input like (a(b(c)d)e), Group 1 captures: a(b(c)d)e Notice it doesn't include the outermost parens. Those are matched by the literal \( and \) at the edges. ============================================================================ PART 3: THE CHOICE ------------------ Inside Group 1, there's a non-capturing group with two options: (?: [^()]+ | \( (?1) \) )* Translation: "Match either of these things, zero or more times." Option A: [^()]+ One or more non-paren characters Option B: \( (?1) \) An open paren, RECURSE, then close paren The regex engine tries Option A first. If it hits an open paren, it switches to Option B and dives deeper. ============================================================================ PART 4: THE RECURSION --------------------- Here's the magic: (?1) This means "run Group 1's pattern right here, right now." Not a backreference. Not "match what Group 1 already matched." It's "run the entire Group 1 pattern again from scratch." When the engine hits (?1), it pushes its current state onto a stack and starts Group 1 over at the current position. When that nested run finishes, it pops back and continues. It's function calls. In regex. .--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/ Your brain should hurt a little. That's normal. ============================================================================ PART 5: WALKING THROUGH IT -------------------------- Input: (a(b(c)d)e) Position: 0 1 2 3 4 5 6 7 8 9 10 Character: ( a ( b ( c ) d ) e ) Here's what happens: STEP PATTERN POS CHAR ACTION ----- --------- --- ---- ------------------------- 1 \( 0 ( Match literal open paren 2 [^()]+ 1 a Match non-paren char 3 \( 2 ( Match nested open 4 (?1) - - RECURSE into Group 1 5 [^()]+ 3 b Match non-paren (in recursion) 6 \( 4 ( Match deeper open 7 (?1) - - RECURSE AGAIN 8 [^()]+ 5 c Match non-paren (deepest) 9 - - - No more parens, return up 10 \) 6 ) Match close (level 2) 11 [^()]+ 7 d Match non-paren (level 1) 12 - - - Return up 13 \) 8 ) Match close (level 1) 14 [^()]+ 9 e Match non-paren (level 0) 15 \) 10 ) Match final close Three levels deep. Handled automatically. ============================================================================ PART 6: THE STACK ----------------- Think of it like function calls: LEVEL 0 (outer): +------------------------------------------+ | \( matches position 0 | | | | GROUP 1 capturing: a(b(c)d)e | | +-- [^()]+ matches: a | | +-- \( matches: ( | | +-- (?1) RECURSE ----+ | | : : | | +-- \) matches: ) <--+ (returns here) | | +-- [^()]+ matches: e | | | | \) matches position 10 | +------------------------------------------+ LEVEL 1 (first recursion): +--------------------------------+ | GROUP 1 matching: b(c)d | | +-- [^()]+ matches: b | | +-- \( matches: ( | | +-- (?1) RECURSE --+ | | : : | | +-- \) matches: ) <+ return | | +-- [^()]+ matches: d | +--------------------------------+ LEVEL 2 (deepest): +--------------+ | GROUP 1: c | | +-- [^()]+:| | c | | (no parens, | | return up) | +--------------+ Each (?1) pushes state. Each completed match pops back. ============================================================================ PART 7: THE FULL PATTERN ANNOTATED ---------------------------------- my $nested = qr~ \( # literal opening paren ( # GROUP 1 - this is what recurses (?: # non-capturing group for the choice [^()]+ # Option A: non-paren characters | # OR \( (?1) \) # Option B: open, RECURSE, close )* # zero or more times ) \) # literal closing paren ~x; The ~x modifier lets us spread it out with comments. Much easier to read than the compressed version. ============================================================================ PART 8: USING IT ---------------- #!/usr/bin/env perl use v5.14; # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Pattern for nested parens # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ my $nested = qr~ \( # literal opening paren ( # GROUP 1 - the recursing part (?: # non-capturing choice group [^()]+ # one or more non-parens | # OR \( (?1) \) # paren, recurse, paren )* # zero or more of either ) \) # literal closing paren ~x; # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Test it # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ my $text = "func(a, calc(b, inner(c)), d)"; if ($text =~ m~$nested~) { say "Matched: $&"; # full match with outer parens say "Inside: $1"; # contents without outer parens } Output: Matched: (a, calc(b, inner(c)), d) Inside: a, calc(b, inner(c)), d ============================================================================ PART 9: VARIATIONS ------------------ Match curly braces instead: my $braces = qr~\{ ( (?: [^{}]+ | \{ (?1) \} )* ) \}~x; Match square brackets: my $brackets = qr~\[ ( (?: [^\[\]]+ | \[ (?1) \] )* ) \]~x; Match HTML-style tags (simplified): my $tags = qr~<(\w+)> ( (?: [^<>]+ | (?0) )* ) ~x; That last one uses (?0) to recurse the entire pattern, and \1 as a backreference to match the closing tag name. ============================================================================ THE CATCH --------- This is Perl-specific. PCRE has it. Python's re module doesn't (but the regex module does). JavaScript? Forget it. If you need portability, you're back to writing a parser. But if you're in Perl? One line. Unlimited depth. No libraries. That's the kind of thing that makes people fall in love with this language. ============================================================================ .---. / \ \.@-@./ /`\_/`\ // _ \\ | \ )|_ /`\_`> <_/ \ \__/'---'\__/ ============================================================================ japh.codes