The regular expressions engine provided by the .NET Framework includes a new feature known as 'balancing groups'.
This feature allows you to increment/decrement the match count of a named capturing group by giving the group a positive and negative match context. You can then test to see you have an equal number of matches, by testing if the group has a value (i.e. an effective zero result means the group was balanced). You can include this syntax in your match pattern, so that only the balanced result is considered a match.
Microsoft don't really go into this much and only show a
small example of matching opening and closing paranthesis.
In my case, I wanted to match a specific chunk of HTML code in a file and then find the closing tag to matching the name of the opening tag.
For example:
<div>
<div class="targetContent">
Something in here
<div> Something else in here</div>
</div>
</div>
Using standard regular expressions, searching for <div class="targetContent"> to </div> can work in two ways. Non-greedy mode, matches on the </div> of the inner div. In greedy mode, it matches all the way to end of the outer div.
What I wanted to do, is match on the last </div> that makes the tags balance, which can be done using balancing groups!
C# Code:
pattern = "<div class=\"targetContent\">.*?((?<TAG><div).*?(?<-TAG></div>))?(?(TAG)(?!))</div>";
Effectively, what the expression does is:
- Start the match from the div with class="targetContent"
- Match any internal content
- Whenever it encounters another div tag, it increments the TAG count
- Match any nested content
- Whenever it encounters another closing div tag, it decrements the TAG count
- It becomes a match when the tag count is equal
- Finally match on the closing tag of our outer div
This can be applied to any XML style markup, where you have the notation of opening and closing tags.