You can skip the topics that don't pertain to your regex engine or to regex features you aren't planning to use in the coming days, but the next three paragraphs (in which I've made sure to insert a streak of yellow) are required reading if you want to understand how group numbering works. This section starts with basics and moves on to more advanced topics. ✽ JavaScript does not have named groups (along with lookbehind, inline modifiers and other useful features.) To insert the capture in the replacement string, you must use the group's number, for instance \1. ✽ Ruby: (?+) defines the group, \k is a back-reference. To insert the capture in the replacement string, you must either use the group's number (for instance \1) or use preg_replace_callback() and access the named capture as $match ✽ PHP: (?+) defines the group, \k is a back-reference. NET (C#, VB.NET…), Java: (?+) defines the group, \k is a back-reference, $ inserts the capture in the replacement string. In this section, to summarize named group syntax across various engines, we'll use the simple regex + which matches capital letters, and we'll name it CAPS. ✽ Ruby does not allow Group numbers above \1 in replacements (use a named group). ✽ Python, Perl, PHP: \10 (if Group 10 has not been set, Python and and PHP treat this as a back-reference to an undefined group, while Perl inserts the backspace character, whatever that means) ✽ Java, JavaScript, Perl, PHP: $10 (if Group 10 has not been set, Java and JavaScript insert Group 1 then the literal 0, while Perl and PHP treat this as a back-reference to an undefined group) So how do you insert Group 10 in a replacement? Some engines use the \1 syntax, some use $1, some allow both. ✽ Java, JavaScript, Python: no special syntax (use \10-knowing that if Group 10 is not set Java will treat this as Group 1 then a literal 0, while JavaScript will treat it as the elusive "backspace character")Īs you probably know, there is no standard across engines to insert capture groups into replacements. To avoid this kind of ambiguity, here is the proper syntax to create a back-reference to Group 10. If there is no Group 10, however, Java translates \10 as a back-reference to Group 1, followed by a literal 0 Python understands it as a back-reference to Group 10 (which will fail) and C#, PCRE, JavaScript, Perl and Ruby understand it as an instruction to match "the backspace character" (whatever that is)… because 10 is the octal code for the backspace character in the ASCII table! If Group 10 has been set, all major engines treat \10 as a back-reference to Group 10. In fact, the meaning does depend on the regex engine. It looks ambiguous: on the face of it, that could refer either to Group 10, or to Group 1 followed by a zero. So in a regular expression, what does \10 mean? However, if you spend time in the smoky corridors of regex, at one time or another you're sure to wonder what is the correct syntax to create back-references to Groups 10 and higher. In practice, you rarely need to create back-references to groups with numbers above 3 or 4, because when you need to juggle many groups you tend to create named capture groups. Normally, within a pattern, you create a back-reference to the content a capture group previously matched by using a backslash followed by the group number-for instance \1 for Group 1. How do Capture Groups Beyond \9 get Referenced? ✽ Relative Back-References and Forward-References ✽ Resetting Capture Groups like Variables (You Can't!) ✽ Generating New Capture Groups Automatically (You Can't!) ✽ Naming Groups-and referring back to them ✽ How do Capture Groups Beyond \9 get Referenced? But when it comes to numbering and naming, there are a few details you need to know, otherwise you will sooner or later run into situations where capture groups seem to behave oddly.įor easy navigation, here are some jumping points to various sections of the page: Yes, capture groups and back-references are easy and fun. You place a sub-expression in parentheses, you access the capture with \1 or $1… What could be easier?įor instance, the regex \b(\w+)\b\s+\1\b matches repeated words, such as regex regex, because the parentheses in (\w+) capture a word to Group 1 then the back-reference \1 tells the engine to match the characters that were captured by Group 1. Capture groups and back-references are some of the more fun features of regular expressions.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |