In this post, I describe the new source code generated RegEx that has been introduced with .NET 7
.NET 7 is now here, and one of the less talked about improvements is the sprinkling of source-code generation magic over the RegEx class along with other improvements to the functionality provided by RegEx.
Stephen Toub has a very in depth blog post that describes all the changes that have been made to RegEx in improve the overall performance of regular expression evaluations.
There is a lot of great information in there, but in this post, I want to focus on the source generator experience.
The approach to moving to a source generated RegEx takes the path that was also used for JSON serialisers in .NET by introducing the new GeneratedRegex attribute that is recognised by the code generator.
(Note, that this attribute was renamed after initially being introduced in the early release candidates, so you may find some early blog posts refer to the older 'RegexGeneratorAttribute' name )
This attribute is applied to a static partial method that returns a RegEx
What is also a nice improvement in Visual Studio is that the pattern is recognised as a regular expression and a colour coding is applied to improve readability.
This is thanks to the attribute's pattern parameter being decorated with StringSyntaxAttribute.RegEx which allows IDEs to apply syntax rules to displaying the string.
The other parameters are optional, though I would recommend looking at the matchTimeoutMilliseconds if your pattern may be susceptible to a 'Regex Denial-of-Service'
In my example, I created a regular expression for matching whether a url presenting in a string is valid. The pattern I have used is not intended as an example of the best way to do this, but is provided as an example of a reasonably complex pattern to put the source generator through its paces.
If I wanted to expose all the functionality of the generated RegEx instance, I could make the method public. However, in most cases, the use of the regular expression is part of some other specific functionality, so I like to make the generated RegEx private and expose a method that calls the RegEx's Match method, thus hiding other functionality such as Replace. Also in the example, the outer class MyUrlValidator is static, but this is not a requirement of using the source generated RegEx and it can happily live in a non-static class.
What IS a requiment for using the generated RegEx is that both the class and the method that is decorated with the GeneratedRegexAttribute are both declared as partial. This is a key factor in using source code generators as the generator will add code to the class and the method during the Rosyln compilation process.
Since the first release of .NET, the RegEx class has used an interpreter engine to perform pattern matching. However, before the interpreter engine can be used, the pattern must be validated and converted into code that will be used by the engine to do the matching. Each time you create a new instance of the RegEx, it takes time to create this code, so the best practice is usually to create it once and cache it in a static thread-safe variable.
In previous versions of .NET, there has been the option to use the RegexOptions.Compiled option to utilise the MSIL generated as raw code instead of using the interpreter engine. However, this adds an initial overhead to startup time, so if the pattern matching is not in an application's hot-path, there may be minimal benefit to its use. In .NET Framework and .NET 6, there has also been the option to compile to an assembly, but this has a number of caveats to its use.
With the source generated RegEx, the interpreter is no longer needed as the generated code creates a custom class that inherits from RegEx that is cached as a thread-safe singleton. So, with the new source generated version, you gain all the performance benefits that you previously had to use the compile options for, without the associated drawbacks.
However, when running benchmarks, the generated RegEx does not necessarily have better or directly comparable performance with the compiled version, so you will need to perform your own benchmarks. In my tests of the URL validator, the source generated version is marginally slower, but we are talking a few nanoseconds. In the screenshot below, benchmarking over one million iterations shows a few milliseconds difference.
I would put the caveat on this that my benchmarking approach may not be correct and may need some refinement as the section about source generation in Stephen Toub's post indicates that performance should be better. The code can be found at https://github.com/stevetalkscode/sourcegenregex
However, whilst performance is important, there are other advantages to using the source generated version.
Performance is not the only advantage of using the new source generated RegEx. There are several other advantages to using the new source code generated version of the regular expression.
The first is that you can see the source code that performs the pattern matching within Visual Studio. Remember, this is not MSIL (as is generated by the older compile option), but is C# that is much more readable.
The next benefit is the XML comment that is generated for the class that explains how the pattern matching will work as a step-by-step instruction in English and can be seen as a tooltip in an IDE.
A Rosyln analyser is now also provided to help you identify where existing RegEx instances can be replaced with a source generated alternative.
Not only will it let you know about the change to refactor, but also provides a fix to do the refactoring for you.
I have been a big fan of source generators since their introduction in .NET 5 and it is good to see Microsoft continuing to bring their benefits into the base library.
As mentioned at the start of this post, Stephen Toub has a very in depth blog post goes into a lot more detail that is worth a read if you want to find out more.