Saturday, September 27, 2008

Compiling Regular Expressions

In the article Regular Expressions in VB.NET, I show the basics of coding regular expressions in VB.NET. Regular expressions are a 'language in a language' with a history that starts before Visual Basic or even just B.A.S.I.C. and they're used in a lot of programming languages. For this article, I use the "telephone number" RegEx from the article above. Read that article for a more detailed explanation of why it works.

Although they're very handy by themselves, regular expressions can be even more useful as compiled DLL modules for all the same reasons that you compile anything: speed, security, and a better way to organize code in libraries.

Because it's a 'language in a language', there's no standalone compiler for RegEx in VB.NET. Instead, there's a static method that is part of the normal RegularExpressions namespace. The method is called CompileToAssembly and you will usually call it with two parameters which, naturally enough, are the 'source code' input and the 'compiled assembly' output. (There are also overloaded methods that let you include custom attributes that can be passed to the compiled DLL. The article Attributes in VB .NET explains what attributes are in VB.NET.)

To see just how much improvement is possible using compiled regular expressions, this article will show how to compile one. Then the StopWatch component of the System.Diagnostics namespace will be used to compare how fast a regular "inline" execution of the RegEx is versus the same compiled RegEx.

First, we have to compile the RegEx. This requirement will probably result in the creation of a utility if you use very many compiled regular expressions. Here's the way I did it. (Arguments in the event Sub's are not shown in this article to save space.)

Private Sub CompileRegEx_Click( ...
Dim myRegexString As String = _
"^1?\s*-?\s*(\d{3}|\(\s*\d{3}\s*\))" & _
"\s*-?\s*\d{3}\s*-?\s*\d{4}$")
Dim RegExNameSpace As String = "myRegExNS"
Dim RegExType As String = "myRegExType"
Dim RegExIsPublic As Boolean = True
Dim RegExAssembly As New _
System.Reflection.AssemblyName("myRegExAssembly")
Dim CompileRegExParms As New RegexCompilationInfo( _
myRegexString, _
RegexOptions.Compiled, _
RegExType, _
RegExNameSpace, _
RegExIsPublic)
Dim CompileRegArray() _
As RegexCompilationInfo = {CompileRegExParms}
Regex.CompileToAssembly(CompileRegArray, RegExAssembly)
End Sub

Notice that the actual RegEx is now just a string (instead of being declared as a RegEx as it was in the article referenced earlier). That's because it's passed to the RegexCompilationInfo to be saved as a string property. The illustration below shows the property displayed in the MsgBox.

--------
Click Here to display the illustration
Click the Back button on your browser to return
--------

In addition to the actual text of the RegEx, we need to declare ...

  • The namespace that the compiled RegEx will be in: myRegExNS
  • The name of the type of the compiled RegEx: myRegExType
  • Whether the compiled Regex will be public or not: RegExIsPublic
  • The name of the assembly (the compiled DLL) for the RegEx: myRegExAssembly

Most of this information is simply passed to the New constructor for the RegexCompilationInfo object. The assembly name is used when the RegEx is compiled.

One additional detail needs doing. The Regex.CompileToAssembly method actually expects an array of RegexCompilationInfo objects, not just one. The assumption is that you will compile a lot of different regular expressions into the same DLL assembly. That's why this statement is necessary:

Dim CompileRegArray() _
As RegexCompilationInfo = {CompileRegExParms}

Once all this is done, compiling is just a method call:

Regex.CompileToAssembly(CompileRegArray, RegExAssembly)




Using the compiled RegEx is just like using any other object. Add a reference to it using the Project Properties dialog:

  1. Right-click the project and select Add Reference...
  2. Browse to the location of the assembly and select it.

--------
Click Here to display the illustration
Click the Back button on your browser to return
--------

Using a compiled RegEx is the same as any other object.

Dim myCompiledRegex As New myRegExNS.myRegExType
Dim RegExResult As Boolean
RegExResult = myCompiledRegex.IsMatch("String to Match")

But just how much will this speed up your application? Let's find out.

I borrowed the timing code from my StringBuilder article and simply changed the variables a bit and had this running in a few minutes ...

Imports System.Diagnostics
Imports System.Text.RegularExpressions

Public Class CompiledRegEx
Dim InlineTimeSpan As Integer
Dim CompiledTimeSpan As Integer
Dim LoopLimit As Integer
Dim PhoneNum(9999) As String
Dim I As Integer

Private Sub TimeEm_Click( ...
If UseInline.Checked Then UseInlineTest()
If UseCompiled.Checked Then UseCompiledTest()
CalcImprovement()
End Sub

Private Sub UseInlineTest()
Dim myRegexString As New Regex( _
"^1?\s*-?\s*(\d{3}|\(\s*\d{3}\s*\))" & _
"\s*-?\s*\d{3}\s*-?\s*\d{4}$")
Dim RegExResult As Boolean
Dim myStopwatch As Stopwatch = New Stopwatch
Dim myTimeSpan As New TimeSpan
myStopwatch.Reset()
myStopwatch.Start()
For I = 0 To LoopLimit
RegExResult = _
myRegexString.IsMatch(PhoneNum(I).ToString)
Next
myStopwatch.Stop()
myTimeSpan = myStopwatch.Elapsed
InlineTimeSpan = myTimeSpan.Milliseconds
UsingInline.Text = myTimeSpan.ToString
End Sub

Private Sub UseCompiledTest()
Dim myStopwatch As Stopwatch = New Stopwatch
Dim myCompiledRegex As New myRegExNS.myRegExType
Dim RegExResult As Boolean
Dim myTimeSpan As New TimeSpan
myStopwatch.Reset()
myStopwatch.Start()
For I = 0 To LoopLimit
RegExResult = _
myCompiledRegex.IsMatch(PhoneNum(I).ToString)
Next
myStopwatch.Stop()
myTimeSpan = myStopwatch.Elapsed
CompiledTimeSpan = myTimeSpan.Milliseconds
UsingCompiled.Text = myTimeSpan.ToString
End Sub

Private Sub CalcImprovement()
Try
Improvement.Text = _
CStr((InlineTimeSpan / CompiledTimeSpan))
Catch ex As Exception
Improvement.Text = ""
End Try
End Sub

Private Sub CompiledRegEx_Load( ...
Dim PhoneNumSuf As Integer = 0
Dim PhoneNumPre As String = "123-555-"
LoopLimit = CInt(NumberOfIterations.Text) - 1
For I = 0 To LoopLimit
PhoneNum(I) = _
PhoneNumPre & I.ToString.PadLeft(4, "0"c)
Next
End Sub
End Class

The result showed a very significant improvement of about 30 to 50 percent, but frankly, I expected more



After examining the code for a while, I wondered if VB.NET was optomizing the RegEx by reusing a version 'compiled' in realtime. To test this guess, I coded a new version that forced the RegEx to be declared for every iteration of the test loop by coding the loop before the call rather than after.

Imports System.Diagnostics
Imports System.Text.RegularExpressions

Public Class CompiledRegEx
Dim InlineTimeSpan As Integer
Dim CompiledTimeSpan As Integer
Dim LoopLimit As Integer
Dim PhoneNum(9999) As String
Dim I As Integer

Private Sub TimeEm_Click( ...
Dim myStopwatch As Stopwatch = New Stopwatch
Dim myTimeSpan As New TimeSpan
myStopwatch.Reset()
myStopwatch.Start()
If UseInline.Checked Then
For I = 0 To LoopLimit
UseInlineTest(I)
Next
End If
myStopwatch.Stop()
myTimeSpan = myStopwatch.Elapsed
InlineTimeSpan = myTimeSpan.Milliseconds
UsingInline.Text = myTimeSpan.ToString
myStopwatch.Reset()
myStopwatch.Start()
If UseCompiled.Checked Then
For I = 0 To LoopLimit
UseCompiledTest(I)
Next
End If
myStopwatch.Stop()
myTimeSpan = myStopwatch.Elapsed
CompiledTimeSpan = myTimeSpan.Milliseconds
UsingCompiled.Text = myTimeSpan.ToString
CalcImprovement()
End Sub

Private Sub UseInlineTest(ByVal I As Integer)
Dim myRegexString As New Regex( _
"^1?\s*-?\s*(\d{3}|\(\s*\d{3}\s*\))" & _
"\s*-?\s*\d{3}\s*-?\s*\d{4}$")
Dim RegExResult As Boolean
RegExResult = _
myRegexString.IsMatch(PhoneNum(I).ToString)
End Sub

Private Sub UseCompiledTest(ByVal I As Integer)
Dim myCompiledRegex As New myRegExNS.myRegExType
Dim RegExResult As Boolean
RegExResult = _
myCompiledRegex.IsMatch(PhoneNum(I).ToString)
End Sub

Private Sub CalcImprovement()
Try
Improvement.Text = _
CStr((InlineTimeSpan / CompiledTimeSpan))
Catch ex As Exception
Improvement.Text = ""
End Try
End Sub

Private Sub CompiledRegEx_Load( ...
Dim PhoneNumSuf As Integer = 0
Dim PhoneNumPre As String = "123-555-"
LoopLimit = CInt(NumberOfIterations.Text) - 1
For I = 0 To LoopLimit
PhoneNum(I) = PhoneNumPre & I.ToString.PadLeft(4, "0"c)
Next
End Sub
End Class

This time, the results were much closer to what I expected: about 7 1/2 times faster!

--------
Click Here to display the illustration
Click the Back button on your browser to return
--------

The lesson here is that if you can avoid re-executing the statement that declares your RegEx, it will add more to your execution speed than even compiling the RegEx.

No comments: