Mr. Clean swept his way into American homes in 1958 after Proctor and Gamble bought the rights from its creator, Linwood Burton. Interestingly, Regular Expressions were first described in the 50’s by Stephen Cole Kleene. Coincidence? I think not.
However, today’s post is not about household cleaning products, nor is it a history lesson in Computer Science. It is about scrubbing data in VBA using Regular Expressions (RegExp).
Substitute ()
I see many folks using multiple nested levels of the Substitute() function to try to clean their data. This can be time consuming and can lead to some inaccuracies. Instead, we can use Regular Expressions (RegExp) to increase efficiency and accuracy as well as handle complex strings and large data sets quickly and efficiently.
User Defined Functions
I also see folks create User Defined Functions (UDF’s) to manipulate strings. Here is a nice example the other day from Doug Jenkins over at Newton Excel Bach. (btw, If you are not following Doug’s blog, you should be) But is there a better way? Enter Regular Expressions (RegExp).
Regular Expressions (RegExp)
A Regular Expression is a sequence of characters that create a pattern. The sequence could be something complicated like <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)\1> to something simple like \d . Regular Expressions are very useful in VBA for working with many different scenarios of strings and introducing automation for transforming your data before loading to target databases for OLAP such as Essbase, Power Pivot or SSAS.
Regular Expression – Methods
Regular Expressions in VBA offers 3 Methods:
- Test
- Replace
- Execute
For today, I will focus on the Execute Method and look at the other Methods in future posts.
An example
Here’s a silly nonsensical string, 12kj$%23fg^&*34950…345. Let’s say I’m only interested in the numbers in the string so I want to return 122334950345. To return just the numbers from the string, I’ll use the pattern [0-9].
Option Explicit Sub RegExFoo() 'Author: Winston Snyder 'Date: 12/11/2013 'Extract a pattern of interest from an input string 'Declare variables Dim wb As Workbook Dim ws As Worksheet Dim RE As Object Dim Match As Object Dim Matches As Object Dim strNumber As String Dim i As Long 'Initialize variables Set wb = ThisWorkbook Set ws = wb.Worksheets("Sheet1") Set RE = CreateObject("VBScript.RegExp") 'Criteria for Regular Expression With RE .Pattern = "[0-9]" .Global = True Set Matches = .Execute(ws.Range("A1").Value) End With 'Loop Matches collection to build string of all numbers in the sample string strNumber = "" For i = 0 To Matches.Count - 1 strNumber = strNumber + Matches(i) Next i 'Output MsgBox strNumber 'Tidy up 'Destroy objects Set Matches = Nothing Set RE = Nothing Set ws = Nothing Set wb = Nothing End Sub
Output : 122334950345
Great! The code returned exactly what I was looking for.
Case Study: RegExp
The sample is a bit obscure and seems somewhat unlikely, but who knows – might be just the thing someone is looking for. Let’s take a look as something a bit more realistic.
Let’s say we receive a load file from Financial Analysis and Planning (FP&A) that they would like loaded to a cube as a forecast scenario. For our sample, we’ll use Customer Codes and Total Revenues. We receive the file, but we immediately see that there is a problem.
There is no delimiter between the customer code and the total revenue amount. Additionally, both sets of substrings are of varying length meaning we cannot use text to columns without some manual cleanup work. Let’s look at one way we might split these strings using Regular Expressions.
I’m going to:
- Read the strings into an array
- Loop the array
- Split the string into substrings – load the substrings into new arrays
- Output the contents of the new arrays to a worksheet
A Quick Note – Early / Late Binding
A discourse on Early / Late Binding is beyond the scope of this post. Suffice to say, I am using Late Binding as demonstrated:
Dim RegEx As Object
'Create a regular epression object Set RegEx = GetRegEx
Private Function GetRegEx() As Object On Error Resume Next Set GetRegEx = CreateObject("VBScript.RegExp") End Function
There are additional links at the bottom of the post for more information on Early / Late Binding.
The Complete Code
You can open a new workbook, launch the Visual Basic Editor (VBE) add a new module and paste the code below into the module. I broke the Subs() and Functions() into separate snippets to improve readability. Alternatively, you can download the workbook, the link is at the bottom of the post.
Option Explicit Sub SplitStringNoDelimiter() 'Author: Winston Snyder 'Date: 12/15/2013 'Purpose: Split string into text and value components 'Comments: No delimiter ' Loop array for output '--------------------------------------------------------------------------------------------------------------------------------------------- Dim RegEx As Object Dim wb As Workbook Dim wsInput As Worksheet Dim wsOutput As Worksheet Dim rngInput As Range Dim rngOutputDescriptions As Range Dim rngOutputValues As Range Dim arrInput() As Variant Dim arrOutputDescriptions() As Variant Dim arrOutputValues() As Variant Dim i As Long Dim lngRowsData As Long Const strPatternDescriptions As String = "\D+" Const strPatternValues As String = "\d+(\.\d{1,2})?" Const lngColumnDescriptions As Long = 1 Const lngColumnValues As Long = 2 'Excel enrionment - speed things up With Application .ScreenUpdating = False .DisplayAlerts = False .EnableEvents = False .Calculation = xlCalculationManual End With 'Initialize Set wb = ThisWorkbook With wb Set wsInput = .Worksheets("Input") Set wsOutput = .Worksheets("Output") End With 'Clear previous output wsOutput.UsedRange.ClearContents 'Input range without header With wsInput lngRowsData = GetRow(ws:=wsInput) Set rngInput = .Range(.Cells(2, 1), .Cells(lngRowsData, 1)) End With 'Transfer input range to input array arrInput = rngInput 'Dimension output arrays ReDim arrOutputDescriptions(LBound(arrInput) To UBound(arrInput)) ReDim arrOutputValues(LBound(arrInput) To UBound(arrInput)) 'Create a regular epression object Set RegEx = GetRegEx 'Loop through each string in the input array For i = LBound(arrInput) To UBound(arrInput) 'Pass the string to regular expression function to return the descriptive portion of the string arrOutputDescriptions(i) = GetSubString(objRegEx:=RegEx, _ strString:=CStr(arrInput(i, 1)), _ strPattern:=strPatternDescriptions) 'Pass the string to regualr expressions functions to return the value portion of the string arrOutputValues(i) = GetSubString(objRegEx:=RegEx, _ strString:=CStr(arrInput(i, 1)), _ strPattern:=strPatternValues) Next i 'Output all elements of each array to an output range 'Description in Column 1, Values in Column 2 'Descriptions Call OutputArray(ws:=wsOutput, _ vTmpArray:=arrOutputDescriptions, _ lngColumn:=lngColumnDescriptions) 'Values Call OutputArray(ws:=wsOutput, _ vTmpArray:=arrOutputValues, _ lngColumn:=lngColumnValues) 'Add a header to the data With wsOutput .Range("A1").EntireRow.Insert shift:=xlDown .Cells(1, 1) = "Descriptions" .Cells(1, 2) = "Values" End With 'Tidy up 'Erase arrays Erase arrInput Erase arrOutputDescriptions Erase arrOutputValues 'Destroy objects Set RegEx = Nothing Set rngInput = Nothing Set wsInput = Nothing Set wsOutput = Nothing Set wb = Nothing 'Restore Excel environment With Application .ScreenUpdating = True .DisplayAlerts = True .EnableEvents = True .Calculation = xlCalculationAutomatic End With End Sub
Private Sub OutputArray(ws As Worksheet, _ vTmpArray() As Variant, _ lngColumn As Long) Dim j As Long For j = LBound(vTmpArray) To UBound(vTmpArray) ws.Cells(j, lngColumn).Value = vTmpArray(j) Next j End Sub
Private Function GetRow(ws As Worksheet) As Long GetRow = ws.Cells(Rows.Count, 1).End(xlUp).Row End Function
Private Function GetRange(ws As Worksheet, _ lngRowsStart As Long, _ lngRowsEnd As Long, _ lngColumn As Long) As Range Dim rng As Range With ws Set rng = .Range(.Cells(lngRowsStart, lngColumn), .Cells(lngRowsEnd, lngColumn)) End With Set GetRange = rng End Function
Private Function GetRegEx() As Object On Error Resume Next Set GetRegEx = CreateObject("VBScript.RegExp") End Function
Private Function GetSubString(objRegEx As Object, _ strString As String, _ strPattern As String) As String Dim reMatches As Object Dim strResult As String strResult = "No Match" With objRegEx .Pattern = strPattern .Global = True Set reMatches = .Execute(strString) If reMatches.Count <> 0 Then strResult = reMatches.Item(0) End If End With GetSubString = strResult End Function
The Results
Great! Exactly what I was looking for.
Tidy up
Final Thoughts
Regular Expressions are found in nearly all programming languages and much like Duct Tape, they have a Million and one uses. You are truly only limited by your imagination and ability to concoct the correct patterns. I use Regular Expressions as I Extract data from various Data Silos to Transform and normalize data prior to Loading to target reporting databases. Let us know how you use Regular Expressions in the comments.
Downloads
Download the workbook from OneDrive.