ASP.NET: How to Debug Regular Expressions
This article gives several tips on how to debug regular
expressions in ASP.NET applications.
Debugging long and complex regular expressions can be very challenging
and time consuming. Using these tips should increase your effectiveness
and shorten your debugging time.
Chunk Long Regular Expressions into Short Ones
Where is the bug in your regular expression? It can be difficult to locate bugs
in long complex regular expressions.
Bad Example
public void Parse(string Input)
{
string Pattern;
Pattern = "January|...|December [0-9]{1,2}, [0-9]{4} at ([0-9]{1,2}):([0-9]{2}) (AM|PM)";
MatchObj = Regex.Match(Input, Pattern);
}
Solution
Chunk long regular expressions into short regular expressions.
Start debugging by including only the first pattern and commenting
out the remaining patterns. Once you have
the first pattern debugged and working add the next pattern.
Better Example
public void Parse(string Input)
{
string Pattern;
string PatternAmPm;
string PatternDay;
string PattermHours;
string PatternMinutes;
string PatternMonth;
string PatternSeconds;
string PatternYear;
PatternMonth = "January|...|December";
PatternDay = "[0-9]{1,2}", ";
PatternYear = "[0-9]{4} at ";
PatternHours = "([0-9]{1,2}):";
PatternMinutes = "([0-9]{2}):";
PatternSeconds = "([0-9]{2})";
PatternAmPm = "(AM|PM)";
Pattern = PatternMonth + PatternDay + PatternYear;
Pattern += PatternHours + PatternMinutes + PatternSeconds + PatternAmPm;
MatchObj = Regex.Match(MatchObj, Pattern);
}
Dump Match Object
The Regex.Match method returns a Match object containing the
parsed elements in the Groups and Captures collections. You
need to view the contents of these collections to determine
whether the regular expression correctly parsed the input string.
However, the Visual Studio (2005) Local debug window does not display
the contents of the Groups and Captures collections.
Solution
Create a Dump method whichs accepts an Regex Match
object as an input parameter, walks through the Groups and
Captures collections, and gets the desired data values.
public void Dump(Match MatchObj)
{
int c;
int g;
string Value;
for (int g = 0; g < MatchObj.Groups.Count; g++)
{
GroupObj = MatchObj.Groups[g];
for (int c = 0; c < GroupObj.Captures.Count; c++)
{
CaptureObj = GroupObj.Captures[c];
Value = CaptureObj.Value;
} // <-- Set breakpoint here.
}
}
Call the Dump method and pass the Match object immediately after calling Regex.Match:
public bool Parse(string Input)
{
Match MatchObj;
string Pattern = "[0-9]*";
MatchObj = Regex.Match(Input, Pattern);
Dump(MatchObj);
}
Set a breakpoint in the Dump method immediately after getting the Capture value (as noted
by the <-- in the Dump method).
Start your application in debug mode. When the debugger stops at the
breakpoint view Value in the locals window. Repeat executing the
loop until you've walked all through the entire collection.
Additional Suggestions
You may want to update the Dump method to write the match information to your trace output stream.
Capture Non-Relevant Substrings
Many times you want only a few pieces of data from a string. The remaining
items are not required for your application. The usual approach is to write
a regular expression to match the relevant and non-relevant substrings but only capture the relevant substring
using the "()" capture expressions. Should Regex.Match fail matching on the non-relevant
substrings you don't know where the failure ocurred.
Bad Example
Suppose we want to parse the time from a string like:
January 15, 2008 at 12:43:04 PM
We could create a regular expression to capture only the time at the end of the string
and not capture the date components at the beginning of the string like:
string Pattern;
string Pattern1;
string Pattern2;
Pattern1 = "January|...|December";
Pattern2 = "[0-9]{1,2}, [0-9]{4} at ([0-9]{1,2}):([0-9]{2}):([0-9]{2}) (AM|PM)";
Pattern = Pattern1 + Pattern2;
MatchObj = Regex.Match(MatchObj, Pattern);
Solution
The solution is to capture non-relevant substrings. By doing so you are able to
view in the debugger what substrings Regex.Match was able to match
and where it stopped in the matching process.
Better Example
We've added capture expressions on the month names, day of month, and year regular
expressions sub elements:
string Pattern;
string Pattern1;
string Pattern2;
Pattern1 = "(January|...|December)";
Pattern2 = "([0-9]{1,2}), ([0-9]{4}) at ([0-9]{1,2}):([0-9]{2}):([0-9]{2}) (AM|PM)";
Pattern = Pattern1 + Pattern2;
MatchObj = Regex.Match(MatchObj);
}
One could argue that adding capture expressions around the delimiters might be useful too.
Use Regex.Match not Regex.IsMatch
Regex.IsMatch gives you a pass/fail on whether the input string
matched the pattern. This general result is good once you have
created and tested the regular expression. However, when you are
debugging the regular expression you don't know what part of the regular
expression failed to match a good input string.
Bad Example
bool Status;
Status = Regex.IsMatch(Input, Pattern);
Solution
Use Regex.Match instead of Regex.IsMatch and use the Dump solution
given above to view the captures.
Better Example
Match MatchObj;
MatchObj = Regex.Match(Input, Pattern);
Dump(MatchObj);
Use Named Indexes For Groups and Captures
As you add or remove subcomponents to your regular
expression the index position of the captured
strings will change in the Groups and Captures collections.
If you use numerical indices you'll spend a lot of wasted time
updating the indices.
A second problem: You can't tell from the numerical indices
what the associated value is. Is "2" the "Month", or the "Year"?
Bad Example
Suppose we are parsing a date string like "05/23/03":
public DateTime DateParse(string Input)
{
string Day;
Match MatchObj;
string Month;
string Year;
MatchObj = new Regex.Match(Input, Pattern);
Month = MatchObj.Groups[1].Captures[0].Value;
Day = MatchObj.Groups[2].Captures[0].Value;
Year = MatchObj.Groups[3].Captures[0].Value;
}
If we update the regular expression to parse additional substrings
at the beginning of the input string we'll need to update the
Group collection indices.
Solution
The solution is to use named indexes, such as, variables
to index into the Groups collection.
Better Example
private int m_Day = 1;
private int m_Month = 2;
private int m_Year = 3;
public DateTime DateParse(string Input)
{
string Day;
Match MatchObj;
string Month;
string Year;
MatchObj = new Regex.Match(Input, Pattern);
Month = MatchObj.Groups[m_Day].Captures[0].Value;
Day = MatchObj.Groups[m_Month].Captures[0].Value;
Year = MatchObj.Groups[m_Year].Captures[0].Value;
}
For a short regular expression and a small number of substring captures
as shown in our example this tip may not be valuable. This tip becomes
very valuable in cases where a long complex regular expression is used.
Check Count or Success
What if the match fails? Do we assume the input string
will be correctly formatted and Regex.Match will always succeed?
Bad practice. We should always assume the input string may
contain invalid data and Regex.Match can fail.
Bad Example
Here we assume the match succeeded and we grab the expected values
from the Groups and Captures collections.
private int m_Day = 1;
private int m_Month = 2;
private int m_Year = 3;
public DateTime DateParse(string Input)
{
string Day;
Match MatchObj;
string Month;
string Year;
MatchObj = new Regex.Match(Input, Pattern);
Month = MatchObj.Groups[m_Day].Captures[0].Value;
Day = MatchObj.Groups[m_Month].Captures[0].Value;
Year = MatchObj.Groups[m_Year].Captures[0].Value;
}
Solution
Check the Match.Success or the Groups count.
Match.Success tells whether the match passed or failed
with a boolean value.
Match.Groups.Count includes one (1) for the entire string and a count
of the captures.
Better Example
private int m_Day = 1;
private int m_Month = 2;
private int m_Year = 3;
public bool DateParse(string Input, ref DateTime Out)
{
string Day;
int DayAsInt;
Match MatchObj;
string Month;
int MonthAsInt;
DateTime Out;
string Year;
int YearAsInt;
MatchObj = new Regex.Match(Input, Pattern);
if( !MatchObj.Success) return false;
// Or, we can check Groups.Count
if( MatchObj.Groups.Count != 4) false;
Month = MatchObj.Groups[m_Day].Captures[0].Value;
Day = MatchObj.Groups[m_Month].Captures[0].Value;
Year = MatchObj.Groups[m_Year].Captures[0].Value;
MonthAsInt = Convert.ToInt32(Month);
DayAsInt = Convert.ToInt32(Day);
YearhAsInt = Convert.ToInt32(Year);
Out = new DateTime(YearAsInt, MonthAsInt, DayAsInt);
return true;
}
Ignore Case
Most of the time (in my experience) we want to capture the substrings
from the input string without regard to the case of the characters.
If you call Regex.Match, as shown in the Bad Example, below the
default behavior is to match the characters exactly. In this
case the input string is lowercase characterer and the
pattern specifies uppercase characters, so the match fails.
Also, note RegexOptions is not specified in the call to Regex.Match.
The match still fails.
Bad Example
Where Input = "abc";
public bool Parse(string Input)
{
Match MatchObj;
MatchObj = Regex.Match(Input, "[A-Z]{3}");
}
Solution
Use RegexOptions.IgnoreCase or use lowercase characters
in your regular expression pattern.
Better Example
public bool Parse(string Input)
{
Match MatchObj;
MatchObj = Regex.Match(Input, "[A-Z]{3}", RegexOptions.IgnoreCase);
// Or do this:
MatchObj = Regex.Match(Input, "[a-zA-Z]{3}");
}
Use a Test Framework
The input string to your parsing method can have
potentially hundreds of variations. Can your regular
expression parse these variations successfully?
Solution
Use a test framework, such as, MbUnit to test
all potential variations.
|