26 January 2012
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Using this quote when talking about regular expressions is not exactly original, I know but I do have a long-standing mistrust and borderline disdain for regular expressions which may well have a relation to the fact that they are not exactly my forte. But unfortunately they also seem to be frequently used by people whose forte they also are not! Often the times I come across them they don't cover all the edge cases that the writer originally either expected them to or didn't expect at all - and then they sort of mutate over time into barely-readable strings of symbols that are more difficult to comprehend and maintain (and slower) than a set of functionally-equivalent string manipulation procedures. Don't even get me started on the fact that commenting them seems to be bypassed every time.. since the regex itself is so small the comment would dwarf it, and that would be stupid right? Wrong.
Everyone knows the classic email validation example which is frequently brought out as a case against regular expressions but I've got two other stories I suffered through first hand:
I wrote a CSS minimiser for use in a Classic ASP Javascript app some years ago using a regular expression to strip the comments out before further processing was done, thus:
return strContent.replace(/\/\*(.|[\r\n])*?\*\//g, "");
I did my research on the old t'interwebs and this seemed to be well recommended and would do just what I wanted. It worked fine for a few weeks until - out of the blue - IIS was flatlining the CPU and requests were timing out. I don't even remember how we tracked this down but it eventually arose that a stylesheet had an unclosed comment in it. Appending "/**/" to the content before performing the replacement made the problem disappear.
The second example was a component I was given to integrate with at work, part of whose job was to query a Hotel Availability Web Service. The response xml was always passed through a regular expression that would ensure no credit card details appeared in the content. The xml returned often described detailed information from many Suppliers and could be several megabytes of text so when these calls were taking over 60 seconds and pegging the cpu I was told that it must be the weight of data and the deserialisation causing it. Watching the data move back and forth in Fiddler, though, it was clear that these requests would complete in around 6 seconds.. further investigation by the component writer eventually confirmed that the deserialisation took very little time or resources (well, "very little" in relation to a 60 seconds / 100% cpu event) but the regular expression scanning for the card details was creating all the work. The best part being that these response would never contain any credit card details, its just that this expression had been applied to all responses for "consistency".
It could well be argued that none of these cases are really the fault of regular expressions themselves - the email example is misuse of a tool, the CSS comment removal could be the regex engine implementation (possibly?!) and the availability issue was entirely unnecessary work. But the fact that these issues are lurking out there (waiting to strike!) makes me wary - which is not a reason in isolation not to use something, but it definitely makes me think that my understanding not only of how they can be written but the implications of how they will be processed could do with serious improvement. But I think this needs to go for anyone else writing these regular expressions - if you don't know how they're being worked out, how do you know whether or not they'll scale to text more than a few lines long? Will they scale linearly or exponentially or in some completely different manner?? Again, these are not exactly original thoughts and Joel Spolsky's Leaky Abstractions article is basically saying (much more eloquently) that you should understand at least one layer below the current abstraction you're using.
But so many people will tell you that regular expressions are a valuable tool to have on hand. And I've used ISAPI Rewriter before to deal with friendly urls and that was great. (Not that I can say I miss it when I use ASP.Net MVC Routing instead though :) And there are definite occasion where regular expressions look like the ideal tool to use - the ones I "borrowed" to write the CSS minifier in my last post were so convenient and much nicer than the idea of parsing all that content manually. And so I'm off to try and expand my knowledge and experience by extending the minifier to deal with "@import" statements in the stylesheets..
This is what I've cobbled together for now. It probably looks to an experienced regular expression writer like it was written by a noob.. er, yeah, there's a good reason for that! :D And I'm not sure if the way I've tried to combine the various import formats using String.Join makes for more readable code or for code that looks like nonsense. Not to mention that they all start and end exactly the same - is this duplication something I want to hide away (DRY) or will that harm the readability which is also very important??
private static Regex ImportDeclarationsMatcher = new Regex(
String.Join("|", new[]
{
// @import url("test.css") screen;
"@import\\s+url\\(\"(?<filename>.*?)\"\\)\\s*(?<media>.*?)\\s*(?:;|\r|\n)",
// @import url("test.css") screen;
"@import\\s+url\\('(?<filename>.*?)'\\)\\s*(?<media>.*?)\\s*(?:;|\r|\n)",
// @import url(test.css) screen;
"@import\\s+url\\((?<filename>.*?)\\)\\s*(?<media>.*?)\\s*(?:;|\r|\n)",
// @import "test.css" screen;
"@import\\s+\"(?<filename>.*?)\"\\s*(?<media>.*?)\\s*(?:;|\r|\n)",
// @import 'test.css' screen;
"@import\\s+'(?<filename>.*?)'\\s*(?<media>.*?)\\s*(?:;|\r|\n)"
}),
RegexOptions.Compiled | RegexOptions.IgnoreCase
);
/// <summary>
/// This will never return null nor any null instances. The content should be stripped of
/// comments before being passed in since there is no parsing done to ensure that the
/// imports matched exist in active (ie. non-commented-out) declarations.
/// </summary>
public static IEnumerable<StylesheetImportDeclaration> GetImports(string content)
{
if (content == null)
throw new ArgumentNullException("content");
if (content.Trim() == "")
return new NonNullImmutableList<StylesheetImportDeclaration>();
// Note: The content needs a line return appending to the end just in case the last line
// is an import doesn't have a trailing semi-colon or line return of its own (the Regex
// won't pick it up otherwise)
var imports = new List<StylesheetImportDeclaration>();
foreach (Match match in ImportDeclarationsMatcher.Matches(content + "\n"))
{
if (match.Success)
{
imports.Add(new StylesheetImportDeclaration(
match.Value,
match.Groups["filename"].Value,
match.Groups["media"].Value
));
}
}
return imports;
}
public class StylesheetImportDeclaration
{
public StylesheetImportDeclaration(
string declaration,
string filename,
string mediaOverride)
{
if (string.IsNullOrWhiteSpace(declaration))
throw new ArgumentException("Null/blank declaration specified");
if (string.IsNullOrWhiteSpace(filename))
throw new ArgumentException("Null/blank filename specified");
Declaration = declaration.Trim();
Filename = filename.Trim();
MediaOverride = string.IsNullOrWhiteSpace(mediaOverride)
? null
: mediaOverride.ToString();
}
/// <summary>
/// This will never be null or empty
/// </summary>
public string Declaration { get; private set; }
/// <summary>
/// This will never be null or empty
/// </summary>
public string Filename { get; private set; }
/// <summary>
/// This may be null but it will never be empty
/// </summary>
public string MediaOverride { get; private set; }
}
This will hopefully match imports of the various supported formats
@import url("test.css")
@import url("test.css")
@import url(test.css)
@import "test.css"
@import 'test.css'
all terminated with either semi-colons or line returns, all with optional media types / media queries, all with variable whitespace between the elements. That is all done in a lot less code that if I was going to try to parse that content myself. Which is nice!
I think this little foray has been a success! But now I've got the syntax down (for this case at least), I need to stop being a hypocrite and go off and try to find out how exactly these expressions are processed. As far as I know these might run fine on content up to a certain size and then go batshit crazy on anything bigger! Or they might run like finely honed algorithmic masterpieces on anything thrown at them* - I guess I won't know until I find out more!
* No, I don't believe that either! :)
Posted at 22:30
21 January 2012
I've been experimenting with minifying javascript and stylesheet content on-the-fly with an ASP.Net MVC project where different pages may have different combinations of javascript and stylesheets - not just to try to minimise the quantity of data transmitted but because some of the stylesheets may conflict.
If this requirement was absent and all of the stylesheets or javascript files from a given folder could be included, I'd probably wait until this becomes available (I'm sure I read somewhere it would be made available for .Net 4.0 as well, though I'm struggling now to find a link to back that up!) -
New Bundling and Minification Support (ASP.NET 4.5 Series)
However, mostly due to this special requirement (and partly because I'll still be learning thing even if this doesn't turn out being as useful as I'd initially hoped :) I've pushed on with investigation.
I'm going to jump straight to the first code I've got in use. There's a controller..
public class CSSController : Controller
{
public ActionResult Process()
{
var filename = Server.MapPath(Request.FilePath);
DateTime lastModifiedDateOfData;
try
{
var file = new FileInfo(filename);
if (!file.Exists)
throw new FileNotFoundException("Requested file does not exist", filename);
lastModifiedDateOfData = file.LastWriteTime;
}
catch (Exception e)
{
Response.StatusCode = 500;
Response.StatusDescription = "Error encountered";
return Content(
String.Format(
"/* Unable to determine LastModifiedDate for file: {0} [{1}] */",
filename,
e.Message
),
"text/css"
);
}
var lastModifiedDateFromRequest = TryToGetIfModifiedSinceDateFromRequest();
if ((lastModifiedDateFromRequest != null) &&
(Math.Abs(
lastModifiedDateFromRequest.Value.Subtract(lastModifiedDateOfData).TotalSeconds)
< 2))
{
// Add a small grace period to the comparison (if only because
// lastModifiedDateOfLiveData is granular to milliseconds while
// lastModifiedDate only considers seconds and so will nearly
// always be between zero and one seconds older)
Response.StatusCode = 304;
Response.StatusDescription = "Not Modified";
return Content("", "text/css");
}
// Try to retrieve from cache
var cacheKey = "CSSController-" + filename;
var cachedData = HttpContext.Cache[cacheKey] as TextFileContents;
if (cachedData != null)
{
// If the cached data is up-to-date then use it..
if (cachedData.LastModified >= lastModifiedDateOfData)
{
SetResponseCacheHeadersForSuccess(lastModifiedDateOfData);
return Content(cachedData.Content, "text/css");
}
// .. otherwise remove it from cache so it can be replaced with current data below
HttpContext.Cache.Remove(cacheKey);
}
try
{
var content = MinifyCSS(System.IO.File.ReadAllText(filename));
SetResponseCacheHeadersForSuccess(lastModifiedDateOfData);
// Use DateTime.MaxValue for AbsoluteExpiration (since we're considering the
// file's LastModifiedDate we don't want this cache entry to expire
// on a separate time based scheme)
HttpContext.Cache.Add(
cacheKey,
new TextFileContents(filename, lastModifiedDateOfData, content),
null,
DateTime.MaxValue,
System.Web.Caching.Cache.NoSlidingExpiration,
System.Web.Caching.CacheItemPriority.Normal,
null
);
return Content(content, "text/css");
}
catch (Exception e)
{
Response.StatusCode = 500;
Response.StatusDescription = "Error encountered";
return Content("/* Error: " + e.Message + " */", "text/css");
}
}
/// <summary>
/// Try to get the If-Modified-Since HttpHeader value - if not present or not valid
/// (ie. not interpretable as a date) then null will be returned
/// </summary>
private DateTime? TryToGetIfModifiedSinceDateFromRequest()
{
var lastModifiedDateRaw = Request.Headers["If-Modified-Since"];
if (lastModifiedDateRaw == null)
return null;
DateTime lastModifiedDate;
if (DateTime.TryParse(lastModifiedDateRaw, out lastModifiedDate))
return lastModifiedDate;
return null;
}
/// <summary>
/// Mark the response as being cacheable and implement content-encoding requests such
/// that gzip is used if supported by requester
/// </summary>
private void SetResponseCacheHeadersForSuccess(DateTime lastModifiedDateOfLiveData)
{
// Mark the response as cacheable
// - Specify "Vary" "Content-Encoding" header to ensure that if cached by proxies
// that different versions are stored for different encodings (eg. gzip'd vs
// non-gzip'd)
Response.Cache.SetCacheability(System.Web.HttpCacheability.Public);
Response.Cache.SetLastModified(lastModifiedDateOfLiveData);
Response.AppendHeader("Vary", "Content-Encoding");
// Handle requested content-encoding method
var encodingsAccepted = (Request.Headers["Accept-Encoding"] ?? "")
.Split(',')
.Select(e => e.Trim().ToLower())
.ToArray();
if (encodingsAccepted.Contains("gzip"))
{
Response.AppendHeader("Content-encoding", "gzip");
Response.Filter = new GZipStream(Response.Filter, CompressionMode.Compress);
}
else if (encodingsAccepted.Contains("deflate"))
{
Response.AppendHeader("Content-encoding", "deflate");
Response.Filter = new DeflateStream(Response.Filter, CompressionMode.Compress);
}
}
/// <summary>
/// Represent a last-modified-date-marked text file we can store in cache
/// </summary>
[Serializable]
private class TextFileContents
{
public TextFileContents(string filename, DateTime lastModified, string content)
{
if (string.IsNullOrWhiteSpace(filename))
throw new ArgumentException("Null/blank filename specified");
if (content == null)
throw new ArgumentNullException("content");
Filename = filename.Trim();
LastModified = lastModified;
Content = content.Trim();
}
/// <summary>
/// This will never be null or empty
/// </summary>
public string Filename { get; private set; }
public DateTime LastModified { get; private set; }
/// <summary>
/// This will never be null but it may be empty if the source file had no content
/// </summary>
public string Content { get; private set; }
}
/// <summary>
/// Simple method to minify CSS content using a few regular expressions
/// </summary>
private string MinifyCSS(string content)
{
if (content == null)
throw new ArgumentNullException("content");
content = content.Trim();
if (content == "")
return "";
content = HashSurroundingWhitespaceRemover.Replace(content, "#");
content = ExtraneousWhitespaceRemover.Replace(content, "");
content = DuplicateWhitespaceRemover.Replace(content, " ");
content = DelimiterWhitespaceRemover.Replace(content, "$1");
content = content.Replace(";}", "}");
content = UnitWhitespaceRemover.Replace(content, "$1");
return CommentRemover.Replace(content, "");
}
// Courtesy of http://madskristensen.net/post/Efficient-stylesheet-minification-in-C.aspx
private static readonly Regex HashSurroundingWhitespaceRemover
= new Regex(@"[a-zA-Z]+#", RegexOptions.Compiled);
private static readonly Regex ExtraneousWhitespaceRemover
= new Regex(@"[\n\r]+\s*", RegexOptions.Compiled);
private static readonly Regex DuplicateWhitespaceRemover
= new Regex(@"\s+", RegexOptions.Compiled);
private static readonly Regex DelimiterWhitespaceRemover
= new Regex(@"\s?([:,;{}])\s?", RegexOptions.Compiled);
private static readonly Regex UnitWhitespaceRemover
= new Regex(@"([\s:]0)(px|pt|%|em)", RegexOptions.Compiled);
private static readonly Regex CommentRemover
= new Regex(@"/\*[\d\D]*?\*/", RegexOptions.Compiled);
}
.. and some route configuration:
// Have to set this to true so that stylesheets (for example) get processed rather than
// returned direct
routes.RouteExistingFiles = true;
routes.MapRoute(
"StandardStylesheets",
"{*allwithextension}",
new { controller = "CSS", action = "Process" },
new { allwithextension = @".*\.css(/.*)?" }
);
I've used a very straight-forward minification approach that I borrowed from this fella -
Efficient stylesheet minification in C#
The minified content is cached along with the last-modified-date of the file so that the http headers can be used to prevent unnecessary work (and bandwidth) by returning a 304 ("Not Modified") response (which doesn't require content). When a browser requests a "hard refresh" it will leave this header out of the request and so will get fresh content.
So far there have been no real surprises but I came across a problem for which I'm still not completely sure where to point the blame. When hosted in IIS (but not the "Visual Studio Development [Web] Server" or IIS Express) there would be responses with the minified content returned to "hard refresh" requests that would appear corrupted. Fiddler would pop up a "The content could not be decompressed. The magic number in GZip header is not correct. Make sure you are passing in a GZIP stream" message. If the css file was entered into the url bar in Firefox, it would display "Content Encoding Error".
Successful requests (for example, where the cache is either empty or the file has been modified since the cache entry was recorded), the request and response headers would be of the form:
GET http://www.productiverage.com/Content/Default.css HTTP/1.1
Host: www.productiverage.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
HTTP/1.1 200 OK
Cache-Control: public
Content-Type: text/css; charset=utf-8
Last-Modified: Thu, 19 Jan 2012 23:03:37 GMT
Vary: Accept-Encoding
Server: Microsoft-IIS/7.0
X-AspNetMvc-Version: 3.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 19 Jan 2012 23:08:55 GMT
Content-Length: 4344
html{background:url("/Content/Images/Background-Repeat.jpg") repeat-x #800C0E}body,td{ ...
while the failing requests would be such:
GET http://www.productiverage.com/Content/Default.css HTTP/1.1
Host: www.productiverage.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
HTTP/1.1 200 OK
Cache-Control: public
Content-Type: text/css; charset=utf-8
Content-Encoding: gzip
Last-Modified: Thu, 19 Jan 2012 23:03:37 GMT
Vary: Accept-Encoding
Server: Microsoft-IIS/7.0
X-AspNetMvc-Version: 3.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 19 Jan 2012 23:07:52 GMT
Content-Length: 4344
html{background:url("/Content/Images/Background-Repeat.jpg") repeat-x #800C0E}body,td{ ...
The only differences in the request being the cache-disabling "Pragma" and "Cache-Control" headers but in the failing response a "Content-Encoding: gzip" header has been added but the content itself is in its raw form - ie. not gzip'd.
That explains the gzip error - the content is being reported as compressed when in actual fact it isn't!
I presume that the compression settings in IIS are somehow interfering here but unfortunately I've not been able to definitively find the cause or if I should do anything in configuration. My Google-fu is failing me today :(
However, the solution in the above code is to handle the response compression in the CSSController. In the SetResponseCacheHeadersForSuccess method the "Accept-Encoding" request header is tested for gzip and deflate and will return content accordingly by setting the Response.Filter to be either a GZipStream or DeflateStream. This has solved the problem! And so I'm going to leave my root-cause investigation for another day :)
Note: You can find the source code to this in one of my repositories at Bitbucket: The CSS Minifier.
Posted at 16:56
Dan is a big geek who likes making stuff with computers! He can be quite outspoken so clearly needs a blog :)
In the last few minutes he seems to have taken to referring to himself in the third person. He's quite enjoying it.