24 June 2012
In the last post (Optimising the Plurality-Handling Normaliser) I got all excited about improving the performance of what is essentially a string comparer for use in a Full Text Indexer I'm playing around with (which is now handling the search facilities on this blog so it must be half way to working at least! :) but the use of compiled LINQ expressions brought about their own problems when I tried to write away a fully-generated index to a disk cache. The generated lambda expression is not serialisable!
There was something in me that thought that since it had been formed through simple LINQ Expressions tied together that it would be easy to serialise and so not be a problem. But I suppose that once it becomes a generic lambda function then all bets are off since they can have references to all sort and so mightn't be easily serialisable anymore.
As usual there's a Stack Overflow post showing I'm hardly the first person to have encountered this issue, and this particular one even has the authority Eric Lippert getting involved! :) Interesting to see him make the point that this was was work that was required with all of the "LINQ-to-whatever" integrations..
Stack Overflow: How can I pass a lambda expression to a WCF Service?
I essentially just want to write to disk a custom string-keyed dictionary object with a particular key comparer to act as another level of caching when it drops out of memory so I didn't have any issues so complicated as passing expressions to a query service so I went for a relatively simple approach; I record all of the data as class members that are required to generate the LINQ Expressions so that I can implement ISerializable and write away just this data when an instance is serialised. Then when it's de-serialised I use this data to regenerate the lambdas.
// This de-serialising constructor takes the values that are stored in the GetObjectData
// method and passes them through to the standard public constructor
protected EnglishPluralityStringNormaliser(SerializationInfo info, StreamingContext context)
: this(
(IEnumerable<PluralEntry>)info.GetValue(
"_plurals",
typeof(IEnumerable<PluralEntry>)
),
(IEnumerable<string>)info.GetValue(
"_fallbackSuffixes",
typeof(IEnumerable<string>)
),
(IStringNormaliser)info.GetValue(
"_optionalPreNormaliser",
typeof(IStringNormaliser)
),
(PreNormaliserWorkOptions)info.GetValue(
"_preNormaliserWork",
typeof(PreNormaliserWorkOptions)
)
) { }
public void GetObjectData(SerializationInfo info, StreamingContext context)
{
// Unfortunately we can't serialise the generated normaliser (we'll get a "Cannot
// serialize delegates over unmanaged function pointers, dynamic methods or methods
// outside the delegate creator's assembly" error) so if we have to serialise this
// instance we'll store all of the dat and then re-generate the normaliser on
// de-serialisation. Not ideal from a performance point of view but at least
// it will work.
info.AddValue("_plurals", _plurals);
info.AddValue("_fallbackSuffixes", _fallbackSuffixes);
info.AddValue("_optionalPreNormaliser", _optionalPreNormaliser);
info.AddValue("_preNormaliserWork", _preNormaliserWork);
}
Then I tried integrating the project as a search facility into my blog which is running ASP.Net MVC 3 (.Net 4.0) and ran into another snag; "Inheritance security rules violated while overriding member: MyBusinessException.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)'. Security accessibility of the overriding method must match the security accessibility of the method being overridden."
Hmmm...
Stack Overflow to the rescue again! Inheritance security rules violated with overriding member. Reading the post led me to click "Use Definition" on ISerializable and observe that the "SecurityCritical" attribute was marked on the GetObjectData method - and from what I understand from what I read, I should be able to fix this by marking that attribute on my GetObjectData method. Sortio!
Not sortio.. :( And now I must admit to being a bit lazy in my eagerness to get the search functionality integrated on the site. One of the Stack Overflow answers was to specify "Full Trust" for the web application but I got the impression that this was cheating a bit and bypassing some of the new .Net 4.0 security mechanisms. However, for now I've gone with it by adding this to the web.config (as per one of the posted answers):
<system.web>
<trust level="Full" />
<system.web>
and now it is working! Still, something to look into further another day I think.
The project I've been talking about is publicly accessible at BitBucket but I'm yet to sort out a decent Readme for it and I'm hoping to write some posts about its development, its use so far and a range of examples - watch this space! Full Text Indexer BitBucket repo
Update (17th December 2012): This has been included as part of a later Full Text Indexer Round-up Post that brings together several Posts into one series, incorporating code and techniques from each of them.
Posted at 16:01
16 June 2012
A slight amendment to a previous post (An English-language Plurality-handling String Normaliser); the original intention of IStringNormaliser implementations was to expose IEqualityComparer<string> and return a value from GetNormalisedString that could be maintained in a dictionary (specifically a Ternary Search Tree). And at this the English Language Plurality-handling String Normaliser has operated splendidly! Does the string "cat" match the string "cats" when searching? Why, yes it does! (And things like this are important to know since the internet is made of cats).
However.. since the normalised values are maintained as strings in a dictionary they may be returned as normalised values. And then compared to a string that would be normalised to that value; so "cat" may be transformed through GetNormalisedString and then compared again to "cat" - and this is where things go awry. "cat" and "cats" are found to match as they are both normalised to "cat|s|es|ses" (as covered the earlier post I linked to) and the problem is that "cat|s|es|ses" does not get matched to "cat" (or "cats" for that matter) as it incorrectly gets transformed again to "cat|s|es|se|s|es|ses" as the final "s" gets matched as a potential suffix and the value gets altered.
Thankfully, the fix is none too difficult; before trying to perform transformations based upon value endings, we need to check for whether a suffix group has already been appended to the value. So before checking whether a value ends with "s", "es" or "ses" we need to check whether it ends with "|s|es|ses" and if so then return it as pre-transformed.
The method that requires changing is that below:
private static Func<string, string> GenerateNormaliser(
IEnumerable<PluralEntry> plurals,
IEnumerable<string> fallbackSuffixes)
{
if (plurals == null)
throw new ArgumentNullException("pluralEntries");
if (fallbackSuffixes == null)
throw new ArgumentNullException("fallbackSuffixes");
// Build up if statements for each suffix - if a match is found, return the input value
// with the matched suffix replaced with a combination of all the other suffixes in
// PluralEntry
var result = Expression.Parameter(typeof(string), "result");
var endLabel = Expression.Label(typeof(string));
var valueTrimmed = Expression.Parameter(typeof(string), "valueTrimmed");
var expressions = new List<Expression>();
var pluralsTidied = new List<PluralEntry>();
foreach (var plural in plurals)
{
if (plural == null)
throw new ArgumentException("Null reference encountered in plurals set");
pluralsTidied.Add(plural);
}
foreach (var plural in pluralsTidied)
{
// Before checking for for suffix matches we need to check whether the input string
// is a value that has already been through the normalisation process! eg. "category"
// and "categories" will both be transformed into the value "categor|y|ies", but if
// that value is passed in again it should leave as "categor|y|ies" and not have
// any futher attempts at normalisation applying to it.
expressions.Add(
Expression.IfThen(
GeneratePredicate(
CreateSuffixExtension(plural.Values),
valueTrimmed,
plural.MatchType
),
Expression.Block(
Expression.Assign(
result,
valueTrimmed
),
Expression.Return(endLabel, result)
)
)
);
}
foreach (var plural in pluralsTidied)
{
foreach (var suffix in plural.Values)
{
var assignNormalisedValueToResult = Expression.Assign(
result,
GenerateStringConcatExpression(
GenerateRemoveLastCharactersExpression(valueTrimmed, suffix.Length),
Expression.Constant(
CreateSuffixExtension(plural.Values),
typeof(string)
)
)
);
expressions.Add(
Expression.IfThen(
GeneratePredicate(suffix, valueTrimmed, plural.MatchType),
Expression.Block(
assignNormalisedValueToResult,
Expression.Return(endLabel, result)
)
)
);
}
}
// If any fallback suffixes are specified, add a statement to append them if none of the
// PluralEntry matches are made
fallbackSuffixes = TidyStringList(fallbackSuffixes, v => v.Trim().ToLower());
if (fallbackSuffixes.Any())
{
expressions.Add(
Expression.Assign(
result,
GenerateStringConcatExpression(
valueTrimmed,
Expression.Constant(
CreateSuffixExtension(fallbackSuffixes),
typeof(string)
)
)
)
);
}
else
expressions.Add(Expression.Assign(result, valueTrimmed));
// Add the return-point label, configured to return the string value in "result"
expressions.Add(Expression.Label(endLabel, result));
return Expression.Lambda<Func<string, string>>(
Expression.Block(
new[] { result },
expressions
),
valueTrimmed
).Compile();
}
In the places I'm using this the plurality-handling normaliser wraps another normaliser that trims the string, lower-cases it, removes any punctuation and replaces any non-latin characters with latin equivalents. This is no problem. But if a normaliser was wrapped that removed any non-alpha characters completely then the method above wouldn't be able to match the transformed "|s|es|ses" ending as the pipe characters would have been removed. So long as this situation is avoided then everything will work lovely, but it's worth bearing in mind!
Update (17th December 2012): This has been included as part of a later Full Text Indexer Round-up Post that brings together several Posts into one series, incorporating code and techniques from each of them.
Posted at 17:53
9 June 2012
In the last post I took a first stab at an English-language Plurality-handling String Normaliser to work with a Full Text Indexer research project I'm playing with. I've been using it in conjunction with a normaliser that strips out punctuation from words, replaces accented characters with latin versions (eg. é
replaced with e
) and lower-cases the content. This originally used a few regular expressions along with some character replacement but I wasn't delighted with the performance, so that was re-written to do one pass through the character and do it all at once and is now running much better. Whether that ties back into my I don't like regular expressions ramble well or not I'm not sure :)
The point at which it does the most work is here:
public bool TryToTransform(string value, out string valueTransformed)
{
if (value == null)
throw new ArgumentNullException("value");
if (_matchType == MatchTypeOptions.SuffixOnly)
{
var matchedSuffix = _values.FirstOrDefault(
suffix => (value.Length > suffix.Length) && value.EndsWith(suffix)
);
if (matchedSuffix != null)
{
valueTransformed =
value.Substring(0, value.Length - matchedSuffix.Length) + _combinedValues;
return true;
}
}
else
{
if (_values.Contains(value))
{
valueTransformed = _combinedValues;
return true;
}
}
valueTransformed = null;
return false;
}
The first approach I took was to take out some of the LINQ - it makes it easy to read but I don't actually know for sure what it's doing! I thought it was worth checking if being more explicit about what I want to do would reap any performance benefit.. So the lengths are explicitly checked first (if matching WholeWord
then the value length must match the suffix length, if matching SuffixOnly
then the value length must be greater than the suffix length) and then count back through the last characters of the input value and ensure that each one matches the suffix. Not quite as readable but no big deal.
public bool TryToTransform(string value, out string valueTransformed)
{
if (value == null)
throw new ArgumentNullException("value");
foreach (var suffix in _values)
{
if (_matchType == MatchTypeOptions.WholeWord)
{
if (value.Length != suffix.Length)
continue;
}
else if (!(value.Length > suffix.Length))
continue;
var matchedSuffixLength = 0;
for (var index = 0; index < suffix.Length; index++)
{
if (value[value.Length - (index + 1)] != suffix[suffix.Length - (index + 1)])
{
matchedSuffixLength = 0;
break;
}
matchedSuffixLength++;
}
if (matchedSuffixLength == 0)
continue;
valueTransformed =
value.Substring(0, value.Length - matchedSuffixLength) + _combinedValues;
return true;
}
valueTransformed = null;
return false;
}
Running a few loops of the All English Words data I mentioned in the last post saw an performance improvement (when run in release mode) over 3x - success!
But with what I've been learning about LINQ Expressions over the last year or so (culminating in The artist previously known as the AutoMapper-By-Constructor and The CompilableTypeConverter BitBucket repository) I couldn't help wondering if writing code that would generate expressions that unrolled the comparison loop and pre-generated the combined suffix extensions might not be faster. The only way to find out is to try!
The idea is that it would effectively generate code along the lines of:
if ((value.length > 1)
&& (value[value.length - 1] == 'y'))
return value.substring(0, value.length - 1) + "[y][ies]";
if ((value.length > 3)
&& (value[value.length - 3] == 'i')
&& (value[value.length - 2] == 'e')
&& (value[value.length - 1] == 's'))
return value.substring(0, value.length - 3) + "[y][ies]";
for all of the various plurality suffixes but while still maintaining the ability to easily define new suffix sets. And so, without further ado, I ended up with this:
/// <summary>
/// This will match common strings where one is the plural and the other the singular version
/// of the same word. It not intended to be perfect and may match a few false positives, but
/// it should catch most of the most common cases.
/// </summary>
[Serializable]
public class EnglishPluralityStringNormaliser : IStringNormaliser
{
private Func<string, string> _normaliser;
private IStringNormaliser _optionalPreNormaliser;
private PreNormaliserWorkOptions _preNormaliserWork;
public EnglishPluralityStringNormaliser(
IEnumerable<PluralEntry> plurals,
IEnumerable<string> fallbackSuffixes,
IStringNormaliser optionalPreNormaliser,
PreNormaliserWorkOptions preNormaliserWork)
{
if (plurals == null)
throw new ArgumentNullException("pluralEntries");
if (fallbackSuffixes == null)
throw new ArgumentNullException("fallbackSuffixes");
var allPreNormaliserOptions = (PreNormaliserWorkOptions)0;
foreach (PreNormaliserWorkOptions option in
Enum.GetValues(typeof(PreNormaliserWorkOptions)))
{
allPreNormaliserOptions = allPreNormaliserOptions | option;
}
if ((preNormaliserWork & allPreNormaliserOptions) != preNormaliserWork)
throw new ArgumentOutOfRangeException("preNormaliserWork");
_normaliser = GenerateNormaliser(plurals, fallbackSuffixes);
_optionalPreNormaliser = optionalPreNormaliser;
_preNormaliserWork = preNormaliserWork;
}
public EnglishPluralityStringNormaliser(
IStringNormaliser optionalPreNormaliser,
PreNormaliserWorkOptions preNormaliserWork
) : this(DefaultPlurals, DefaultFallback, optionalPreNormaliser, preNormaliserWork) { }
public EnglishPluralityStringNormaliser()
: this(null, PreNormaliserWorkOptions.PreNormaliserDoesNothing) { }
public string GetNormalisedString(string value)
{
if (value == null)
throw new ArgumentNullException("value");
// If an additional normaliser was specified in the constructor then process the
// string with that first (eg. a normaliser that removes punctuation from values
// may be beneficial depending upon the content that may be passed in)
if (_optionalPreNormaliser != null)
value = _optionalPreNormaliser.GetNormalisedString(value);
if ((_preNormaliserWork & PreNormaliserWorkOptions.PreNormaliserTrims)
!= PreNormaliserWorkOptions.PreNormaliserTrims)
value = value.Trim();
if (value == "")
return "";
// We have to lower case the trimmed value since the suffixes are all stored as
// lower case values
if ((_preNormaliserWork & PreNormaliserWorkOptions.PreNormaliserLowerCases)
!= PreNormaliserWorkOptions.PreNormaliserLowerCases)
value = value.ToLower();
return _normaliser(value);
}
public bool Equals(string x, string y)
{
if (x == null)
throw new ArgumentNullException("x");
if (y == null)
throw new ArgumentNullException("y");
return GetNormalisedString(x) == GetNormalisedString(y);
}
public int GetHashCode(string obj)
{
if (obj == null)
throw new ArgumentNullException("obj");
return GetNormalisedString(obj).GetHashCode();
}
private static Func<string, string> GenerateNormaliser(
IEnumerable<PluralEntry> plurals,
IEnumerable<string> fallbackSuffixes)
{
if (plurals == null)
throw new ArgumentNullException("pluralEntries");
if (fallbackSuffixes == null)
throw new ArgumentNullException("fallbackSuffixes");
// Build up if statements for each suffix - if a match is found, return the input
// value with the matched suffix replaced with a combination of all the other
// suffixes in PluralEntry
var result = Expression.Parameter(typeof(string), "result");
var endLabel = Expression.Label(typeof(string));
var valueTrimmed = Expression.Parameter(typeof(string), "valueTrimmed");
var expressions = new List<Expression>();
foreach (var plural in plurals)
{
if (plural == null)
throw new ArgumentException("Null reference encountered in plurals set");
foreach (var suffix in plural.Values)
{
var assignNormalisedValueToResult = Expression.Assign(
result,
GenerateStringConcatExpression(
GenerateRemoveLastCharactersExpression(valueTrimmed, suffix.Length),
Expression.Constant(
CreateSuffixExtension(plural.Values),
typeof(string)
)
)
);
expressions.Add(
Expression.IfThen(
GeneratePredicate(suffix, valueTrimmed, plural.MatchType),
Expression.Block(
assignNormalisedValueToResult,
Expression.Return(endLabel, result)
)
)
);
}
}
// If any fallback suffixes are specified, add a statement to append them if none
// of the PluralEntry matches are made
fallbackSuffixes = TidyStringList(fallbackSuffixes, v => v.Trim().ToLower());
if (fallbackSuffixes.Any())
{
expressions.Add(
Expression.Assign(
result,
GenerateStringConcatExpression(
valueTrimmed,
Expression.Constant(
CreateSuffixExtension(fallbackSuffixes),
typeof(string)
)
)
)
);
}
else
expressions.Add(Expression.Assign(result, valueTrimmed));
// Add the return-point label, configured to return the string value in "result"
expressions.Add(Expression.Label(endLabel, result));
return Expression.Lambda<Func<string, string>>(
Expression.Block(
new[] { result },
expressions
),
valueTrimmed
).Compile();
}
/// <summary>
/// Generate an expression that determines whether a string parameter matches a specified
/// suffix / matchType combination
/// </summary>
private static Expression GeneratePredicate(
string suffix,
ParameterExpression valueTrimmed,
MatchTypeOptions matchType)
{
if (string.IsNullOrWhiteSpace(suffix))
throw new ArgumentException("Null/blank suffix specified");
if (valueTrimmed == null)
throw new ArgumentNullException("valueTrimmed");
if (!Enum.IsDefined(typeof(MatchTypeOptions), matchType))
throw new ArgumentOutOfRangeException("matchType");
suffix = suffix.Trim();
var conditionElements = new List<Expression>();
var lengthProperty = typeof(string).GetProperty("Length");
var indexedProperty = typeof(string).GetProperties().First(
p => (p.GetIndexParameters() ?? new ParameterInfo[0]).Any()
);
if (matchType == MatchTypeOptions.SuffixOnly)
{
conditionElements.Add(
Expression.GreaterThan(
Expression.Property(valueTrimmed, lengthProperty),
Expression.Constant(suffix.Length, typeof(int))
)
);
}
else
{
conditionElements.Add(
Expression.Equal(
Expression.Property(valueTrimmed, lengthProperty),
Expression.Constant(suffix.Length, typeof(int))
)
);
}
for (var index = 0; index < suffix.Length; index++)
{
conditionElements.Add(
Expression.Equal(
Expression.Constant(suffix[index], typeof(char)),
Expression.Property(
valueTrimmed,
indexedProperty,
Expression.Subtract(
Expression.Property(valueTrimmed, lengthProperty),
Expression.Constant(suffix.Length - index, typeof(int))
)
)
)
);
}
return CombineExpressionsWithAndAlso(conditionElements);
}
private static Expression CombineExpressionsWithAndAlso(
IEnumerable<Expression> expressions)
{
if (expressions == null)
throw new ArgumentNullException("expressions");
var expressionsTidied = new List<Expression>();
foreach (var expression in expressions)
{
if (expression == null)
throw new ArgumentException("Null reference encountered in expressions set");
expressionsTidied.Add(expression);
}
if (!expressionsTidied.Any())
throw new Exception("No entries in expressions set");
else if (expressionsTidied.Count == 1)
return expressionsTidied[0];
var reducedExpressions = new List<Expression>();
for (var index = 0; index < expressionsTidied.Count; index += 2)
{
var expression = expressionsTidied[index];
if (index < (expressionsTidied.Count - 1))
{
var expressionNext = expressionsTidied[index + 1];
reducedExpressions.Add(Expression.AndAlso(expression, expressionNext));
}
else
reducedExpressions.Add(expression);
}
return (reducedExpressions.Count == 1)
? reducedExpressions[0]
: CombineExpressionsWithAndAlso(reducedExpressions);
}
/// <summary>
/// The value Expression must represent a non-null string that is as at least as long as
/// the specified length or an exception will
/// be thrown upon exection
/// </summary>
private static Expression GenerateRemoveLastCharactersExpression(
Expression value,
int length)
{
if (value == null)
throw new ArgumentNullException("value");
if (length < 0)
throw new ArgumentOutOfRangeException("length");
return Expression.Call(
value,
typeof(string).GetMethod("Substring", new[] { typeof(int), typeof(int) }),
Expression.Constant(0),
Expression.Subtract(
Expression.Property(value, typeof(string).GetProperty("Length")),
Expression.Constant(length, typeof(int))
)
);
}
/// <summary>
/// The values Expressions must represent strings otherwise the expression will fail when
/// executed
/// </summary>
private static Expression GenerateStringConcatExpression(params Expression[] values)
{
if (values == null)
throw new ArgumentNullException("values");
var valuesTidied = values.ToList();
if (!valuesTidied.Any())
throw new ArgumentException("No entries in values set");
if (valuesTidied.Any(v => v == null))
throw new ArgumentException("Null reference encountered in values set");
return Expression.Call(
typeof(string).GetMethod("Concat", new[] { typeof(string[]) }),
Expression.NewArrayInit(
typeof(string),
valuesTidied
)
);
}
private static string CreateSuffixExtension(IEnumerable<string> suffixes)
{
if (suffixes == null)
throw new ArgumentNullException("suffixes");
var suffixesTidied = suffixes.ToList();
if (!suffixesTidied.Any())
throw new ArgumentException("No entries in suffixes set");
if (suffixesTidied.Any(s => string.IsNullOrWhiteSpace(s)))
throw new ArgumentException("Null/blank entry encountered in suffixes set");
return "|" + string.Join("|", suffixesTidied.Select(s => s.Trim()));
}
/// <summary>
/// Given a set of values, ensure that none are null and return them de-duplicated after
/// having been pushed through a string manipulation. This will throw an exception for
/// null arguments or if any null value is encountered in the values set.
/// </summary>
private static IEnumerable<string> TidyStringList(
IEnumerable<string> values,
Func<string, string> transformer)
{
if (values == null)
throw new ArgumentNullException("values");
if (transformer == null)
throw new ArgumentNullException("transformer");
var valuesTidied = new List<string>();
foreach (var value in values)
{
if (value == null)
throw new ArgumentException("Null entry encountered in values");
var valueToStore = transformer(value);
if (!valuesTidied.Contains(valueToStore))
valuesTidied.Add(valueToStore);
}
return valuesTidied.Distinct();
}
public readonly static IEnumerable<string> DefaultFallback = new[] { "ses", "es", "s" };
public readonly static PluralEntry[] DefaultPlurals = new[]
{
// eg. formula / formulae / formulas
new PluralEntry(new[] { "ula", "ulae", "ulas" }, MatchTypeOptions.SuffixOnly),
// eg. category / categories
new PluralEntry(new[] { "y", "ies" }, MatchTypeOptions.SuffixOnly),
// eg. cactus / cactii
new PluralEntry(new[] { "us", "ii" }, MatchTypeOptions.SuffixOnly),
// eg. child / children
new PluralEntry(new[] { "ld", "ldren" }, MatchTypeOptions.SuffixOnly),
// eg. medium / media
new PluralEntry(new[] { "ium", "ia" }, MatchTypeOptions.SuffixOnly),
// Common special cases that have to come before the "ses", es", "s" form
new PluralEntry(new[] { "index", "indexes", "indices" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "matrix", "matrices" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "vertex", "vertices" }, MatchTypeOptions.WholeWord),
// eg. Abacuses, matching "s" here means we must use "ses", "es" AND "s" as fallbacks
new PluralEntry(new[] { "ses", "es", "s" }, MatchTypeOptions.SuffixOnly),
// Other common special cases
new PluralEntry(new[] { "datum", "data" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "man", "men" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "woman", "women" }, MatchTypeOptions.WholeWord)
};
[Serializable]
public class PluralEntry
{
public PluralEntry(IEnumerable<string> values, MatchTypeOptions matchType)
{
if (values == null)
throw new ArgumentNullException("values");
if (!Enum.IsDefined(typeof(MatchTypeOptions), matchType))
throw new ArgumentOutOfRangeException("matchType");
var valuesTidied = TidyStringList(values, v => v.Trim().ToLower());
if (!valuesTidied.Any())
throw new ArgumentException("No entries in values set");
Values = valuesTidied.Distinct().ToList().AsReadOnly();
MatchType = matchType;
}
/// <summary>
/// This will never be null or an empty set, nor will it contain any null, empty or
/// duplicate values (all values are lower-cased and trimmed)
/// </summary>
public ReadOnlyCollection<string> Values { get; private set; }
public MatchTypeOptions MatchType { get; private set; }
}
[Serializable]
public enum MatchTypeOptions
{
SuffixOnly,
WholeWord
}
}
It still takes me a far while to craft the generation LINQ Expressions but I do think that once written the resulting code is actually fairly easy to follow. For each suffix in a PluralEntry
(where the PluralEntry might describe the group y
, ies
as a SuffixOnly
extension - as clearly seen in the last post) a predicate is generated with LINQ Expressions that compares the input string's length and each of the characters that could correlate with the suffix string. Very similar to inside the loop of the first optimisation at the top of this post. An IfThen Expression will consider this predicate and - if matched - generate result that removes the suffix from the input value and appends a combined string consisting of the suffix values in the group before jumping to the end of the Expression block (effectively "returning" out of the block). Again, just like the setting of the valueTransformed string in the earlier code. If none of the suffix groups are found to match then it will append a default set of suffixes, so that cat
is transformed to cat|s|es|ses
in order to match cats
which would also be transformed to cat|s|es|ses
, for example.
There are couple of oddities in the code - I struggled for a while to find a nice way to access characters by index in a string since the ArrayAccess
Expressions can't be used since a string isn't technically an array of characters; you have to first use reflection to get hold of the indexed property of the string type, there's only one so that must be the property we want to access! When comparing the string length and the individual characters, the Expressions are combined with the AndAlso
Expression as this ensures that short-circuiting of the conditions is utilised - as soon as one condition is not met, any further ones are ignored.
This brought on another performance improvement of over 3x - success again!
There are a couple more minor optimisations in the new code that were made with knowledge of how I intended to integrate it. I was intending to use this DefaultStringNormaliser
mentioned earlier that would trim, lower-case, remove punctuation and replace non-latin characters. This can be passed in as the optionalPreNormaliser
constructor parameter and will process the string before the plurality normalisation is handled. However, if this "pre-normaliser" is going to trim and lower-case the input then there's no need for the EnglishPluralityStringNormaliser
to do it as well! So there's a PreNormaliserWorkOptions
enum that allows the instantiator to pass in hints as to whether the optionalPreNormaliser
(if any) is going to do any of this work.
Sending a few passes of the All English Words data through an EnglishPluralityStringNormaliser
that wraps the DefaultStringNormaliser
(code below) with PreNormaliserLowerCases
and PreNormaliserTrims
specified compared to running it with PreNormaliserDoesNothing
(which force the Trim
and ToLower
calls to be made by the EnglishPluralityStringNormaliser
even though the DefaultStringNormaliser
has already done this work) resulted in a performance boost of over 1.4x. Not as dramatic, but definitely not to be sniffed at!
There's one final tweak to note; I've switched from appending the suffix groups as [s][es][ses]
to |s|es|ses
since I'm intended to store the resulting normalised strings in a Ternary Search Tree (as discussed in The .Net Dictionary is FAST!) and if the string is shorter then less comparisons have to be made when matching a string in that structure!
Since I've made reference a few times to the DefaultStringNormaliser
which I've made use of, here's the current code:
/// <summary>
/// This will perform string comparisons where the values have any accented characters
/// replaced with non-accented versions, all whitespace converted to spaces and runs of
/// whitespace replaced with a single space, all punctuation removed and the content
/// then lower-cased.
/// </summary>
[Serializable]
public sealed class DefaultStringNormaliser : IStringNormaliser
{
private readonly static HashSet<Char> PunctuationCharacters = new HashSet<char>(
Enumerable.Range(char.MinValue, char.MaxValue)
.Select(c => (char)c)
.Where(c => char.IsPunctuation(c))
);
public string GetNormalisedString(string value)
{
if (value == null)
throw new ArgumentNullException("value");
var normalisedValue = value.Normalize(NormalizationForm.FormKD);
var content = new char[normalisedValue.Length];
var contentIndex = 0;
var contentIndexOfLastNonWhitespace = 0;
var lastCharWasWhitespace = false;
var gotContent = false;
for (var index = 0; index < normalisedValue.Length; index++)
{
var currentChar = normalisedValue[index];
if (PunctuationCharacters.Contains(currentChar))
continue;
if ((currentChar == '\r') || (currentChar == '\n') || (currentChar == '\t'))
currentChar = ' ';
else
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(currentChar);
if ((unicodeCategory == UnicodeCategory.EnclosingMark)
|| (unicodeCategory == UnicodeCategory.NonSpacingMark)
|| (unicodeCategory == UnicodeCategory.SpacingCombiningMark))
currentChar = ' ';
}
if (currentChar == ' ')
{
if (!lastCharWasWhitespace && gotContent)
{
content[contentIndex] = currentChar;
contentIndex++;
lastCharWasWhitespace = true;
}
continue;
}
if (!char.IsLower(currentChar))
currentChar = char.ToLower(currentChar);
content[contentIndex] = currentChar;
contentIndex++;
contentIndexOfLastNonWhitespace = contentIndex;
lastCharWasWhitespace = false;
gotContent = true;
}
return new string(content, 0, contentIndexOfLastNonWhitespace);
}
public bool Equals(string x, string y)
{
if (x == null)
throw new ArgumentNullException("x");
if (y == null)
throw new ArgumentNullException("y");
return GetNormalisedString(x) == GetNormalisedString(y);
}
public int GetHashCode(string obj)
{
if (obj == null)
throw new ArgumentNullException("obj");
return GetNormalisedString(obj).GetHashCode();
}
}
I'm still a firm believer in writing the code to work and be easily understood and maintained first. But when a section of code is measurably a bottleneck, and that bottleneck is worth removing, then little adventures like this can be fun and beneficial! And, to be honest, I don't think the resulting code is that difficult to understand. There are probably a few more tweaks that could be made to really eke out some more performance but I'm perfectly happy with it for now :)
Update (17th December 2012): This has been included as part of a later Full Text Indexer Round-up Post that brings together several Posts into one series, incorporating code and techniques from each of them.
Posted at 00:04
Dan is a big geek who likes making stuff with computers! He can be quite outspoken so clearly needs a blog :)
In the last few minutes he seems to have taken to referring to himself in the third person. He's quite enjoying it.