Productive Rage

Dan's techie ramblings

The Full Text Indexer: Search Term Highlighting with Source Locations

I made reference last time (in The Full Text Indexer: Source Locations) to using Source Location data to implement search term highlighting and since adding this functionality to the Full Text Indexer, I've rewritten the search term highlighting mechanism on my blog to illustrate it.

The idea is that when a search is performed, the results page tries to show a snippet of each matching post with instances of the search term(s) highlighted. The source locations will describe where all of the matches are but it might not be possible to show all of the matches if the length of the summary snippet is restricted to, say, 300 characters. So a mechanism is required that can choose the best snippet to show; this might be the segment of content that contains the greatest number of matches, or the greatest cumulative weight of matches. It also needs to bear in mind that it's possible that none of the source locations will be in the post content itself; it's feasible that a search term could be present in a post's title but not its content at all!

Getting straight to it

The previous system that I had on my blog to perform this task was a bit complicated - it tried to determine all of the places in the content where matches might have occured (this was before source locations information was included in the Full Text Indexer data) and then generated all permutations of combinations of some or all of these matches in order to try to decide what was the best segment of the content to extract and display as the summary. The logic wasn't precisely the same as the Full Text Indexer's searching as I didn't want to go crazy on processing the content - so it wouldn't handle matches of plurals, for example. And the LINQ statement I carefully crafted over a few iterations to generate the permutations of possible matches seemed cryptic when I came back to look at it a few months later.

The new approach is much simpler:

  1. Sort the source location data by index (where the match appears in the content) and then by length of matched token
  2. Loop through the source data and build up chains of adjacent / overlapping matches where all of the matches can fit inside a segment that is no longer than maxLengthForHighlightedContent
  3. Construct each chain by starting with the current source location
  4. Then look ahead to the next source location (if any) and determine whether both it and the current source location fit within the maxLengthForHighlightedContent constraint
  5. Continue for the subsequent source locations (as soon as including one would exceed the maxLengthForHighlightedContent, stop looking - this works since they're sorted by index and length)
  6. Decide which of these source location chains will result in the best summary (if no chains could be constructed then return an empty set instead)
  7. Return a set of segments (index / length pairs) to highlight - no overlapping segments will be returned, any overlapping segments will be combined (this can make the actual highlighting of search terms much simpler)

The "best summary" is determined by an IComparer that considers different sets of source locations. The implementation I use runs through three metrics

  1. The combined MatchWeightContribution of the source location sets (the greater the better)
  2. If there are chains that the first metric can't differentiate between then consider the number of source locations in the chain (the lower the better, this must mean that the average weight of each match is greater)
  3. If a decision still hasn't been reached for a given comparison then consider the minimum SourceIndex (the lower the better, meaning the matching starts earlier in the content)

I will only be showing a search summary extracted from the post Content field although the search functionality also considers the Title as well as any Tags for the post. The first Content Retriever extracts content from a plain text version of the post's content so all source locations that relate to the Content field will have a SourceFieldIndex value of zero (I touched briefly on this last time but I'll explain in more detail further down in this post too).

So let's see the code! This is one of those pleasant cases where the code flows nicely from the outlined approach. I didn't go into detail above about how the overlap-prevention was handled but the code (hopefully!) illustrates more than adequately -

using System;
using System.Collections.Generic;
using System.Linq;
using FullTextIndexer.Common.Lists;
using FullTextIndexer.Core.Indexes;

namespace BlogBackEnd.FullTextIndexing
{
  public static class SearchTermHighlighter
  {
    public static NonNullImmutableList<StringSegment> IdentifySearchTermsToHighlight(
      string content,
      int maxLengthForHighlightedContent,
      NonNullImmutableList<SourceFieldLocation> sourceLocations,
      IComparer<NonNullImmutableList<SourceFieldLocation>> bestMatchDeterminer)
    {
      if (content == null)
        throw new ArgumentNullException("content");
      if (maxLengthForHighlightedContent <= 0)
      {
        throw new ArgumentOutOfRangeException(
          "maxLengthForHighlightedContent",
          "must be greater than zero"
        );
      }
      if (sourceLocations == null)
        throw new ArgumentNullException("sourceLocations");
      if (sourceLocations.Select(s => s.SourceFieldIndex).Distinct().Count() > 1)
        throw new ArgumentException("All sourceLocations must have the same SourceFieldIndex");
      if (bestMatchDeterminer == null)
        throw new ArgumentNullException("bestMatchDeterminer");

      // If there are no source locations there there is nothing to highlight
      if (!sourceLocations.Any())
        return new NonNullImmutableList<StringSegment>();

      // Sort sourceLocations by index and then length
      sourceLocations = sourceLocations.Sort((x, y) =>
      {
        if (x.SourceIndex < y.SourceIndex)
          return -1;
        else if (y.SourceIndex < x.SourceIndex)
          return 1;

        if (x.SourceTokenLength < y.SourceTokenLength)
          return -1;
        else if (y.SourceTokenLength < x.SourceTokenLength)
          return 1;

        return 0;
      });

      // Identify all combinations of source locations that can be shown at once without exceeding the
      // maxLengthForHighlightedContent restraint
      var sourceLocationChains = new NonNullImmutableList<NonNullImmutableList<SourceFieldLocation>>();
      for (var indexOfFirstSourceLocationInChain = 0;
               indexOfFirstSourceLocationInChain < sourceLocations.Count;
               indexOfFirstSourceLocationInChain++)
      {
        var sourceLocationChain = new NonNullImmutableList<SourceFieldLocation>();
        for (var indexOfLastSourceLocationInChain = indexOfFirstSourceLocationInChain;
                 indexOfLastSourceLocationInChain < sourceLocations.Count;
                 indexOfLastSourceLocationInChain++)
        {
          var startPoint = sourceLocations[indexOfFirstSourceLocationInChain].SourceIndex;
          var endPoint =
            sourceLocations[indexOfLastSourceLocationInChain].SourceIndex +
            sourceLocations[indexOfLastSourceLocationInChain].SourceTokenLength;
          if ((endPoint - startPoint) > maxLengthForHighlightedContent)
            break;

          sourceLocationChain = sourceLocationChain.Add(sourceLocations[indexOfLastSourceLocationInChain]);
          sourceLocationChains = sourceLocationChains.Add(sourceLocationChain);
        }
      }

      // Get the best source location chain, if any (if not, return an empty set) and translate into a
      // StringSegment set
      if (!sourceLocationChains.Any())
        return new NonNullImmutableList<StringSegment>();

      return ToStringSegments(
        sourceLocationChains.Sort(bestMatchDeterminer).First()
      );
    }

    private static NonNullImmutableList<StringSegment> ToStringSegments(
      NonNullImmutableList<SourceFieldLocation> sourceLocations)
    {
      if (sourceLocations == null)
        throw new ArgumentNullException("sourceLocations");
      if (!sourceLocations.Any())
        throw new ArgumentException("must not be empty", "sourceLocations");

      var stringSegments = new NonNullImmutableList<StringSegment>();
      var sourceLocationsToCombine = new NonNullImmutableList<SourceFieldLocation>();
      foreach (var sourceLocation in sourceLocations.Sort((x, y) => x.SourceIndex.CompareTo(y.SourceIndex)))
      {
        // If the current sourceLocation overlaps with the previous one (or adjoins it) then they should
        // be combined together (if there isn't a previous sourceLocation then start a new queue)
        if (!sourceLocationsToCombine.Any()
        || (sourceLocation.SourceIndex
          <= sourceLocationsToCombine.Max(s => (s.SourceIndex + s.SourceTokenLength)))
        )
        {
          sourceLocationsToCombine = sourceLocationsToCombine.Add(sourceLocation);
          continue;
        }

        // If the current sourceLocation marks the start of a new to-highlight segment then add any
        // queued-up sourceLocationsToCombine content to the stringSegments set..
        if (sourceLocationsToCombine.Any())
          stringSegments = stringSegments.Add(new StringSegment(sourceLocationsToCombine));

        // .. and start a new sourceLocationsToCombine list
        sourceLocationsToCombine = new NonNullImmutableList<SourceFieldLocation>(new[] { sourceLocation });
      }
      if (sourceLocationsToCombine.Any())
        stringSegments = stringSegments.Add(new StringSegment(sourceLocationsToCombine));
      return stringSegments;
    }

    public class StringSegment
    {
      public StringSegment(NonNullImmutableList<SourceFieldLocation> sourceLocations)
      {
        if (sourceLocations == null)
          throw new ArgumentNullException("sourceLocations");
        if (!sourceLocations.Any())
          throw new ArgumentException("must not be empty", "sourceLocations");
        if (sourceLocations.Select(s => s.SourceFieldIndex).Distinct().Count() > 1)
          throw new ArgumentException("All sourceLocations must have the same SourceFieldIndex");

        Index = sourceLocations.Min(s => s.SourceIndex);
        Length = sourceLocations.Max(s => (s.SourceIndex + s.SourceTokenLength) - Index);
        SourceLocations = sourceLocations;
      }

      public int Index { get; private set; }
      public int Length { get; private set; }
      public NonNullImmutableList<SourceFieldLocation> SourceLocations { get; private set; }
    }
  }
}

The overlap-prevention is important for my application since I want to be able to take arbitrary segments of the content and wrap them in <strong> tags so that they can appear highlighted - if there are segments that overlap then this isn't going to result in valid html!

The other part of the puzzle is the "best match determiner". This also follows very closely the approach outlined:

public class BlogSearchTermBestMatchComparer : IComparer<NonNullImmutableList<SourceFieldLocation>>
{
  public int Compare(
    NonNullImmutableList<SourceFieldLocation> x,
    NonNullImmutableList<SourceFieldLocation> y)
  {
    if (x == null)
      throw new ArgumentNullException("x");
    if (y == null)
      throw new ArgumentNullException("y");

    var combinedWeightComparisonResult =
      y.Sum(s => s.MatchWeightContribution)
      .CompareTo(x.Sum(s => s.MatchWeightContribution));
    if (combinedWeightComparisonResult != 0)
      return combinedWeightComparisonResult;

    var numberOfTokensComparisonResult = x.Count.CompareTo(y.Count);
    if (numberOfTokensComparisonResult != 0)
      return numberOfTokensComparisonResult;

    return x.Min(s => s.SourceIndex).CompareTo(y.Min(s => s.SourceIndex));
  }
}

Ok, there's actually one more thing. Since I currently use the GetPartialMatches method to deal with multi-word searches on my blog, I have a NonNullImmutableList<SourceFieldLocationWithTerm> rather than a NonNullImmutableList<SourceFieldLocation> so I have this alternate method signature:

public static NonNullImmutableList<StringSegment> IdentifySearchTermsToHighlight(
  string content,
  int maxLengthForHighlightedContent,
  NonNullImmutableList<IndexData_Extensions_PartialMatches.SourceFieldLocationWithTerm> sourceLocations,
  IComparer<NonNullImmutableList<SourceFieldLocation>> bestMatchDeterminer)
{
  if (sourceLocations == null)
    throw new ArgumentNullException("sourceLocations");

  return IdentifySearchTermsToHighlight(
    content,
    maxLengthForHighlightedContent,
    sourceLocations
      .Select(s => new SourceFieldLocation(
        s.SourceFieldIndex,
        s.TokenIndex,
        s.SourceIndex,
        s.SourceTokenLength,
        s.MatchWeightContribution
      ))
      .ToNonNullImmutableList(),
    bestMatchDeterminer
  );
}

It would be nice if covariance was supported for classes in C#, rather than interfaces only, as then this method signature and the extra wrapping would not be required. I've contemplated changing my code such that the NonNullImmutableList implements INonNullImmutableList and supporting covariance on that interface, but I'm a bit uncomfortable that then implementations of INonNullImmutableList could be provided that actually aren't immutable. Having the interface specify NonNullImmutableList (which inherits from ImmutableList) means that the list provided absolutely is immutable. Unfortunately this leaves us without covariance support. (This reminds me of this post I read: Immutability and ReadOnlyCollection<T>).

SourceFieldIndex Values

When the index is generated from the source data, Content Retrievers are specified which are responsible for returning strings of content. This was covered in the first post I wrote for this project: The Full Text Indexer. Originally, each Content Retriever would return zero or one strings but since then the functionality has been expanded to return a NonNullOrEmptyStringList and so zero, one or multiple strings of content may be extracted by a single retriever.

To return to my blog as an example, each post has a Title, Content and Tags (there may be zero, one or multiple Tags). The first Content Retriever extracts a single string from the Content property (it has to do some manipulation since internally it is Markdown which is converted into html, then I extract plain text content from that). The second Content Retriever takes the Title property for a post and the third Content Retriever takes a string for each Tag. This means that any given post will generate at least two content strings, depending upon how many Tags there are.

Each source location associated with a matched search term has a SourceFieldIndex value. For results from searching my blog posts, I know that any source location with SourceFieldIndex zero comes from the post's Content, any source location with SourceFieldIndex one comes from the Title and any with a SourceFieldIndex greater than one must relate to a Tag. So when I want to consider source locations to highlight matches within a segment of the post Content, I consider only those with a SourceFieldIndex of zero.

If you wish to use the AutomatedIndexGeneratorFactoryBuilder (see The Full Text Indexer - Automating Index Generation) to configure an index generator (since that makes it really easy!) there is a method SetPropertyForFirstContentRetriever which enables a particular property to be specified as that which the first Content Retriever will extract content from. This allows this sort of functionality to be layered on top

For more information on this project, see the Full Text Indexer Round-up.

Posted at 20:40