Productive Rage

Dan's techie ramblings

(Approximately) correcting perspective with C# (fixing a blurry presentation video - part two)

TL;DR

I have a video of a presentation where the camera keeps losing focus such that the slides are unreadable. I have the original slide deck and I want to fix this.

Step one was identifying the area in each frame that it seemed likely was where the slides were being projected, now step two is to correct the perspective of the projection back into a rectangle to make it easier to perform comparisons against the original slide deck images and try to determine which slide was being projected.

(An experimental TL;DR approach: See this small scale .NET Fiddle demonstration of what I'll be discussing)

A slide extracted from a frame of a video presentation and 'perspective-corrected' back into a rectangle

The basic approach

An overview of the processing to do this looks as follows:

  1. Load the original slide image into a Bitmap
  2. Using the projected-slide-area region calculated in step one..
    1. Take the line from the top left of the region to the top right
    2. Take the line from the bottom left of the region to the bottom right (note that this line may be a little longer or shorter than the first line)
    3. Create vertical slices of the image by stepping through the first line (the one across the top), connecting each pixel to a pixel on the bottom line
  3. These vertical slices will not all be the same height and so they'll need to be adjusted to a consistent size (the further from the camera that a vertical slice of the projection is, the smaller it will be)
  4. The height-adjusted vertical slices are then combined into a single rectangle, which will result in an approximation of a perspective-corrected version of the projection of the slide

Note: The reason that this process is only going to be an approximation is due to the way that the height of the output image will be determined -

  1. For my purposes, it will be fine to use the largest of the top-left-to-bottom-left length (ie. the left-hand edge of the projection) and the top-right-to-bottom-right length (the right-hand edge of the projected) but this will always result in an output whose aspect ratio is stretched vertically slightly because the largest of those two lengths will be "magnified" somewhat due to the perspective effect
  2. What might seem like an obvious improvement would be to take an average of the left-hand-edge-height and the right-hand-edge-height but I decided not to do this because I would be losing some fidelity from the vertical slices that would be shrunken down to match this average and because this would still be an approximation as..
  3. The correct way to determine the appropriate aspect ratio for the perspective-corrected image is to use some clever maths to try to determine that angle of the wall that the projection is on (look up perspective correction and vanishing points if you're really curious!) and to use that to decide what ratio of the left-hand-edge-height and the right-hand-edge-height to use
    • (The reason that the take-an-average approach is still an approximation is that perspective makes the larger edge grow more quickly than the smaller edge shrinks, so this calculation would still skew towards a vertically-stretched image)

Slice & dice!

So if we follow the plan above then we'll generate a list of vertical slices a bit like this:

An illustration of how the vertical slices are taken from a projected image

.. which, when combined would look like this:

The vertical slices of the projected image before their heights are normalised

This is very similar to the original projection except that:

  • The top edge is now across the top of the rectangular area
  • The bottom left corner is aligned with the left-hand side of the image
  • The bottom right corner is aligned with the right-hand side of the image

We're not done yet but this has brought things much closer!

In fact, all that is needed is to stretch those vertical slices so that they are all the same length and; ta-da!

The projected image contorted back into a rectangle

Implementation for slicing and stretching

So, from previous analysis, I know that the bounding area for the projection of the slide in the frames of my video is:

topLeft: (1224, 197)
topRight: (1915, 72)

bottomLeft: (1229, 638)
bottomRight: (1915, 662)

Since I'm going to walk along the top edge and create vertical slices from that, I'm going to need the length of that edge - which is easy enough with some Pythagoras:

private static int LengthOfLine((PointF From, PointF To) line)
{
    var deltaX = line.To.X - line.From.X;
    var deltaY = line.To.Y - line.From.Y;
    return (int)Math.Round(Math.Sqrt((deltaX * deltaX) + (deltaY * deltaY)));
}

So although it's only 691px horizontally from the top left to the top right (1915 - 1224), the actual length of that edge is 702px (because it's not a line that angles up slightly rather than being a flat horizontal one).

This edge length determines how many vertical slices that we'll take and we'll get them by looping across this top edge, working out where the corresponding point on the bottom edge should be and joining them together into a line; one vertical slice. Each time that the loop increments, the current point on the top edge is going to move slightly to the right and even more slightly upwards while each corresponding point on the bottom edge will also move slightly to the right but it will move slightly down as the projection on the wall gets closer and closer to the camera.

One way to get all of these vertical slice lines is a method such as the following:

private sealed record ProjectionDetails(
    Size ProjectionSize,
    IEnumerable<((PointF From, PointF To) Line, int Index)> VerticalSlices
);

private static ProjectionDetails GetProjectionDetails(
    Point topLeft,
    Point topRight,
    Point bottomRight,
    Point bottomLeft)
{
    var topEdge = (From: topLeft, To: topRight);
    var bottomEdge = (From: bottomLeft, To: bottomRight);
    var lengthOfEdgeToStartFrom = LengthOfLine(topEdge);
    var dimensions = new Size(
        width: lengthOfEdgeToStartFrom,
        height: Math.Max(
            LengthOfLine((topLeft, bottomLeft)),
            LengthOfLine((topRight, bottomRight))
        )
    );
    return new ProjectionDetails(dimensions, GetVerticalSlices());

    IEnumerable<((PointF From, PointF To) Line, int Index)> GetVerticalSlices() =>
        Enumerable
            .Range(0, lengthOfEdgeToStartFrom)
            .Select(i =>
            {
                var fractionOfProgressAlongPrimaryEdge = (float)i / lengthOfEdgeToStartFrom;
                return (
                    Line: (
                        GetPointAlongLine(topEdge, fractionOfProgressAlongPrimaryEdge),
                        GetPointAlongLine(bottomEdge, fractionOfProgressAlongPrimaryEdge)
                    ),
                    Index: i
                );
            });
}

This returns the dimensions of the final perspective-corrected projection (which is as wide as the top edge is long and which is as high as the greater of the left-hand edge's length and the right-hand edge's length) as well as an IEnumerable of the start and end points for each slice that we'll be taking.

The dimensions are going to allow us to create a bitmap that we'll paste the slices into when we're ready - but, before that, we need to determine pixel values for every point on every vertical slice. As the horizontal distance across the top edge is 691px and the vertical distance is 125px but its actual length is 702px, each time we move one along in that 702px loop the starting point for the vertical slice will move (691 / 702) = 0.98px across and (125 / 702) = 0.18px up. So almost all of these vertical slices are going to have start and end points that are not whole pixel values - and the same applies to each point on that vertical slice. This means that we're going to have to take average colour values for when we're dealing with fractional pixel locations.

For example, if we're at the point (1309.5, 381.5) and the colours at (1309, 381), (1310, 381), (1309, 382), (1310, 382) are all white then the averaging is really easy - the "averaged" colour is white! If we're at the point (1446.5, 431.5) and the colours at (1446, 431), (1447, 431), (1446, 432), (1447, 432) are #BCA6A9, #B1989C, #BCA6A9, #B1989C then it's also not too complicated - because (1446.5, 431.5) is at the precise midpoint between all four points then we can take a really simple average by adding all four R values together, all four G values together, all four B values together and diving them by 4 to get a combined result. It gets a little more complicated where it's not 0.5 of a pixel and it's slightly more to the left or to the right and/or to the top or to the bottom - eg. (1446.1, 431.9) would get more of its averaged colour from the pixels on the left than on the right (as 1446.1 is only just past 1446) while it would get more of its averaged colour from the pixels on the bottom than the top (as 431.9 is practically ay 432). On the other hand, on the rare occasion where it is a precise location (with no fractional pixel values), such as (1826, 258), then it's the absolute simplest case because no averaging is required!

private static Color GetAverageColour(Bitmap image, PointF point)
{
    var (integralX, fractionalX) = GetIntegralAndFractional(point.X);
    var x0 = integralX;
    var x1 = Math.Min(integralX + 1, image.Width);

    var (integralY, fractionalY) = GetIntegralAndFractional(point.Y);
    var y0 = integralY;
    var y1 = Math.Min(integralY + 1, image.Height);

    var (topColour0, topColour1) = GetColours(new Point(x0, y0), new Point(x1, y0));
    var (bottomColour0, bottomColour1) = GetColours(new Point(x0, y1), new Point(x1, y1));

    return CombineColours(
        CombineColours(topColour0, topColour1, fractionalX),
        CombineColours(bottomColour0, bottomColour1, fractionalX),
        fractionalY
    );

    (Color c0, Color c1) GetColours(Point p0, Point p1)
    {
        var c0 = image.GetPixel(p0.X, p0.Y);
        var c1 = (p0 == p1) ? c0 : image.GetPixel(p1.X, p1.Y);
        return (c0, c1);
    }

    static (int Integral, float Fractional) GetIntegralAndFractional(float value)
    {
        var integral = (int)Math.Truncate(value);
        var fractional = value - integral;
        return (integral, fractional);
    }

    static Color CombineColours(Color x, Color y, float proportionOfSecondColour)
    {
        if ((proportionOfSecondColour == 0) || (x == y))
            return x;

        if (proportionOfSecondColour == 1)
            return y;

        return Color.FromArgb(
            red: CombineComponent(x.R, y.R),
            green: CombineComponent(x.G, y.G),
            blue: CombineComponent(x.B, y.B),
            alpha: CombineComponent(x.A, y.A)
        );

        int CombineComponent(int x, int y) =>
            Math.Min(
                (int)Math.Round((x * (1 - proportionOfSecondColour)) + (y * proportionOfSecondColour)),
                255
            );
    }
}

This gives us the capability to split the wonky projection into vertical slices, to loop over each slice and to walk down each slice and get a list of pixel values for each point down that slice. The final piece of the puzzle is that we then need to resize each vertical slice so that they all match the projection height returned from the GetProjectionDetails method earlier. Handily, the .NET Bitmap drawing code has DrawImage functionality that can resize content, so we can:

  1. Create a Bitmap whose dimensions are those returned from GetProjectionDetails
  2. Loop over each vertical slice (which is an IEnumerable also returned from GetProjectionDetails)
  3. Create a bitmap just for that slice - that is 1px wide and only as tall as the current vertical slice is long
  4. Use DrawImage to paste that slice's bitmap onto the full-size projection Bitmap

In code:

private static void RenderSlice(
    Bitmap projectionBitmap,
    IEnumerable<Color> pixelsOnLine,
    int index)
{
    var pixelsOnLineArray = pixelsOnLine.ToArray();

    using var slice = new Bitmap(1, pixelsOnLineArray.Length);
    for (var j = 0; j < pixelsOnLineArray.Length; j++)
        slice.SetPixel(0, j, pixelsOnLineArray[j]);

    using var g = Graphics.FromImage(projectionBitmap);
    g.DrawImage(
        slice,
        srcRect: new Rectangle(0, 0, slice.Width, slice.Height),
        destRect: new Rectangle(index, 0, 1, projectionBitmap.Height),
        srcUnit: GraphicsUnit.Pixel
    );
}

Pulling it all together

If we combine all of this logic together then we end up with a fairly straightforward static class that does all the work - takes a Bitmap that is a frame from a video where there is a section that should be extracted and then "perspective-corrected", takes the four points that describe that region and then returns a new Bitmap that is the extracted content in a lovely rectangle!

/// <summary>
/// This uses a simple algorithm to try to undo the distortion of a rectangle in an image
/// due to perspective - it takes the content of the rectangle and stretches it into a
/// rectangle. This is only a simple approximation and does not guarantee accuracy (in
/// fact, it will result in an image that is slightly vertically stretched such that its
/// aspect ratio will not match the original content and a more thorough approach would
/// be necessary if this is too great an approximation)
/// </summary>
internal static class SimplePerspectiveCorrection
{
    public static Bitmap ExtractAndPerspectiveCorrect(
        Bitmap image,
        Point topLeft,
        Point topRight,
        Point bottomRight,
        Point bottomLeft)
    {
        var (projectionSize, verticalSlices) =
            GetProjectionDetails(topLeft, topRight, bottomRight, bottomLeft);

        var projection = new Bitmap(projectionSize.Width, projectionSize.Height);
        foreach (var (lineToTrace, index) in verticalSlices)
        {
            var lengthOfLineToTrace = LengthOfLine(lineToTrace);

            var pixelsOnLine = Enumerable
                .Range(0, lengthOfLineToTrace)
                .Select(j =>
                {
                    var fractionOfProgressAlongLineToTrace = (float)j / lengthOfLineToTrace;
                    var point = GetPointAlongLine(lineToTrace, fractionOfProgressAlongLineToTrace);
                    return GetAverageColour(image, point);
                });

            RenderSlice(projection, pixelsOnLine, index);
        }
        return projection;

        static Color GetAverageColour(Bitmap image, PointF point)
        {
            var (integralX, fractionalX) = GetIntegralAndFractional(point.X);
            var x0 = integralX;
            var x1 = Math.Min(integralX + 1, image.Width);

            var (integralY, fractionalY) = GetIntegralAndFractional(point.Y);
            var y0 = integralY;
            var y1 = Math.Min(integralY + 1, image.Height);

            var (topColour0, topColour1) = GetColours(new Point(x0, y0), new Point(x1, y0));
            var (bottomColour0, bottomColour1) = GetColours(new Point(x0, y1), new Point(x1, y1));

            return CombineColours(
                CombineColours(topColour0, topColour1, fractionalX),
                CombineColours(bottomColour0, bottomColour1, fractionalX),
                fractionalY
            );

            (Color c0, Color c1) GetColours(Point p0, Point p1)
            {
                var c0 = image.GetPixel(p0.X, p0.Y);
                var c1 = (p0 == p1) ? c0 : image.GetPixel(p1.X, p1.Y);
                return (c0, c1);
            }

            static (int Integral, float Fractional) GetIntegralAndFractional(float value)
            {
                var integral = (int)Math.Truncate(value);
                var fractional = value - integral;
                return (integral, fractional);
            }

            static Color CombineColours(Color x, Color y, float proportionOfSecondColour)
            {
                if ((proportionOfSecondColour == 0) || (x == y))
                    return x;

                if (proportionOfSecondColour == 1)
                    return y;

                return Color.FromArgb(
                    red: CombineComponent(x.R, y.R),
                    green: CombineComponent(x.G, y.G),
                    blue: CombineComponent(x.B, y.B),
                    alpha: CombineComponent(x.A, y.A)
                );

                int CombineComponent(int x, int y) =>
                    Math.Min(
                        (int)Math.Round(
                            (x * (1 - proportionOfSecondColour)) +
                            (y * proportionOfSecondColour)
                        ),
                        255
                    );
            }
        }
    }

    private sealed record ProjectionDetails(
        Size ProjectionSize,
        IEnumerable<((PointF From, PointF To) Line, int Index)> VerticalSlices
    );

    private static ProjectionDetails GetProjectionDetails(
        Point topLeft,
        Point topRight,
        Point bottomRight,
        Point bottomLeft)
    {
        var topEdge = (From: topLeft, To: topRight);
        var bottomEdge = (From: bottomLeft, To: bottomRight);
        var lengthOfEdgeToStartFrom = LengthOfLine(topEdge);
        var dimensions = new Size(
            width: lengthOfEdgeToStartFrom,
            height: Math.Max(
                LengthOfLine((topLeft, bottomLeft)),
                LengthOfLine((topRight, bottomRight))
            )
        );
        return new ProjectionDetails(dimensions, GetVerticalSlices());

        IEnumerable<((PointF From, PointF To) Line, int Index)> GetVerticalSlices() =>
            Enumerable
                .Range(0, lengthOfEdgeToStartFrom)
                .Select(i =>
                {
                    var fractionOfProgressAlongPrimaryEdge = (float)i / lengthOfEdgeToStartFrom;
                    return (
                        Line: (
                            GetPointAlongLine(topEdge, fractionOfProgressAlongPrimaryEdge),
                            GetPointAlongLine(bottomEdge, fractionOfProgressAlongPrimaryEdge)
                        ),
                        Index: i
                    );
                });
    }

    private static PointF GetPointAlongLine((PointF From, PointF To) line, float fraction)
    {
        var deltaX = line.To.X - line.From.X;
        var deltaY = line.To.Y - line.From.Y;
        return new PointF(
            (deltaX * fraction) + line.From.X,
            (deltaY * fraction) + line.From.Y
        );
    }

    private static int LengthOfLine((PointF From, PointF To) line)
    {
        var deltaX = line.To.X - line.From.X;
        var deltaY = line.To.Y - line.From.Y;
        return (int)Math.Round(Math.Sqrt((deltaX * deltaX) + (deltaY * deltaY)));
    }

    private static void RenderSlice(
        Bitmap projectionBitmap,
        IEnumerable<Color> pixelsOnLine,
        int index)
    {
        var pixelsOnLineArray = pixelsOnLine.ToArray();

        using var slice = new Bitmap(1, pixelsOnLineArray.Length);
        for (var j = 0; j < pixelsOnLineArray.Length; j++)
            slice.SetPixel(0, j, pixelsOnLineArray[j]);

        using var g = Graphics.FromImage(projectionBitmap);
        g.DrawImage(
            slice,
            srcRect: new Rectangle(0, 0, slice.Width, slice.Height),
            destRect: new Rectangle(index, 0, 1, projectionBitmap.Height),
            srcUnit: GraphicsUnit.Pixel
        );
    }
}

Coming next

So step one was to take frames from a video and to work out what the bounds were of the area where slides were being projected (and to filter out any intro and outro frames), step two has been to be able to take the bounded area from any slide and project it back into a rectangle to make it easier to match against the original slide images.. step three will be to use these projections to try to guess what slide is being displayed on what frame!

The frame that I've been using as an example throughout this post probably looks like a fairly easy case - big blocks of white or black and not actually too out of focus.. but some of the frames look like this and that's a whole other kettle of fish!

An out of focus frame from a presentation

Posted at 19:02

Comments

Finding the brightest area in an image with C# (fixing a blurry presentation video - part one)

TL;DR

I have a video of a presentation where the camera keeps losing focus such that the slides are unreadable. I have the original slide deck and I want to fix this.

The first step is analysing the individual frames of the video to find a common "most illuminated area" so that I can work out where the slide content was being projected, and that is what is described in this post.

(An experimental TL;DR approach: See this small scale .NET Fiddle demonstration of what I'll be discussing)

An out of focus frame from a presentation

The basic approach

An overview of the processing to do this looks as follows:

  1. Load the image into a Bitmap
  2. Convert the image to greyscale
  3. Identify the lightest and darkest values in the greyscale range
  4. Calculate a 2/3 threshold from that range and create a mask of the image where anything below that value is zero and anything equal to or greater is one
    • eg. If the darkest value was 10 and the lightest was 220 then the difference is 220 - 10 = 210 and the cutoff point would be 2/3 of this range on top of the minimum, so the threshold value would equal ((2/3) * range) + minimum = ((2/3) * 210) + 10 = 140 + 10 = 150
  5. Find the largest bounded area within this mask (if there is one) and presume that that's the projection of the slide in the darkened room!

Before looking at code to do that, I'm going to toss in a few other complications that arise from having to process a lot of frames from throughout the video, rather than just one..

Firstly, the camera loses focus at different points in the video and to different extents and so some frames are blurrier than others. Following the steps above, the blurrier frames are likely to report a larger projection area for the slides. I would really like to identify a common projection area that is reasonable to use across all frames because this will make later processing (where I try to work out what slide is currently being shown in the frame) easier.

Secondly, this video has intro and outro animations and it would be nice if I was able to write code that worked out when they stopped and started.

The implementation for a single image

To do this work, I'm going to introduce a variation of my old friend the DataRectangle (from "How are barcodes read?" and "Face or no face") -

public static class DataRectangle
{
    public static DataRectangle<T> For<T>(T[,] values) => new DataRectangle<T>(values);
}

public sealed class DataRectangle<T>
{
    private readonly T[,] _protectedValues;
    public DataRectangle(T[,] values) : this(values, isolationCopyMayBeBypassed: false) { }
    private DataRectangle(T[,] values, bool isolationCopyMayBeBypassed)
    {
        if ((values.GetLowerBound(0) != 0) || (values.GetLowerBound(1) != 0))
            throw new ArgumentException("Both dimensions must have lower bound zero");
        var arrayWidth = values.GetUpperBound(0) + 1;
        var arrayHeight = values.GetUpperBound(1) + 1;
        if ((arrayWidth == 0) || (arrayHeight == 0))
            throw new ArgumentException("zero element arrays are not supported");

        Width = arrayWidth;
        Height = arrayHeight;

        if (isolationCopyMayBeBypassed)
            _protectedValues = values;
        else
        {
            _protectedValues = new T[Width, Height];
            Array.Copy(values, _protectedValues, Width * Height);
        }
    }

    public int Width { get; }

    public int Height { get; }

    public T this[int x, int y]
    {
        get
        {
            if ((x < 0) || (x >= Width))
                throw new ArgumentOutOfRangeException(nameof(x));
            if ((y < 0) || (y >= Height))
                throw new ArgumentOutOfRangeException(nameof(y));
            return _protectedValues[x, y];
        }
    }

    public IEnumerable<(Point Point, T Value)> Enumerate()
    {
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
            {
                var value = _protectedValues[x, y];
                var point = new Point(x, y);
                yield return (point, value);
            }
        }
    }

    public DataRectangle<TResult> Transform<TResult>(Func<T, TResult> transformer)
    {
        var transformed = new TResult[Width, Height];
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
                transformed[x, y] = transformer(_protectedValues[x, y]);
        }
        return new DataRectangle<TResult>(transformed, isolationCopyMayBeBypassed: true);
    }
}

For working with DataRectangle instances that contain double values (as we will be here), I've got a couple of convenient extension methods:

public static class DataRectangleOfDoubleExtensions
{
    public static (double Min, double Max) GetMinAndMax(this DataRectangle<double> source) =>
        source
            .Enumerate()
            .Select(pointAndValue => pointAndValue.Value)
            .Aggregate(
                seed: (Min: double.MaxValue, Max: double.MinValue),
                func: (acc, value) => (Math.Min(value, acc.Min), Math.Max(value, acc.Max))
            );

    public static DataRectangle<bool> Mask(this DataRectangle<double> values, double threshold) =>
        values.Transform(value => value >= threshold);
}

And for working with Bitmap instances, I've got some extension methods for those as well:

public static class BitmapExtensions
{
    public static Bitmap CopyAndResize(this Bitmap image, int resizeLargestSideTo)
    {
        var (width, height) = (image.Width > image.Height)
            ? (resizeLargestSideTo, (int)((double)image.Height / image.Width * resizeLargestSideTo))
            : ((int)((double)image.Width / image.Height * resizeLargestSideTo), resizeLargestSideTo);

        return new Bitmap(image, width, height);
    }

    /// <summary>
    /// This will return values in the range 0-255 (inclusive)
    /// </summary>
    // Based on http://stackoverflow.com/a/4748383/3813189
    public static DataRectangle<double> GetGreyscale(this Bitmap image) =>
        image
            .GetAllPixels()
            .Transform(c => (0.2989 * c.R) + (0.5870 * c.G) + (0.1140 * c.B));

    public static DataRectangle<Color> GetAllPixels(this Bitmap image)
    {
        var values = new Color[image.Width, image.Height];
        var data = image.LockBits(
            new Rectangle(0, 0, image.Width, image.Height),
            ImageLockMode.ReadOnly,
            PixelFormat.Format24bppRgb
        );
        try
        {
            var pixelData = new byte[data.Stride];
            for (var lineIndex = 0; lineIndex < data.Height; lineIndex++)
            {
                Marshal.Copy(
                    source: data.Scan0 + (lineIndex * data.Stride),
                    destination: pixelData,
                    startIndex: 0,
                    length: data.Stride
                );
                for (var pixelOffset = 0; pixelOffset < data.Width; pixelOffset++)
                {
                    // Note: PixelFormat.Format24bppRgb means the data is stored in memory as BGR
                    const int PixelWidth = 3;
                    values[pixelOffset, lineIndex] = Color.FromArgb(
                        red: pixelData[pixelOffset * PixelWidth + 2],
                        green: pixelData[pixelOffset * PixelWidth + 1],
                        blue: pixelData[pixelOffset * PixelWidth]
                    );
                }
            }
        }
        finally
        {
            image.UnlockBits(data);
        }
        return DataRectangle.For(values);
    }
}

With this code, we can already perform those first steps that I've described in the find-projection-area-in-image process.

Note that I'm going to throw in an extra step of shrinking the input images if they're larger than 400px because we don't need pixel-perfect accuracy when the whole point of this process is that a lot of the frames are too blurry to read (as a plus, shrinking the images means that there's less data to process and the whole thing should finish more quickly).

using var image = new Bitmap("frame_338.jpg");
using var resizedImage = image.CopyAndResize(resizeLargestSideTo: 400);
var greyScaleImageData = resizedImage.GetGreyscale();
var (min, max) = greyScaleImageData.GetMinAndMax();
var range = max - min;
const double thresholdOfRange = 2 / 3d;
var thresholdForMasking = min + (range * thresholdOfRange);
var mask = greyScaleImageData.Mask(thresholdForMasking);

This gives us a DataRectangle of boolean values that represent the brighter points as true and the less bright points as false.

In the image below, you can see the original frame on the left. In the middle is the content that would be masked out by hiding all but the brightest pixels. On the right is the "binary mask" (where we discard the original colour of the pixel and make them all either black or white) -

A frame from the video with the brightest third of the pixels masked out

Now we need to identify the largest "object" within this mask - wherever bright pixels are adjacent to other bright pixels, they will be considered part of the same object and we would expect there to be several such objects within the mask that has been generated.

To do so, I'll be reusing some more code from "How are barcodes read?" -

private static IEnumerable<IEnumerable<Point>> GetDistinctObjects(DataRectangle<bool> mask)
{
    // Flood fill areas in the mask to create distinct areas
    var allPoints = mask
        .Enumerate()
        .Where(pointAndIsMasked => pointAndIsMasked.Value)
        .Select(pointAndIsMasked => pointAndIsMasked.Point).ToHashSet();
    while (allPoints.Any())
    {
        var currentPoint = allPoints.First();
        var pointsInObject = GetPointsInObject(currentPoint).ToArray();
        foreach (var point in pointsInObject)
            allPoints.Remove(point);
        yield return pointsInObject;
    }

    // Inspired by code at
    // https://simpledevcode.wordpress.com/2015/12/29/flood-fill-algorithm-using-c-net/
    IEnumerable<Point> GetPointsInObject(Point startAt)
    {
        var pixels = new Stack<Point>();
        pixels.Push(startAt);

        var valueAtOriginPoint = mask[startAt.X, startAt.Y];
        var filledPixels = new HashSet<Point>();
        while (pixels.Count > 0)
        {
            var currentPoint = pixels.Pop();
            if ((currentPoint.X < 0) || (currentPoint.X >= mask.Width)
            || (currentPoint.Y < 0) || (currentPoint.Y >= mask.Height))
                continue;

            if ((mask[currentPoint.X, currentPoint.Y] == valueAtOriginPoint)
            && !filledPixels.Contains(currentPoint))
            {
                filledPixels.Add(new Point(currentPoint.X, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X - 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X + 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y - 1));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y + 1));
            }
        }
        return filledPixels;
    }
}

As the code mentions, this is based on an article "Flood Fill algorithm (using C#.NET)" and its output is a list of objects, where each object is a list of points within that object. So the way to determine which object is largest is to take the one that contains the most points!

A binary mask of a frame in the video with all but the single largest areas hidden

var pointsInLargestHighlightedArea = GetDistinctObjects(mask)
    .OrderByDescending(points => points.Count())
    .FirstOrDefault();

(Note: If pointsInLargestHighlightedArea is null then we need to escape out of the method that we're in because the source image didn't produce a mask with any highlighted objects - this could happen if the image has every single with the same colour, for example; an edge case, surely, but one that we should handle)

From this largest object, we want to find a bounding quadrilateral, which we do by looking at every point and finding the one closest to the top left of the image (because this will be the top left of the bounding area), the point closest to the top right of the image (for the top right of the bounding area) and the same for the points closest to the bottom left and bottom right.

This can be achieved by calculating, for each point in the object, the distances from each of the corners to the points and then determining which points have the shortest distances - eg.

var distancesOfPointsFromImageCorners = pointsInLargeHighlightedArea
    .Select(p =>
    {
        // To work out distance from the top left, you would use Pythagoras to take the
        // squared horizontal distance of the point from the left of the image and add
        // that to the squared vertical distance of the point from the top of the image,
        // then you would square root that sum. In this case, we only want to be able to
        // compare determine which distances are smaller or larger and we don't actually
        // care about the precise distances themselves and so we can save ourselves from
        // performing that final square root calculation.
        var distanceFromRight = greyScaleImageData.Width - p.X;
        var distanceFromBottom = greyScaleImageData.Height - p.Y;
        var fromLeftScore = p.X * p.X;
        var fromTopScore = p.Y * p.Y;
        var fromRightScore = distanceFromRight * distanceFromRight;
        var fromBottomScore = distanceFromBottom * distanceFromBottom;
        return new
        {
            Point = p,
            FromTopLeft = fromLeftScore + fromTopScore,
            FromTopRight = fromRightScore + fromTopScore,
            FromBottomLeft = fromLeftScore + fromBottomScore,
            FromBottomRight = fromRightScore + fromBottomScore
        };
    })
    .ToArray(); // Call ToArray to avoid repeating this enumeration four times below
    
var topLeft = distancesOfPointsFromImageCorners.OrderBy(p => p.FromTopLeft).First().Point;
var topRight = distancesOfPointsFromImageCorners.OrderBy(p => p.FromTopRight).First().Point;
var bottomLeft = distancesOfPointsFromImageCorners.OrderBy(p => p.FromBottomLeft).First().Point;
var bottomRight = distancesOfPointsFromImageCorners.OrderBy(p => p.FromBottomRight).First().Point;

Finally, because we want to find the bounding area of the largest object in the original image, we may need to multiply up the bounds that we just found because we shrank the image down if either dimension was larger than 400px and we were performing calculations on that smaller version.

We can tell how much we reduced the data by looking at the width of the original image and comparing it to the width of the greyScaleImageData DataRectangle that was generated from the shrunken version of the image:

var reducedImageSideBy = (double)image.Width / greyScaleImageData.Width;

Now we only need a function that will multiply the bounding area that we've got according to this value, while ensuring that none of the point values are multiplied such that they exceed the bounds of the original image:

private static (Point TopLeft, Point TopRight, Point BottomRight, Point BottomLeft) Resize(
    Point topLeft,
    Point topRight,
    Point bottomRight,
    Point bottomLeft,
    double resizeBy,
    int minX,
    int maxX,
    int minY,
    int maxY)
{
    if (resizeBy <= 0)
        throw new ArgumentOutOfRangeException("must be a positive value", nameof(resizeBy));

    return (
        Constrain(Multiply(topLeft)),
        Constrain(Multiply(topRight)),
        Constrain(Multiply(bottomRight)),
        Constrain(Multiply(bottomLeft))
    );

    Point Multiply(Point p) =>
        new Point((int)Math.Round(p.X * resizeBy), (int)Math.Round(p.Y * resizeBy));

    Point Constrain(Point p) =>
        new Point(Math.Min(Math.Max(p.X, minX), maxX), Math.Min(Math.Max(p.Y, minY), maxY));
}

The final bounding area for the largest bright area of an image is now retrieved like this:

var bounds = Resize(
    topLeft,
    topRight,
    bottomRight,
    bottomLeft,
    reducedImageSideBy,
    minX: 0,
    maxX: image.Width - 1,
    minY: 0,
    maxY: image.Height - 1
);

For the example image that we're looking at, this area is outlined liked this:

An outline around the largest object within the binary mask of a frame from the video

Applying the process to multiple images

Say that we put all of the above functionality into a method called GetMostHighlightedArea that took a Bitmap to process and returned a tuple of the four points that represented the bounds of the brightest area, we could then easily prepare a LINQ statement that ran that code and found the most common brightest-area-bounds across all of the source images that I have. (As I said before, the largest-bounded-area will vary from image to image in my example as the camera recording the session gained and lost focus)

var files = new DirectoryInfo("Frames").EnumerateFiles("*.jpg");
var (topLeft, topRight, bottomRight, bottomLeft) = files
    .Select(file =>
    {
        using var image = new Bitmap(file.FullName);
        return IlluminatedAreaLocator.GetMostHighlightedArea(image);
    })
    .GroupBy(area => area)
    .OrderByDescending(group => group.Count())
    .Select(group => group.Key)
    .FirstOrDefault();

Presuming that there is a folder called "Frames" in the output folder of project*, this will read them all, look for the largest bright area on each of them individually, then return the area that appears most often across all of the images. (Note: If there are no images to read then the FirstOrDefault call at the bottom will return a default tuple-of-four-Points, which will be 4x (0,0) values)

* (Since you probably don't happen to have a bunch of images from a video of my presentation lying around, see the next section for some code that will download some in case you want to try this all out!)

This ties in nicely with my recent post "Parallelising (LINQ) work in C#" because the processing required for each image is..

  1. Completely independent from the processing of the other images (important for parallelising work)
  2. Expensive enough that the overhead from splitting the work into multiple threads and then combining their results back together would be overshadowed by the work performed (which is also important for parallelising work - if individual tasks are too small and the computer spends more time scheduling the work on threads and then pulling all the results back together than it does on actually performing that work then using multiple threads can be slower than using a single one!)

All that we would have to change in order to use multiple threads to process multiple images is the addition of a single line:

var files = new DirectoryInfo("Frames").EnumerateFiles("*.jpg");
var (topLeft, topRight, bottomRight, bottomLeft) = files
    .AsParallel() // <- WOO!! This is all that we needed to add!
    .Select(file =>
    {
        using var image = new Bitmap(file.FullName);
        return IlluminatedAreaLocator.GetMostHighlightedArea(image);
    })
    .GroupBy(area => area)
    .OrderByDescending(group => group.Count())
    .Select(group => group.Key)
    .FirstOrDefault();

(Parallelisation sidebar: When we split up the work like this, if the processing for each image was solely in memory then it would be a no-brainer that using more threads would make sense - however, the processing for each image involves LOADING the image from disk and THEN processing it in memory and if you had a spinning rust hard disk then you may fear that trying to ask it to read multiple files simultaneously would be slower than asking it to read them one at a time because its poor little read heads have to physically move around the plates.. it turns out that this is not necessarily the case and that you can find more information in this article that I found interesting; "Performance Impact of Parallel Disk Access")

Testing the code on your own computer

I haven't quite finished yet but I figured that there may be some wild people out there that would like to try running this code locally themselves - maybe just to see it work or maybe even to get it working and then chop and change it for some new and exciting purpose!

To this end, I have some sample frames available from this video that I'm trying to fix - with varying levels of fuzziness present. To download them, use the following method:

private static async Task EnsureSamplesAvailable(DirectoryInfo framesfolder)
{
    // Note: The GitHub API is rate limited quite severely for non-authenticated apps, so we just
    // only call use it if the framesFolder doesn't exist or is empty - if there are already files
    // in there then we presume that we downloaded them on a previous run (if the API is hit too
    // often then it will return a 403 "rate limited" response)
    if (framesfolder.Exists && framesfolder.EnumerateFiles().Any())
    {
        Console.WriteLine("Sample images have already been downloaded and are ready for use");
        return;
    }

    Console.WriteLine("Downloading sample images..");
    if (!framesfolder.Exists)
        framesfolder.Create();

    string namesAndUrlsJson;
    using (var client = new WebClient())
    {
        // The API refuses requests without a User Agent, so set one before calling (see
        // https://docs.github.com/en/rest/overview/resources-in-the-rest-api#user-agent-required)
        client.Headers.Add(HttpRequestHeader.UserAgent, "ProductiveRage Blog Post Example");
        namesAndUrlsJson = await client.DownloadStringTaskAsync(new Uri(
            "https://api.github.com/repos/" +
            "ProductiveRage/NaivePerspectiveCorrection/contents/Samples/Frames"
        ));
    }

    // Deserialise the response into an array of entries that have Name and Download_Url properties
    var namesAndUrls = JsonConvert.DeserializeAnonymousType(
        namesAndUrlsJson,
        new[] { new { Name = "", Download_Url = (Uri?)null } }
    );
    if (namesAndUrls is null)
    {
        Console.WriteLine("GitHub reported zero sample images to download");
        return;
    }

    await Task.WhenAll(namesAndUrls
        .Select(async entry =>
        {
            using var client = new WebClient();
            await client.DownloadFileTaskAsync(
                entry.Download_Url,
                Path.Combine(framesfolder.FullName, entry.Name)
            );
        })
    );

    Console.WriteLine($"Downloaded {namesAndUrls.Length} sample image(s)");
}

.. and call it with the following argument, presuming you're trying to read images from the "Frames" folder as the code earlier illustrated:

await EnsureSamplesAvailable(new DirectoryInfo("Frames"));

Filtering out intro/outro slides

So I said earlier that it would also be nice if I could programmatically identify which frames were part of the intro/outro animations of the video that I'm looking at.

It feels logical that any frame that is of the actual presentation will have a fairly similarly-sized-and-located bright area (where a slide is being projected onto a wall in a darkened room) while any frame that is part of an intro/outro animation won't. So we should be able to take the most-common-largest-brightest-area and then look at every frame and see if its largest bright area is approximately the same - if it's similar enough then it's probably a frame that is part of the projection but if it's too dissimilar then it's probably not.

Rather than waste time going too far down a rabbit hole that I've found won't immediately result in success, I'm going to use a slightly altered version of that plan (I'll explain why in a moment). I'm still going to take that common largest brightest area and compare the largest bright area on each frame to it but, instead of saying "largest-bright-area-is-close-enough-to-the-most-common = presentation frame / largest-bright-area-not-close-enough = intro or outro", I'm going to find the first frame whose largest bright area is close enough and the last frame that is and declare that that range is probably where the frames for the presentation are.

The reason that I'm going to do this is that I found that there are some slides with more variance that can skew the results if the first approach was taken - if a frame in the middle of the presentation is so blurry that the range in intensity from darkest pixel to brightest pixel is squashed down too far then it can result in it identifying a largest bright area that isn't an accurate representation of the image. It's quite possible that I could still have made the first approach work by tweaking some other parameters in the image processing - such as considering changing that arbitrary "create a mask where the intensity threshold is 2/3 of the range of the brightness of all pixels" (maybe 3/4 would have worked better?), for example - but I know that this second approach works for my data and so I didn't pursue the first one too hard.

To do this, though, we are going to need to know what order the frames are supposed to appear in - it's no longer sufficient for there to simply be a list of images that are frames out of the video, we now need to know what were they appeared relative to each other. This is simple enough with my data because they all have names like "frame_1052.jpg" where 1052 is the frame index from the original video.

So I'm going to change the frame-image-loading code to look like this:

// Get all filenames, parse the frame index from them and discard any that don't
// match the filename pattern that is expected (eg. "frame_1052.jpg")
var frameIndexMatcher = new Regex(@"frame_(\d+)\.jpg", RegexOptions.IgnoreCase);
var files = new DirectoryInfo("Frames")
    .EnumerateFiles()
    .Select(file =>
    {
        var frameIndexMatch = frameIndexMatcher.Match(file.Name);
        return frameIndexMatch.Success
            ? (file.FullName, FrameIndex: int.Parse(frameIndexMatch.Groups[1].Value))
            : default;
    })
    .Where(entry => entry != default);

// Get the largest bright area for each file
var allFrameHighlightedAreas = files
    .AsParallel()
    .Select(file =>
    {
        using var image = new Bitmap(file.FullName);
        return (
            file.FrameIndex,
            HighlightedArea: IlluminatedAreaLocator.GetMostHighlightedArea(image)
        );
    })
    .ToArray()

// Get the most common largest bright area across all of the images
var (topLeft, topRight, bottomRight, bottomLeft) = allFrameHighlightedAreas
    .GroupBy(entry => entry.HighlightedArea)
    .OrderByDescending(group => group.Count())
    .Select(group => group.Key)
    .FirstOrDefault();

(Note that I'm calling ToArray() when declaring allFrameHighlightedAreas - that's to store the results now because I know that I'm going to need every result in the list that is generated and because I'm going to enumerate it twice in the work outlined here, so there's no point leaving allFrameHighlightedAreas to be a lazily-evaluated IEnumerable that would be recalculated each time it was looped over; then it would be doing all of the IlluminatedAreaLocator.GetMostHighlightedArea calculations for each image twice if enumerated the list twice, which would just be wasteful!)

Now to look at the allFrameHighlightedAreas list and try to decide if each HighlightedArea value is close enough to the most common area that we found. I'm going to use a very simple algorithm for this - I'm going to:

  1. Take all four points from the HighlightedArea on each entry in allFrameHighlightedAreas
  2. Take all four points from the most common area (which are the topLeft, topRight, bottomRight, bottomLeft values that we already have in the code above)
  3. Take the differences in X value between all four points in these two areas and add them up
  4. Compare this difference to the width of the most common highlighted area - if it's too big of a proportion (say if the sum of the X differences is greater than 20% of the width of the entire area) then we'll say it's not a match and drop out of this list
  5. If the X values aren't too bad then we'll take the differences in Y value between all four points in these two areas and add those up
  6. That total will be compared to the height of the most common highlighted area - if it's more than the 20% threshold then we'll say that it's not a match
  7. If we got to here then we'll say that the highlighted area in the current frame is close enough to the most common highlighted area and so the current frame probably is part of the presentation - yay!

In code:

var highlightedAreaWidth = Math.Max(topRight.X, bottomRight.X) - Math.Min(topLeft.X, bottomLeft.X);
var highlightedAreaHeight = Math.Max(bottomLeft.Y, bottomRight.Y) - Math.Min(topLeft.Y, topRight.Y);
const double thresholdForPointVarianceComparedToAreaSize = 0.2;
var frameIndexesThatHaveTheMostCommonHighlightedArea = allFrameHighlightedAreas
    .Where(entry =>
    {
        var (entryTL, entryTR, entryBR, entryBL) = entry.HighlightedArea;
        var xVariance =
            new[]
            {
                entryBL.X - bottomLeft.X,
                entryBR.X - bottomRight.X,
                entryTL.X - topLeft.X,
                entryTR.X - topRight.X
            }
            .Sum(Math.Abs);
        var yVariance =
            new[]
            {
                entryBL.Y - bottomLeft.Y,
                entryBR.Y - bottomRight.Y,
                entryTL.Y - topLeft.Y,
                entryTR.Y - topRight.Y
            }
            .Sum(Math.Abs);
        return
            (xVariance <= highlightedAreaWidth * thresholdForPointVarianceComparedToAreaSize) &&
            (yVariance <= highlightedAreaHeight * thresholdForPointVarianceComparedToAreaSize);
    })
    .Select(entry => entry.FrameIndex)
    .ToArray();

This gives us a frameIndexesThatHaveTheMostCommonHighlightedArea array of frame indexes that have a largest brightest area that is fairly close to the most common one. So to decide which frames are probably the start of the presentation and the end, we simply need to say:

var firstFrameIndex = frameIndexesThatHaveTheMostCommonHighlightedArea.Min();
var lasttFrameIndex = frameIndexesThatHaveTheMostCommonHighlightedArea.Max();

Any frames whose index is less than firstFrameIndex or greater than lastFrameIndex is probably part of the intro or outro sequence - eg.

A frame from the intro of the video - the largest bright area is not near the slide projection

Any frames whose index is within the firstFrameIndex / lastFrameIndex range is probably part of the presentation - eg.

A frame from the presentation part of the video - the largest bright area IS the slide projection

Coming soon

As the title of this post strongly suggests, this is only the first step in my desire to fix up my blurry presentation video. What I'm going to have to cover in the future is to:

  1. Extract the content from the most-common-brightest-area in each frame of the video that is part of the presentation and contort it back into a rectangle - undoing the distortion that is introduced by perspective due to the position of the camera and where the slides were projected in the room (I'll be tackling this in a slightly approximate-but-good-enough manner because to do it super accurately requires lots of complicated maths and I've managed to forget nearly all of the maths degree that I got twenty years ago!)
  2. Find a way to compare the perspective-corrected projections from each frame against a clean image of the original slide deck and work out which slide each frame is most similar to (this should be possible with some surprisingly rudimentary calculations inspired by some of the image preprocessing that I've mentioned in a couple of my posts that touch on machine learning but without requiring any machine learning itself)
  3. Some tweaks that were required to get the best results with my particular images (for example, when I described the GetMostHighlightedArea function earlier, I picked 400px as an arbitrary value to resize images to before greyscaling them, masking them and looking for their largest bright area; maybe it will turn out that smaller or larger values for that process result in improved or worsened results - we'll find out!)

Once this is all done, I will take the original frame images and, for each one, overlay a clean version of the slide that appeared blurrily in the frame (again, I'll have clean versions of each slide from the original slide deck that I produced, so that should be an easy part) - then I'll mash them all back together into a new video, combined with the original audio. To do this (the video work), I'll likely use the same tool that I used to extract the individual frame files from the video in the first place - the famous FFmpeg!

I doubt that I'll have a post on this last section as it would only be a small amount of C# code that combines two images for each frame, writes the results to disk, followed by me making a command line call to FFmpeg to produce the video - and I don't think that there's anything particularly exciting there! If I get this all completed, though, I will - of course - link to the fixed-up presentation video.. because why not shameless plug myself given any opportunity!

Posted at 21:06

Comments

How are barcodes read?? (Library-less image processing in C#)

I've been using MyFitnessPal and it has the facility to load nutrition information by scanning the barcode on the product. I can guess how the retrieval works once the barcode number is obtained (a big database somewhere) but it struck me that I had no idea how the reading of the barcode itself worked and.. well, I'm curious and enjoy the opportunity to learn something new to me by writing the code to do it. I do enjoy being able to look up (almost) anything on the internet to find out how it works!

(For anyone who wants to either play along but not copy-paste the code themselves or for anyone who wants to jump to the end result, I've put the code - along with the example image I've used in this post - up on a GitHub repo)

The plan of attack

There are two steps required here:

  1. Read image and try to identify areas that look like barcodes
  2. Try to extract numbers from the looks-like-a-barcode regions

As with anything, these steps may be broken down into smaller tasks. The first step can be done like this:

  1. Barcodes are black and white regions that have content that has steep "gradients" in image intensity horizontally (where there is a change from a black bar to a white space) and little change in intensity vertically (as each bar is a vertical line), so first we greyscale the image and then generate horizontal and vertical intensity gradients values for each point in the image and combine the values by subtracting vertical gradient from horizontal gradient
  2. These values are normalised so that they are all on the scale zero to one - this data could be portrayed as another greyscale image where the brightest parts are most likely to be within barcodes
  3. These values are then "spread out" or "blurred" and then a threshold value is applied where every value about it is changed into a 1 and every value below it a 0
  4. This "mask" (where every value is a 0 or 1) should have identified many of the pixels within the barcodes and we want to group these pixels into distinct objects
  5. There is a chance, though, that there could be gaps between bars that mean that a single barcode is spread across multiple masked-out objects and we need to try to piece them back together into one area (since the bars are tall and narrow, this may be done by considering a square area over every object and then combining objects whose squared areas overlap into one)
  6. This process will result in a list of areas that may be barcodes - any that are taller than they are wide are ignored (because barcode regions are always wider than they are tall)

The second step can be broken down into:

  1. Take the maybe-barcode region of the image, greyscale it and then turn into a mask by setting any pixel with an intensity less than a particular threshold to zero and otherwise to one
  2. Take a horizontal slice across the image region - all of the pixels on the first row of the image - and change the zero-or-one raw data into a list of line lengths where a new line starts at any transition from zero-to-one or one-to-zero (so "01001000110" becomes "1,1,2,1,3,2,1" because there is 1x zero and then 1x one and then 2x zero and then 1x one, etc..)
  3. These line lengths should correspond to bar sizes (and space-between-bar sizes) if we've found a barcode - so run the values through the magic barcode bar-size-reading algorithm (see section 2.1 in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859730/) and if we get a number (and the checksum is correct) then we're done, hurrah!
  4. If we couldn't get a number from this horizontal slice then move one pixel down and go back around
  5. If it was not possible to extract a number from any of the slices through the image region then it's either not a barcode or it's somehow so distorted in the image that we can't read it

This approach is fairly resilient to changes in lighting and orientation because the barcode regions are still likely to have the highest horizontal intensity gradient whether the image is dark or light (and even if part of the image is light and part of it is dark) and the barcode-reading algorithm works on ratios of bar/space-between-bar widths and these remain constant if the image is rotated.

(Some of the techniques are similar to things that I did in my Face or no face (finding faces in photos using C# and Accord.NET) and so I'll be using some of the same code shortly that I described then)

Identifying maybe-barcode images

Let's this image as an example to work with (peanut butter.. I do love peanut butter) -

Delicious peanut butter

Before looking at any code, let's visualise the process.

We're going to consider horizontal and vertical gradient intensity maps - at every point in the image we either look to the pixels to the left and to the right (for the horizontal gradient) or we look at the pixels above and below (for the vertical gradient) and the larger the change, the brighter the pixel in the gradient intensity map

Horizontal gradient intensity

And when they're combined by subtracting the vertical gradient at each point from the horizontal gradient, it looks lke this:

Combined gradient intensity

If this image is blurred then we get this:

Blurred combined gradient intensity

.. and if we create a binary mask by saying "normalise the intensity values so that their range goes from zero (for the darkest pixel) to one (for the brightest) and then set any pixels that are in the bottom third in terms of intensity to 0 and set the rest to 1" then we get this:

Mask of possibly-part-of-a-barcode areas

If each distinct area (where an "area" means "a group of pixels that are connected") is identified and squares overlaid and centered around the areas then we see this:

Mask of possibly-part-of-a-barcode areas, extended into squared areas

.. and if the areas whose bounding squares overlap are combined and then cropped around the white pixels then we end up with this:

Combined possibly-a-barcode areas

This has identified the area around the barcode and also two tiny other areas - when we come to trying to read barcode numbers out of these, the tiny regions will result in no value while the area around the genuine barcode content should result in a number successfully being read. But I'm getting ahead of myself.. let's look at the code required to perform the above transformations.

I'm going to start with a DataRectangle for performing transformations -

public static class DataRectangle
{
    public static DataRectangle<T> For<T>(T[,] values) => new DataRectangle<T>(values);
}

public sealed class DataRectangle<T>
{
    private readonly T[,] _protectedValues;
    public DataRectangle(T[,] values) : this(values, isolationCopyMayBeBypassed: false) { }
    private DataRectangle(T[,] values, bool isolationCopyMayBeBypassed)
    {
        if ((values.GetLowerBound(0) != 0) || (values.GetLowerBound(1) != 0))
            throw new ArgumentException("Both dimensions must have lower bound zero");
        var arrayWidth = values.GetUpperBound(0) + 1;
        var arrayHeight = values.GetUpperBound(1) + 1;
        if ((arrayWidth == 0) || (arrayHeight == 0))
            throw new ArgumentException("zero element arrays are not supported");

        Width = arrayWidth;
        Height = arrayHeight;

        if (isolationCopyMayBeBypassed)
            _protectedValues = values;
        else
        {
            _protectedValues = new T[Width, Height];
            Array.Copy(values, _protectedValues, Width * Height);
        }
    }

    /// <summary>
    /// This will always be greater than zero
    /// </summary>
    public int Width { get; }

    /// <summary>
    /// This will always be greater than zero
    /// </summary>
    public int Height { get; }

    public T this[int x, int y]
    {
        get
        {
            if ((x < 0) || (x >= Width))
                throw new ArgumentOutOfRangeException(nameof(x));
            if ((y < 0) || (y >= Height))
                throw new ArgumentOutOfRangeException(nameof(y));
            return _protectedValues[x, y];
        }
    }

    public IEnumerable<Tuple<Point, T>> Enumerate(Func<Point, T, bool>? optionalFilter = null)
    {
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
            {
                var value = _protectedValues[x, y];
                var point = new Point(x, y);
                if (optionalFilter?.Invoke(point, value) ?? true)
                    yield return Tuple.Create(point, value);
            }
        }
    }

    public DataRectangle<TResult> Transform<TResult>(Func<T, TResult> transformer)
    {
        return Transform((value, coordinates) => transformer(value));
    }

    public DataRectangle<TResult> Transform<TResult>(Func<T, Point, TResult> transformer)
    {
        var transformed = new TResult[Width, Height];
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
                transformed[x, y] = transformer(_protectedValues[x, y], new Point(x, y));
        }
        return new DataRectangle<TResult>(transformed, isolationCopyMayBeBypassed: true);
    }
}

And then I'm going to add a way to load image data into this structure -

public static class BitmapExtensions
{
    /// <summary>
    /// This will return values in the range 0-255 (inclusive)
    /// </summary>
    // Based on http://stackoverflow.com/a/4748383/3813189
    public static DataRectangle<double> GetGreyscale(this Bitmap image)
    {
        var values = new double[image.Width, image.Height];
        var data = image.LockBits(
            new Rectangle(0, 0, image.Width, image.Height),
            ImageLockMode.ReadOnly,
            PixelFormat.Format24bppRgb
        );
        try
        {
            var pixelData = new Byte[data.Stride];
            for (var lineIndex = 0; lineIndex < data.Height; lineIndex++)
            {
                Marshal.Copy(
                    source: data.Scan0 + (lineIndex * data.Stride),
                    destination: pixelData,
                    startIndex: 0,
                    length: data.Stride
                );
                for (var pixelOffset = 0; pixelOffset < data.Width; pixelOffset++)
                {
                    // Note: PixelFormat.Format24bppRgb means the data is stored in memory as BGR
                    const int PixelWidth = 3;
                    var r = pixelData[pixelOffset * PixelWidth + 2];
                    var g = pixelData[pixelOffset * PixelWidth + 1];
                    var b = pixelData[pixelOffset * PixelWidth];
                    values[pixelOffset, lineIndex] = (0.2989 * r) + (0.5870 * g) + (0.1140 * b);
                }
            }
        }
        finally
        {
            image.UnlockBits(data);
        }
        return DataRectangle.For(values);
    }
}

With these classes, we can load an image and calculate the combined horizontal-gradient-minus-vertical-gradient value like this:

private static IEnumerable<Rectangle> GetPossibleBarcodeAreasForBitmap(Bitmap image)
{
    var greyScaleImageData = image.GetGreyscale();
    var combinedGradients = greyScaleImageData.Transform((intensity, pos) =>
    {
        // Consider gradients to be zero at the edges of the image because there aren't pixels
        // both left/right or above/below and so it's not possible to calculate a real value
        var horizontalChange = (pos.X == 0) || (pos.X == greyScaleImageData.Width - 1)
            ? 0
            : greyScaleImageData[pos.X + 1, pos.Y] - greyScaleImageData[pos.X - 1, pos.Y];
        var verticalChange = (pos.Y == 0) || (pos.Y == greyScaleImageData.Height - 1)
            ? 0
            : greyScaleImageData[pos.X, pos.Y + 1] - greyScaleImageData[pos.X, pos.Y - 1];
        return Math.Max(0, Math.Abs(horizontalChange) - Math.Abs(verticalChange));
    });

    // .. more will go here soon
}

Before jumping straight into the image analysis, though, it's worth resizing the source image if it's large. Since this stage of the processing is looking for areas that look approximately like barcodes, we don't require a lot of granularity - I'm envisaging (as with the MyFitnessPal use case) source images where the barcode takes up a significant space in the image and is roughly aligned with the view port* and so resizing the image such that the largest side is 300px should work well. If you wanted to scan an image where there were many barcodes to process (or even where there was only one but it was very small) then you might want to allow larger inputs than this - the more data that there is, though, the more work that must be done and the slower that the processing will be!

* (The barcode has to be roughly aligned with the viewport because the approaching of looking for areas with large horizontal variance in intensity with minor vertical variance would not work - as we'll see later, though, there is considerable margin for error in this approach and perfect alignment is not required)

A naive approach to this would be force the image so that its largest side is 300px, regardless of what it was originally. However, this is unnecessary if the largest side is already less than 300px (scaling it up will actually give us more work to do) and if the largest side is not much more than 300px then it's probably not worth doing either - scaling it down may make any barcodes areas fuzzy and risk reducing the effectiveness of the processing while not actually reducing the required work. So I'm going to say that if the largest side of the image is 450px or larger than resize it so that its largest side is 300px and do nothing otherwise. To achieve that, we need a method like this:

private static DataRectangle<double> GetGreyscaleData(
    Bitmap image,
    int resizeIfLargestSideGreaterThan,
    int resizeTo)
{
    var largestSide = Math.Max(image.Width, image.Height);
    if (largestSide <= resizeIfLargestSideGreaterThan)
        return image.GetGreyscale();

    int width, height;
    if (image.Width > image.Height)
    {
        width = resizeTo;
        height = (int)(((double)image.Height / image.Width) * width);
    }
    else
    {
        height = resizeTo;
        width = (int)(((double)image.Width / image.Height) * height);
    }
    using var resizedImage = new Bitmap(image, width, height);
    return resizedImage.GetGreyscale();
}

The next steps are to "normalise" the combined intensity variance values so that they fit the range zero-to-one, to "blur" this data and to then create a binary mask where the brighter pixels get set to one and the darker pixels get set to zero. In other words, to extend the code earlier (that calculated the intensity variance values) like this:

private static IEnumerable<Rectangle> GetPossibleBarcodeAreasForBitmap(Bitmap image)
{
    var greyScaleImageData = GetGreyscaleData(
        image,
        resizeIfLargestSideGreaterThan: 450,
        resizeTo: 300
    );
    var combinedGradients = greyScaleImageData.Transform((intensity, pos) =>
    {
        // Consider gradients to be zero at the edges of the image because there aren't pixels
        // both left/right or above/below and so it's not possible to calculate a real value
        var horizontalChange = (pos.X == 0) || (pos.X == greyScaleImageData.Width - 1)
            ? 0
            : greyScaleImageData[pos.X + 1, pos.Y] - greyScaleImageData[pos.X - 1, pos.Y];
        var verticalChange = (pos.Y == 0) || (pos.Y == greyScaleImageData.Height - 1)
            ? 0
            : greyScaleImageData[pos.X, pos.Y + 1] - greyScaleImageData[pos.X, pos.Y - 1];
        return Math.Max(0, Math.Abs(horizontalChange) - Math.Abs(verticalChange));
    });

    const int maxRadiusForGradientBlurring = 2;
    const double thresholdForMaskingGradients = 1d / 3;

    var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
        .Transform(value => (value >= thresholdForMaskingGradients));

    // .. more will go here soon
}

To do that we, need a "Normalise" method - which is simple:

private static DataRectangle<double> Normalise(DataRectangle<double> values)
{
    var max = values.Enumerate().Max(pointAndValue => pointAndValue.Item2);
    return (max == 0)
        ? values
        : values.Transform(value => (value / max));
}

.. and a "Blur" method - which is a little less simple but hopefully still easy enough to follow (for every point, look at the points around it and take an average of all of them; it just looks for a square area, which is fine for small "maxRadius" values but which might be better implemented as a circular area if large "maxRadius" values might be needed, which they aren't in this code):

private static DataRectangle<double> Blur(DataRectangle<double> values, int maxRadius)
{
    return values.Transform((value, point) =>
    {
        var valuesInArea = new List<double>();
        for (var x = -maxRadius; x <= maxRadius; x++)
        {
            for (var y = -maxRadius; y <= maxRadius; y++)
            {
                var newPoint = new Point(point.X + x, point.Y + y);
                if ((newPoint.X < 0) || (newPoint.Y < 0)
                || (newPoint.X >= values.Width) || (newPoint.Y >= values.Height))
                    continue;
                valuesInArea.Add(values[newPoint.X, newPoint.Y]);
            }
        }
        return valuesInArea.Average();
    });
}

This gets us to this point:

Mask of possibly-part-of-a-barcode areas

.. which feels like good progress!

Now we need to try to identify distinct "islands" of pixels where each "island" or "object" is a set of points that are within a single connected area. A straightforward way to do that is to look at every point in the mask that is set to 1 and either:

  1. Perform a pixel-style "flood fill" starting at this point in order to find other points in an object
  2. If this pixel has already been included in such a fill operation, do nothing (because it's already been accounted for)

This was made easier for me by reading the article Flood Fill algorithm (using C#.Net)..

private static IEnumerable<IEnumerable<Point>> GetDistinctObjects(DataRectangle<bool> mask)
{
    // Flood fill areas in the looks-like-bar-code mask to create distinct areas
    var allPoints = new HashSet<Point>(
        mask.Enumerate(optionalFilter: (point, isMasked) => isMasked).Select(point => point.Item1)
    );
    while (allPoints.Any())
    {
        var currentPoint = allPoints.First();
        var pointsInObject = GetPointsInObject(currentPoint).ToArray();
        foreach (var point in pointsInObject)
            allPoints.Remove(point);
        yield return pointsInObject;
    }

    // Inspired by code at
    // https://simpledevcode.wordpress.com/2015/12/29/flood-fill-algorithm-using-c-net/
    IEnumerable<Point> GetPointsInObject(Point startAt)
    {
        var pixels = new Stack<Point>();
        pixels.Push(startAt);

        var valueAtOriginPoint = mask[startAt.X, startAt.Y];
        var filledPixels = new HashSet<Point>();
        while (pixels.Count > 0)
        {
            var currentPoint = pixels.Pop();
            if ((currentPoint.X < 0) || (currentPoint.X >= mask.Width)
            || (currentPoint.Y < 0) || (currentPoint.Y >= mask.Height))
                continue;

            if ((mask[currentPoint.X, currentPoint.Y] == valueAtOriginPoint)
            && !filledPixels.Contains(currentPoint))
            {
                filledPixels.Add(new Point(currentPoint.X, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X - 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X + 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y - 1));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y + 1));
            }
        }
        return filledPixels;
    }
}

The problem is that, even with the blurring we performed, there will likely be some groups of distinct objects that are actually part of a single barcode. These areas need to be joined together. It's quite possible for there to be relatively large gaps in the middle of barcodes (there is in the example that we've been looking at) and so we might not easily be able to just take the distinct objects that we've got and join together areas that seem "close enough".

Areas that are possibly part of a barcode

On the basis that individual bars in a barcode are tall compared to the largest possible width that any of them can be (which I'll go into more detail about later on), it seems like a reasonable idea to take any areas that are taller than they are wide and expand their width until they become square. That would give us this:

Mask of possibly-part-of-a-barcode areas, extended into squared areas

We'd then work out which of these "squared off" rectangles overlap (if any) and replace overlapping rectangles with rectangles that cover their combined areas, which would look like this:

Overlapping squared-off areas that have been combined

The only problem with this is that the combined rectangles extend too far to the left and right of the areas, so we need to trim them down. The will be fairly straightforward because we have the information about what distinct objects there are and each object is just a list of points - so we work out which objects have points within each of the combined bounding areas and then we work out which out of all of the objects for each combined area has the smallest "x" value and smallest "y" value and which have the largest values. That way, we can change the combined bounding areas to only cover actual barcode pixels. Which would leave us with this:

Combined possibly-a-barcode areas

That might sound like a lot of complicated work but if we take a bit of a brute force* approach to it then it can be expressed like this:

private static IEnumerable<Rectangle> GetOverlappingObjectBounds(
    IEnumerable<IEnumerable<Point>> objects)
{
    // Translate each "object" (a list of connected points) into a bounding box (squared off if
    // it was taller than it was wide)
    var squaredOffBoundedObjects = new HashSet<Rectangle>(
        objects.Select((points, index) =>
        {
            var bounds = Rectangle.FromLTRB(
                points.Min(p => p.X),
                points.Min(p => p.Y),
                points.Max(p => p.X) + 1,
                points.Max(p => p.Y) + 1
            );
            if (bounds.Height > bounds.Width)
                bounds.Inflate((bounds.Height - bounds.Width) / 2, 0);
            return bounds;
        })
    );

    // Loop over the boundedObjects and reduce the collection by merging any two rectangles
    // that overlap and then starting again until there are no more bounds merges to perform
    while (true)
    {
        var combinedOverlappingAreas = false;
        foreach (var bounds in squaredOffBoundedObjects)
        {
            foreach (var otherBounds in squaredOffBoundedObjects)
            {
                if (otherBounds == bounds)
                    continue;

                if (bounds.IntersectsWith(otherBounds))
                {
                    squaredOffBoundedObjects.Remove(bounds);
                    squaredOffBoundedObjects.Remove(otherBounds);
                    squaredOffBoundedObjects.Add(Rectangle.FromLTRB(
                        Math.Min(bounds.Left, otherBounds.Left),
                        Math.Min(bounds.Top, otherBounds.Top),
                        Math.Max(bounds.Right, otherBounds.Right),
                        Math.Max(bounds.Bottom, otherBounds.Bottom)
                    ));
                    combinedOverlappingAreas = true;
                    break;
                }
            }
            if (combinedOverlappingAreas)
                break;
        }
        if (!combinedOverlappingAreas)
            break;
    }

    return squaredOffBoundedObjects.Select(bounds =>
    {
        var allPointsWithinBounds = objects
            .Where(points => points.Any(point => bounds.Contains(point)))
            .SelectMany(points => points)
            .ToArray(); // Don't re-evaluate in the four accesses below
        return Rectangle.FromLTRB(
            left: allPointsWithinBounds.Min(p => p.X),
            right: allPointsWithinBounds.Max(p => p.X) + 1,
            top: allPointsWithinBounds.Min(p => p.Y),
            bottom: allPointsWithinBounds.Max(p => p.Y) + 1
        );
    });
}

* (There are definitely more efficient ways that this could be done but since we're only looking at 300px images then we're not likely to end up with huge amounts of data to deal with)

To complete the process, we need to do three more things:

  1. Since barcodes are wider than they are tall, we can discard any regions that don't fit this shape (of which there are two in the example image)
  2. The remaining regions are expanded a little across so that they more clearly surround the barcode region, rather than being butted right up to it (this will make the barcode reading process a little easier)
  3. As the regions that have been identified may well be on a resized version of the source image, they may need to scaled up so that they correctly apply to the source

To do that, we'll start from this code that we saw earlier:

var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
    .Transform(value => (value >= thresholdForMaskingGradients));

.. and expand it like so (removing the "// .. more will go here soon" comment), using the methods above:

// Determine how much the image was scaled down (if it had to be scaled down at all)
// by comparing the width of the potentially-scaled-down data to the source image
var reducedImageSideBy = (double)image.Width / greyScaleImageData.Width;

var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
    .Transform(value => (value >= thresholdForMaskingGradients));

return GetOverlappingObjectBounds(GetDistinctObjects(mask))
    .Where(boundedObject => boundedObject.Width > boundedObject.Height)
    .Select(boundedObject =>
    {
        var expandedBounds = boundedObject;
        expandedBounds.Inflate(width: expandedBounds.Width / 10, height: 0);
        expandedBounds.Intersect(
            Rectangle.FromLTRB(0, 0, greyScaleImageData.Width, greyScaleImageData.Height)
        );
        return new Rectangle(
            x: (int)(expandedBounds.X * reducedImageSideBy),
            y: (int)(expandedBounds.Y * reducedImageSideBy),
            width: (int)(expandedBounds.Width * reducedImageSideBy),
            height: (int)(expandedBounds.Height * reducedImageSideBy)
        );
    });

The final result is that the barcode has been successfully located on the image - hurrah!

Barcode located!

With this information, we should be able to extract regions or "sub images" from the source image and attempt to decipher the barcode value in it (presuming that there IS a bar code in it and we haven't got a false positive match).

As we'll see in a moment, the barcode doesn't have to be perfectly lined up - some rotation is acceptable (depending upon the image, up to around 20 or 30 degrees should be fine). The MyFitnessPal app has a couple of fallbacks that I've noticed, such as being able to read barcodes that are upside down or even back to front (which can happen if a barcode is scanned from the wrong side of a transparent wrapper). While I won't be writing code here for either of those approaches, I'm sure that you could envisage how it could be done - the source image data could be processed as described here and then, if no barcode is read, rotated 180 degrees and re-processed and reversed and re-processed, etc..

How to read a bar code

A barcode is comprised of both black and white bars - so it's not just the black parts that are significant, it is the spaces between them as well.

The format of a barcode is as follows:

  1. Three single-width bars (a black one, a white one and another black one) that are used to gauge what is considered to be a "single width"
  2. Information for six numbers then appears, where each number is encoded by a sequence of four bars (white, black, white, black) - particular combinations of bar widths relate to particular digits (see below)
  3. Another guard section appears with five single width bars (white, black, white, black, white)
  4. Six more numbers appear (using the same bar-width-combinations encoding as before but the groups of four bars are now black, white, black, white)
  5. A final guard section of three single width bards (black, white, black)

The numbers are encoded using the following system:

 Digit      Bar widths

   0        3, 2, 1, 1
   1        2, 2, 2, 1
   2        2, 1, 2, 2
   3        1, 4, 1, 1
   4        1, 1, 3, 2
   5        1, 2, 3, 1
   6        1, 1, 1, 4
   7        1, 3, 1, 2
   8        1, 2, 1, 3
   9        3, 1, 1, 2

(Note that every combination of values totals 7 when they added up - this is very helpful later!)

To see what that looks like in the real world, here's a slice of that barcode from the jar of peanut butter with each section and each numeric value identified:

Barcode numbers interpreted

(I should point out that the article How does a barcode work? was extremely helpful in the research I did for this post and I'm very grateful to the author for having written it in such an approachable manner!)

Any combination of bar widths that is not found in the table is considered to be invalid. On the one hand, you might think that this a potential loss; the format could support more combinations of bar widths to encode more values and then more data could be packed into the same space. There is an advantage, however, to having relatively few valid combinations of bar widths - it makes easier to tell whether the information being read appears to be correct. If a combination is encountered that seems incorrect then the read attempt should be aborted and retried. The format has existed for decades and it would make sense, bearing that in mind, to prioritise making it easier for the hardware to read rather prioritising trying to cram as much data in there as possible. There is also a checksum included in the numerical data to try to catch any "misreads" but when working with low resolutions or hardware with little computing power, the easier that it is to bail out of a scan and to retry the better.

The way to tackle the reading is to:

  1. Convert the sub image to greyscale
  2. Create a binary mask so that the darker pixels become 0 and the lighter ones become 1
  3. Take a single line across the area
  4. Change the individual 1s and 0s into lengths of continuous "runs" of values
    • eg. 0001100 would become 3, 2, 2 because there are three 0s then two 1s and then two 0s
  5. These runs of values will represent the different sized (black and white) bars that were encountered
    • For a larger image, each run length will be longer than for a small image but that won't matter because when we encounter runs of four bar length values that we think should be interpreted as a single digit, we'll do some dividing to try to guess the average size of a single width bar
  6. Take these runs of values, skip through the expected guard regions and try to interpret each set of four bars that is thought to represent a digit of the bar code as that digit
  7. If successful then perform a checksum calculation on the output and return the value ass a success if it meets expectations
  8. If the line couldn't be interpreted as a barcode or the checksum calculation fails then take the next line down and go back to step 4
  9. If there are no more lines to attempt then a barcode could not be identified in the image

This processing is fairly light computationally and so there is no need to resize the "may be a barcode" image region before attempting the work. In fact, it's beneficial to not shrink it as shrinking it will likely make the barcode section fuzzier and that makes the above steps less likely to work - the ideal case for creating a binary mask is where there is no significant "seepage" of pixel intensity between the black bar areas and the white bar areas. That's not to say that the images have to be crystal clear or perfectly aligned with the camera because the redundancy built into the format works in our favour here - if one line across the image can't be read because it's fuzzy then there's a good chance that one of the other lines will be legible.

60 length values is the precise number that we expect to find - there is expected to be some blank space before the barcode starts (1) and then a guard section of three single-width lines that we use to gauge bar width (3) and then six numbers that are encoded in four bars each (6x4=24) and then a guard section of five single-width lines (5) and then six numbers (6x4=24) and then a final guard region of three single-width bars, giving 1+3+24+5+24+3=60.

There will likely be another section of blank content after the barcode that we ignore

If we don't want to validate the final guard region then we can work with a barcode image where some of the end of cut off, so long as the data for the 12 digits is there; in this case, 57 lengths if the minimum number that we can accept

Reading the numeric value with code

I'm going to try to present the code in approximately the same order as the steps presented above. So, firstly we need to convert the sub image to greyscale and create a binary mark from it. Then we'll go line by line down the image data and try to read a value. So we'll take this:

public static string? TryToReadBarcodeValue(Bitmap subImage)
{
    const double threshold = 0.5;

     // Black lines are considered 1 and so we set to true if it's a dark pixel (and 0 if light)
    var mask = subImage.GetGreyscale().Transform(intensity => intensity < (256 * threshold));
    for (var y = 0; y < mask.Height; y++)
    {
        var value = TryToReadBarcodeValueFromSingleLine(mask, y);
        if (value is object)
            return value;
    }
    return null;
}

.. and the read-each-slice-of-the-image code looks like this:

private static string? TryToReadBarcodeValueFromSingleLine(
    DataRectangle<bool> barcodeDetails,
    int sliceY)
{
    if ((sliceY < 0) || (sliceY >= barcodeDetails.Height))
        throw new ArgumentOutOfRangeException(nameof(sliceY));

    var lengths = GetBarLengthsFromBarcodeSlice(barcodeDetails, sliceY).ToArray();
    if (lengths.Length < 57)
    {
        // As explained, we'd like 60 bars (which would include the final guard region) but we
        // can still make an attempt with 57 (but no fewer)
        // - There will often be another section of blank content after the barcode that we ignore
        // - If we don't want to validate the final guard region then we can work with a barcode
        //   image where some of the end is cut off, so long as the data for the 12 digits is
        //   there (this will be the case where there are only 57 lengths)
        return null;
    }

    var offset = 0;
    var extractedNumericValues = new List<int>();
    for (var i = 0; i < 14; i++)
    {
        if (i == 0)
        {
            // This should be the first guard region and it should be a pattern of three single-
            // width bars
            offset += 3;
        }
        else if (i == 7)
        {
            // This should be the guard region in the middle of the barcode and it should be a
            // pattern of five single-width bars
            offset += 5;
        }
        else
        {
            var value = TryToGetValueForLengths(
                lengths[offset],
                lengths[offset + 1],
                lengths[offset + 2],
                lengths[offset + 3]
            );
            if (value is null)
                return null;
            extractedNumericValues.Add(value.Value);
            offset += 4;
        }
    }

    // Calculate what the checksum should be based upon the first 11 numbers and ensure that
    // the 12th matches it
    if (extractedNumericValues.Last() != CalculateChecksum(extractedNumericValues.Take(11)))
        return null;

    return string.Join("", extractedNumericValues);
}

With the code below, we find the runs of continuous 0 or 1 lengths that will represent bars are return that list (again, for larger images each run will be longer and for smaller images each run will be shorter but this will be taken care of later) -

private static IEnumerable<int> GetBarLengthsFromBarcodeSlice(
    DataRectangle<bool> barcodeDetails,
    int sliceY)
{
    if ((sliceY < 0) || (sliceY >= barcodeDetails.Height))
        throw new ArgumentOutOfRangeException(nameof(sliceY));

    // Take the horizontal slice of the data
    var values = new List<bool>();
    for (var x = 0; x < barcodeDetails.Width; x++)
        values.Add(barcodeDetails[x, sliceY]);

    // Split the slice into bars - we only care about how long each segment is when they
    // alternate, not whether they're dark bars or light bars
    var segments = new List<Tuple<bool, int>>();
    foreach (var value in values)
    {
        if ((segments.Count == 0) || (segments[^1].Item1 != value))
            segments.Add(Tuple.Create(value, 1));
        else
            segments[^1] = Tuple.Create(value, segments[^1].Item2 + 1);
    }
    if ((segments.Count > 0) && !segments[0].Item1)
    {
        // Remove the white space before the first bar
        segments.RemoveAt(0);
    }
    return segments.Select(segment => segment.Item2);
}

Now we need to implement the "TryToGetValueForLengths" method that "TryToReadBarcodeValueFromSingleLine" calls. This takes four bar lengths that are thought to represent a single digit in the bar code value (they are not part of a guard region or anything like that). It take those four bar lengths and guesses how many pixels across a single bar would be - which is made my simpler by the fact that all of the possible combinations of bar lengths in the lookup chart that we saw earlier add up to 7.

There's a little flexibility introduced here to try to account for a low quality image or if the threshold was a bit strong in the creation of the binary mask; we'll take that calculated expected width of a single bar and tweak it up or down a little if apply that division to the bar lengths means that we made some of the bars too small that they disappeared or too large and it seemed like the total width would be more than seven single estimated-width bars. There's only a little flexibility here because if we fail then we can always try another line of the image! (Or maybe it will turn out that this sub image was a false positive match and there isn't a bar code in it at all).

private static int? TryToGetValueForLengths(int l0, int l1, int l2, int l3)
{
    if (l0 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l0));
    if (l1 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l1));
    if (l2 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l2));
    if (l3 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l3));

    // Take a guess at what the width of a single bar is based upon these four values
    // (the four bars that encode a number should add up to a width of seven)
    var raw = new[] { l0, l1, l2, l3 };
    var singleWidth = raw.Sum() / 7d;
    var adjustment = singleWidth / 10;
    var attemptedSingleWidths = new HashSet<double>();
    while (true)
    {
        var normalised = raw.Select(x => Math.Max(1, (int)Math.Round(x / singleWidth))).ToArray();
        var sum = normalised.Sum();
        if (sum == 7)
            return TryToGetNumericValue(normalised[0], normalised[1], normalised[2], normalised[3]);

        attemptedSingleWidths.Add(singleWidth);
        if (sum > 7)
            singleWidth += adjustment;
        else
            singleWidth -= adjustment;
        if (attemptedSingleWidths.Contains(singleWidth))
        {
            // If we've already tried this width-of-a-single-bar value then give up -
            // it doesn't seem like we can make the input values make sense
            return null;
        }
    }

    static int? TryToGetNumericValue(int i0, int i1, int i2, int i3)
    {
        var lookFor = string.Join("", new[] { i0, i1, i2, i3 });
        var lookup = new[]
        {
            // These values correspond to the lookup chart shown earlier
            "3211", "2221", "2122", "1411", "1132", "1231", "1114", "1312", "1213", "3112"
        };
        for (var i = 0; i < lookup.Length; i++)
        {
            if (lookFor == lookup[i])
                return i;
        }
        return null;
    }
}

Finally we need the CalculateChecksum method (as noted in the code, there's a great explanation of how to do this in wikipedia) -

private static int CalculateChecksum(IEnumerable<int> values)
{
    if (values == null)
        throw new ArgumentNullException(nameof(values));
    if (values.Count() != 11)
        throw new ArgumentException("Should be provided with precisely 11 values");

    // See https://en.wikipedia.org/wiki/Check_digit#UPC
    var checksumTotal = values
        .Select((value, index) => (index % 2 == 0) ? (value * 3) : value)
        .Sum();
    var checksumModulo = checksumTotal % 10;
    if (checksumModulo != 0)
        checksumModulo = 10 - checksumModulo;
    return checksumModulo;
}

With this code, we have executed all of the planned steps outlined before.

It should be noted that, even with the small amount of flexibility in the "TryToGetValueForLengths" method, in the peanut butter bar code example it requires 15 calls to "GetBarLengthsFromBarcodeSlice" until a bar code is successfully matched! Presumably, this is because there is a little more distortion further up the bar code due to the curve of the jar.

That's not to say, however, that this approach to bar reading is particularly fussy. The redundancy and simplicity, not to mention the size of the average bar code, means that there is plenty of opportunity to try reading a sub image in multiple slices until one of them does match. In fact, I mentioned earlier that the barcode doesn't have to be perfectly at 90 degrees in order to be interpretable and that some rotation is acceptable. This hopefully makes some intuitive sense based upon the logic above and how it doesn't matter how long each individual bar code line is because they are averaged out - if a bar code was rotated a little and then a read was attempted of it line by line then the ratios between each line should remain consistent and the same data should be readable.

To illustrate, here's a zoomed-in section of the middle of the peanut butter bar code in the orientation shown so far:

A strip of the peanut butter jar's bar code

If we then rotate it like this:

The peanut butter jar rotated slightly

.. then the code above will still read the value correctly because a strip across the rotated bar code looks like this:

A strip of the peanut butter jar's bar code from the rotated image

Hopefully it's clear enough that, for each given line, the ratios are essentially the same as for the non-rotated strip:

A strip of the peanut butter jar's bar code

To get a reading from an image that is rotated more than this requires a very clear source image and will still be limited by the first stage of processing - that tried to find sections where the horizontal image intensity changed with steep gradients but the vertical intensity did not. If the image is rotated too much then there will be more vertical image intensity differences encountered and it is less likely to identify it as a "maybe a bar code" region.

(Note: I experimented with rotated images that were produced by an online barcode generator and had more success - meaning that I could rotate them more than I could with real photographs - but that's because those images are generated with stark black and white and the horizontal / vertical intensity gradients are maintained for longer when the image is rotated if they start with such a high level of clarity.. I'm more interested in reading values from real photographs and so I would suggest that only fairly moderate rotation will work - though it would still be plenty for an MyFitnessPal-type app that expects the User to hold the bar code in roughly the right orientation!)

Tying it all together

We've looked at the separate steps involved in the whole reading process, all that is left is to combine them. The "GetPossibleBarcodeAreasForBitmap" and "TryToReadBarcodeValue" methods can be put together into a fully functioning program like this:

static void Main()
{
    using var image = new Bitmap("Source.jpg");

    var barcodeValues = new List<string>();
    foreach (var area in GetPossibleBarcodeAreasForBitmap(image))
    {
        using var areaBitmap = new Bitmap(area.Width, area.Height);
        using (var g = Graphics.FromImage(areaBitmap))
        {
            g.DrawImage(
                image,
                destRect: new Rectangle(0, 0, areaBitmap.Width, areaBitmap.Height),
                srcRect: area,
                srcUnit: GraphicsUnit.Pixel
            );
        }
        var valueFromBarcode = TryToReadBarcodeValue(areaBitmap);
        if (valueFromBarcode is object)
            barcodeValues.Add(valueFromBarcode);
    }

    if (!barcodeValues.Any())
        Console.WriteLine("Couldn't read any bar codes from the source image :(");
    else
    {
        Console.WriteLine("Read the following bar code(s) from the image:");
        foreach (var barcodeValue in barcodeValues)
            Console.WriteLine("- " + barcodeValue);
    }

    Console.WriteLine();
    Console.WriteLine("Press [Enter] to terminate..");
    Console.ReadLine();
}

Finito!

And with that, we're finally done! I must admit that I started writing this post about three years ago and it's been in my TODO list for a loooooong time now. But I've taken a week off work and been able to catch up with a few things and have finally been able to cross it off the list. And I'm quite relieved that I didn't give up on it entirely because it was a fun little project and coming back to it now allowed me to tidy it up a bit with the newer C# 8 syntax and even enable the nullable reference types option on the project (I sure do hate unintentional nulls being allowed to sneak in!)

A quick reminder if you want to see it in action or play about it yourself, the GitHub repo is here.

Thanks to anyone that read this far!

Posted at 23:24

Comments

Face or no face (finding faces in photos using C# and Accord.NET)

I've always been intrigued as to how facial detection and recognition is handled by computers. It's one of the few effects that is seen in Hollywood films that seems convincing - where someone is tracked through a city, live CCTV feeds finding their face as they move from one street to the next. For a more mundane example, facial detection has been built into digital cameras and smartphones for quite some time now (my phone puts green squares around faces in photos and, I presume, uses this information to decide where to focus).

I knew literally nothing about the subject to begin with and so I hit the internet and tried to find out everything I could about it. Turns out that there's a lot that's been written about this, with various different methods and approaches discussed and documented. It's an interesting subject because it's been so affected by the rate of change in technology - for example, the digital cameras that introduced it had access to much less processing power than modern day devices do. I've had a hard time finding definitive information about when digital cameras started offering this facility but I'm getting a rough impression that it's over the last ten years (citation needed!) - when you think about how phones, for example, have changed in ten years in terms of their power.. well, it's a lot!

Having said that, digital cameras (and web cams) can take some shortcuts and only concern themselves with subjects that are facing directly towards the camera, since these are more likely to be subjects that are of interest - it's not as important to worry about people looking at ninety degrees from the camera lens, for example). So, straight away, it's apparent there are going to different techniques and different requirements for different situations and some will be more expensive than others.

I suspect that this is going to be an ongoing investigation for me; so far I've only managed to implement facial detection functionality and not facial recognition (once a face has been detected, is it someone from a database of known faces?). I thought I'd present what I have at this point, though - if it's interesting to me then maybe it will be of interest to other people! And I couldn't find much material out there that goes into enough detail that it's reproducible without being a really dense research paper (and certainly not ones that make it easy to reproduce their findings using C#), so I thought that I'd try to fill that gap. There are articles out there about using libraries to do all the hard work for you - but where's the fun in that?! I want to know how the code I'm calling is working.

tl;dr

My current code (github.com/ProductiveRage/FacialRecognition) identifies faces in photos by using two passes. First, it looks for regions of skin that have holes (that we presume may be eyes and a nose) and that are very roughly the right proportions to be a face.

The image content from these regions is passed through a "linear support vector machine", which will have been trained to classify a potential image region as a face or as not-a-face.

Before embarking on this adventure, I had no idea what a "support vector machine" even was. If you don't either then maybe you'll come away from this feeling like you know a little something about machine learning. I know that I did! And "machine learning" is a phrase that I've been interested in finding out more about for quite some time now, it seems like it's cropping up everywhere! I have a strong feeling that trying to perform facial recognition is going to involve even more magical sounding terms - such as "deep convolutional neural network" - but that's going to have to wait for another day, it's one (baby) step at a time for me.

The "Naked People Skin Filter"

In a past life, I worked in IT Support and, one day, I was asked by the Head Clinician why I'd been printing out porn. Which confused me considerably since I had most certainly not been printing out explicit material. There had been complaints about the colour tone and contrast of a new colour laser printer and so I'd sent a full-page test image to it that included greyscale content, colour grids and images that contained a lot of skin. Apparently printing out pictures with lots of skin in it raises eyebrows and maybe it's for similar reasons that a research paper entitled "Naked People Skin Filter" amusingly sounds a touch pervy (or maybe that's just a British thing.. we do love to keep most of our naked-skin-unveiling behind closed doors - because it's so cold!).

This "Naked People Skin Filter" paper was published by Margaret M. Fleck and David A Forsyth and is available at http://mfleck.cs.illinois.edu/naked-skin.html. It essentially describes an algorithm to detect regions of skin in an image. In 1997, another paper "Face Detection in Color Images" described starting with that skin-tone-identifying approach and then using it to try to find solid regions of skin that have a few holes in, on the basis that this is probably a face (where the holes are eyes, nose, mouth).

Tiger Woods

Note: That "Face Detection in Color Images" link use the Wayback Machine to make available a GeoCities page. Maybe it's a sad reminder that there was actually worthwhile content on GeoCities as well as the many harshly-coloured-and-bizarrely-animated eye-watering monstrosities. By the way, if you (quite rightly, imho) think that the web has lost something by having less places for people to upload content for free then check out NeoCities - it's awesome because it makes that possible again (and all the good and bad that comes with it)!

"Face Detection in Color Images" uses an example image of Tiger Woods and so I will use that same image while I discuss what I've done, based upon these articles.

There are a set of steps that are followed when trying to identify faces in an image:

  1. Scale the image down (depending upon its size)
  2. Avoid desaturation by bring colour down values so that the minimum is true zero
  3. Transform RGB colours into the RG/BY colour space, alongside an Intensity spectrum
  4. Calculate "texture amplitude" across the image (how quickly intensity changes - the theory is that areas of skin have a relatively low texture amplitude)
  5. Calculate hue and saturation across the image (in order to identify areas that seem to be skin-coloured)
  6. Combine the texture amplitude and hue data to identify areas of skin (the texture amplitude, hue and saturation must be within acceptable "this is probably skin" bounds)
  7. Expand these areas of skin using a more relaxed texture amplitude / hue / saturation "this is probably skin" filter
  8. Check these areas for holes - no holes (for non-skin facial features, such as eyes) means no face
  • Any skin areas that are very small are ignored
  • Any skin areas that are extreme aspect ratios are ignored (very, very wide areas or very, very tall areas are probably not faces)
  1. Expand the areas slightly because, even with the flexible-skin-filter expansions, the area that a human would identify as being a person's face is slightly larger than the areas identified by this algorithm

Calculating hue, saturation and texture amplitude (steps 1-7) to identify skin areas

The first few steps are fairly simple. If the image is either taller or wider than 400px then it's shrunk down so that the largest side is 400px. "Naked People Skin Filter" talks about applying "smoothing" (which I'll cover in a moment) relative to the size of the image but doesn't suggest any restrictions on the image size. "Face Detection in Color Images" mentions that only images up to 250px were used. With the test images I used, I found that capping at 400px provided a reasonable balance between identifying faces and not taking too long to process. I used the Tiger Woods image initially to try to ensure that my calculations were matching those described in "Face Detection in Color Images" but I used other test images (which will appear further down) to double-check that everything seemed to be working as expected. There are some differences between how Fleck & Forsyth ("Naked People Skin Filter") and Kapur ("Face Detection in Colour Images") perform the analysis and my test images led me to tweak a few parameters myself.

Fleck & Forsyth recommend pulling down the colour values so that the minimum value is zero because it "avoids potentially significant desaturation of opponent color values if the zero-response is far from zero". What this means is that the darkest parts of an image are probably not "true black" (ie. RGB 0, 0, 0) and the theory is that we should get better results if everything is adjusted down so that the darkest colour is black. In practice, I look for the lowest of any of the R, G or B values from any pixel in the image and subtract that value from every R, G and B value across all pixels in the image.

Next, I generate "RgByI" values for every pixel. This a trio of values; an RG (red-green) spectrum value, a BY (blue-yellow) spectrum value and an intensity. Since I'm going to spend a lot of time taking the pixels from an image and loading them into a 2D array and then performing a range of operations on them, I've written a DataRectangle<T> class that makes this easier. When I first read the data from an image, the pixels are loaded and used to populate a DataRectangle of RGB values. When I need to get RgByI values from them, I can do the following (using Kapur's formulae from "Face Detection in Colour Images) -

var rgByIValues = colourData.Transform(colour =>
{
    Func<byte, double> L = x => (105 * Math.Log10(x + 1));
    return new IRgBy(
        rg: L(colour.R) - L(colour.G),
        by: L(colour.B) - ((L(colour.G) + L(colour.R)) / 2),
        i: (L(colour.R) + L(colour.B) + L(colour.G)) / 3
    );
});

In order to reduce unhelpful detail in the image somewhat, the RgByI values have a "windowing median filter" applied to them. From "Face Detection in Colour Images" -

The Rg and By matrices are then filtered with a windowing median filter .. with sides of length 4*SCALE. The SCALE value is calculated as being the closest integer value to (height+width)/320

(The "Naked People Skin Filter" paper recommends using 2*SCALE for the RG and BY values, which I have gone with because I found the results to be just as accurate and it's less computational work than 4*SCALE)

To do this, you go through every pixel and take its value and the values from the surrounding pixels, sort them, take the middle value (the "median") and use that as the new value for the current pixel. The idea is that this reduces noise by discarding any outlier pixel colours in a given range. I take a square block around where the initial pixel is but it would probably be better to approximate a circular area if the media filter radius is large (in my filter, it's never more than three).

After smoothing the RG/BY data, their values are combined to generate hue and saturation -

var smoothedHues = smoothedRG.CombineWith(
    smoothedBY,
    (rg, by) => new
    {
        Hue = RadianToDegree(Math.Atan2(rg, by)),
        Saturation = Math.Sqrt((rg * rg) + (by * by))
    }
);

Hue
Saturation

The two images here are the hue and saturation values generated from the original image.

In order to generate greyscale images, I had to translate the hue and saturation values into the 0-255 range. The hue values will be in the range of -180 to +180 degrees so I just added 180 and then divided by 2. The saturation values are always positive and won't exceed 0-255 if multiplied by 2, so I just did that.

Generating texture amplitude is more complicated. We start with the intensity values from the RgByI data. We then run that through a median filter of 2 * SCALE and calculate the difference between every point in the median filter result and the original intensity. Finally, we run the result of the calculation through a median filter of 3 * SCALE. It may be clearer with some code -

var smoothedIntensity = rgByIValues.MedianFilter(
    value => value.I,
    2 * scale
);
var differenceBetweenOriginalIntensityAndSmoothedIntensity = rgByIValues.CombineWith(
    smoothedIntensity,
    (x, y) => Math.Abs(x.I - y)
);
var textureAmplitude = differenceBetweenOriginalIntensityAndSmoothedIntensity.MedianFilter(
    value => value,
    3 * scale
);

Texture Amplitude

(Note that the values 2 and 3 in the above code are much lower than Kapur suggests - 8*SCALE and 12*SCALE - but the median filtering algorithm that I threw together was very slow using the higher values and using lower values - which meant that the processing completed much more quickly - did not seem to affect the outcomes)

Hopefully it's apparent that around the lower face there is low texture amplitude. Texture amplitude is higher around the facial features but that's to expected. There is low texture amplitude elsewhere in the image (particularly in the sky behind him, the texture is very smooth there and so the texture amplitude is very low) but what we're going to look for is areas that appear to be skin in hue/saturation and in texture amplitude.

Now that we have all of the information required to guess whether a given pixel is within the acceptable bounds of "probably skin", we can create a skin mask (a DataRectangle<bool>) -

var skinMask = smoothedHues
    .CombineWith(textureAmplitude, (hs, t) => new
    {
        Hue = hs.Hue,
        Saturation = hs.Saturation,
        TextureAmplitude = t
    })
    .Transform(hst =>
    {
        return (
            ((hst.Hue >= 105) && (hst.Hue <= 160) && (hst.Saturation >= 10) && (hst.Saturation <= 60)) ||
            ((hst.Hue >= 160) && (hst.Hue <= 180) && (hst.Saturation >= 30) && (hst.Saturation <= 30))
        )
        && (hst.TextureAmplitude <= 5);
    });

This is another point at which I have added my own tweaks to the processing. There are slightly different ranges of acceptable hue / saturation / texture amplitude suggested by Fleck & Forsyth than are suggested by Kaypur and I found that I wanted to change them a little based upon the test images that I was using.

Skin mask

The final step in generating the skin mask is to try to identify any skin areas just outside the identified mask. I've used the approach suggested by Kapur, who recommends it because it "helps to enlarge the skin map regions to include skin/background border pixels, regions near hair or other features, or desaturated areas".

The idea is that we look at negative values in the DataRectangle<bool> skin mask and check whether any of the adjacent values is positive and if the colour of the pixels that resulted in the false value passes a more relaxed skin filter. The relaxed skin filter test demands only that the hue is within the range 110-180 and that the saturation is with 0-180 (the text amplitude is not considered). This expansion work is performed twice.

Recognising skin objects (and checking for holes)

Detailed skin mask

In order to get back some of the details that are within the skin mask areas, the original image is changed to greyscale and then the skin mask is combined with it to produce a new skin mask that is slightly more restrictive; any negative content from the skin mask remains negative while positive content is only allowed if the greyscale intensity is within an acceptable range. Everything from this point (including this step) comes from the "Face Detection in Color Images" article, since "Naked People Skin Filter" ends when the skin regions are detected (it has no interest in faces, specifically).

skinMask = colourData.CombineWith(
    skinMask,
    (colour, mask) =>
    {
        if (!mask)
            return false;
        var intensity = colour.ToGreyScale();
        return (intensity >= 90) && (intensity <= 240);
    }
);

In order to turn this DataRectangle<bool> mask into groups of points (where each group represents a distinct area of skin), I used a variation of the "Stack based implementation" from this article: Flood Fill algorithm (using C#.Net). If you're looking my code, it's the "TryToGetPointsInObject" method in the FaceDetector.cs class. I'm not stressing out about performance at this point, I just wanted to get things working and then considering measuring & improving in the future - so finding a nice simple flood fill algorithm was very helpful (I'm not saying that it's not a well-performing method, I'm just saying that at this point in time it's not critically important to me one way or the other).

Any skin objects that are very small (have less than 64 * SCALE points) are ignored. What is slightly more complicated is to identify any completely enclosed holes in the skin object. But it's not much more complicated - the basic approach is, for each skin object, take a negative point within the bounds of the skin object and use the flood fill algorithm again; if the fill reaches the edges of the bounds then the negative point was not part of an enclosed hole. Then move on to the next negative point within the bounds that hasn't already been included in a flood fill operation and continue doing this until a fully enclosed hole is found that's larger than a given size (1 * SCALE).

Tiger Woods' face detected

With the Tiger Woods image, we could actually stop here. It successfully identifies only his face as a possible face region. However, with other test images I used, some more work was required. I found that I could quite easily eliminate a few false positives by ignoring any regions that were obviously the wrong aspect ratio (either very long and shallow sections or very tall and narrow sections). I also found that, depending upon lighting or what background a face was against, sometimes the detection process up to this point would result in a region that is too tight over the facial features - expanding the matched area a little meant that the next filtering in the next stage would get better results.

The real problem, though, is false positives - the algorithm will often identify areas that are not faces.

Applying this to other images

I wanted to try applying this facial detection logic to some other images. I'd used the image from one of the articles so that I could try to produce intermediate images that looked similar to those in "Face Detection in Color Images" so that I could reassure myself that I was (at least approximately) doing things correctly as described. But now I wanted to try it on some other test images.

In my Pictures folder, some photos from an Easter weekend* night out I went to a couple of years ago jumped out at me. Initially, I only used them because they made me laugh but, on closer inspection, they're actually really useful for illustrating what the skin tone detection process (and the subsequent support vector machine classification) is good at and where their limitations come into play.

* (which goes some way to explaining the eggs and the costumes.. the photos were put up on Facebook by a professional photographer, I hope he doesn't mind me reproducing them here - I tried getting in touch to ask but he didn't get back to me)

Egg Man

Firstly, we have this fine gentleman. The skin tone pass has identified his face perfectly but it's also misidentified two regions that contain colours and textures that the skin filter allows through.

Egg Man Skin Mask

Looking at the generated skin mask (where each distinct skin object is filled with a different colour), it should be clear why this has happened.

Group Photo One

If we take another photo then another problem is apparent - with this group of people, there are lots of enclosed skin objects that have holes in that are being identified as faces but that are actually hands holding things.

Also, because we ignore regions that are very small, there is a limit to what faces the skin tone filter will identify. If you look closely at the group photo there is an undetected bearded face along the left edge (near the top) but I am happy for it to exclude him since I think that it is a reasonable compromise to prefer faces that are in the foreground. On the whole, it does a good job of detecting faces but it may also produce a lot of false positives.

So the next step is to try to find a way to filter these regions down further, to try to get rid of those pesky false positives. To do this, I went back to the internet. One of the pages that I found incredibly useful was the "HOG Person Detector Tutorial" by Chris McCormick. He introduces a popular and modern technique, touches on its history and then goes into an illustrated explanation of how to implement it. It's almost redundant me going over it again! .. but it's my blog and I'm talking about how I implemented it, so I'm going to start from the beginning anyway :)

At the very highest level, what we're going to do is this -

  1. Take a load of images that are known to be faces and a load of images that are known to not be faces - these sets of "positive" and "negative" images are our "training data"
  2. Extract a load of numbers from each image - each image must be processed in a manner that results in them each producing the same amount of numbers (this is called "feature extraction")
  3. The resulting data (which is a big list of entries, one from each training image, where each entry is a list of features from that image and a boolean value for whether the image was a face or not) is used to train a "Support Vector Machine" (which I'll explain in a second)
  4. Now that we have a trained SVM, we can use the same feature extraction process from step 2 on each of the sub-images generated by the skin tone detection process and the SVM should tell us whether each one is a face or not a face

So what is a Support Vector Machine?

This is a concept that I had never heard of before, so I know first-hand what it's like trying to search around on the internet for a good description. The problem is that a lot of articles seem to be aimed at people who know at least something about machine learning and they go immediately to showing lots of formulae and talking about techniques that I'd never heard of. One example is "Kernel Support Vector Machines for Classification and Regression in C#" by César Souza (who wrote much, if not all, of the Accord.NET library - which I will be using later); this is a really well-written article that I appreciate now but it was on my "come back to this when you know more" list for quite a while after I first found it!

So I'm going to take a real beginner approach and try to describe it in a manner that would have helped me a few weeks ago.

Firstly, an SVM tends to be a "binary classifier". This means that it will be trained with data that is all categorised as either yes or no. After it has been trained, you may only ask it to classify further data as being a yes-or-no result. (Actually, there is a such a thing as a "Multi-Class Support Vector Machine" that can return a greater range than binary but it's not relevant here so I'm going to forget about it for now).

(Time for some graphs. Shout out to Create your own XKCD-style Graphs, which helped me come up with what's below!)

Manager Decision History

To conjure up an example that almost sounds feasible, imagine that every development task at a software shop requires that its strategic benefit to the company be quantified, along with the percentage of the work that a customer is just dying to pay in order for it to be done (presuming that there is a customer who wants it and that it's not just a task for internal purposes). Since every task needs to have these pieces of information before your manager will decide whether or not the work will be scheduled, there exists historical records of estimated tasks which have two "features" (strategic value and percentage-that-will-be-immediately-paid-for-by-customers) and an outcome of either "yes, it was done" or "no, we didn't do it".

If something is of high strategic value to us and it happens to be a feature that a customer is chomping at the bit for (and so will contribute significantly towards) then it seems like a dead cert that we should do it. Unfortunately, this is not a very common case.. More often, a customer will want something that is important to them specifically. This may not be something that is necessarily very easy to resell to other customers (which would give it more strategic value) or something that will help us deal with internal issues, such as scaleability or technical debt (developments that do help with those will also have higher strategic value).

It seems like there's a roughly linear correlation between a development's strategic value, the fraction that customer(s) are willing to immediately pay for it and whether the work got the go-ahead. It's like we could draw a line and everything above it tends to get the green light by our manager and everything below it gets rejected.

Manager Decision History

An SVM is (according to wikipedia), a:

"model with associated learning algorithms.. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible"

What this boils down to is that we're essentially trying to come up with a formula to split those two sets of points. Then, when we get a new feature-pair (strategic value vs amount-customer-will-pay) we can plug those two numbers into the formula and work out which side of the line we're on (where one side means I-predict-yes and the other side means I-predict-no).

The hard work in the machine learning algorithm is working out where that line should go. Ideally, you want all historical "yes" results on one side and all historical "no" results on the other side. In some cases, though, this is not possible and there will be some outliers (remember that task that you had to do that seemed to have no strategic value, that no customer was willing to pay for and yet someone at one of those customer companies had convinced one of your directors to do it at as a favour...?). Whether there are any outliers or not, there are probably still many slight variations on what line could be drawn. It's common for an algorithm to try to arrange the line so that it is equally distant from the positive results as it is from the negative results but there are some variations on this theme. This is really what the magic is in this process - giving the computer the results and letting it find the best match.

Higher dimension planes and non-linear methods

In the above example, each data point only had two features which made it very easy to envisage on a 2D graph how they related to each other (which isn't surprising since it's an example I made up to illustrate that very point!). Sometimes, though, more data is required. It could be that you plotted strategic-value against amount-customer-will-pay and could see no correlation, though one was there - but hidden due to the interaction of another feature. Imagine if you find historical data that shows that a customer wanted to pay for a new feature and the feature was of reasonable strategic value but it wasn't authorised. Or if there were a bunch features with comparatively low value that did get the go-ahead. And if you realised that these were not outliers and that lots of results looked to be "all over the place". Well, it could be that the reason that the high value work wasn't done was because the development team were already stacked out with important work. Whereas those low value jobs that did get done happened to come in during quiet periods where they could be easily slotted into the schedule. This would mean that there was actually a third factor at play; available developer resources.

What would be required to model this would be to include some quantified "available developer resources" value with each data point. With this additional information, we should be able to construct a useful prediction model. However, the points could no longer be plotted on a simple 2D graph - now they would have to be plotted in 3D space. And so the formula that would be produced from the training data would no longer draw a simple line, it would have to describe a plane that cuts through the 3D space (with the "yes" values on one side and the "no" values on the other).

If you had more than three features to each data point then you would no longer be able to draw a graph of it, your points would now live in the mysterious-sounding "n-dimensional space". It's still possible to concoct a formula that divides the data point within that space, it's just not very easy to imagine it in a visual manner. On the 2D graph, it was a one-dimensional line that split the data points and allowed us to make predictions against future data points. On the 3D graph, it would be a two-dimensional plane that divides the points. In n-dimensional space, it will be a "hyperplane" (which just means that it is a structure with one less dimension than the space that it exists in). Once you start throwing terms around like "Linear SVM" and "hyperplane", it's hard not to feel like you're making real progress towards getting a handle on all this machine learning lark! I mention these higher dimensional spaces, though, not just because they sound exciting but because they will be relevant soon..

Something else that's interesting when considering SVMs is that, even with all of the required data, it still might not be possible to delineate the positive and negative results with a linear formula. If we go back to imagining a 2D graph, there could be curious curves in the data that are predictable but can not be cleanly cut with a straight line. In this case, we would have two choices - have the algorithm try to find a line that cuts the results as well as possible (but accept that this will have a lower accuracy because we know that a straight line will get a lot of predictions wrong, based upon the data we've got) or we can allow it to try to come up with a non-linear separation.

The Stricter Manager's Decision History

If, say, you had a manager who leant more towards authorising tasks that had greater strategic value, even if there weren't immediately customers to finance the work (maybe they favour tasks that improve efficiency for work that customers are excited about or maybe it's to lay the groundwork required before customer-led work can continue) then it might not be possible to perfectly fit a straight line to the data. One option, when trying to model this, would be to specify that the learning algorithm could use a non-linear kernel, would would hopefully generate better predictions. Or you could stick with the linear approach and accept that it will be less accurate. Looking at the graph here, there is a curved line that matches the data precisely and there is a straight line that tries its best but can't help splitting the data so that some negative values are above the line (where only positive values should be) and some positive values are below the line (where only negative values should be).

This is another illustration that machine learning can't perform miracles. Before, I described an example where a third feature would need to be added to the data ("available developer resources") - without which it would be impossible to train a useful model. Here, we can see that sometimes we need to have some insight into what prediction models may or may not work for us (ie. we need to know that a linear kernel won't yield the best results).

I'm not going to dwell on this any longer here because, happily, a linear SVM is a good fit for the data that we'll be working with for classifying face / not-a-face. I don't think that I yet understand enough about the subject to explain why a linear model works so well for what we want to do but I do have links to articles (which I'll include at the end of this post) that seem happy to explain why this is the case.

Feature extraction for maybe-a-face images

So, if an SVM can be trained on pre-existing data and then used to decide whether other data (that it's never seen before) should be classified as a yes or a no, how can this be applied to our "potential face regions"? The first thing that we need to do is decide how to extract features from an image. In the "Manager Decision History" examples above, there were only two or three features but there's nothing stopping us from extracting many more features from our inputs. Whatever we do, though, we need to generate the same number of features for every input into the SVM - so we'll almost certainly have to resize each maybe-a-face sub-image before processing it. We could consider taking all of the pixel data from each resized sub-image (every red value, green value and blue value from every pixel across the entire sub-image).. but we're not going to. Not only would there be a lot of data to work with, there would also be too much variance introduced by the environment of the source photograph. We want to try to give the SVM meaningful data for it to analyse and remove irrelevant variables where possible. By "irrelevant variables", I mean that it would be great if we could ignore the lighting in an image and focus on the shapes.

Sometimes the lighting varies not just from one image to another but within the same image - if there is a person outside and sunlight falls on one side of them, for example. And it would be great if we could ignore the contrast and the shades of colour - if you can imagine fiddling with an image in Photoshop (or the excellent Paint.NET), recolouring an image or playing with the contrast and brightness doesn't affect your ability to recognise what's in the image (unless you really stretch some of those dials to the max) and we don't want those sorts of differences to have any impact on the SVM's predictions. What we really want is to discount as much of this variance as possible when preparing the inputs (both the inputs used to train it and the subsequent inputs that we want predictions for).

A method that works well to achieve this is to generate normalised "Histograms of Oriented Gradients" from the sub-image. Again, I've come to this conclusion from doing a lot of reading around and then attempting to implement described methods to see how they work - I'll include references to articles with more details at the end of this post.

The premise is fairly simple, you need to:

  1. Ensure that every image to process is a consistent size before starting (if an image is the wrong aspect ratio for whatever the "consistent size" that you decide on is then add some black bars above-below or left-right, depending upon which sides are too long/short)
  2. Greyscale the image
  3. For every pixel, work out the direction that the intensity of light on the image is changing and how quickly it's changing
  4. Split the data into blocks
  5. Reduce each block into a histogram of gradients
  6. Apply some sort of normalisation to reduce impact of contrast or brightness differences from image to image
  7. Use the resulting data as the image's feature set in the next step!

I think that steps 3 and 5 are probably the ones that require further explanation. Since we have a greyscale image by step 3, instead of colour data we effectively have "intensity" values for each pixel. To work out the angle in which the intensity is changing, take the intensity of the pixel below the current one and substract it from the intensity of the pixel above the current one. This is the "y difference" - or "delta y", so I'll name it "dy" (so that I can sound all maths-y). "dx", as you can probably imagine, is the difference between the intensity of the pixel to the right minus the intensity of the pixel to the left. We can calculate the angle at which intensity is changing (and how big the change is) with the following:

var angleInDegrees = RadiansToDegrees(Math.Atan2(dx, dy)));
var magnitude = Math.Sqrt((dx * dx) + (dy * dy));

Note that "Math.Atan2" returns a value in radians. Radians are another unit of measurement of angles but instead of describing a full circle by going from 0-360 degrees, the scale goes from 0-2π radians. So, to convert from radians into easier-to-deal-with-in-this-case degrees we can do this:

private static double RadiansToDegrees(double angle)
{
    return angle * (180d / Math.PI);
}

Also note that Atan2 actually returns values in the range of -π to +π and so calling RadiansToDegrees on the result will give us a value from -180 to +180.

(When calculating gradient angles and magnitudes, any pixels on image edges won't have pixels surrounding in every direction and so I just record them as having both an angle and magnitude of zero)

A Histogram

Now for step 5, we want to produce "histograms of oriented gradients" (one for each block that we've split the image into in step 4). I always envisage a histogram like a bar chart - it shows how values are distributed across a range.

(The example on the left is from wikipedia and reproduced here under the creative commons license)

To generate our histograms, we want to assign the gradients with the block across nine buckets, spaced twenty-degrees apart. The first will be at 10 degrees, the second at 30 and they'll go all the way up to 170.

We're going to use "unsigned gradients", which means that if we have a gradient where intensity is changing only vertically, we don't really care if it's getting brighter as it goes down or brighter as it goes up, we only care that the line is precisely vertical. Similarly, imagine the intensity increasing up-and-right, at 45 degrees - we're happy to treat this the same as intensity going in precisely the opposite direction; down-and-left and 225 degrees (or -135 degrees since our current angleInDegrees values are in the -180 to +180 range). What this essentially means is that we want to tweak our angles so that they are all within 0-180 (instead of -180 to +180). To do, so we just add 180 degrees to any values less than zero.

Every gradient needs to be assigned to one or more of these buckets. If a gradient's angle is precisely 10 degrees then the entirety of the gradient's magnitude is added to the 10 bucket. However, if it's between 10 and 30 then its magnitude is split proportionally between 10 and 30 (so a gradient of 20 degrees is split evenly between 10 and 30 while a gradient of 25 degrees will contribute 3/4 of its magnitude to the 30 bucket - which it is 5 degrees away from - and 1/4 of its magnitude to the 10 bucket - which it is 15 degrees away from).

Performing this transformation on the image is an effective way of reducing the amount of data that we need to deal with. If we decide that the standardised size of the images that we want an SVM to classify is 128x128 then we 128 x 128 x 3 = 49,152 values (since every pixel has three colour values; red, green and blue). If use a block size of 8 when generating the HOG data then the 128 x 128 image will be 16 x 16 blocks and each block has 9 values (since each histogram has values in nine bins), which gives a total of 2,304 values.

Another nice thing about this representation of the data is that, if you sort of squint, you can kind of make out the shapes that were in the source image -

Histogram of gradients render for Egg Man's face

If you get curious and want to try generating HOGs for your own images, there is code in my my GitHub project to do this..

using (var image = new Bitmap(imagePath))
{
    using (var resizedImage = new Bitmap(image, new Size(128, 128)))
    {
        // (Can ignore GetFor's return value when only interested in generating a preview image)
        FaceClassifier.FeatureExtractor.GetFor(
            resizedImage,
            blockSize: 8,
            optionalHogPreviewImagePath: "HOG.png",
            normaliser: GlobalNormaliser.Normalise
        );
    }
}

.. which brings me neatly on to HOG normalisation (since there's a mysterious reference to a normaliser in the code above). When the gradient magnitudes are calculated, some images may have many pixels that vary wildly in intensity from their neighbours while other images may have smoother, gentler gradients. In both cases, it is the relative flow of intensity that is important, since that helps identify the shapes. If you took an image and produced HOG data for it and then increased the contrast of the image and re-processed it, you would get greater gradient magnitude values in the increased-contrast version of the image, even though they both - so far as we are interested (in the context of trying to detect faces) - contain exactly the same information.

What we want to do is to align all of the magnitudes to a common base line. A simple (and fairly effective) way to do this is to find the largest value of any of the buckets across all of the histograms generated for an image and to then divide every magnitude by this value. In the example above, the 128x128 input image is transformed into 16x16 blocks, each of which is a histogram that contains 9 values. So we get the largest value from each of those 16x16x9 values and then divide all of them by it. This means that the largest value is now precisely 1.0 and every other value is somewhere between zero and one. This is what the "GlobalNormaliser.Normalise" delegate in the code above does. All it essentially has to do is this:

public static DataRectangle<HistogramOfGradient> Normalise(DataRectangle<HistogramOfGradient> hogs)
{
    if (hogs == null)
        throw new ArgumentNullException(nameof(hogs));

    var maxMagnitude = hogs.Enumerate()
        .Select(pointAndHistogram => pointAndHistogram.Item2)
        .Max(histogram => histogram.GreatestMagnitude);
    return hogs.Transform(hog => hog.Multiply(1 / maxMagnitude));
}

However, there is a variation on this that has been found to produce more accurate results; "block normalisation". The original description of this process comes from (as I understand it) the original research into using HOGs for this form of detection ("Histograms of Oriented Gradients for Human Detection [PDF]") -

For better invariance to illumination, shadowing, etc., it is also useful to contrast-normalize the local
responses before using them. This can be done by accumulating a measure of local histogram “energy” over somewhat larger spatial regions (“blocks”) and using the results to normalize all of the cells in the block

What this means in practical terms is explained well by one of the articles that I linked earlier ("HOG Person Detector Tutorial"). In essence, it means that we can get better results from normalising over smaller areas of the image. Instead of taking the max magnitude across the entire set of data, the idea is to group the histograms into blocks of four and normalising over those.

Let's go back to Tiger Woods' face to illustrate what I mean.

First 2x2 block of histograms to normalise
Second 2x2 block of histograms to normalise

We first take the 2x2 histograms from the top left of the image - we get the max magnitude from within those four histograms and use it to normalise the values within them. These four normalised histograms will provide the first sets of values that we extract from the image.

Then we move across one to get another set of 2x2 histograms and repeat the process; get the max magnitude from within those four histograms, use it to normalise them and then take those four normalised histograms as the next set of values that we have extracted from the image.

What you might notice here is that, as we look at the 2x2 blocks of histograms, some of them will appear multiple times. The histograms from the edges of the image won't but the others will. In the Tiger images here, you can see that the two histograms at the right hand side of the block in the first image are included again in the second image (now they are the two histograms on the left hand side of the block).

This means that this "block normalisation" process will result in more data being produced. When we "globally normalised" the HOGs then we had 16x16x9 = 2,304 values. However, if we block-normalise (using blocks of 2x2) then we generate 30 blocks across (there are two edge blocks that are only counted once but the other 14 blocks are all counted twice, so the total is 2 + (14*2) = 30). For the same reason, we will generate 30 blocks worth of data as we go down the image. This means that we end up with a total of 30x30x9 = 8,100 values.

To extract those features, we would tweak the code from before -

const int inputWidth = 128;
const int inputHeight = 128;
const int blockSizeForHogGeneration = 8;
const int blockSizeForLocalNormalisation = 2;

IEnumerable<double> features;
using (var image = new Bitmap(imagePath))
{
    using (var resizedImage = new Bitmap(image, new Size(inputWidth, inputHeight)))
    {
        var blockNormaliser = new OverlappingBlockwiseNormaliser(blockSizeForLocalNormalisation);
        features = FaceClassifier.FeatureExtractor.GetFor(
            resizedImage,
            blockSize: blockSizeForHogGeneration,
            optionalHogPreviewImagePath: null,
            normaliser: blockNormaliser.Normalise
        );
    }
}

(Note that I'm setting "optionalHogPreviewImagePath" to null so that "FeatureExtractor.GetFor" doesn't generate a "HOG preview" image - this is because it's much harder to recognise the shapes that the gradients were extracted from when this form of normalisation is use since most of the HOGs appear multiple times, so the preview images are of less interest)

When I tried comparing the results of global normalisation vs block normalisation, I found that I got better result (ie. better accuracy) when using block normalisation, just as the authors of "Histograms of Oriented Gradients for Human Detection" did. The number of images that I've been testing with is, I'm sure, much smaller than the number used by Dalal and Triggs in their research but it was gratifying that I could see improvements with my data set from using block normalisation - if only because it reassured me that I was moving in the right direction!

Training an SVM

It's really coming together now. We've got a way to extract data from an image that we will be able to pass to an SVM in order for it to classify the image as "face" or "not face". There's just one thing that we're missing.. a trained SVM. Or, another way to look at it, two things we're missing - a mechanism to train an SVM and the data to train it with.

Let's start with a general way to train an SVM. For this, I'm going to use a package called Accord.NET. It's a library entirely written in C#, which was a plus to me because I like to see how things work and when I was doing my early reading around on the subject of face detection/recognition, a lot of people were recommending OpenCV. This is a C++ library (which can be called by C# using a .NET wrapper called Emgu), while I would be happier with C# source code that I could more confidently dig around in. (Also, I'm toying with the idea of trying to port some of this work to a Bridge.NET project so that I can try making it work in the browser - this wouldn't be possible if I took a dependency on a library like OpenCV).

Accord.NET really does make it easy.. once you know where to get started. There are a lot of examples on the accord-framework.net site and on the GitHub wiki, though some of the code samples are out of date and will generate compile warnings if you try to use them directly. (Having said that, the warnings can be ignored and it's not too hard to find other examples that compile without warnings - and, from reading some of the GitHub issues, I know that César is aware that some of the docs need updating and is intending to do so when he can make time).

To demonstrate, let's revisit the "Manager Decision History" example from earlier. We'll formulate some example data where we pretend that the Manager is super-consistent and will always authorise work if the percentage that the customer will pay for immediately (as a value between 0 and 1) plus the strategic value (also somehow quantified as a value from 0 to 1) add up to more than 1. (So strategic value 0.9 plus customer-will-pay-immediately 0.2 will be authorised as 0.9 + 0.2 > 1 but strategic value 0.8 with customer-will-pay-immediately of 0.15 will not be authorised as 0.8 + 0.15 < 1). We can then use that data to train an SVM and then try other values against the model -

// Make up some data (in the real world we'd use some proper pre-classified training
// data but this is just an example to demonstate how to train an SVM using Accord.NET)
var decisionHistory = Enumerable.Range(0, 10).Select(x => x / 10d)
    .SelectMany(x => Enumerable.Range(0, 10).Select(y => y / 10d).Select(y => new
    {
        StrategicValue = x,
        ImmediateCustomerContribution = y,
        Authorised = (x + y) > 1
    }));

// From the data, extract the input features (strategic-value and amount-customer-will-
// pay-now)..
var inputs = decisionHistory
    .Select(decision => new[]
    {
        decision.StrategicValue,
        decision.ImmediateCustomerContribution
    })
    .ToArrary();

// .. and the true/false outputs for each of those sets of features
var outputs = decisionHistory.Select(decision => decision.Authorised).ToArray();

// Then use the inputs and outputs to train an SVM
var smo = new SequentialMinimalOptimization<Linear>();
var svm = smo.Learn(inputs, outputs);

The SequentialMinimalOptimization defines the process by which it will be decided when the best match has been found for the data that it's been provided. We're specified that a linear kernel be used, which means that we're presuming that it will be possible to neatly classify our data with a straight line.

Now that it's been trained, we can ask the SVM to predict an output by calling its "Decide" method and giving it a pair of values -

var easyWin = svm.Decide(new[] { 0.81, 0.79 });

This returns true - which is what we would expect since the numbers indicate a feature that has high strategic value (0.81) and there are customers who want it so much right now that they are already getting their chequebooks out (the customer-will-immediately-pay value is 79%).

var lowPriority = svm.Decide(new[] { 0.26, 0.14 });

This returns false - which we'd also expect, since the numbers indicate a feature of low strategic value and one that no customer is excited about contributing much towards the development cost of.

Time for another gotcha. We saw earlier that a linear kernel is not always going to be capable of perfectly classifying the results in the training data. Sometimes you might need to use a non-linear kernel (or stick with a linear kernel but accept a lower accuracy). I didn't talk about what other kernel options there are (because it's not relevant to what I want to do here) but it was an important point that machine learning will sometimes need some external insight in order to be as effective as it can be. Another example of this is that sometimes you need to tweak the training parameters, depending upon the data that you're using. In the below example, I'm going to try to train an SVM in a very similar manner to what we just looked at, but with much less data -

var decisionHistory  = new[]
{
    // Strategic wins
    new { StrategicValue = 0.95, ImmediateCustomerContribution = 0.1, Authorised = true },
    new { StrategicValue = 0.85, ImmediateCustomerContribution = 0.2, Authorised = true },

    // Customer wins
    new { StrategicValue = 0.15, ImmediateCustomerContribution = 0.9, Authorised = true },
    new { StrategicValue = 0.2, ImmediateCustomerContribution = 0.9, Authorised = true },

    // Everybody is happy
    new { StrategicValue = 0.8, ImmediateCustomerContribution = 0.8, Authorised = true },

    // Low priority
    new { StrategicValue = 0.2, ImmediateCustomerContribution = 0.1, Authorised = false },
    new { StrategicValue = 0.4, ImmediateCustomerContribution = 0.2, Authorised = false }
};

var inputs = decisionHistory
    .Select(decision => new[]
    {
        decision.StrategicValue,
        decision.ImmediateCustomerContribution
    })
    .ToArrary();

var outputs = decisionHistory.Select(decision => decision.Authorised).ToArray();

var smo = new SequentialMinimalOptimization<Linear>();
var svm = smo.Learn(inputs, outputs);

The training data is very similar to before in that all authorised decisions still have a feature sum of more than 1 and all rejected decisions have a sum of 1 or less. However, something seems to have gone wrong because when I ask the trained SVM what it thinks of the "lowPriority" example -

var lowPriority = svm.Decide(new[] { 0.26, 0.14 });

.. it returns true! This is not what I want.

The only way that this could happen is if the prediction model that has been generated is completely wonky somehow. To put this to the test, I'm going to use a mechanism that Accord has where you can double-check your trained SVM by running the training data back through it to see how well the prediction model managed to fit it. This can be useful in cases where you're not sure if the SVM kernel that you're using is appropriate, since it can highlight a badly-fitting model. To calculate the error rate when the training data is passed back through the SVM, do the following:

var predicted = svm.Decide(inputs);
var error = new ZeroOneLoss(outputs).Loss(predicted);

This just uses the model to calculate a prediction for each of the inputs and then compares the results to the expected values (then it works out what proportion are incorrect). In this case, the error rate is 0.2857142857142857, which is 2/7. Looking at the predicted values (an array of bool), every value is true! This isn't right, the last two inputs (the "low priority" data points) should result in a false prediction. I guess that that explains why the "lowPriority" example returns true from this model - it seems to return for everything!

We know that a linear model will fit this data because we know that it's a straight line on a graph that separates everything above 1 (which are "decision authorised" results) from everything else (which are "decision rejected" results). So it's not the kernel that's the problem. The only other thing to do is to look for some options to fiddle with. Poring through the documentation, one that sounds promising is "Complexity" (also referred to as "cost") -

The cost parameter C controls the trade off between allowing training errors and forcing rigid margins. It creates a soft margin that permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well.

That description makes it sound like a bad thing to try increasing the complexity, I think I want a model that will generalise well. However, leaving the complexity at its default is clearly not working well for us in this case. So, I tried changing the SequentialMinimalOptimization initialisation to:

var smo = new SequentialMinimalOptimization<Linear> { Complexity = 10 };

.. and re-running. This time, the predicted array precisely matched the output array, which means that the error rate is now zero. When I ask the new model to predict a value for the "lowPriority" features, it returns false - which is much better!

I've only experienced this problem when working with small amounts of data. To train a face classification SVM, we're going to throw a lot of training data at it and so this shouldn't be a problem (there will be no need to fiddle with Complexity or any other settings). I only mention it now in case you decide to do a few experiments of your own and fall into the same trap that I did!

Training data for faces

We have all the tools that we need now, the final piece of the puzzle is that we need training data to teach an SVM what looks like a face and what doesn't.

I could take the pictures that I found on my computer, run them through the skin tone face detector, manually categorise each maybe-a-face region and then use that information to train an SVM. I would use the HOG feature extractor to generate the training inputs and the list of output values would be the manual classifications that I would have to prepare (eg. sub-image-1 is a face, sub-image-2 is not a face, etc..). This should result in an SVM that could them tell apart each of the sub images automatically. However, that would be cheating! What I want to do is train a classifier with one lot of data and then use the SVM on my Easter Weekend photos to prove that it's worked. (Testing an SVM using the same data used to train it is a bit pointless, it gives you no indication whether you have produced something that is useful for general purpose or if you've only succeeded in training an SVM that is specialised and only works with that particular set of inputs).

It's crucial to train it using both positive images (ie. face images) and negative images (non-face images), otherwise the SVM will have no idea to classify. If, for example, you tried to train an SVM using only positive images then all you teach it is that everything is a positive image! (By always returning true, it would produce a zero error rate for the training data but it's not very useful to alway returns true when trying to classify real work data). So we need both kinds of input and I think that we ideally want to have an equal number of positive and negative images.

(If I had to really think about it, maybe it should be the case that we want at least as many negative images as positive as there are only so many variations of a face that exist but there are loads of things that aren't faces.. however, an equal number of positive/negative has worked well for me and so I haven't felt the need to experiment with different ratios)

I've found various places that have databases of images of faces but I found it difficult to decide how to get a set of negative images. I'm sure that in some things I read, people spoke about using images of scenery.. but I can't see why there should be any particular kind of negative image (such as scenery) that should be used - it could be anything (so long as it's not a face)!

What I did in the end was download the "Caltech 10,000 Web Faces" data set, which includes lots of photos downloaded from various Google image searches along with a text file that, crucially, has coordinates of the faces in the photos. From this data set, I extracted the faces from images but also extracted other parts of the images at random that didn't contain faces. For each face entry in the "Ground Truth" text file, I extracted three images - one where the edges of the sub-image were close around the face and then two where there was a little more background around the face. This should help produce an SVM that can recognise faces when the region around the face is very tight and when it's less so, which is important for classifying the skin tone face detection results - Tiger Woods' detected-face-region is quite tight while the other photos show that much looser face regions may be identified for other photos.

There's no exciting implementation details here. The "CalTechWebFacesSvmTrainer" class is given a bunch of configuration options: path containing the CalTech web face images, path to the Ground Truth text file (which lists all of the face regions in the images), standard size of image to use to generate SVM inputs (128x128), block size for HOG generation (8), normaliser (the OverlappingBlockwiseNormaliser we saw earlier, with block size 2) and number of training images to process. These options will train an SVM on positive and negative images from the Caltech data set and the "TrainFromCaltechData" method will return an IClassifyPotentialFaces implementation, which wraps the SVM and exposes has a single method -

bool IsFace(Bitmap image);

Now, we can take the possible-face sub-images identified by the skin tone pass and pass them to "IsFace". This method will resize the specified Bitmap to the standard image size that the SVM requires, generate normalised HOG data from it and then query the SVM - which returns the final result; is this or is this not a face?

The only configuration option that I haven't mentioned until now is "number of training images to process". We saw before that trying to train an SVM with very little data can be ineffective (unless you know what kernel settings to fiddle with) but it's very difficult to come up with a hard and fast rule of how much data is enough to train from. I think that the best thing to do is just to experiment. My gut told me that it would surely have to be at least 100s or 1000s of images since the SVM has to be trained to classify many variations of faces (and not-faces) and each of the inputs has a lot of values (8,100 - as we calculated earlier) and so it seems like it's going to need to have a lot of information at its disposal so that it can work out what is and isn't important.

Dancing Classification Fail

So, initially, I tried specifying 500 training images, which resulted in an SVM that actually did a pretty good job. The group photo shown before had all of the maybe-face regions correctly classified (the faces were all identified as faces and the hands in the photos were all correctly classified as not-faces).

However, there was a problem with the image shown here (the problem, btw, clearly isn't the moves being thrown on the dancefloor, since they look like the very essence of poetry in motion). The problem is that one of the maybe-face regions identified by the skin tone pass has been mis-classified by the SVM as a face, when it's really an arm.

(Note that there are some faces in the background that were not identified by the skin tone pass, but I'm not worried about that - it is to be expected that they would be ignored because they are small, relative to the size of the image.. though it should be possible to tweak the skin tone filter parameters if you wanted to try to capture background faces).

Increasing the number of training images to 1,000 didn't address the problem with this particular image and, in fact, made another image worse (where a non-face region was correctly classified as not-a-face before, with 1,000 training images the SVM thought that it was a face). Increasing to 1,500 training images corrected the classification of the dancing image regions but there were still false positives in other images. Such as this one of that same individual who appears to now be wearing a flag (which reminds me of a scene from Fear and Loathing in Las Vegas).

Another Classification Failure

2,000 training images seemed to be the sweet spot. The classifier produced the correct results for every case that I expected it to.

The quantity of training data is going to vary from application to application and will depend upon what sort of feature extraction logic you are using. I think that it makes sense to start with lower quantities, if only because it's faster to train an SVM with less data and you can experiment more quickly with the other variables. To illustrate, on my computer (with the code as it currently stands, which is in a completely unoptimised state, but running a release build) it takes 17s to load the 2,000 images and extract the features from them, it then takes 27s to train the SVM on them. With 500 images it takes 5s to load the image data and 2s to train the SVM. With 5,000 images it takes 42s and 168s.

Egg Man's Glorious Face

There's one final tweak that I'd like to mention before bringing everything to a close. The SVM requires that the inputs be fixed dimensions, so the "IsFace" implementation resizes the input image if it's too big or too small. If it's the wrong dimensions, though, then it leaves black bars above-and-below or left-and-right of the input (which is part of the process that I described in "Feature extraction for maybe-a-face images"). I found that the classification accuracy improved slightly if I expanded the maybe-a-face region so that it matched the aspect ratio of the SVM's input dimensions first. For example, if the skin tone pass identified a face region that was 91x241px and I knew that the SVM was configured to work with input images of 128x128px then I would first expand the 91x241px region to 241x241px (similarly, if I had a region that was 123x56px then I would expand it to 123x123px).

Performance

In the context of this kind of work, "performance" relates to two distinct areas - how accurately does it perform its work and how quickly does it do it? To an extent, the two are related since we could make the skin tone filter work more quickly (if we reduced the limit from no-input-dimension-may-be-more-than-400px to 300px or 200px) but at the risk of it no longer identifying regions that we want it to. As we've just seen with the SVM training, the time it takes to train it depends upon the quantity of data (both the size of the input image that features are extracted from and the number of input images) - so the inputs could be reduced (so that the training time would be shorter) but this negatively affects the accuracy of the classifier.

(A handy tip that I used when I was happy with the SVM training but experimenting with the skin tone pass was that the Accord.NET SVM class may be serialised using the BinaryFormatter - so I was persisting it to disk between runs, rather than having to re-train it on every run)

Once the SVM is trained, it is very quick at classifying. Which makes sense, since all it needs to do is take a feature set, apply its linear formula and see which side of the hyperplane the data point sits on. There is a little work required to extract the feature data from a maybe-face sub-image in order to give the SVM something to work on but it takes less then 15ms per sub-image on my computer. The skin tone pass is slower, taking between 150 and 300ms for the images that I've been testing with (which are all about 1000px on the largest side and so are resized down to 400px before being processed). I'd like to cut this time down because I like the idea of being able to apply this processing to video and I feel that it needs to be able to process at least five frames a second to convincingly track faces. I haven't made any effort towards this yet but reducing the total time down to 200ms feels reasonable since the code has been written to be easy to follow and tweak, rather than to be fast, so there is surely plenty of performance tuning potential.

Another reason that I haven't spent any time optimising the current code is that it might be wiser to spend time researching alternative algorithms. For example, Accord.NET has some support for face detection, such as the FaceHaarCascade class (which I believe is the same class of detector as the Viola–Jones object detection framework) -

Rectangle[] faceRegions;
using (var image = new Bitmap(path))
{
    var detector = new HaarObjectDetector(new FaceHaarCascade());
    faceRegions = detector.ProcessFrame(image);
}

However, it gave poor results for my images. It seemed to do well with very small faces but struggled with larger images (or the images where the faces took up a large proportion of the image). It's very possible that I could improve the accuracy by tweaking the parameters. If my images are too large, maybe they should just be shrunk down further before processing? That might help the false negative rate but I'm also concerned about the false positive rate and I'm not sure that shrinking the image first would help with that.

TigerWoods - Sliding Window SVM Results

One approach that sounds very promising: if the SVM classifier is so fast and effective then maybe we should get rid of the skin tone processing altogether and just generate segments from all over the image, then filter them with the classifier. This is discussed and recommended by many people and is referred to as the "sliding window" approach (see Sliding Windows for Object Detection with Python and OpenCV, for example). The idea is that you pick a window size (eg. 50px if limiting the source image to no more than 400px on a side) and then start at the top left of the image and take that top-left 50x50px as the first sub-image to pass to the classifier. Then you move the window over a bit (say, 10px) and then use that as the second sub-image. Keep going until you can't go any further and then go back to the left but go down 10px - moving across then back-and-down until you have covered the entire image. Then do the same with some other window sizes (maybe 75px and then 100px), since you don't know whether the faces on an image are going to be small (relative to the size of the image) or large. This will give you many, many regions to classify - but that's no problem because linear SVM classification is very fast. Unfortunately, I did try this but I got an enormous false positive rate (as you can see here with Tiger - if you can make him out from behind all the green "this is a face!" boxes). Surely this approach can be made to work since it's so commonly presented as a solution to this problem.. it may just be that I need to tweak how the SVM is trained, maybe it needs more than 2,000 training images (though I did also try it with 10,000 training images and the results were not much better).

Black and White Face Detection Fail

To return to the performance of the skin-tone-pass-following-by-(2k-image-trained-)SVM-classification model that I've talked about for most of this post; in terms of accuracy I'm very happy with it. In the test images that I've been using, there are zero false positives. There are a couple of false negatives but I think that they are understandable. The first obvious failing that it has is with a black-and-white photo - since the skin tone face detector looks for particular hues and saturations, it's not going to work on if there is no colour in the image. That's just a built-in limitation of the approach.

Semi-occluded

Another failure is with a face that is half hidden behind someone else (in the top right of this group photo). The skin tone face detector identifies the face but the SVM classifies it as not-a-face. Since the face is half-hidden, I think that this is acceptable.

Vomit shot with skin mask
Vomit shot false negative

Last but not least is this unfortunate guy who foolishly entered some sort of drinking contest. I don't know if he won but he doesn't look too good about it one way or the other. The red liquid erupting from his lips falls within the first phase's looks-like-skin-tones bounds, as you can see by looking at the skin mask image. Not only is his head tilted in the shot but the vomit-detected-as-skin means that his face isn't centered in the maybe-face region. So I'm not surprised that the SVM doesn't correctly classify it.

In summary, it's not perfect but I'm still very happy with the results.

Article references

I want to close this post off by linking to the sites and papers that helped me get to this point. It's been a fun journey and is hopefully only my first step on the machine learning track!

The "Naked People Skin Filter (Fleck & Forsyth)" and "Face Detection in Color Images" papers were amongst the first that I found that had easily understandable material (for a face detection novice) that didn't use a third party library in some way.

Once I had implemented the skin tone logic and was looking to improve the results, one of the articles that I found most inspiring (partly because it was so approachable) was "Machine Learning is Fun! Part 4: Modern Face Recognition with Deep Learning" (by Adam Geitgey). This talks about taking things to the next step and performing facial recognition - not just locating faces but also trying to identify who they are, against a known database. I haven't got to that advanced point yet but it's definitely something that I'd like to explore. The only thing that I would say about this article is that it doesn't go into a lot of depth about generating HOGs or normalising them or what sort of input image size and other parameters to use. He does link to some Python code that uses third party libraries in lieu of more details, though. Since I wanted to write C# and (ideally) only take C# dependencies this wasn't very helpful to me. The article itself, though, got me really excited about the possibilities and I enjoyed the guy's writing style (and intend to read the rest of his "Machine Learning is Fun" series).

An article that did go into the details was the "HOG Person Detector Tutorial" (by Chris McCormick). This was also really well written and very approachable. Although it talks about detecting a person in a photo, rather than a face, the principles are the same. The description of the HOG generation and normalisation really helped me clarify in my mind how it should all work. This article links to the original HOG person detector paper by Dalal and Triggs [PDF], which is full of research information and graphs and equations - so if you want to dig deep then this is the place to start!

The Triggs and Dalal paper state in the introduction that "For simplicity and speed, we use linear SVM as a baseline classifier throughout the study". Later on they mention that they tried a non-linear kernel but that "Replacing the linear SVM with a Gaussian kernel one improves performance by about 3%.. at the cost of much higher run times". If this "why is a linear SVM classifier appropriate" explanation feels too light for you then it is discussed in greater depth in "Why do linear SVMs trained on HOG features perform so well? [PDF].

Finally, I want to mention again the article "Kernel Support Vector Machines for Classification and Regression in C#" by Accord.NET's César Souza. Here, he introduces the SVM in a much more formal way than I've done here, he includes a video about "The Kernel Trick" (which is a simple yet mind-blowing way in which the linear kernel can be used to classify data that doesn't immediately look like it would work for), he talks about other kernels (aside from linear he also describes polynomial and gaussian) and he describes how the Sequential Minimal Optimization learning algorithm works. There's a lot of code and there's illustrated examples at the end. When I found this article fairly early on in my research, I recognised how much care had gone into preparing it but also knew that I needed something a little more beginner-level first! So I bookmarked it for future reference and now I think that I understand (almost all of) it. Hopefully, if you started off knowing nothing but have followed all the way through this post, then you will too!

Posted at 18:39

Comments