29 March 2022
I have a video of a presentation where the camera keeps losing focus such that the slides are unreadable. I have the original slide deck and I want to fix this.
Step one was identifying the area in each frame that it seemed likely was where the slides were being projected, now step two is to correct the perspective of the projection back into a rectangle to make it easier to perform comparisons against the original slide deck images and try to determine which slide was being projected.
(An experimental TL;DR approach: See this small scale .NET Fiddle demonstration of what I'll be discussing)
An overview of the processing to do this looks as follows:
Bitmap
Note: The reason that this process is only going to be an approximation is due to the way that the height of the output image will be determined -
So if we follow the plan above then we'll generate a list of vertical slices a bit like this:
.. which, when combined would look like this:
This is very similar to the original projection except that:
We're not done yet but this has brought things much closer!
In fact, all that is needed is to stretch those vertical slices so that they are all the same length and; ta-da!
So, from previous analysis, I know that the bounding area for the projection of the slide in the frames of my video is:
topLeft: (1224, 197)
topRight: (1915, 72)
bottomLeft: (1229, 638)
bottomRight: (1915, 662)
Since I'm going to walk along the top edge and create vertical slices from that, I'm going to need the length of that edge - which is easy enough with some Pythagoras:
private static int LengthOfLine((PointF From, PointF To) line)
{
var deltaX = line.To.X - line.From.X;
var deltaY = line.To.Y - line.From.Y;
return (int)Math.Round(Math.Sqrt((deltaX * deltaX) + (deltaY * deltaY)));
}
So although it's only 691px horizontally from the top left to the top right (1915 - 1224), the actual length of that edge is 702px (because it's not a line that angles up slightly rather than being a flat horizontal one).
This edge length determines how many vertical slices that we'll take and we'll get them by looping across this top edge, working out where the corresponding point on the bottom edge should be and joining them together into a line; one vertical slice. Each time that the loop increments, the current point on the top edge is going to move slightly to the right and even more slightly upwards while each corresponding point on the bottom edge will also move slightly to the right but it will move slightly down as the projection on the wall gets closer and closer to the camera.
One way to get all of these vertical slice lines is a method such as the following:
private sealed record ProjectionDetails(
Size ProjectionSize,
IEnumerable<((PointF From, PointF To) Line, int Index)> VerticalSlices
);
private static ProjectionDetails GetProjectionDetails(
Point topLeft,
Point topRight,
Point bottomRight,
Point bottomLeft)
{
var topEdge = (From: topLeft, To: topRight);
var bottomEdge = (From: bottomLeft, To: bottomRight);
var lengthOfEdgeToStartFrom = LengthOfLine(topEdge);
var dimensions = new Size(
width: lengthOfEdgeToStartFrom,
height: Math.Max(
LengthOfLine((topLeft, bottomLeft)),
LengthOfLine((topRight, bottomRight))
)
);
return new ProjectionDetails(dimensions, GetVerticalSlices());
IEnumerable<((PointF From, PointF To) Line, int Index)> GetVerticalSlices() =>
Enumerable
.Range(0, lengthOfEdgeToStartFrom)
.Select(i =>
{
var fractionOfProgressAlongPrimaryEdge = (float)i / lengthOfEdgeToStartFrom;
return (
Line: (
GetPointAlongLine(topEdge, fractionOfProgressAlongPrimaryEdge),
GetPointAlongLine(bottomEdge, fractionOfProgressAlongPrimaryEdge)
),
Index: i
);
});
}
This returns the dimensions of the final perspective-corrected projection (which is as wide as the top edge is long and which is as high as the greater of the left-hand edge's length and the right-hand edge's length) as well as an IEnumerable
of the start and end points for each slice that we'll be taking.
The dimensions are going to allow us to create a bitmap that we'll paste the slices into when we're ready - but, before that, we need to determine pixel values for every point on every vertical slice. As the horizontal distance across the top edge is 691px and the vertical distance is 125px but its actual length is 702px, each time we move one along in that 702px loop the starting point for the vertical slice will move (691 / 702) = 0.98px across and (125 / 702) = 0.18px up. So almost all of these vertical slices are going to have start and end points that are not whole pixel values - and the same applies to each point on that vertical slice. This means that we're going to have to take average colour values for when we're dealing with fractional pixel locations.
For example, if we're at the point (1309.5, 381.5) and the colours at (1309, 381), (1310, 381), (1309, 382), (1310, 382) are all white then the averaging is really easy - the "averaged" colour is white! If we're at the point (1446.5, 431.5) and the colours at (1446, 431), (1447, 431), (1446, 432), (1447, 432) are #BCA6A9, #B1989C, #BCA6A9, #B1989C then it's also not too complicated - because (1446.5, 431.5) is at the precise midpoint between all four points then we can take a really simple average by adding all four R values together, all four G values together, all four B values together and diving them by 4 to get a combined result. It gets a little more complicated where it's not 0.5 of a pixel and it's slightly more to the left or to the right and/or to the top or to the bottom - eg. (1446.1, 431.9) would get more of its averaged colour from the pixels on the left than on the right (as 1446.1 is only just past 1446) while it would get more of its averaged colour from the pixels on the bottom than the top (as 431.9 is practically ay 432). On the other hand, on the rare occasion where it is a precise location (with no fractional pixel values), such as (1826, 258), then it's the absolute simplest case because no averaging is required!
private static Color GetAverageColour(Bitmap image, PointF point)
{
var (integralX, fractionalX) = GetIntegralAndFractional(point.X);
var x0 = integralX;
var x1 = Math.Min(integralX + 1, image.Width);
var (integralY, fractionalY) = GetIntegralAndFractional(point.Y);
var y0 = integralY;
var y1 = Math.Min(integralY + 1, image.Height);
var (topColour0, topColour1) = GetColours(new Point(x0, y0), new Point(x1, y0));
var (bottomColour0, bottomColour1) = GetColours(new Point(x0, y1), new Point(x1, y1));
return CombineColours(
CombineColours(topColour0, topColour1, fractionalX),
CombineColours(bottomColour0, bottomColour1, fractionalX),
fractionalY
);
(Color c0, Color c1) GetColours(Point p0, Point p1)
{
var c0 = image.GetPixel(p0.X, p0.Y);
var c1 = (p0 == p1) ? c0 : image.GetPixel(p1.X, p1.Y);
return (c0, c1);
}
static (int Integral, float Fractional) GetIntegralAndFractional(float value)
{
var integral = (int)Math.Truncate(value);
var fractional = value - integral;
return (integral, fractional);
}
static Color CombineColours(Color x, Color y, float proportionOfSecondColour)
{
if ((proportionOfSecondColour == 0) || (x == y))
return x;
if (proportionOfSecondColour == 1)
return y;
return Color.FromArgb(
red: CombineComponent(x.R, y.R),
green: CombineComponent(x.G, y.G),
blue: CombineComponent(x.B, y.B),
alpha: CombineComponent(x.A, y.A)
);
int CombineComponent(int x, int y) =>
Math.Min(
(int)Math.Round((x * (1 - proportionOfSecondColour)) + (y * proportionOfSecondColour)),
255
);
}
}
This gives us the capability to split the wonky projection into vertical slices, to loop over each slice and to walk down each slice and get a list of pixel values for each point down that slice. The final piece of the puzzle is that we then need to resize each vertical slice so that they all match the projection height returned from the GetProjectionDetails
method earlier. Handily, the .NET Bitmap
drawing code has DrawImage
functionality that can resize content, so we can:
Bitmap
whose dimensions are those returned from GetProjectionDetails
IEnumerable
also returned from GetProjectionDetails
)DrawImage
to paste that slice's bitmap onto the full-size projection Bitmap
In code:
private static void RenderSlice(
Bitmap projectionBitmap,
IEnumerable<Color> pixelsOnLine,
int index)
{
var pixelsOnLineArray = pixelsOnLine.ToArray();
using var slice = new Bitmap(1, pixelsOnLineArray.Length);
for (var j = 0; j < pixelsOnLineArray.Length; j++)
slice.SetPixel(0, j, pixelsOnLineArray[j]);
using var g = Graphics.FromImage(projectionBitmap);
g.DrawImage(
slice,
srcRect: new Rectangle(0, 0, slice.Width, slice.Height),
destRect: new Rectangle(index, 0, 1, projectionBitmap.Height),
srcUnit: GraphicsUnit.Pixel
);
}
If we combine all of this logic together then we end up with a fairly straightforward static class that does all the work - takes a Bitmap
that is a frame from a video where there is a section that should be extracted and then "perspective-corrected", takes the four points that describe that region and then returns a new Bitmap
that is the extracted content in a lovely rectangle!
/// <summary>
/// This uses a simple algorithm to try to undo the distortion of a rectangle in an image
/// due to perspective - it takes the content of the rectangle and stretches it into a
/// rectangle. This is only a simple approximation and does not guarantee accuracy (in
/// fact, it will result in an image that is slightly vertically stretched such that its
/// aspect ratio will not match the original content and a more thorough approach would
/// be necessary if this is too great an approximation)
/// </summary>
internal static class SimplePerspectiveCorrection
{
public static Bitmap ExtractAndPerspectiveCorrect(
Bitmap image,
Point topLeft,
Point topRight,
Point bottomRight,
Point bottomLeft)
{
var (projectionSize, verticalSlices) =
GetProjectionDetails(topLeft, topRight, bottomRight, bottomLeft);
var projection = new Bitmap(projectionSize.Width, projectionSize.Height);
foreach (var (lineToTrace, index) in verticalSlices)
{
var lengthOfLineToTrace = LengthOfLine(lineToTrace);
var pixelsOnLine = Enumerable
.Range(0, lengthOfLineToTrace)
.Select(j =>
{
var fractionOfProgressAlongLineToTrace = (float)j / lengthOfLineToTrace;
var point = GetPointAlongLine(lineToTrace, fractionOfProgressAlongLineToTrace);
return GetAverageColour(image, point);
});
RenderSlice(projection, pixelsOnLine, index);
}
return projection;
static Color GetAverageColour(Bitmap image, PointF point)
{
var (integralX, fractionalX) = GetIntegralAndFractional(point.X);
var x0 = integralX;
var x1 = Math.Min(integralX + 1, image.Width);
var (integralY, fractionalY) = GetIntegralAndFractional(point.Y);
var y0 = integralY;
var y1 = Math.Min(integralY + 1, image.Height);
var (topColour0, topColour1) = GetColours(new Point(x0, y0), new Point(x1, y0));
var (bottomColour0, bottomColour1) = GetColours(new Point(x0, y1), new Point(x1, y1));
return CombineColours(
CombineColours(topColour0, topColour1, fractionalX),
CombineColours(bottomColour0, bottomColour1, fractionalX),
fractionalY
);
(Color c0, Color c1) GetColours(Point p0, Point p1)
{
var c0 = image.GetPixel(p0.X, p0.Y);
var c1 = (p0 == p1) ? c0 : image.GetPixel(p1.X, p1.Y);
return (c0, c1);
}
static (int Integral, float Fractional) GetIntegralAndFractional(float value)
{
var integral = (int)Math.Truncate(value);
var fractional = value - integral;
return (integral, fractional);
}
static Color CombineColours(Color x, Color y, float proportionOfSecondColour)
{
if ((proportionOfSecondColour == 0) || (x == y))
return x;
if (proportionOfSecondColour == 1)
return y;
return Color.FromArgb(
red: CombineComponent(x.R, y.R),
green: CombineComponent(x.G, y.G),
blue: CombineComponent(x.B, y.B),
alpha: CombineComponent(x.A, y.A)
);
int CombineComponent(int x, int y) =>
Math.Min(
(int)Math.Round(
(x * (1 - proportionOfSecondColour)) +
(y * proportionOfSecondColour)
),
255
);
}
}
}
private sealed record ProjectionDetails(
Size ProjectionSize,
IEnumerable<((PointF From, PointF To) Line, int Index)> VerticalSlices
);
private static ProjectionDetails GetProjectionDetails(
Point topLeft,
Point topRight,
Point bottomRight,
Point bottomLeft)
{
var topEdge = (From: topLeft, To: topRight);
var bottomEdge = (From: bottomLeft, To: bottomRight);
var lengthOfEdgeToStartFrom = LengthOfLine(topEdge);
var dimensions = new Size(
width: lengthOfEdgeToStartFrom,
height: Math.Max(
LengthOfLine((topLeft, bottomLeft)),
LengthOfLine((topRight, bottomRight))
)
);
return new ProjectionDetails(dimensions, GetVerticalSlices());
IEnumerable<((PointF From, PointF To) Line, int Index)> GetVerticalSlices() =>
Enumerable
.Range(0, lengthOfEdgeToStartFrom)
.Select(i =>
{
var fractionOfProgressAlongPrimaryEdge = (float)i / lengthOfEdgeToStartFrom;
return (
Line: (
GetPointAlongLine(topEdge, fractionOfProgressAlongPrimaryEdge),
GetPointAlongLine(bottomEdge, fractionOfProgressAlongPrimaryEdge)
),
Index: i
);
});
}
private static PointF GetPointAlongLine((PointF From, PointF To) line, float fraction)
{
var deltaX = line.To.X - line.From.X;
var deltaY = line.To.Y - line.From.Y;
return new PointF(
(deltaX * fraction) + line.From.X,
(deltaY * fraction) + line.From.Y
);
}
private static int LengthOfLine((PointF From, PointF To) line)
{
var deltaX = line.To.X - line.From.X;
var deltaY = line.To.Y - line.From.Y;
return (int)Math.Round(Math.Sqrt((deltaX * deltaX) + (deltaY * deltaY)));
}
private static void RenderSlice(
Bitmap projectionBitmap,
IEnumerable<Color> pixelsOnLine,
int index)
{
var pixelsOnLineArray = pixelsOnLine.ToArray();
using var slice = new Bitmap(1, pixelsOnLineArray.Length);
for (var j = 0; j < pixelsOnLineArray.Length; j++)
slice.SetPixel(0, j, pixelsOnLineArray[j]);
using var g = Graphics.FromImage(projectionBitmap);
g.DrawImage(
slice,
srcRect: new Rectangle(0, 0, slice.Width, slice.Height),
destRect: new Rectangle(index, 0, 1, projectionBitmap.Height),
srcUnit: GraphicsUnit.Pixel
);
}
}
So step one was to take frames from a video and to work out what the bounds were of the area where slides were being projected (and to filter out any intro and outro frames), step two has been to be able to take the bounded area from any slide and project it back into a rectangle to make it easier to match against the original slide images.. step three will be to use these projections to try to guess what slide is being displayed on what frame!
The frame that I've been using as an example throughout this post probably looks like a fairly easy case - big blocks of white or black and not actually too out of focus.. but some of the frames look like this and that's a whole other kettle of fish!
Posted at 19:02
15 March 2022
I have a video of a presentation where the camera keeps losing focus such that the slides are unreadable. I have the original slide deck and I want to fix this.
The first step is analysing the individual frames of the video to find a common "most illuminated area" so that I can work out where the slide content was being projected, and that is what is described in this post.
(An experimental TL;DR approach: See this small scale .NET Fiddle demonstration of what I'll be discussing)
An overview of the processing to do this looks as follows:
Bitmap
Before looking at code to do that, I'm going to toss in a few other complications that arise from having to process a lot of frames from throughout the video, rather than just one..
Firstly, the camera loses focus at different points in the video and to different extents and so some frames are blurrier than others. Following the steps above, the blurrier frames are likely to report a larger projection area for the slides. I would really like to identify a common projection area that is reasonable to use across all frames because this will make later processing (where I try to work out what slide is currently being shown in the frame) easier.
Secondly, this video has intro and outro animations and it would be nice if I was able to write code that worked out when they stopped and started.
To do this work, I'm going to introduce a variation of my old friend the DataRectangle
(from "How are barcodes read?" and "Face or no face") -
public static class DataRectangle
{
public static DataRectangle<T> For<T>(T[,] values) => new DataRectangle<T>(values);
}
public sealed class DataRectangle<T>
{
private readonly T[,] _protectedValues;
public DataRectangle(T[,] values) : this(values, isolationCopyMayBeBypassed: false) { }
private DataRectangle(T[,] values, bool isolationCopyMayBeBypassed)
{
if ((values.GetLowerBound(0) != 0) || (values.GetLowerBound(1) != 0))
throw new ArgumentException("Both dimensions must have lower bound zero");
var arrayWidth = values.GetUpperBound(0) + 1;
var arrayHeight = values.GetUpperBound(1) + 1;
if ((arrayWidth == 0) || (arrayHeight == 0))
throw new ArgumentException("zero element arrays are not supported");
Width = arrayWidth;
Height = arrayHeight;
if (isolationCopyMayBeBypassed)
_protectedValues = values;
else
{
_protectedValues = new T[Width, Height];
Array.Copy(values, _protectedValues, Width * Height);
}
}
public int Width { get; }
public int Height { get; }
public T this[int x, int y]
{
get
{
if ((x < 0) || (x >= Width))
throw new ArgumentOutOfRangeException(nameof(x));
if ((y < 0) || (y >= Height))
throw new ArgumentOutOfRangeException(nameof(y));
return _protectedValues[x, y];
}
}
public IEnumerable<(Point Point, T Value)> Enumerate()
{
for (var x = 0; x < Width; x++)
{
for (var y = 0; y < Height; y++)
{
var value = _protectedValues[x, y];
var point = new Point(x, y);
yield return (point, value);
}
}
}
public DataRectangle<TResult> Transform<TResult>(Func<T, TResult> transformer)
{
var transformed = new TResult[Width, Height];
for (var x = 0; x < Width; x++)
{
for (var y = 0; y < Height; y++)
transformed[x, y] = transformer(_protectedValues[x, y]);
}
return new DataRectangle<TResult>(transformed, isolationCopyMayBeBypassed: true);
}
}
For working with DataRectangle
instances that contain double
values (as we will be here), I've got a couple of convenient extension methods:
public static class DataRectangleOfDoubleExtensions
{
public static (double Min, double Max) GetMinAndMax(this DataRectangle<double> source) =>
source
.Enumerate()
.Select(pointAndValue => pointAndValue.Value)
.Aggregate(
seed: (Min: double.MaxValue, Max: double.MinValue),
func: (acc, value) => (Math.Min(value, acc.Min), Math.Max(value, acc.Max))
);
public static DataRectangle<bool> Mask(this DataRectangle<double> values, double threshold) =>
values.Transform(value => value >= threshold);
}
And for working with Bitmap
instances, I've got some extension methods for those as well:
public static class BitmapExtensions
{
public static Bitmap CopyAndResize(this Bitmap image, int resizeLargestSideTo)
{
var (width, height) = (image.Width > image.Height)
? (resizeLargestSideTo, (int)((double)image.Height / image.Width * resizeLargestSideTo))
: ((int)((double)image.Width / image.Height * resizeLargestSideTo), resizeLargestSideTo);
return new Bitmap(image, width, height);
}
/// <summary>
/// This will return values in the range 0-255 (inclusive)
/// </summary>
// Based on http://stackoverflow.com/a/4748383/3813189
public static DataRectangle<double> GetGreyscale(this Bitmap image) =>
image
.GetAllPixels()
.Transform(c => (0.2989 * c.R) + (0.5870 * c.G) + (0.1140 * c.B));
public static DataRectangle<Color> GetAllPixels(this Bitmap image)
{
var values = new Color[image.Width, image.Height];
var data = image.LockBits(
new Rectangle(0, 0, image.Width, image.Height),
ImageLockMode.ReadOnly,
PixelFormat.Format24bppRgb
);
try
{
var pixelData = new byte[data.Stride];
for (var lineIndex = 0; lineIndex < data.Height; lineIndex++)
{
Marshal.Copy(
source: data.Scan0 + (lineIndex * data.Stride),
destination: pixelData,
startIndex: 0,
length: data.Stride
);
for (var pixelOffset = 0; pixelOffset < data.Width; pixelOffset++)
{
// Note: PixelFormat.Format24bppRgb means the data is stored in memory as BGR
const int PixelWidth = 3;
values[pixelOffset, lineIndex] = Color.FromArgb(
red: pixelData[pixelOffset * PixelWidth + 2],
green: pixelData[pixelOffset * PixelWidth + 1],
blue: pixelData[pixelOffset * PixelWidth]
);
}
}
}
finally
{
image.UnlockBits(data);
}
return DataRectangle.For(values);
}
}
With this code, we can already perform those first steps that I've described in the find-projection-area-in-image process.
Note that I'm going to throw in an extra step of shrinking the input images if they're larger than 400px because we don't need pixel-perfect accuracy when the whole point of this process is that a lot of the frames are too blurry to read (as a plus, shrinking the images means that there's less data to process and the whole thing should finish more quickly).
using var image = new Bitmap("frame_338.jpg");
using var resizedImage = image.CopyAndResize(resizeLargestSideTo: 400);
var greyScaleImageData = resizedImage.GetGreyscale();
var (min, max) = greyScaleImageData.GetMinAndMax();
var range = max - min;
const double thresholdOfRange = 2 / 3d;
var thresholdForMasking = min + (range * thresholdOfRange);
var mask = greyScaleImageData.Mask(thresholdForMasking);
This gives us a DataRectangle
of boolean values that represent the brighter points as true and the less bright points as false.
In the image below, you can see the original frame on the left. In the middle is the content that would be masked out by hiding all but the brightest pixels. On the right is the "binary mask" (where we discard the original colour of the pixel and make them all either black or white) -
Now we need to identify the largest "object" within this mask - wherever bright pixels are adjacent to other bright pixels, they will be considered part of the same object and we would expect there to be several such objects within the mask that has been generated.
To do so, I'll be reusing some more code from "How are barcodes read?" -
private static IEnumerable<IEnumerable<Point>> GetDistinctObjects(DataRectangle<bool> mask)
{
// Flood fill areas in the mask to create distinct areas
var allPoints = mask
.Enumerate()
.Where(pointAndIsMasked => pointAndIsMasked.Value)
.Select(pointAndIsMasked => pointAndIsMasked.Point).ToHashSet();
while (allPoints.Any())
{
var currentPoint = allPoints.First();
var pointsInObject = GetPointsInObject(currentPoint).ToArray();
foreach (var point in pointsInObject)
allPoints.Remove(point);
yield return pointsInObject;
}
// Inspired by code at
// https://simpledevcode.wordpress.com/2015/12/29/flood-fill-algorithm-using-c-net/
IEnumerable<Point> GetPointsInObject(Point startAt)
{
var pixels = new Stack<Point>();
pixels.Push(startAt);
var valueAtOriginPoint = mask[startAt.X, startAt.Y];
var filledPixels = new HashSet<Point>();
while (pixels.Count > 0)
{
var currentPoint = pixels.Pop();
if ((currentPoint.X < 0) || (currentPoint.X >= mask.Width)
|| (currentPoint.Y < 0) || (currentPoint.Y >= mask.Height))
continue;
if ((mask[currentPoint.X, currentPoint.Y] == valueAtOriginPoint)
&& !filledPixels.Contains(currentPoint))
{
filledPixels.Add(new Point(currentPoint.X, currentPoint.Y));
pixels.Push(new Point(currentPoint.X - 1, currentPoint.Y));
pixels.Push(new Point(currentPoint.X + 1, currentPoint.Y));
pixels.Push(new Point(currentPoint.X, currentPoint.Y - 1));
pixels.Push(new Point(currentPoint.X, currentPoint.Y + 1));
}
}
return filledPixels;
}
}
As the code mentions, this is based on an article "Flood Fill algorithm (using C#.NET)" and its output is a list of objects, where each object is a list of points within that object. So the way to determine which object is largest is to take the one that contains the most points!
var pointsInLargestHighlightedArea = GetDistinctObjects(mask)
.OrderByDescending(points => points.Count())
.FirstOrDefault();
(Note: If pointsInLargestHighlightedArea
is null then we need to escape out of the method that we're in because the source image didn't produce a mask with any highlighted objects - this could happen if the image has every single with the same colour, for example; an edge case, surely, but one that we should handle)
From this largest object, we want to find a bounding quadrilateral, which we do by looking at every point and finding the one closest to the top left of the image (because this will be the top left of the bounding area), the point closest to the top right of the image (for the top right of the bounding area) and the same for the points closest to the bottom left and bottom right.
This can be achieved by calculating, for each point in the object, the distances from each of the corners to the points and then determining which points have the shortest distances - eg.
var distancesOfPointsFromImageCorners = pointsInLargeHighlightedArea
.Select(p =>
{
// To work out distance from the top left, you would use Pythagoras to take the
// squared horizontal distance of the point from the left of the image and add
// that to the squared vertical distance of the point from the top of the image,
// then you would square root that sum. In this case, we only want to be able to
// compare determine which distances are smaller or larger and we don't actually
// care about the precise distances themselves and so we can save ourselves from
// performing that final square root calculation.
var distanceFromRight = greyScaleImageData.Width - p.X;
var distanceFromBottom = greyScaleImageData.Height - p.Y;
var fromLeftScore = p.X * p.X;
var fromTopScore = p.Y * p.Y;
var fromRightScore = distanceFromRight * distanceFromRight;
var fromBottomScore = distanceFromBottom * distanceFromBottom;
return new
{
Point = p,
FromTopLeft = fromLeftScore + fromTopScore,
FromTopRight = fromRightScore + fromTopScore,
FromBottomLeft = fromLeftScore + fromBottomScore,
FromBottomRight = fromRightScore + fromBottomScore
};
})
.ToArray(); // Call ToArray to avoid repeating this enumeration four times below
var topLeft = distancesOfPointsFromImageCorners.OrderBy(p => p.FromTopLeft).First().Point;
var topRight = distancesOfPointsFromImageCorners.OrderBy(p => p.FromTopRight).First().Point;
var bottomLeft = distancesOfPointsFromImageCorners.OrderBy(p => p.FromBottomLeft).First().Point;
var bottomRight = distancesOfPointsFromImageCorners.OrderBy(p => p.FromBottomRight).First().Point;
Finally, because we want to find the bounding area of the largest object in the original image, we may need to multiply up the bounds that we just found because we shrank the image down if either dimension was larger than 400px and we were performing calculations on that smaller version.
We can tell how much we reduced the data by looking at the width of the original image and comparing it to the width of the greyScaleImageData DataRectangle
that was generated from the shrunken version of the image:
var reducedImageSideBy = (double)image.Width / greyScaleImageData.Width;
Now we only need a function that will multiply the bounding area that we've got according to this value, while ensuring that none of the point values are multiplied such that they exceed the bounds of the original image:
private static (Point TopLeft, Point TopRight, Point BottomRight, Point BottomLeft) Resize(
Point topLeft,
Point topRight,
Point bottomRight,
Point bottomLeft,
double resizeBy,
int minX,
int maxX,
int minY,
int maxY)
{
if (resizeBy <= 0)
throw new ArgumentOutOfRangeException("must be a positive value", nameof(resizeBy));
return (
Constrain(Multiply(topLeft)),
Constrain(Multiply(topRight)),
Constrain(Multiply(bottomRight)),
Constrain(Multiply(bottomLeft))
);
Point Multiply(Point p) =>
new Point((int)Math.Round(p.X * resizeBy), (int)Math.Round(p.Y * resizeBy));
Point Constrain(Point p) =>
new Point(Math.Min(Math.Max(p.X, minX), maxX), Math.Min(Math.Max(p.Y, minY), maxY));
}
The final bounding area for the largest bright area of an image is now retrieved like this:
var bounds = Resize(
topLeft,
topRight,
bottomRight,
bottomLeft,
reducedImageSideBy,
minX: 0,
maxX: image.Width - 1,
minY: 0,
maxY: image.Height - 1
);
For the example image that we're looking at, this area is outlined liked this:
Say that we put all of the above functionality into a method called GetMostHighlightedArea
that took a Bitmap
to process and returned a tuple of the four points that represented the bounds of the brightest area, we could then easily prepare a LINQ statement that ran that code and found the most common brightest-area-bounds across all of the source images that I have. (As I said before, the largest-bounded-area will vary from image to image in my example as the camera recording the session gained and lost focus)
var files = new DirectoryInfo("Frames").EnumerateFiles("*.jpg");
var (topLeft, topRight, bottomRight, bottomLeft) = files
.Select(file =>
{
using var image = new Bitmap(file.FullName);
return IlluminatedAreaLocator.GetMostHighlightedArea(image);
})
.GroupBy(area => area)
.OrderByDescending(group => group.Count())
.Select(group => group.Key)
.FirstOrDefault();
Presuming that there is a folder called "Frames" in the output folder of project*, this will read them all, look for the largest bright area on each of them individually, then return the area that appears most often across all of the images. (Note: If there are no images to read then the FirstOrDefault
call at the bottom will return a default tuple-of-four-Points, which will be 4x (0,0) values)
* (Since you probably don't happen to have a bunch of images from a video of my presentation lying around, see the next section for some code that will download some in case you want to try this all out!)
This ties in nicely with my recent post "Parallelising (LINQ) work in C#" because the processing required for each image is..
All that we would have to change in order to use multiple threads to process multiple images is the addition of a single line:
var files = new DirectoryInfo("Frames").EnumerateFiles("*.jpg");
var (topLeft, topRight, bottomRight, bottomLeft) = files
.AsParallel() // <- WOO!! This is all that we needed to add!
.Select(file =>
{
using var image = new Bitmap(file.FullName);
return IlluminatedAreaLocator.GetMostHighlightedArea(image);
})
.GroupBy(area => area)
.OrderByDescending(group => group.Count())
.Select(group => group.Key)
.FirstOrDefault();
(Parallelisation sidebar: When we split up the work like this, if the processing for each image was solely in memory then it would be a no-brainer that using more threads would make sense - however, the processing for each image involves LOADING the image from disk and THEN processing it in memory and if you had a spinning rust hard disk then you may fear that trying to ask it to read multiple files simultaneously would be slower than asking it to read them one at a time because its poor little read heads have to physically move around the plates.. it turns out that this is not necessarily the case and that you can find more information in this article that I found interesting; "Performance Impact of Parallel Disk Access")
I haven't quite finished yet but I figured that there may be some wild people out there that would like to try running this code locally themselves - maybe just to see it work or maybe even to get it working and then chop and change it for some new and exciting purpose!
To this end, I have some sample frames available from this video that I'm trying to fix - with varying levels of fuzziness present. To download them, use the following method:
private static async Task EnsureSamplesAvailable(DirectoryInfo framesfolder)
{
// Note: The GitHub API is rate limited quite severely for non-authenticated apps, so we just
// only call use it if the framesFolder doesn't exist or is empty - if there are already files
// in there then we presume that we downloaded them on a previous run (if the API is hit too
// often then it will return a 403 "rate limited" response)
if (framesfolder.Exists && framesfolder.EnumerateFiles().Any())
{
Console.WriteLine("Sample images have already been downloaded and are ready for use");
return;
}
Console.WriteLine("Downloading sample images..");
if (!framesfolder.Exists)
framesfolder.Create();
string namesAndUrlsJson;
using (var client = new WebClient())
{
// The API refuses requests without a User Agent, so set one before calling (see
// https://docs.github.com/en/rest/overview/resources-in-the-rest-api#user-agent-required)
client.Headers.Add(HttpRequestHeader.UserAgent, "ProductiveRage Blog Post Example");
namesAndUrlsJson = await client.DownloadStringTaskAsync(new Uri(
"https://api.github.com/repos/" +
"ProductiveRage/NaivePerspectiveCorrection/contents/Samples/Frames"
));
}
// Deserialise the response into an array of entries that have Name and Download_Url properties
var namesAndUrls = JsonConvert.DeserializeAnonymousType(
namesAndUrlsJson,
new[] { new { Name = "", Download_Url = (Uri?)null } }
);
if (namesAndUrls is null)
{
Console.WriteLine("GitHub reported zero sample images to download");
return;
}
await Task.WhenAll(namesAndUrls
.Select(async entry =>
{
using var client = new WebClient();
await client.DownloadFileTaskAsync(
entry.Download_Url,
Path.Combine(framesfolder.FullName, entry.Name)
);
})
);
Console.WriteLine($"Downloaded {namesAndUrls.Length} sample image(s)");
}
.. and call it with the following argument, presuming you're trying to read images from the "Frames" folder as the code earlier illustrated:
await EnsureSamplesAvailable(new DirectoryInfo("Frames"));
So I said earlier that it would also be nice if I could programmatically identify which frames were part of the intro/outro animations of the video that I'm looking at.
It feels logical that any frame that is of the actual presentation will have a fairly similarly-sized-and-located bright area (where a slide is being projected onto a wall in a darkened room) while any frame that is part of an intro/outro animation won't. So we should be able to take the most-common-largest-brightest-area and then look at every frame and see if its largest bright area is approximately the same - if it's similar enough then it's probably a frame that is part of the projection but if it's too dissimilar then it's probably not.
Rather than waste time going too far down a rabbit hole that I've found won't immediately result in success, I'm going to use a slightly altered version of that plan (I'll explain why in a moment). I'm still going to take that common largest brightest area and compare the largest bright area on each frame to it but, instead of saying "largest-bright-area-is-close-enough-to-the-most-common = presentation frame / largest-bright-area-not-close-enough = intro or outro", I'm going to find the first frame whose largest bright area is close enough and the last frame that is and declare that that range is probably where the frames for the presentation are.
The reason that I'm going to do this is that I found that there are some slides with more variance that can skew the results if the first approach was taken - if a frame in the middle of the presentation is so blurry that the range in intensity from darkest pixel to brightest pixel is squashed down too far then it can result in it identifying a largest bright area that isn't an accurate representation of the image. It's quite possible that I could still have made the first approach work by tweaking some other parameters in the image processing - such as considering changing that arbitrary "create a mask where the intensity threshold is 2/3 of the range of the brightness of all pixels" (maybe 3/4 would have worked better?), for example - but I know that this second approach works for my data and so I didn't pursue the first one too hard.
To do this, though, we are going to need to know what order the frames are supposed to appear in - it's no longer sufficient for there to simply be a list of images that are frames out of the video, we now need to know what were they appeared relative to each other. This is simple enough with my data because they all have names like "frame_1052.jpg" where 1052 is the frame index from the original video.
So I'm going to change the frame-image-loading code to look like this:
// Get all filenames, parse the frame index from them and discard any that don't
// match the filename pattern that is expected (eg. "frame_1052.jpg")
var frameIndexMatcher = new Regex(@"frame_(\d+)\.jpg", RegexOptions.IgnoreCase);
var files = new DirectoryInfo("Frames")
.EnumerateFiles()
.Select(file =>
{
var frameIndexMatch = frameIndexMatcher.Match(file.Name);
return frameIndexMatch.Success
? (file.FullName, FrameIndex: int.Parse(frameIndexMatch.Groups[1].Value))
: default;
})
.Where(entry => entry != default);
// Get the largest bright area for each file
var allFrameHighlightedAreas = files
.AsParallel()
.Select(file =>
{
using var image = new Bitmap(file.FullName);
return (
file.FrameIndex,
HighlightedArea: IlluminatedAreaLocator.GetMostHighlightedArea(image)
);
})
.ToArray()
// Get the most common largest bright area across all of the images
var (topLeft, topRight, bottomRight, bottomLeft) = allFrameHighlightedAreas
.GroupBy(entry => entry.HighlightedArea)
.OrderByDescending(group => group.Count())
.Select(group => group.Key)
.FirstOrDefault();
(Note that I'm calling ToArray()
when declaring allFrameHighlightedAreas
- that's to store the results now because I know that I'm going to need every result in the list that is generated and because I'm going to enumerate it twice in the work outlined here, so there's no point leaving allFrameHighlightedAreas
to be a lazily-evaluated IEnumerable
that would be recalculated each time it was looped over; then it would be doing all of the IlluminatedAreaLocator.GetMostHighlightedArea
calculations for each image twice if enumerated the list twice, which would just be wasteful!)
Now to look at the allFrameHighlightedAreas
list and try to decide if each HighlightedArea
value is close enough to the most common area that we found. I'm going to use a very simple algorithm for this - I'm going to:
HighlightedArea
on each entry in allFrameHighlightedAreas
topLeft
, topRight
, bottomRight
, bottomLeft
values that we already have in the code above)X
value between all four points in these two areas and add them upX
differences is greater than 20% of the width of the entire area) then we'll say it's not a match and drop out of this listX
values aren't too bad then we'll take the differences in Y
value between all four points in these two areas and add those upIn code:
var highlightedAreaWidth = Math.Max(topRight.X, bottomRight.X) - Math.Min(topLeft.X, bottomLeft.X);
var highlightedAreaHeight = Math.Max(bottomLeft.Y, bottomRight.Y) - Math.Min(topLeft.Y, topRight.Y);
const double thresholdForPointVarianceComparedToAreaSize = 0.2;
var frameIndexesThatHaveTheMostCommonHighlightedArea = allFrameHighlightedAreas
.Where(entry =>
{
var (entryTL, entryTR, entryBR, entryBL) = entry.HighlightedArea;
var xVariance =
new[]
{
entryBL.X - bottomLeft.X,
entryBR.X - bottomRight.X,
entryTL.X - topLeft.X,
entryTR.X - topRight.X
}
.Sum(Math.Abs);
var yVariance =
new[]
{
entryBL.Y - bottomLeft.Y,
entryBR.Y - bottomRight.Y,
entryTL.Y - topLeft.Y,
entryTR.Y - topRight.Y
}
.Sum(Math.Abs);
return
(xVariance <= highlightedAreaWidth * thresholdForPointVarianceComparedToAreaSize) &&
(yVariance <= highlightedAreaHeight * thresholdForPointVarianceComparedToAreaSize);
})
.Select(entry => entry.FrameIndex)
.ToArray();
This gives us a frameIndexesThatHaveTheMostCommonHighlightedArea
array of frame indexes that have a largest brightest area that is fairly close to the most common one. So to decide which frames are probably the start of the presentation and the end, we simply need to say:
var firstFrameIndex = frameIndexesThatHaveTheMostCommonHighlightedArea.Min();
var lasttFrameIndex = frameIndexesThatHaveTheMostCommonHighlightedArea.Max();
Any frames whose index is less than firstFrameIndex
or greater than lastFrameIndex
is probably part of the intro or outro sequence - eg.
Any frames whose index is within the firstFrameIndex
/ lastFrameIndex
range is probably part of the presentation - eg.
As the title of this post strongly suggests, this is only the first step in my desire to fix up my blurry presentation video. What I'm going to have to cover in the future is to:
GetMostHighlightedArea
function earlier, I picked 400px as an arbitrary value to resize images to before greyscaling them, masking them and looking for their largest bright area; maybe it will turn out that smaller or larger values for that process result in improved or worsened results - we'll find out!)Once this is all done, I will take the original frame images and, for each one, overlay a clean version of the slide that appeared blurrily in the frame (again, I'll have clean versions of each slide from the original slide deck that I produced, so that should be an easy part) - then I'll mash them all back together into a new video, combined with the original audio. To do this (the video work), I'll likely use the same tool that I used to extract the individual frame files from the video in the first place - the famous FFmpeg!
I doubt that I'll have a post on this last section as it would only be a small amount of C# code that combines two images for each frame, writes the results to disk, followed by me making a command line call to FFmpeg to produce the video - and I don't think that there's anything particularly exciting there! If I get this all completed, though, I will - of course - link to the fixed-up presentation video.. because why not shameless plug myself given any opportunity!
Posted at 21:06
7 August 2020
I've been using MyFitnessPal and it has the facility to load nutrition information by scanning the barcode on the product. I can guess how the retrieval works once the barcode number is obtained (a big database somewhere) but it struck me that I had no idea how the reading of the barcode itself worked and.. well, I'm curious and enjoy the opportunity to learn something new to me by writing the code to do it. I do enjoy being able to look up (almost) anything on the internet to find out how it works!
(For anyone who wants to either play along but not copy-paste the code themselves or for anyone who wants to jump to the end result, I've put the code - along with the example image I've used in this post - up on a GitHub repo)
There are two steps required here:
As with anything, these steps may be broken down into smaller tasks. The first step can be done like this:
The second step can be broken down into:
This approach is fairly resilient to changes in lighting and orientation because the barcode regions are still likely to have the highest horizontal intensity gradient whether the image is dark or light (and even if part of the image is light and part of it is dark) and the barcode-reading algorithm works on ratios of bar/space-between-bar widths and these remain constant if the image is rotated.
(Some of the techniques are similar to things that I did in my Face or no face (finding faces in photos using C# and Accord.NET) and so I'll be using some of the same code shortly that I described then)
Let's this image as an example to work with (peanut butter.. I do love peanut butter) -
Before looking at any code, let's visualise the process.
We're going to consider horizontal and vertical gradient intensity maps - at every point in the image we either look to the pixels to the left and to the right (for the horizontal gradient) or we look at the pixels above and below (for the vertical gradient) and the larger the change, the brighter the pixel in the gradient intensity map
And when they're combined by subtracting the vertical gradient at each point from the horizontal gradient, it looks lke this:
If this image is blurred then we get this:
.. and if we create a binary mask by saying "normalise the intensity values so that their range goes from zero (for the darkest pixel) to one (for the brightest) and then set any pixels that are in the bottom third in terms of intensity to 0 and set the rest to 1" then we get this:
If each distinct area (where an "area" means "a group of pixels that are connected") is identified and squares overlaid and centered around the areas then we see this:
.. and if the areas whose bounding squares overlap are combined and then cropped around the white pixels then we end up with this:
This has identified the area around the barcode and also two tiny other areas - when we come to trying to read barcode numbers out of these, the tiny regions will result in no value while the area around the genuine barcode content should result in a number successfully being read. But I'm getting ahead of myself.. let's look at the code required to perform the above transformations.
I'm going to start with a DataRectangle for performing transformations -
public static class DataRectangle
{
public static DataRectangle<T> For<T>(T[,] values) => new DataRectangle<T>(values);
}
public sealed class DataRectangle<T>
{
private readonly T[,] _protectedValues;
public DataRectangle(T[,] values) : this(values, isolationCopyMayBeBypassed: false) { }
private DataRectangle(T[,] values, bool isolationCopyMayBeBypassed)
{
if ((values.GetLowerBound(0) != 0) || (values.GetLowerBound(1) != 0))
throw new ArgumentException("Both dimensions must have lower bound zero");
var arrayWidth = values.GetUpperBound(0) + 1;
var arrayHeight = values.GetUpperBound(1) + 1;
if ((arrayWidth == 0) || (arrayHeight == 0))
throw new ArgumentException("zero element arrays are not supported");
Width = arrayWidth;
Height = arrayHeight;
if (isolationCopyMayBeBypassed)
_protectedValues = values;
else
{
_protectedValues = new T[Width, Height];
Array.Copy(values, _protectedValues, Width * Height);
}
}
/// <summary>
/// This will always be greater than zero
/// </summary>
public int Width { get; }
/// <summary>
/// This will always be greater than zero
/// </summary>
public int Height { get; }
public T this[int x, int y]
{
get
{
if ((x < 0) || (x >= Width))
throw new ArgumentOutOfRangeException(nameof(x));
if ((y < 0) || (y >= Height))
throw new ArgumentOutOfRangeException(nameof(y));
return _protectedValues[x, y];
}
}
public IEnumerable<Tuple<Point, T>> Enumerate(Func<Point, T, bool>? optionalFilter = null)
{
for (var x = 0; x < Width; x++)
{
for (var y = 0; y < Height; y++)
{
var value = _protectedValues[x, y];
var point = new Point(x, y);
if (optionalFilter?.Invoke(point, value) ?? true)
yield return Tuple.Create(point, value);
}
}
}
public DataRectangle<TResult> Transform<TResult>(Func<T, TResult> transformer)
{
return Transform((value, coordinates) => transformer(value));
}
public DataRectangle<TResult> Transform<TResult>(Func<T, Point, TResult> transformer)
{
var transformed = new TResult[Width, Height];
for (var x = 0; x < Width; x++)
{
for (var y = 0; y < Height; y++)
transformed[x, y] = transformer(_protectedValues[x, y], new Point(x, y));
}
return new DataRectangle<TResult>(transformed, isolationCopyMayBeBypassed: true);
}
}
And then I'm going to add a way to load image data into this structure -
public static class BitmapExtensions
{
/// <summary>
/// This will return values in the range 0-255 (inclusive)
/// </summary>
// Based on http://stackoverflow.com/a/4748383/3813189
public static DataRectangle<double> GetGreyscale(this Bitmap image)
{
var values = new double[image.Width, image.Height];
var data = image.LockBits(
new Rectangle(0, 0, image.Width, image.Height),
ImageLockMode.ReadOnly,
PixelFormat.Format24bppRgb
);
try
{
var pixelData = new Byte[data.Stride];
for (var lineIndex = 0; lineIndex < data.Height; lineIndex++)
{
Marshal.Copy(
source: data.Scan0 + (lineIndex * data.Stride),
destination: pixelData,
startIndex: 0,
length: data.Stride
);
for (var pixelOffset = 0; pixelOffset < data.Width; pixelOffset++)
{
// Note: PixelFormat.Format24bppRgb means the data is stored in memory as BGR
const int PixelWidth = 3;
var r = pixelData[pixelOffset * PixelWidth + 2];
var g = pixelData[pixelOffset * PixelWidth + 1];
var b = pixelData[pixelOffset * PixelWidth];
values[pixelOffset, lineIndex] = (0.2989 * r) + (0.5870 * g) + (0.1140 * b);
}
}
}
finally
{
image.UnlockBits(data);
}
return DataRectangle.For(values);
}
}
With these classes, we can load an image and calculate the combined horizontal-gradient-minus-vertical-gradient value like this:
private static IEnumerable<Rectangle> GetPossibleBarcodeAreasForBitmap(Bitmap image)
{
var greyScaleImageData = image.GetGreyscale();
var combinedGradients = greyScaleImageData.Transform((intensity, pos) =>
{
// Consider gradients to be zero at the edges of the image because there aren't pixels
// both left/right or above/below and so it's not possible to calculate a real value
var horizontalChange = (pos.X == 0) || (pos.X == greyScaleImageData.Width - 1)
? 0
: greyScaleImageData[pos.X + 1, pos.Y] - greyScaleImageData[pos.X - 1, pos.Y];
var verticalChange = (pos.Y == 0) || (pos.Y == greyScaleImageData.Height - 1)
? 0
: greyScaleImageData[pos.X, pos.Y + 1] - greyScaleImageData[pos.X, pos.Y - 1];
return Math.Max(0, Math.Abs(horizontalChange) - Math.Abs(verticalChange));
});
// .. more will go here soon
}
Before jumping straight into the image analysis, though, it's worth resizing the source image if it's large. Since this stage of the processing is looking for areas that look approximately like barcodes, we don't require a lot of granularity - I'm envisaging (as with the MyFitnessPal use case) source images where the barcode takes up a significant space in the image and is roughly aligned with the view port* and so resizing the image such that the largest side is 300px should work well. If you wanted to scan an image where there were many barcodes to process (or even where there was only one but it was very small) then you might want to allow larger inputs than this - the more data that there is, though, the more work that must be done and the slower that the processing will be!
* (The barcode has to be roughly aligned with the viewport because the approaching of looking for areas with large horizontal variance in intensity with minor vertical variance would not work - as we'll see later, though, there is considerable margin for error in this approach and perfect alignment is not required)
A naive approach to this would be force the image so that its largest side is 300px, regardless of what it was originally. However, this is unnecessary if the largest side is already less than 300px (scaling it up will actually give us more work to do) and if the largest side is not much more than 300px then it's probably not worth doing either - scaling it down may make any barcodes areas fuzzy and risk reducing the effectiveness of the processing while not actually reducing the required work. So I'm going to say that if the largest side of the image is 450px or larger than resize it so that its largest side is 300px and do nothing otherwise. To achieve that, we need a method like this:
private static DataRectangle<double> GetGreyscaleData(
Bitmap image,
int resizeIfLargestSideGreaterThan,
int resizeTo)
{
var largestSide = Math.Max(image.Width, image.Height);
if (largestSide <= resizeIfLargestSideGreaterThan)
return image.GetGreyscale();
int width, height;
if (image.Width > image.Height)
{
width = resizeTo;
height = (int)(((double)image.Height / image.Width) * width);
}
else
{
height = resizeTo;
width = (int)(((double)image.Width / image.Height) * height);
}
using var resizedImage = new Bitmap(image, width, height);
return resizedImage.GetGreyscale();
}
The next steps are to "normalise" the combined intensity variance values so that they fit the range zero-to-one, to "blur" this data and to then create a binary mask where the brighter pixels get set to one and the darker pixels get set to zero. In other words, to extend the code earlier (that calculated the intensity variance values) like this:
private static IEnumerable<Rectangle> GetPossibleBarcodeAreasForBitmap(Bitmap image)
{
var greyScaleImageData = GetGreyscaleData(
image,
resizeIfLargestSideGreaterThan: 450,
resizeTo: 300
);
var combinedGradients = greyScaleImageData.Transform((intensity, pos) =>
{
// Consider gradients to be zero at the edges of the image because there aren't pixels
// both left/right or above/below and so it's not possible to calculate a real value
var horizontalChange = (pos.X == 0) || (pos.X == greyScaleImageData.Width - 1)
? 0
: greyScaleImageData[pos.X + 1, pos.Y] - greyScaleImageData[pos.X - 1, pos.Y];
var verticalChange = (pos.Y == 0) || (pos.Y == greyScaleImageData.Height - 1)
? 0
: greyScaleImageData[pos.X, pos.Y + 1] - greyScaleImageData[pos.X, pos.Y - 1];
return Math.Max(0, Math.Abs(horizontalChange) - Math.Abs(verticalChange));
});
const int maxRadiusForGradientBlurring = 2;
const double thresholdForMaskingGradients = 1d / 3;
var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
.Transform(value => (value >= thresholdForMaskingGradients));
// .. more will go here soon
}
To do that we, need a "Normalise" method - which is simple:
private static DataRectangle<double> Normalise(DataRectangle<double> values)
{
var max = values.Enumerate().Max(pointAndValue => pointAndValue.Item2);
return (max == 0)
? values
: values.Transform(value => (value / max));
}
.. and a "Blur" method - which is a little less simple but hopefully still easy enough to follow (for every point, look at the points around it and take an average of all of them; it just looks for a square area, which is fine for small "maxRadius" values but which might be better implemented as a circular area if large "maxRadius" values might be needed, which they aren't in this code):
private static DataRectangle<double> Blur(DataRectangle<double> values, int maxRadius)
{
return values.Transform((value, point) =>
{
var valuesInArea = new List<double>();
for (var x = -maxRadius; x <= maxRadius; x++)
{
for (var y = -maxRadius; y <= maxRadius; y++)
{
var newPoint = new Point(point.X + x, point.Y + y);
if ((newPoint.X < 0) || (newPoint.Y < 0)
|| (newPoint.X >= values.Width) || (newPoint.Y >= values.Height))
continue;
valuesInArea.Add(values[newPoint.X, newPoint.Y]);
}
}
return valuesInArea.Average();
});
}
This gets us to this point:
.. which feels like good progress!
Now we need to try to identify distinct "islands" of pixels where each "island" or "object" is a set of points that are within a single connected area. A straightforward way to do that is to look at every point in the mask that is set to 1 and either:
This was made easier for me by reading the article Flood Fill algorithm (using C#.Net)..
private static IEnumerable<IEnumerable<Point>> GetDistinctObjects(DataRectangle<bool> mask)
{
// Flood fill areas in the looks-like-bar-code mask to create distinct areas
var allPoints = new HashSet<Point>(
mask.Enumerate(optionalFilter: (point, isMasked) => isMasked).Select(point => point.Item1)
);
while (allPoints.Any())
{
var currentPoint = allPoints.First();
var pointsInObject = GetPointsInObject(currentPoint).ToArray();
foreach (var point in pointsInObject)
allPoints.Remove(point);
yield return pointsInObject;
}
// Inspired by code at
// https://simpledevcode.wordpress.com/2015/12/29/flood-fill-algorithm-using-c-net/
IEnumerable<Point> GetPointsInObject(Point startAt)
{
var pixels = new Stack<Point>();
pixels.Push(startAt);
var valueAtOriginPoint = mask[startAt.X, startAt.Y];
var filledPixels = new HashSet<Point>();
while (pixels.Count > 0)
{
var currentPoint = pixels.Pop();
if ((currentPoint.X < 0) || (currentPoint.X >= mask.Width)
|| (currentPoint.Y < 0) || (currentPoint.Y >= mask.Height))
continue;
if ((mask[currentPoint.X, currentPoint.Y] == valueAtOriginPoint)
&& !filledPixels.Contains(currentPoint))
{
filledPixels.Add(new Point(currentPoint.X, currentPoint.Y));
pixels.Push(new Point(currentPoint.X - 1, currentPoint.Y));
pixels.Push(new Point(currentPoint.X + 1, currentPoint.Y));
pixels.Push(new Point(currentPoint.X, currentPoint.Y - 1));
pixels.Push(new Point(currentPoint.X, currentPoint.Y + 1));
}
}
return filledPixels;
}
}
The problem is that, even with the blurring we performed, there will likely be some groups of distinct objects that are actually part of a single barcode. These areas need to be joined together. It's quite possible for there to be relatively large gaps in the middle of barcodes (there is in the example that we've been looking at) and so we might not easily be able to just take the distinct objects that we've got and join together areas that seem "close enough".
On the basis that individual bars in a barcode are tall compared to the largest possible width that any of them can be (which I'll go into more detail about later on), it seems like a reasonable idea to take any areas that are taller than they are wide and expand their width until they become square. That would give us this:
We'd then work out which of these "squared off" rectangles overlap (if any) and replace overlapping rectangles with rectangles that cover their combined areas, which would look like this:
The only problem with this is that the combined rectangles extend too far to the left and right of the areas, so we need to trim them down. The will be fairly straightforward because we have the information about what distinct objects there are and each object is just a list of points - so we work out which objects have points within each of the combined bounding areas and then we work out which out of all of the objects for each combined area has the smallest "x" value and smallest "y" value and which have the largest values. That way, we can change the combined bounding areas to only cover actual barcode pixels. Which would leave us with this:
That might sound like a lot of complicated work but if we take a bit of a brute force* approach to it then it can be expressed like this:
private static IEnumerable<Rectangle> GetOverlappingObjectBounds(
IEnumerable<IEnumerable<Point>> objects)
{
// Translate each "object" (a list of connected points) into a bounding box (squared off if
// it was taller than it was wide)
var squaredOffBoundedObjects = new HashSet<Rectangle>(
objects.Select((points, index) =>
{
var bounds = Rectangle.FromLTRB(
points.Min(p => p.X),
points.Min(p => p.Y),
points.Max(p => p.X) + 1,
points.Max(p => p.Y) + 1
);
if (bounds.Height > bounds.Width)
bounds.Inflate((bounds.Height - bounds.Width) / 2, 0);
return bounds;
})
);
// Loop over the boundedObjects and reduce the collection by merging any two rectangles
// that overlap and then starting again until there are no more bounds merges to perform
while (true)
{
var combinedOverlappingAreas = false;
foreach (var bounds in squaredOffBoundedObjects)
{
foreach (var otherBounds in squaredOffBoundedObjects)
{
if (otherBounds == bounds)
continue;
if (bounds.IntersectsWith(otherBounds))
{
squaredOffBoundedObjects.Remove(bounds);
squaredOffBoundedObjects.Remove(otherBounds);
squaredOffBoundedObjects.Add(Rectangle.FromLTRB(
Math.Min(bounds.Left, otherBounds.Left),
Math.Min(bounds.Top, otherBounds.Top),
Math.Max(bounds.Right, otherBounds.Right),
Math.Max(bounds.Bottom, otherBounds.Bottom)
));
combinedOverlappingAreas = true;
break;
}
}
if (combinedOverlappingAreas)
break;
}
if (!combinedOverlappingAreas)
break;
}
return squaredOffBoundedObjects.Select(bounds =>
{
var allPointsWithinBounds = objects
.Where(points => points.Any(point => bounds.Contains(point)))
.SelectMany(points => points)
.ToArray(); // Don't re-evaluate in the four accesses below
return Rectangle.FromLTRB(
left: allPointsWithinBounds.Min(p => p.X),
right: allPointsWithinBounds.Max(p => p.X) + 1,
top: allPointsWithinBounds.Min(p => p.Y),
bottom: allPointsWithinBounds.Max(p => p.Y) + 1
);
});
}
* (There are definitely more efficient ways that this could be done but since we're only looking at 300px images then we're not likely to end up with huge amounts of data to deal with)
To complete the process, we need to do three more things:
To do that, we'll start from this code that we saw earlier:
var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
.Transform(value => (value >= thresholdForMaskingGradients));
.. and expand it like so (removing the "// .. more will go here soon" comment), using the methods above:
// Determine how much the image was scaled down (if it had to be scaled down at all)
// by comparing the width of the potentially-scaled-down data to the source image
var reducedImageSideBy = (double)image.Width / greyScaleImageData.Width;
var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
.Transform(value => (value >= thresholdForMaskingGradients));
return GetOverlappingObjectBounds(GetDistinctObjects(mask))
.Where(boundedObject => boundedObject.Width > boundedObject.Height)
.Select(boundedObject =>
{
var expandedBounds = boundedObject;
expandedBounds.Inflate(width: expandedBounds.Width / 10, height: 0);
expandedBounds.Intersect(
Rectangle.FromLTRB(0, 0, greyScaleImageData.Width, greyScaleImageData.Height)
);
return new Rectangle(
x: (int)(expandedBounds.X * reducedImageSideBy),
y: (int)(expandedBounds.Y * reducedImageSideBy),
width: (int)(expandedBounds.Width * reducedImageSideBy),
height: (int)(expandedBounds.Height * reducedImageSideBy)
);
});
The final result is that the barcode has been successfully located on the image - hurrah!
With this information, we should be able to extract regions or "sub images" from the source image and attempt to decipher the barcode value in it (presuming that there IS a bar code in it and we haven't got a false positive match).
As we'll see in a moment, the barcode doesn't have to be perfectly lined up - some rotation is acceptable (depending upon the image, up to around 20 or 30 degrees should be fine). The MyFitnessPal app has a couple of fallbacks that I've noticed, such as being able to read barcodes that are upside down or even back to front (which can happen if a barcode is scanned from the wrong side of a transparent wrapper). While I won't be writing code here for either of those approaches, I'm sure that you could envisage how it could be done - the source image data could be processed as described here and then, if no barcode is read, rotated 180 degrees and re-processed and reversed and re-processed, etc..
A barcode is comprised of both black and white bars - so it's not just the black parts that are significant, it is the spaces between them as well.
The format of a barcode is as follows:
The numbers are encoded using the following system:
Digit Bar widths
0 3, 2, 1, 1
1 2, 2, 2, 1
2 2, 1, 2, 2
3 1, 4, 1, 1
4 1, 1, 3, 2
5 1, 2, 3, 1
6 1, 1, 1, 4
7 1, 3, 1, 2
8 1, 2, 1, 3
9 3, 1, 1, 2
(Note that every combination of values totals 7 when they added up - this is very helpful later!)
To see what that looks like in the real world, here's a slice of that barcode from the jar of peanut butter with each section and each numeric value identified:
(I should point out that the article How does a barcode work? was extremely helpful in the research I did for this post and I'm very grateful to the author for having written it in such an approachable manner!)
Any combination of bar widths that is not found in the table is considered to be invalid. On the one hand, you might think that this a potential loss; the format could support more combinations of bar widths to encode more values and then more data could be packed into the same space. There is an advantage, however, to having relatively few valid combinations of bar widths - it makes easier to tell whether the information being read appears to be correct. If a combination is encountered that seems incorrect then the read attempt should be aborted and retried. The format has existed for decades and it would make sense, bearing that in mind, to prioritise making it easier for the hardware to read rather prioritising trying to cram as much data in there as possible. There is also a checksum included in the numerical data to try to catch any "misreads" but when working with low resolutions or hardware with little computing power, the easier that it is to bail out of a scan and to retry the better.
The way to tackle the reading is to:
This processing is fairly light computationally and so there is no need to resize the "may be a barcode" image region before attempting the work. In fact, it's beneficial to not shrink it as shrinking it will likely make the barcode section fuzzier and that makes the above steps less likely to work - the ideal case for creating a binary mask is where there is no significant "seepage" of pixel intensity between the black bar areas and the white bar areas. That's not to say that the images have to be crystal clear or perfectly aligned with the camera because the redundancy built into the format works in our favour here - if one line across the image can't be read because it's fuzzy then there's a good chance that one of the other lines will be legible.
60 length values is the precise number that we expect to find - there is expected to be some blank space before the barcode starts (1) and then a guard section of three single-width lines that we use to gauge bar width (3) and then six numbers that are encoded in four bars each (6x4=24) and then a guard section of five single-width lines (5) and then six numbers (6x4=24) and then a final guard region of three single-width bars, giving 1+3+24+5+24+3=60.
There will likely be another section of blank content after the barcode that we ignore
If we don't want to validate the final guard region then we can work with a barcode image where some of the end of cut off, so long as the data for the 12 digits is there; in this case, 57 lengths if the minimum number that we can accept
I'm going to try to present the code in approximately the same order as the steps presented above. So, firstly we need to convert the sub image to greyscale and create a binary mark from it. Then we'll go line by line down the image data and try to read a value. So we'll take this:
public static string? TryToReadBarcodeValue(Bitmap subImage)
{
const double threshold = 0.5;
// Black lines are considered 1 and so we set to true if it's a dark pixel (and 0 if light)
var mask = subImage.GetGreyscale().Transform(intensity => intensity < (256 * threshold));
for (var y = 0; y < mask.Height; y++)
{
var value = TryToReadBarcodeValueFromSingleLine(mask, y);
if (value is object)
return value;
}
return null;
}
.. and the read-each-slice-of-the-image code looks like this:
private static string? TryToReadBarcodeValueFromSingleLine(
DataRectangle<bool> barcodeDetails,
int sliceY)
{
if ((sliceY < 0) || (sliceY >= barcodeDetails.Height))
throw new ArgumentOutOfRangeException(nameof(sliceY));
var lengths = GetBarLengthsFromBarcodeSlice(barcodeDetails, sliceY).ToArray();
if (lengths.Length < 57)
{
// As explained, we'd like 60 bars (which would include the final guard region) but we
// can still make an attempt with 57 (but no fewer)
// - There will often be another section of blank content after the barcode that we ignore
// - If we don't want to validate the final guard region then we can work with a barcode
// image where some of the end is cut off, so long as the data for the 12 digits is
// there (this will be the case where there are only 57 lengths)
return null;
}
var offset = 0;
var extractedNumericValues = new List<int>();
for (var i = 0; i < 14; i++)
{
if (i == 0)
{
// This should be the first guard region and it should be a pattern of three single-
// width bars
offset += 3;
}
else if (i == 7)
{
// This should be the guard region in the middle of the barcode and it should be a
// pattern of five single-width bars
offset += 5;
}
else
{
var value = TryToGetValueForLengths(
lengths[offset],
lengths[offset + 1],
lengths[offset + 2],
lengths[offset + 3]
);
if (value is null)
return null;
extractedNumericValues.Add(value.Value);
offset += 4;
}
}
// Calculate what the checksum should be based upon the first 11 numbers and ensure that
// the 12th matches it
if (extractedNumericValues.Last() != CalculateChecksum(extractedNumericValues.Take(11)))
return null;
return string.Join("", extractedNumericValues);
}
With the code below, we find the runs of continuous 0 or 1 lengths that will represent bars are return that list (again, for larger images each run will be longer and for smaller images each run will be shorter but this will be taken care of later) -
private static IEnumerable<int> GetBarLengthsFromBarcodeSlice(
DataRectangle<bool> barcodeDetails,
int sliceY)
{
if ((sliceY < 0) || (sliceY >= barcodeDetails.Height))
throw new ArgumentOutOfRangeException(nameof(sliceY));
// Take the horizontal slice of the data
var values = new List<bool>();
for (var x = 0; x < barcodeDetails.Width; x++)
values.Add(barcodeDetails[x, sliceY]);
// Split the slice into bars - we only care about how long each segment is when they
// alternate, not whether they're dark bars or light bars
var segments = new List<Tuple<bool, int>>();
foreach (var value in values)
{
if ((segments.Count == 0) || (segments[^1].Item1 != value))
segments.Add(Tuple.Create(value, 1));
else
segments[^1] = Tuple.Create(value, segments[^1].Item2 + 1);
}
if ((segments.Count > 0) && !segments[0].Item1)
{
// Remove the white space before the first bar
segments.RemoveAt(0);
}
return segments.Select(segment => segment.Item2);
}
Now we need to implement the "TryToGetValueForLengths" method that "TryToReadBarcodeValueFromSingleLine" calls. This takes four bar lengths that are thought to represent a single digit in the bar code value (they are not part of a guard region or anything like that). It take those four bar lengths and guesses how many pixels across a single bar would be - which is made my simpler by the fact that all of the possible combinations of bar lengths in the lookup chart that we saw earlier add up to 7.
There's a little flexibility introduced here to try to account for a low quality image or if the threshold was a bit strong in the creation of the binary mask; we'll take that calculated expected width of a single bar and tweak it up or down a little if apply that division to the bar lengths means that we made some of the bars too small that they disappeared or too large and it seemed like the total width would be more than seven single estimated-width bars. There's only a little flexibility here because if we fail then we can always try another line of the image! (Or maybe it will turn out that this sub image was a false positive match and there isn't a bar code in it at all).
private static int? TryToGetValueForLengths(int l0, int l1, int l2, int l3)
{
if (l0 <= 0)
throw new ArgumentOutOfRangeException(nameof(l0));
if (l1 <= 0)
throw new ArgumentOutOfRangeException(nameof(l1));
if (l2 <= 0)
throw new ArgumentOutOfRangeException(nameof(l2));
if (l3 <= 0)
throw new ArgumentOutOfRangeException(nameof(l3));
// Take a guess at what the width of a single bar is based upon these four values
// (the four bars that encode a number should add up to a width of seven)
var raw = new[] { l0, l1, l2, l3 };
var singleWidth = raw.Sum() / 7d;
var adjustment = singleWidth / 10;
var attemptedSingleWidths = new HashSet<double>();
while (true)
{
var normalised = raw.Select(x => Math.Max(1, (int)Math.Round(x / singleWidth))).ToArray();
var sum = normalised.Sum();
if (sum == 7)
return TryToGetNumericValue(normalised[0], normalised[1], normalised[2], normalised[3]);
attemptedSingleWidths.Add(singleWidth);
if (sum > 7)
singleWidth += adjustment;
else
singleWidth -= adjustment;
if (attemptedSingleWidths.Contains(singleWidth))
{
// If we've already tried this width-of-a-single-bar value then give up -
// it doesn't seem like we can make the input values make sense
return null;
}
}
static int? TryToGetNumericValue(int i0, int i1, int i2, int i3)
{
var lookFor = string.Join("", new[] { i0, i1, i2, i3 });
var lookup = new[]
{
// These values correspond to the lookup chart shown earlier
"3211", "2221", "2122", "1411", "1132", "1231", "1114", "1312", "1213", "3112"
};
for (var i = 0; i < lookup.Length; i++)
{
if (lookFor == lookup[i])
return i;
}
return null;
}
}
Finally we need the CalculateChecksum method (as noted in the code, there's a great explanation of how to do this in wikipedia) -
private static int CalculateChecksum(IEnumerable<int> values)
{
if (values == null)
throw new ArgumentNullException(nameof(values));
if (values.Count() != 11)
throw new ArgumentException("Should be provided with precisely 11 values");
// See https://en.wikipedia.org/wiki/Check_digit#UPC
var checksumTotal = values
.Select((value, index) => (index % 2 == 0) ? (value * 3) : value)
.Sum();
var checksumModulo = checksumTotal % 10;
if (checksumModulo != 0)
checksumModulo = 10 - checksumModulo;
return checksumModulo;
}
With this code, we have executed all of the planned steps outlined before.
It should be noted that, even with the small amount of flexibility in the "TryToGetValueForLengths" method, in the peanut butter bar code example it requires 15 calls to "GetBarLengthsFromBarcodeSlice" until a bar code is successfully matched! Presumably, this is because there is a little more distortion further up the bar code due to the curve of the jar.
That's not to say, however, that this approach to bar reading is particularly fussy. The redundancy and simplicity, not to mention the size of the average bar code, means that there is plenty of opportunity to try reading a sub image in multiple slices until one of them does match. In fact, I mentioned earlier that the barcode doesn't have to be perfectly at 90 degrees in order to be interpretable and that some rotation is acceptable. This hopefully makes some intuitive sense based upon the logic above and how it doesn't matter how long each individual bar code line is because they are averaged out - if a bar code was rotated a little and then a read was attempted of it line by line then the ratios between each line should remain consistent and the same data should be readable.
To illustrate, here's a zoomed-in section of the middle of the peanut butter bar code in the orientation shown so far:
If we then rotate it like this:
.. then the code above will still read the value correctly because a strip across the rotated bar code looks like this:
Hopefully it's clear enough that, for each given line, the ratios are essentially the same as for the non-rotated strip:
To get a reading from an image that is rotated more than this requires a very clear source image and will still be limited by the first stage of processing - that tried to find sections where the horizontal image intensity changed with steep gradients but the vertical intensity did not. If the image is rotated too much then there will be more vertical image intensity differences encountered and it is less likely to identify it as a "maybe a bar code" region.
(Note: I experimented with rotated images that were produced by an online barcode generator and had more success - meaning that I could rotate them more than I could with real photographs - but that's because those images are generated with stark black and white and the horizontal / vertical intensity gradients are maintained for longer when the image is rotated if they start with such a high level of clarity.. I'm more interested in reading values from real photographs and so I would suggest that only fairly moderate rotation will work - though it would still be plenty for an MyFitnessPal-type app that expects the User to hold the bar code in roughly the right orientation!)
We've looked at the separate steps involved in the whole reading process, all that is left is to combine them. The "GetPossibleBarcodeAreasForBitmap" and "TryToReadBarcodeValue" methods can be put together into a fully functioning program like this:
static void Main()
{
using var image = new Bitmap("Source.jpg");
var barcodeValues = new List<string>();
foreach (var area in GetPossibleBarcodeAreasForBitmap(image))
{
using var areaBitmap = new Bitmap(area.Width, area.Height);
using (var g = Graphics.FromImage(areaBitmap))
{
g.DrawImage(
image,
destRect: new Rectangle(0, 0, areaBitmap.Width, areaBitmap.Height),
srcRect: area,
srcUnit: GraphicsUnit.Pixel
);
}
var valueFromBarcode = TryToReadBarcodeValue(areaBitmap);
if (valueFromBarcode is object)
barcodeValues.Add(valueFromBarcode);
}
if (!barcodeValues.Any())
Console.WriteLine("Couldn't read any bar codes from the source image :(");
else
{
Console.WriteLine("Read the following bar code(s) from the image:");
foreach (var barcodeValue in barcodeValues)
Console.WriteLine("- " + barcodeValue);
}
Console.WriteLine();
Console.WriteLine("Press [Enter] to terminate..");
Console.ReadLine();
}
And with that, we're finally done! I must admit that I started writing this post about three years ago and it's been in my TODO list for a loooooong time now. But I've taken a week off work and been able to catch up with a few things and have finally been able to cross it off the list. And I'm quite relieved that I didn't give up on it entirely because it was a fun little project and coming back to it now allowed me to tidy it up a bit with the newer C# 8 syntax and even enable the nullable reference types option on the project (I sure do hate unintentional nulls being allowed to sneak in!)
A quick reminder if you want to see it in action or play about it yourself, the GitHub repo is here.
Thanks to anyone that read this far!
Posted at 23:24
12 February 2017
I've always been intrigued as to how facial detection and recognition is handled by computers. It's one of the few effects that is seen in Hollywood films that seems convincing - where someone is tracked through a city, live CCTV feeds finding their face as they move from one street to the next. For a more mundane example, facial detection has been built into digital cameras and smartphones for quite some time now (my phone puts green squares around faces in photos and, I presume, uses this information to decide where to focus).
I knew literally nothing about the subject to begin with and so I hit the internet and tried to find out everything I could about it. Turns out that there's a lot that's been written about this, with various different methods and approaches discussed and documented. It's an interesting subject because it's been so affected by the rate of change in technology - for example, the digital cameras that introduced it had access to much less processing power than modern day devices do. I've had a hard time finding definitive information about when digital cameras started offering this facility but I'm getting a rough impression that it's over the last ten years (citation needed!) - when you think about how phones, for example, have changed in ten years in terms of their power.. well, it's a lot!
Having said that, digital cameras (and web cams) can take some shortcuts and only concern themselves with subjects that are facing directly towards the camera, since these are more likely to be subjects that are of interest - it's not as important to worry about people looking at ninety degrees from the camera lens, for example). So, straight away, it's apparent there are going to different techniques and different requirements for different situations and some will be more expensive than others.
I suspect that this is going to be an ongoing investigation for me; so far I've only managed to implement facial detection functionality and not facial recognition (once a face has been detected, is it someone from a database of known faces?). I thought I'd present what I have at this point, though - if it's interesting to me then maybe it will be of interest to other people! And I couldn't find much material out there that goes into enough detail that it's reproducible without being a really dense research paper (and certainly not ones that make it easy to reproduce their findings using C#), so I thought that I'd try to fill that gap. There are articles out there about using libraries to do all the hard work for you - but where's the fun in that?! I want to know how the code I'm calling is working.
My current code (github.com/ProductiveRage/FacialRecognition) identifies faces in photos by using two passes. First, it looks for regions of skin that have holes (that we presume may be eyes and a nose) and that are very roughly the right proportions to be a face.
The image content from these regions is passed through a "linear support vector machine", which will have been trained to classify a potential image region as a face or as not-a-face.
Before embarking on this adventure, I had no idea what a "support vector machine" even was. If you don't either then maybe you'll come away from this feeling like you know a little something about machine learning. I know that I did! And "machine learning" is a phrase that I've been interested in finding out more about for quite some time now, it seems like it's cropping up everywhere! I have a strong feeling that trying to perform facial recognition is going to involve even more magical sounding terms - such as "deep convolutional neural network" - but that's going to have to wait for another day, it's one (baby) step at a time for me.
In a past life, I worked in IT Support and, one day, I was asked by the Head Clinician why I'd been printing out porn. Which confused me considerably since I had most certainly not been printing out explicit material. There had been complaints about the colour tone and contrast of a new colour laser printer and so I'd sent a full-page test image to it that included greyscale content, colour grids and images that contained a lot of skin. Apparently printing out pictures with lots of skin in it raises eyebrows and maybe it's for similar reasons that a research paper entitled "Naked People Skin Filter" amusingly sounds a touch pervy (or maybe that's just a British thing.. we do love to keep most of our naked-skin-unveiling behind closed doors - because it's so cold!).
This "Naked People Skin Filter" paper was published by Margaret M. Fleck and David A Forsyth and is available at http://mfleck.cs.illinois.edu/naked-skin.html. It essentially describes an algorithm to detect regions of skin in an image. In 1997, another paper "Face Detection in Color Images" described starting with that skin-tone-identifying approach and then using it to try to find solid regions of skin that have a few holes in, on the basis that this is probably a face (where the holes are eyes, nose, mouth).
Note: That "Face Detection in Color Images" link use the Wayback Machine to make available a GeoCities page. Maybe it's a sad reminder that there was actually worthwhile content on GeoCities as well as the many harshly-coloured-and-bizarrely-animated eye-watering monstrosities. By the way, if you (quite rightly, imho) think that the web has lost something by having less places for people to upload content for free then check out NeoCities - it's awesome because it makes that possible again (and all the good and bad that comes with it)!
"Face Detection in Color Images" uses an example image of Tiger Woods and so I will use that same image while I discuss what I've done, based upon these articles.
There are a set of steps that are followed when trying to identify faces in an image:
The first few steps are fairly simple. If the image is either taller or wider than 400px then it's shrunk down so that the largest side is 400px. "Naked People Skin Filter" talks about applying "smoothing" (which I'll cover in a moment) relative to the size of the image but doesn't suggest any restrictions on the image size. "Face Detection in Color Images" mentions that only images up to 250px were used. With the test images I used, I found that capping at 400px provided a reasonable balance between identifying faces and not taking too long to process. I used the Tiger Woods image initially to try to ensure that my calculations were matching those described in "Face Detection in Color Images" but I used other test images (which will appear further down) to double-check that everything seemed to be working as expected. There are some differences between how Fleck & Forsyth ("Naked People Skin Filter") and Kapur ("Face Detection in Colour Images") perform the analysis and my test images led me to tweak a few parameters myself.
Fleck & Forsyth recommend pulling down the colour values so that the minimum value is zero because it "avoids potentially significant desaturation of opponent color values if the zero-response is far from zero". What this means is that the darkest parts of an image are probably not "true black" (ie. RGB 0, 0, 0) and the theory is that we should get better results if everything is adjusted down so that the darkest colour is black. In practice, I look for the lowest of any of the R, G or B values from any pixel in the image and subtract that value from every R, G and B value across all pixels in the image.
Next, I generate "RgByI" values for every pixel. This a trio of values; an RG (red-green) spectrum value, a BY (blue-yellow) spectrum value and an intensity. Since I'm going to spend a lot of time taking the pixels from an image and loading them into a 2D array and then performing a range of operations on them, I've written a DataRectangle<T> class that makes this easier. When I first read the data from an image, the pixels are loaded and used to populate a DataRectangle of RGB values. When I need to get RgByI values from them, I can do the following (using Kapur's formulae from "Face Detection in Colour Images) -
var rgByIValues = colourData.Transform(colour =>
{
Func<byte, double> L = x => (105 * Math.Log10(x + 1));
return new IRgBy(
rg: L(colour.R) - L(colour.G),
by: L(colour.B) - ((L(colour.G) + L(colour.R)) / 2),
i: (L(colour.R) + L(colour.B) + L(colour.G)) / 3
);
});
In order to reduce unhelpful detail in the image somewhat, the RgByI values have a "windowing median filter" applied to them. From "Face Detection in Colour Images" -
The Rg and By matrices are then filtered with a windowing median filter .. with sides of length 4*SCALE. The SCALE value is calculated as being the closest integer value to (height+width)/320
(The "Naked People Skin Filter" paper recommends using 2*SCALE for the RG and BY values, which I have gone with because I found the results to be just as accurate and it's less computational work than 4*SCALE)
To do this, you go through every pixel and take its value and the values from the surrounding pixels, sort them, take the middle value (the "median") and use that as the new value for the current pixel. The idea is that this reduces noise by discarding any outlier pixel colours in a given range. I take a square block around where the initial pixel is but it would probably be better to approximate a circular area if the media filter radius is large (in my filter, it's never more than three).
After smoothing the RG/BY data, their values are combined to generate hue and saturation -
var smoothedHues = smoothedRG.CombineWith(
smoothedBY,
(rg, by) => new
{
Hue = RadianToDegree(Math.Atan2(rg, by)),
Saturation = Math.Sqrt((rg * rg) + (by * by))
}
);
The two images here are the hue and saturation values generated from the original image.
In order to generate greyscale images, I had to translate the hue and saturation values into the 0-255 range. The hue values will be in the range of -180 to +180 degrees so I just added 180 and then divided by 2. The saturation values are always positive and won't exceed 0-255 if multiplied by 2, so I just did that.
Generating texture amplitude is more complicated. We start with the intensity values from the RgByI data. We then run that through a median filter of 2 * SCALE and calculate the difference between every point in the median filter result and the original intensity. Finally, we run the result of the calculation through a median filter of 3 * SCALE. It may be clearer with some code -
var smoothedIntensity = rgByIValues.MedianFilter(
value => value.I,
2 * scale
);
var differenceBetweenOriginalIntensityAndSmoothedIntensity = rgByIValues.CombineWith(
smoothedIntensity,
(x, y) => Math.Abs(x.I - y)
);
var textureAmplitude = differenceBetweenOriginalIntensityAndSmoothedIntensity.MedianFilter(
value => value,
3 * scale
);
(Note that the values 2 and 3 in the above code are much lower than Kapur suggests - 8*SCALE and 12*SCALE - but the median filtering algorithm that I threw together was very slow using the higher values and using lower values - which meant that the processing completed much more quickly - did not seem to affect the outcomes)
Hopefully it's apparent that around the lower face there is low texture amplitude. Texture amplitude is higher around the facial features but that's to expected. There is low texture amplitude elsewhere in the image (particularly in the sky behind him, the texture is very smooth there and so the texture amplitude is very low) but what we're going to look for is areas that appear to be skin in hue/saturation and in texture amplitude.
Now that we have all of the information required to guess whether a given pixel is within the acceptable bounds of "probably skin", we can create a skin mask (a DataRectangle<bool>) -
var skinMask = smoothedHues
.CombineWith(textureAmplitude, (hs, t) => new
{
Hue = hs.Hue,
Saturation = hs.Saturation,
TextureAmplitude = t
})
.Transform(hst =>
{
return (
((hst.Hue >= 105) && (hst.Hue <= 160) && (hst.Saturation >= 10) && (hst.Saturation <= 60)) ||
((hst.Hue >= 160) && (hst.Hue <= 180) && (hst.Saturation >= 30) && (hst.Saturation <= 30))
)
&& (hst.TextureAmplitude <= 5);
});
This is another point at which I have added my own tweaks to the processing. There are slightly different ranges of acceptable hue / saturation / texture amplitude suggested by Fleck & Forsyth than are suggested by Kaypur and I found that I wanted to change them a little based upon the test images that I was using.
The final step in generating the skin mask is to try to identify any skin areas just outside the identified mask. I've used the approach suggested by Kapur, who recommends it because it "helps to enlarge the skin map regions to include skin/background border pixels, regions near hair or other features, or desaturated areas".
The idea is that we look at negative values in the DataRectangle<bool> skin mask and check whether any of the adjacent values is positive and if the colour of the pixels that resulted in the false value passes a more relaxed skin filter. The relaxed skin filter test demands only that the hue is within the range 110-180 and that the saturation is with 0-180 (the text amplitude is not considered). This expansion work is performed twice.
In order to get back some of the details that are within the skin mask areas, the original image is changed to greyscale and then the skin mask is combined with it to produce a new skin mask that is slightly more restrictive; any negative content from the skin mask remains negative while positive content is only allowed if the greyscale intensity is within an acceptable range. Everything from this point (including this step) comes from the "Face Detection in Color Images" article, since "Naked People Skin Filter" ends when the skin regions are detected (it has no interest in faces, specifically).
skinMask = colourData.CombineWith(
skinMask,
(colour, mask) =>
{
if (!mask)
return false;
var intensity = colour.ToGreyScale();
return (intensity >= 90) && (intensity <= 240);
}
);
In order to turn this DataRectangle<bool> mask into groups of points (where each group represents a distinct area of skin), I used a variation of the "Stack based implementation" from this article: Flood Fill algorithm (using C#.Net). If you're looking my code, it's the "TryToGetPointsInObject" method in the FaceDetector.cs class. I'm not stressing out about performance at this point, I just wanted to get things working and then considering measuring & improving in the future - so finding a nice simple flood fill algorithm was very helpful (I'm not saying that it's not a well-performing method, I'm just saying that at this point in time it's not critically important to me one way or the other).
Any skin objects that are very small (have less than 64 * SCALE points) are ignored. What is slightly more complicated is to identify any completely enclosed holes in the skin object. But it's not much more complicated - the basic approach is, for each skin object, take a negative point within the bounds of the skin object and use the flood fill algorithm again; if the fill reaches the edges of the bounds then the negative point was not part of an enclosed hole. Then move on to the next negative point within the bounds that hasn't already been included in a flood fill operation and continue doing this until a fully enclosed hole is found that's larger than a given size (1 * SCALE).
With the Tiger Woods image, we could actually stop here. It successfully identifies only his face as a possible face region. However, with other test images I used, some more work was required. I found that I could quite easily eliminate a few false positives by ignoring any regions that were obviously the wrong aspect ratio (either very long and shallow sections or very tall and narrow sections). I also found that, depending upon lighting or what background a face was against, sometimes the detection process up to this point would result in a region that is too tight over the facial features - expanding the matched area a little meant that the next filtering in the next stage would get better results.
The real problem, though, is false positives - the algorithm will often identify areas that are not faces.
I wanted to try applying this facial detection logic to some other images. I'd used the image from one of the articles so that I could try to produce intermediate images that looked similar to those in "Face Detection in Color Images" so that I could reassure myself that I was (at least approximately) doing things correctly as described. But now I wanted to try it on some other test images.
In my Pictures folder, some photos from an Easter weekend* night out I went to a couple of years ago jumped out at me. Initially, I only used them because they made me laugh but, on closer inspection, they're actually really useful for illustrating what the skin tone detection process (and the subsequent support vector machine classification) is good at and where their limitations come into play.
* (which goes some way to explaining the eggs and the costumes.. the photos were put up on Facebook by a professional photographer, I hope he doesn't mind me reproducing them here - I tried getting in touch to ask but he didn't get back to me)
Firstly, we have this fine gentleman. The skin tone pass has identified his face perfectly but it's also misidentified two regions that contain colours and textures that the skin filter allows through.
Looking at the generated skin mask (where each distinct skin object is filled with a different colour), it should be clear why this has happened.
If we take another photo then another problem is apparent - with this group of people, there are lots of enclosed skin objects that have holes in that are being identified as faces but that are actually hands holding things.
Also, because we ignore regions that are very small, there is a limit to what faces the skin tone filter will identify. If you look closely at the group photo there is an undetected bearded face along the left edge (near the top) but I am happy for it to exclude him since I think that it is a reasonable compromise to prefer faces that are in the foreground. On the whole, it does a good job of detecting faces but it may also produce a lot of false positives.
So the next step is to try to find a way to filter these regions down further, to try to get rid of those pesky false positives. To do this, I went back to the internet. One of the pages that I found incredibly useful was the "HOG Person Detector Tutorial" by Chris McCormick. He introduces a popular and modern technique, touches on its history and then goes into an illustrated explanation of how to implement it. It's almost redundant me going over it again! .. but it's my blog and I'm talking about how I implemented it, so I'm going to start from the beginning anyway :)
At the very highest level, what we're going to do is this -
This is a concept that I had never heard of before, so I know first-hand what it's like trying to search around on the internet for a good description. The problem is that a lot of articles seem to be aimed at people who know at least something about machine learning and they go immediately to showing lots of formulae and talking about techniques that I'd never heard of. One example is "Kernel Support Vector Machines for Classification and Regression in C#" by César Souza (who wrote much, if not all, of the Accord.NET library - which I will be using later); this is a really well-written article that I appreciate now but it was on my "come back to this when you know more" list for quite a while after I first found it!
So I'm going to take a real beginner approach and try to describe it in a manner that would have helped me a few weeks ago.
Firstly, an SVM tends to be a "binary classifier". This means that it will be trained with data that is all categorised as either yes or no. After it has been trained, you may only ask it to classify further data as being a yes-or-no result. (Actually, there is a such a thing as a "Multi-Class Support Vector Machine" that can return a greater range than binary but it's not relevant here so I'm going to forget about it for now).
(Time for some graphs. Shout out to Create your own XKCD-style Graphs, which helped me come up with what's below!)
To conjure up an example that almost sounds feasible, imagine that every development task at a software shop requires that its strategic benefit to the company be quantified, along with the percentage of the work that a customer is just dying to pay in order for it to be done (presuming that there is a customer who wants it and that it's not just a task for internal purposes). Since every task needs to have these pieces of information before your manager will decide whether or not the work will be scheduled, there exists historical records of estimated tasks which have two "features" (strategic value and percentage-that-will-be-immediately-paid-for-by-customers) and an outcome of either "yes, it was done" or "no, we didn't do it".
If something is of high strategic value to us and it happens to be a feature that a customer is chomping at the bit for (and so will contribute significantly towards) then it seems like a dead cert that we should do it. Unfortunately, this is not a very common case.. More often, a customer will want something that is important to them specifically. This may not be something that is necessarily very easy to resell to other customers (which would give it more strategic value) or something that will help us deal with internal issues, such as scaleability or technical debt (developments that do help with those will also have higher strategic value).
It seems like there's a roughly linear correlation between a development's strategic value, the fraction that customer(s) are willing to immediately pay for it and whether the work got the go-ahead. It's like we could draw a line and everything above it tends to get the green light by our manager and everything below it gets rejected.
An SVM is (according to wikipedia), a:
"model with associated learning algorithms.. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible"
What this boils down to is that we're essentially trying to come up with a formula to split those two sets of points. Then, when we get a new feature-pair (strategic value vs amount-customer-will-pay) we can plug those two numbers into the formula and work out which side of the line we're on (where one side means I-predict-yes and the other side means I-predict-no).
The hard work in the machine learning algorithm is working out where that line should go. Ideally, you want all historical "yes" results on one side and all historical "no" results on the other side. In some cases, though, this is not possible and there will be some outliers (remember that task that you had to do that seemed to have no strategic value, that no customer was willing to pay for and yet someone at one of those customer companies had convinced one of your directors to do it at as a favour...?). Whether there are any outliers or not, there are probably still many slight variations on what line could be drawn. It's common for an algorithm to try to arrange the line so that it is equally distant from the positive results as it is from the negative results but there are some variations on this theme. This is really what the magic is in this process - giving the computer the results and letting it find the best match.
In the above example, each data point only had two features which made it very easy to envisage on a 2D graph how they related to each other (which isn't surprising since it's an example I made up to illustrate that very point!). Sometimes, though, more data is required. It could be that you plotted strategic-value against amount-customer-will-pay and could see no correlation, though one was there - but hidden due to the interaction of another feature. Imagine if you find historical data that shows that a customer wanted to pay for a new feature and the feature was of reasonable strategic value but it wasn't authorised. Or if there were a bunch features with comparatively low value that did get the go-ahead. And if you realised that these were not outliers and that lots of results looked to be "all over the place". Well, it could be that the reason that the high value work wasn't done was because the development team were already stacked out with important work. Whereas those low value jobs that did get done happened to come in during quiet periods where they could be easily slotted into the schedule. This would mean that there was actually a third factor at play; available developer resources.
What would be required to model this would be to include some quantified "available developer resources" value with each data point. With this additional information, we should be able to construct a useful prediction model. However, the points could no longer be plotted on a simple 2D graph - now they would have to be plotted in 3D space. And so the formula that would be produced from the training data would no longer draw a simple line, it would have to describe a plane that cuts through the 3D space (with the "yes" values on one side and the "no" values on the other).
If you had more than three features to each data point then you would no longer be able to draw a graph of it, your points would now live in the mysterious-sounding "n-dimensional space". It's still possible to concoct a formula that divides the data point within that space, it's just not very easy to imagine it in a visual manner. On the 2D graph, it was a one-dimensional line that split the data points and allowed us to make predictions against future data points. On the 3D graph, it would be a two-dimensional plane that divides the points. In n-dimensional space, it will be a "hyperplane" (which just means that it is a structure with one less dimension than the space that it exists in). Once you start throwing terms around like "Linear SVM" and "hyperplane", it's hard not to feel like you're making real progress towards getting a handle on all this machine learning lark! I mention these higher dimensional spaces, though, not just because they sound exciting but because they will be relevant soon..
Something else that's interesting when considering SVMs is that, even with all of the required data, it still might not be possible to delineate the positive and negative results with a linear formula. If we go back to imagining a 2D graph, there could be curious curves in the data that are predictable but can not be cleanly cut with a straight line. In this case, we would have two choices - have the algorithm try to find a line that cuts the results as well as possible (but accept that this will have a lower accuracy because we know that a straight line will get a lot of predictions wrong, based upon the data we've got) or we can allow it to try to come up with a non-linear separation.
If, say, you had a manager who leant more towards authorising tasks that had greater strategic value, even if there weren't immediately customers to finance the work (maybe they favour tasks that improve efficiency for work that customers are excited about or maybe it's to lay the groundwork required before customer-led work can continue) then it might not be possible to perfectly fit a straight line to the data. One option, when trying to model this, would be to specify that the learning algorithm could use a non-linear kernel, would would hopefully generate better predictions. Or you could stick with the linear approach and accept that it will be less accurate. Looking at the graph here, there is a curved line that matches the data precisely and there is a straight line that tries its best but can't help splitting the data so that some negative values are above the line (where only positive values should be) and some positive values are below the line (where only negative values should be).
This is another illustration that machine learning can't perform miracles. Before, I described an example where a third feature would need to be added to the data ("available developer resources") - without which it would be impossible to train a useful model. Here, we can see that sometimes we need to have some insight into what prediction models may or may not work for us (ie. we need to know that a linear kernel won't yield the best results).
I'm not going to dwell on this any longer here because, happily, a linear SVM is a good fit for the data that we'll be working with for classifying face / not-a-face. I don't think that I yet understand enough about the subject to explain why a linear model works so well for what we want to do but I do have links to articles (which I'll include at the end of this post) that seem happy to explain why this is the case.
So, if an SVM can be trained on pre-existing data and then used to decide whether other data (that it's never seen before) should be classified as a yes or a no, how can this be applied to our "potential face regions"? The first thing that we need to do is decide how to extract features from an image. In the "Manager Decision History" examples above, there were only two or three features but there's nothing stopping us from extracting many more features from our inputs. Whatever we do, though, we need to generate the same number of features for every input into the SVM - so we'll almost certainly have to resize each maybe-a-face sub-image before processing it. We could consider taking all of the pixel data from each resized sub-image (every red value, green value and blue value from every pixel across the entire sub-image).. but we're not going to. Not only would there be a lot of data to work with, there would also be too much variance introduced by the environment of the source photograph. We want to try to give the SVM meaningful data for it to analyse and remove irrelevant variables where possible. By "irrelevant variables", I mean that it would be great if we could ignore the lighting in an image and focus on the shapes.
Sometimes the lighting varies not just from one image to another but within the same image - if there is a person outside and sunlight falls on one side of them, for example. And it would be great if we could ignore the contrast and the shades of colour - if you can imagine fiddling with an image in Photoshop (or the excellent Paint.NET), recolouring an image or playing with the contrast and brightness doesn't affect your ability to recognise what's in the image (unless you really stretch some of those dials to the max) and we don't want those sorts of differences to have any impact on the SVM's predictions. What we really want is to discount as much of this variance as possible when preparing the inputs (both the inputs used to train it and the subsequent inputs that we want predictions for).
A method that works well to achieve this is to generate normalised "Histograms of Oriented Gradients" from the sub-image. Again, I've come to this conclusion from doing a lot of reading around and then attempting to implement described methods to see how they work - I'll include references to articles with more details at the end of this post.
The premise is fairly simple, you need to:
I think that steps 3 and 5 are probably the ones that require further explanation. Since we have a greyscale image by step 3, instead of colour data we effectively have "intensity" values for each pixel. To work out the angle in which the intensity is changing, take the intensity of the pixel below the current one and substract it from the intensity of the pixel above the current one. This is the "y difference" - or "delta y", so I'll name it "dy" (so that I can sound all maths-y). "dx", as you can probably imagine, is the difference between the intensity of the pixel to the right minus the intensity of the pixel to the left. We can calculate the angle at which intensity is changing (and how big the change is) with the following:
var angleInDegrees = RadiansToDegrees(Math.Atan2(dx, dy)));
var magnitude = Math.Sqrt((dx * dx) + (dy * dy));
Note that "Math.Atan2" returns a value in radians. Radians are another unit of measurement of angles but instead of describing a full circle by going from 0-360 degrees, the scale goes from 0-2π radians. So, to convert from radians into easier-to-deal-with-in-this-case degrees we can do this:
private static double RadiansToDegrees(double angle)
{
return angle * (180d / Math.PI);
}
Also note that Atan2 actually returns values in the range of -π to +π and so calling RadiansToDegrees on the result will give us a value from -180 to +180.
(When calculating gradient angles and magnitudes, any pixels on image edges won't have pixels surrounding in every direction and so I just record them as having both an angle and magnitude of zero)
Now for step 5, we want to produce "histograms of oriented gradients" (one for each block that we've split the image into in step 4). I always envisage a histogram like a bar chart - it shows how values are distributed across a range.
(The example on the left is from wikipedia and reproduced here under the creative commons license)
To generate our histograms, we want to assign the gradients with the block across nine buckets, spaced twenty-degrees apart. The first will be at 10 degrees, the second at 30 and they'll go all the way up to 170.
We're going to use "unsigned gradients", which means that if we have a gradient where intensity is changing only vertically, we don't really care if it's getting brighter as it goes down or brighter as it goes up, we only care that the line is precisely vertical. Similarly, imagine the intensity increasing up-and-right, at 45 degrees - we're happy to treat this the same as intensity going in precisely the opposite direction; down-and-left and 225 degrees (or -135 degrees since our current angleInDegrees values are in the -180 to +180 range). What this essentially means is that we want to tweak our angles so that they are all within 0-180 (instead of -180 to +180). To do, so we just add 180 degrees to any values less than zero.
Every gradient needs to be assigned to one or more of these buckets. If a gradient's angle is precisely 10 degrees then the entirety of the gradient's magnitude is added to the 10 bucket. However, if it's between 10 and 30 then its magnitude is split proportionally between 10 and 30 (so a gradient of 20 degrees is split evenly between 10 and 30 while a gradient of 25 degrees will contribute 3/4 of its magnitude to the 30 bucket - which it is 5 degrees away from - and 1/4 of its magnitude to the 10 bucket - which it is 15 degrees away from).
Performing this transformation on the image is an effective way of reducing the amount of data that we need to deal with. If we decide that the standardised size of the images that we want an SVM to classify is 128x128 then we 128 x 128 x 3 = 49,152 values (since every pixel has three colour values; red, green and blue). If use a block size of 8 when generating the HOG data then the 128 x 128 image will be 16 x 16 blocks and each block has 9 values (since each histogram has values in nine bins), which gives a total of 2,304 values.
Another nice thing about this representation of the data is that, if you sort of squint, you can kind of make out the shapes that were in the source image -
If you get curious and want to try generating HOGs for your own images, there is code in my my GitHub project to do this..
using (var image = new Bitmap(imagePath))
{
using (var resizedImage = new Bitmap(image, new Size(128, 128)))
{
// (Can ignore GetFor's return value when only interested in generating a preview image)
FaceClassifier.FeatureExtractor.GetFor(
resizedImage,
blockSize: 8,
optionalHogPreviewImagePath: "HOG.png",
normaliser: GlobalNormaliser.Normalise
);
}
}
.. which brings me neatly on to HOG normalisation (since there's a mysterious reference to a normaliser in the code above). When the gradient magnitudes are calculated, some images may have many pixels that vary wildly in intensity from their neighbours while other images may have smoother, gentler gradients. In both cases, it is the relative flow of intensity that is important, since that helps identify the shapes. If you took an image and produced HOG data for it and then increased the contrast of the image and re-processed it, you would get greater gradient magnitude values in the increased-contrast version of the image, even though they both - so far as we are interested (in the context of trying to detect faces) - contain exactly the same information.
What we want to do is to align all of the magnitudes to a common base line. A simple (and fairly effective) way to do this is to find the largest value of any of the buckets across all of the histograms generated for an image and to then divide every magnitude by this value. In the example above, the 128x128 input image is transformed into 16x16 blocks, each of which is a histogram that contains 9 values. So we get the largest value from each of those 16x16x9 values and then divide all of them by it. This means that the largest value is now precisely 1.0 and every other value is somewhere between zero and one. This is what the "GlobalNormaliser.Normalise" delegate in the code above does. All it essentially has to do is this:
public static DataRectangle<HistogramOfGradient> Normalise(DataRectangle<HistogramOfGradient> hogs)
{
if (hogs == null)
throw new ArgumentNullException(nameof(hogs));
var maxMagnitude = hogs.Enumerate()
.Select(pointAndHistogram => pointAndHistogram.Item2)
.Max(histogram => histogram.GreatestMagnitude);
return hogs.Transform(hog => hog.Multiply(1 / maxMagnitude));
}
However, there is a variation on this that has been found to produce more accurate results; "block normalisation". The original description of this process comes from (as I understand it) the original research into using HOGs for this form of detection ("Histograms of Oriented Gradients for Human Detection [PDF]") -
For better invariance to illumination, shadowing, etc., it is also useful to contrast-normalize the local
responses before using them. This can be done by accumulating a measure of local histogram “energy” over somewhat larger spatial regions (“blocks”) and using the results to normalize all of the cells in the block
What this means in practical terms is explained well by one of the articles that I linked earlier ("HOG Person Detector Tutorial"). In essence, it means that we can get better results from normalising over smaller areas of the image. Instead of taking the max magnitude across the entire set of data, the idea is to group the histograms into blocks of four and normalising over those.
Let's go back to Tiger Woods' face to illustrate what I mean.
We first take the 2x2 histograms from the top left of the image - we get the max magnitude from within those four histograms and use it to normalise the values within them. These four normalised histograms will provide the first sets of values that we extract from the image.
Then we move across one to get another set of 2x2 histograms and repeat the process; get the max magnitude from within those four histograms, use it to normalise them and then take those four normalised histograms as the next set of values that we have extracted from the image.
What you might notice here is that, as we look at the 2x2 blocks of histograms, some of them will appear multiple times. The histograms from the edges of the image won't but the others will. In the Tiger images here, you can see that the two histograms at the right hand side of the block in the first image are included again in the second image (now they are the two histograms on the left hand side of the block).
This means that this "block normalisation" process will result in more data being produced. When we "globally normalised" the HOGs then we had 16x16x9 = 2,304 values. However, if we block-normalise (using blocks of 2x2) then we generate 30 blocks across (there are two edge blocks that are only counted once but the other 14 blocks are all counted twice, so the total is 2 + (14*2) = 30). For the same reason, we will generate 30 blocks worth of data as we go down the image. This means that we end up with a total of 30x30x9 = 8,100 values.
To extract those features, we would tweak the code from before -
const int inputWidth = 128;
const int inputHeight = 128;
const int blockSizeForHogGeneration = 8;
const int blockSizeForLocalNormalisation = 2;
IEnumerable<double> features;
using (var image = new Bitmap(imagePath))
{
using (var resizedImage = new Bitmap(image, new Size(inputWidth, inputHeight)))
{
var blockNormaliser = new OverlappingBlockwiseNormaliser(blockSizeForLocalNormalisation);
features = FaceClassifier.FeatureExtractor.GetFor(
resizedImage,
blockSize: blockSizeForHogGeneration,
optionalHogPreviewImagePath: null,
normaliser: blockNormaliser.Normalise
);
}
}
(Note that I'm setting "optionalHogPreviewImagePath" to null so that "FeatureExtractor.GetFor" doesn't generate a "HOG preview" image - this is because it's much harder to recognise the shapes that the gradients were extracted from when this form of normalisation is use since most of the HOGs appear multiple times, so the preview images are of less interest)
When I tried comparing the results of global normalisation vs block normalisation, I found that I got better result (ie. better accuracy) when using block normalisation, just as the authors of "Histograms of Oriented Gradients for Human Detection" did. The number of images that I've been testing with is, I'm sure, much smaller than the number used by Dalal and Triggs in their research but it was gratifying that I could see improvements with my data set from using block normalisation - if only because it reassured me that I was moving in the right direction!
It's really coming together now. We've got a way to extract data from an image that we will be able to pass to an SVM in order for it to classify the image as "face" or "not face". There's just one thing that we're missing.. a trained SVM. Or, another way to look at it, two things we're missing - a mechanism to train an SVM and the data to train it with.
Let's start with a general way to train an SVM. For this, I'm going to use a package called Accord.NET. It's a library entirely written in C#, which was a plus to me because I like to see how things work and when I was doing my early reading around on the subject of face detection/recognition, a lot of people were recommending OpenCV. This is a C++ library (which can be called by C# using a .NET wrapper called Emgu), while I would be happier with C# source code that I could more confidently dig around in. (Also, I'm toying with the idea of trying to port some of this work to a Bridge.NET project so that I can try making it work in the browser - this wouldn't be possible if I took a dependency on a library like OpenCV).
Accord.NET really does make it easy.. once you know where to get started. There are a lot of examples on the accord-framework.net site and on the GitHub wiki, though some of the code samples are out of date and will generate compile warnings if you try to use them directly. (Having said that, the warnings can be ignored and it's not too hard to find other examples that compile without warnings - and, from reading some of the GitHub issues, I know that César is aware that some of the docs need updating and is intending to do so when he can make time).
To demonstrate, let's revisit the "Manager Decision History" example from earlier. We'll formulate some example data where we pretend that the Manager is super-consistent and will always authorise work if the percentage that the customer will pay for immediately (as a value between 0 and 1) plus the strategic value (also somehow quantified as a value from 0 to 1) add up to more than 1. (So strategic value 0.9 plus customer-will-pay-immediately 0.2 will be authorised as 0.9 + 0.2 > 1 but strategic value 0.8 with customer-will-pay-immediately of 0.15 will not be authorised as 0.8 + 0.15 < 1). We can then use that data to train an SVM and then try other values against the model -
// Make up some data (in the real world we'd use some proper pre-classified training
// data but this is just an example to demonstate how to train an SVM using Accord.NET)
var decisionHistory = Enumerable.Range(0, 10).Select(x => x / 10d)
.SelectMany(x => Enumerable.Range(0, 10).Select(y => y / 10d).Select(y => new
{
StrategicValue = x,
ImmediateCustomerContribution = y,
Authorised = (x + y) > 1
}));
// From the data, extract the input features (strategic-value and amount-customer-will-
// pay-now)..
var inputs = decisionHistory
.Select(decision => new[]
{
decision.StrategicValue,
decision.ImmediateCustomerContribution
})
.ToArrary();
// .. and the true/false outputs for each of those sets of features
var outputs = decisionHistory.Select(decision => decision.Authorised).ToArray();
// Then use the inputs and outputs to train an SVM
var smo = new SequentialMinimalOptimization<Linear>();
var svm = smo.Learn(inputs, outputs);
The SequentialMinimalOptimization defines the process by which it will be decided when the best match has been found for the data that it's been provided. We're specified that a linear kernel be used, which means that we're presuming that it will be possible to neatly classify our data with a straight line.
Now that it's been trained, we can ask the SVM to predict an output by calling its "Decide" method and giving it a pair of values -
var easyWin = svm.Decide(new[] { 0.81, 0.79 });
This returns true - which is what we would expect since the numbers indicate a feature that has high strategic value (0.81) and there are customers who want it so much right now that they are already getting their chequebooks out (the customer-will-immediately-pay value is 79%).
var lowPriority = svm.Decide(new[] { 0.26, 0.14 });
This returns false - which we'd also expect, since the numbers indicate a feature of low strategic value and one that no customer is excited about contributing much towards the development cost of.
Time for another gotcha. We saw earlier that a linear kernel is not always going to be capable of perfectly classifying the results in the training data. Sometimes you might need to use a non-linear kernel (or stick with a linear kernel but accept a lower accuracy). I didn't talk about what other kernel options there are (because it's not relevant to what I want to do here) but it was an important point that machine learning will sometimes need some external insight in order to be as effective as it can be. Another example of this is that sometimes you need to tweak the training parameters, depending upon the data that you're using. In the below example, I'm going to try to train an SVM in a very similar manner to what we just looked at, but with much less data -
var decisionHistory = new[]
{
// Strategic wins
new { StrategicValue = 0.95, ImmediateCustomerContribution = 0.1, Authorised = true },
new { StrategicValue = 0.85, ImmediateCustomerContribution = 0.2, Authorised = true },
// Customer wins
new { StrategicValue = 0.15, ImmediateCustomerContribution = 0.9, Authorised = true },
new { StrategicValue = 0.2, ImmediateCustomerContribution = 0.9, Authorised = true },
// Everybody is happy
new { StrategicValue = 0.8, ImmediateCustomerContribution = 0.8, Authorised = true },
// Low priority
new { StrategicValue = 0.2, ImmediateCustomerContribution = 0.1, Authorised = false },
new { StrategicValue = 0.4, ImmediateCustomerContribution = 0.2, Authorised = false }
};
var inputs = decisionHistory
.Select(decision => new[]
{
decision.StrategicValue,
decision.ImmediateCustomerContribution
})
.ToArrary();
var outputs = decisionHistory.Select(decision => decision.Authorised).ToArray();
var smo = new SequentialMinimalOptimization<Linear>();
var svm = smo.Learn(inputs, outputs);
The training data is very similar to before in that all authorised decisions still have a feature sum of more than 1 and all rejected decisions have a sum of 1 or less. However, something seems to have gone wrong because when I ask the trained SVM what it thinks of the "lowPriority" example -
var lowPriority = svm.Decide(new[] { 0.26, 0.14 });
.. it returns true! This is not what I want.
The only way that this could happen is if the prediction model that has been generated is completely wonky somehow. To put this to the test, I'm going to use a mechanism that Accord has where you can double-check your trained SVM by running the training data back through it to see how well the prediction model managed to fit it. This can be useful in cases where you're not sure if the SVM kernel that you're using is appropriate, since it can highlight a badly-fitting model. To calculate the error rate when the training data is passed back through the SVM, do the following:
var predicted = svm.Decide(inputs);
var error = new ZeroOneLoss(outputs).Loss(predicted);
This just uses the model to calculate a prediction for each of the inputs and then compares the results to the expected values (then it works out what proportion are incorrect). In this case, the error rate is 0.2857142857142857, which is 2/7. Looking at the predicted values (an array of bool), every value is true! This isn't right, the last two inputs (the "low priority" data points) should result in a false prediction. I guess that that explains why the "lowPriority" example returns true from this model - it seems to return for everything!
We know that a linear model will fit this data because we know that it's a straight line on a graph that separates everything above 1 (which are "decision authorised" results) from everything else (which are "decision rejected" results). So it's not the kernel that's the problem. The only other thing to do is to look for some options to fiddle with. Poring through the documentation, one that sounds promising is "Complexity" (also referred to as "cost") -
The cost parameter C controls the trade off between allowing training errors and forcing rigid margins. It creates a soft margin that permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well.
That description makes it sound like a bad thing to try increasing the complexity, I think I want a model that will generalise well. However, leaving the complexity at its default is clearly not working well for us in this case. So, I tried changing the SequentialMinimalOptimization initialisation to:
var smo = new SequentialMinimalOptimization<Linear> { Complexity = 10 };
.. and re-running. This time, the predicted array precisely matched the output array, which means that the error rate is now zero. When I ask the new model to predict a value for the "lowPriority" features, it returns false - which is much better!
I've only experienced this problem when working with small amounts of data. To train a face classification SVM, we're going to throw a lot of training data at it and so this shouldn't be a problem (there will be no need to fiddle with Complexity or any other settings). I only mention it now in case you decide to do a few experiments of your own and fall into the same trap that I did!
We have all the tools that we need now, the final piece of the puzzle is that we need training data to teach an SVM what looks like a face and what doesn't.
I could take the pictures that I found on my computer, run them through the skin tone face detector, manually categorise each maybe-a-face region and then use that information to train an SVM. I would use the HOG feature extractor to generate the training inputs and the list of output values would be the manual classifications that I would have to prepare (eg. sub-image-1 is a face, sub-image-2 is not a face, etc..). This should result in an SVM that could them tell apart each of the sub images automatically. However, that would be cheating! What I want to do is train a classifier with one lot of data and then use the SVM on my Easter Weekend photos to prove that it's worked. (Testing an SVM using the same data used to train it is a bit pointless, it gives you no indication whether you have produced something that is useful for general purpose or if you've only succeeded in training an SVM that is specialised and only works with that particular set of inputs).
It's crucial to train it using both positive images (ie. face images) and negative images (non-face images), otherwise the SVM will have no idea to classify. If, for example, you tried to train an SVM using only positive images then all you teach it is that everything is a positive image! (By always returning true, it would produce a zero error rate for the training data but it's not very useful to alway returns true when trying to classify real work data). So we need both kinds of input and I think that we ideally want to have an equal number of positive and negative images.
(If I had to really think about it, maybe it should be the case that we want at least as many negative images as positive as there are only so many variations of a face that exist but there are loads of things that aren't faces.. however, an equal number of positive/negative has worked well for me and so I haven't felt the need to experiment with different ratios)
I've found various places that have databases of images of faces but I found it difficult to decide how to get a set of negative images. I'm sure that in some things I read, people spoke about using images of scenery.. but I can't see why there should be any particular kind of negative image (such as scenery) that should be used - it could be anything (so long as it's not a face)!
What I did in the end was download the "Caltech 10,000 Web Faces" data set, which includes lots of photos downloaded from various Google image searches along with a text file that, crucially, has coordinates of the faces in the photos. From this data set, I extracted the faces from images but also extracted other parts of the images at random that didn't contain faces. For each face entry in the "Ground Truth" text file, I extracted three images - one where the edges of the sub-image were close around the face and then two where there was a little more background around the face. This should help produce an SVM that can recognise faces when the region around the face is very tight and when it's less so, which is important for classifying the skin tone face detection results - Tiger Woods' detected-face-region is quite tight while the other photos show that much looser face regions may be identified for other photos.
There's no exciting implementation details here. The "CalTechWebFacesSvmTrainer" class is given a bunch of configuration options: path containing the CalTech web face images, path to the Ground Truth text file (which lists all of the face regions in the images), standard size of image to use to generate SVM inputs (128x128), block size for HOG generation (8), normaliser (the OverlappingBlockwiseNormaliser we saw earlier, with block size 2) and number of training images to process. These options will train an SVM on positive and negative images from the Caltech data set and the "TrainFromCaltechData" method will return an IClassifyPotentialFaces implementation, which wraps the SVM and exposes has a single method -
bool IsFace(Bitmap image);
Now, we can take the possible-face sub-images identified by the skin tone pass and pass them to "IsFace". This method will resize the specified Bitmap to the standard image size that the SVM requires, generate normalised HOG data from it and then query the SVM - which returns the final result; is this or is this not a face?
The only configuration option that I haven't mentioned until now is "number of training images to process". We saw before that trying to train an SVM with very little data can be ineffective (unless you know what kernel settings to fiddle with) but it's very difficult to come up with a hard and fast rule of how much data is enough to train from. I think that the best thing to do is just to experiment. My gut told me that it would surely have to be at least 100s or 1000s of images since the SVM has to be trained to classify many variations of faces (and not-faces) and each of the inputs has a lot of values (8,100 - as we calculated earlier) and so it seems like it's going to need to have a lot of information at its disposal so that it can work out what is and isn't important.
So, initially, I tried specifying 500 training images, which resulted in an SVM that actually did a pretty good job. The group photo shown before had all of the maybe-face regions correctly classified (the faces were all identified as faces and the hands in the photos were all correctly classified as not-faces).
However, there was a problem with the image shown here (the problem, btw, clearly isn't the moves being thrown on the dancefloor, since they look like the very essence of poetry in motion). The problem is that one of the maybe-face regions identified by the skin tone pass has been mis-classified by the SVM as a face, when it's really an arm.
(Note that there are some faces in the background that were not identified by the skin tone pass, but I'm not worried about that - it is to be expected that they would be ignored because they are small, relative to the size of the image.. though it should be possible to tweak the skin tone filter parameters if you wanted to try to capture background faces).
Increasing the number of training images to 1,000 didn't address the problem with this particular image and, in fact, made another image worse (where a non-face region was correctly classified as not-a-face before, with 1,000 training images the SVM thought that it was a face). Increasing to 1,500 training images corrected the classification of the dancing image regions but there were still false positives in other images. Such as this one of that same individual who appears to now be wearing a flag (which reminds me of a scene from Fear and Loathing in Las Vegas).
2,000 training images seemed to be the sweet spot. The classifier produced the correct results for every case that I expected it to.
The quantity of training data is going to vary from application to application and will depend upon what sort of feature extraction logic you are using. I think that it makes sense to start with lower quantities, if only because it's faster to train an SVM with less data and you can experiment more quickly with the other variables. To illustrate, on my computer (with the code as it currently stands, which is in a completely unoptimised state, but running a release build) it takes 17s to load the 2,000 images and extract the features from them, it then takes 27s to train the SVM on them. With 500 images it takes 5s to load the image data and 2s to train the SVM. With 5,000 images it takes 42s and 168s.
There's one final tweak that I'd like to mention before bringing everything to a close. The SVM requires that the inputs be fixed dimensions, so the "IsFace" implementation resizes the input image if it's too big or too small. If it's the wrong dimensions, though, then it leaves black bars above-and-below or left-and-right of the input (which is part of the process that I described in "Feature extraction for maybe-a-face images"). I found that the classification accuracy improved slightly if I expanded the maybe-a-face region so that it matched the aspect ratio of the SVM's input dimensions first. For example, if the skin tone pass identified a face region that was 91x241px and I knew that the SVM was configured to work with input images of 128x128px then I would first expand the 91x241px region to 241x241px (similarly, if I had a region that was 123x56px then I would expand it to 123x123px).
In the context of this kind of work, "performance" relates to two distinct areas - how accurately does it perform its work and how quickly does it do it? To an extent, the two are related since we could make the skin tone filter work more quickly (if we reduced the limit from no-input-dimension-may-be-more-than-400px to 300px or 200px) but at the risk of it no longer identifying regions that we want it to. As we've just seen with the SVM training, the time it takes to train it depends upon the quantity of data (both the size of the input image that features are extracted from and the number of input images) - so the inputs could be reduced (so that the training time would be shorter) but this negatively affects the accuracy of the classifier.
(A handy tip that I used when I was happy with the SVM training but experimenting with the skin tone pass was that the Accord.NET SVM class may be serialised using the BinaryFormatter - so I was persisting it to disk between runs, rather than having to re-train it on every run)
Once the SVM is trained, it is very quick at classifying. Which makes sense, since all it needs to do is take a feature set, apply its linear formula and see which side of the hyperplane the data point sits on. There is a little work required to extract the feature data from a maybe-face sub-image in order to give the SVM something to work on but it takes less then 15ms per sub-image on my computer. The skin tone pass is slower, taking between 150 and 300ms for the images that I've been testing with (which are all about 1000px on the largest side and so are resized down to 400px before being processed). I'd like to cut this time down because I like the idea of being able to apply this processing to video and I feel that it needs to be able to process at least five frames a second to convincingly track faces. I haven't made any effort towards this yet but reducing the total time down to 200ms feels reasonable since the code has been written to be easy to follow and tweak, rather than to be fast, so there is surely plenty of performance tuning potential.
Another reason that I haven't spent any time optimising the current code is that it might be wiser to spend time researching alternative algorithms. For example, Accord.NET has some support for face detection, such as the FaceHaarCascade class (which I believe is the same class of detector as the Viola–Jones object detection framework) -
Rectangle[] faceRegions;
using (var image = new Bitmap(path))
{
var detector = new HaarObjectDetector(new FaceHaarCascade());
faceRegions = detector.ProcessFrame(image);
}
However, it gave poor results for my images. It seemed to do well with very small faces but struggled with larger images (or the images where the faces took up a large proportion of the image). It's very possible that I could improve the accuracy by tweaking the parameters. If my images are too large, maybe they should just be shrunk down further before processing? That might help the false negative rate but I'm also concerned about the false positive rate and I'm not sure that shrinking the image first would help with that.
One approach that sounds very promising: if the SVM classifier is so fast and effective then maybe we should get rid of the skin tone processing altogether and just generate segments from all over the image, then filter them with the classifier. This is discussed and recommended by many people and is referred to as the "sliding window" approach (see Sliding Windows for Object Detection with Python and OpenCV, for example). The idea is that you pick a window size (eg. 50px if limiting the source image to no more than 400px on a side) and then start at the top left of the image and take that top-left 50x50px as the first sub-image to pass to the classifier. Then you move the window over a bit (say, 10px) and then use that as the second sub-image. Keep going until you can't go any further and then go back to the left but go down 10px - moving across then back-and-down until you have covered the entire image. Then do the same with some other window sizes (maybe 75px and then 100px), since you don't know whether the faces on an image are going to be small (relative to the size of the image) or large. This will give you many, many regions to classify - but that's no problem because linear SVM classification is very fast. Unfortunately, I did try this but I got an enormous false positive rate (as you can see here with Tiger - if you can make him out from behind all the green "this is a face!" boxes). Surely this approach can be made to work since it's so commonly presented as a solution to this problem.. it may just be that I need to tweak how the SVM is trained, maybe it needs more than 2,000 training images (though I did also try it with 10,000 training images and the results were not much better).
To return to the performance of the skin-tone-pass-following-by-(2k-image-trained-)SVM-classification model that I've talked about for most of this post; in terms of accuracy I'm very happy with it. In the test images that I've been using, there are zero false positives. There are a couple of false negatives but I think that they are understandable. The first obvious failing that it has is with a black-and-white photo - since the skin tone face detector looks for particular hues and saturations, it's not going to work on if there is no colour in the image. That's just a built-in limitation of the approach.
Another failure is with a face that is half hidden behind someone else (in the top right of this group photo). The skin tone face detector identifies the face but the SVM classifies it as not-a-face. Since the face is half-hidden, I think that this is acceptable.
Last but not least is this unfortunate guy who foolishly entered some sort of drinking contest. I don't know if he won but he doesn't look too good about it one way or the other. The red liquid erupting from his lips falls within the first phase's looks-like-skin-tones bounds, as you can see by looking at the skin mask image. Not only is his head tilted in the shot but the vomit-detected-as-skin means that his face isn't centered in the maybe-face region. So I'm not surprised that the SVM doesn't correctly classify it.
In summary, it's not perfect but I'm still very happy with the results.
I want to close this post off by linking to the sites and papers that helped me get to this point. It's been a fun journey and is hopefully only my first step on the machine learning track!
The "Naked People Skin Filter (Fleck & Forsyth)" and "Face Detection in Color Images" papers were amongst the first that I found that had easily understandable material (for a face detection novice) that didn't use a third party library in some way.
Once I had implemented the skin tone logic and was looking to improve the results, one of the articles that I found most inspiring (partly because it was so approachable) was "Machine Learning is Fun! Part 4: Modern Face Recognition with Deep Learning" (by Adam Geitgey). This talks about taking things to the next step and performing facial recognition - not just locating faces but also trying to identify who they are, against a known database. I haven't got to that advanced point yet but it's definitely something that I'd like to explore. The only thing that I would say about this article is that it doesn't go into a lot of depth about generating HOGs or normalising them or what sort of input image size and other parameters to use. He does link to some Python code that uses third party libraries in lieu of more details, though. Since I wanted to write C# and (ideally) only take C# dependencies this wasn't very helpful to me. The article itself, though, got me really excited about the possibilities and I enjoyed the guy's writing style (and intend to read the rest of his "Machine Learning is Fun" series).
An article that did go into the details was the "HOG Person Detector Tutorial" (by Chris McCormick). This was also really well written and very approachable. Although it talks about detecting a person in a photo, rather than a face, the principles are the same. The description of the HOG generation and normalisation really helped me clarify in my mind how it should all work. This article links to the original HOG person detector paper by Dalal and Triggs [PDF], which is full of research information and graphs and equations - so if you want to dig deep then this is the place to start!
The Triggs and Dalal paper state in the introduction that "For simplicity and speed, we use linear SVM as a baseline classifier throughout the study". Later on they mention that they tried a non-linear kernel but that "Replacing the linear SVM with a Gaussian kernel one improves performance by about 3%.. at the cost of much higher run times". If this "why is a linear SVM classifier appropriate" explanation feels too light for you then it is discussed in greater depth in "Why do linear SVMs trained on HOG features perform so well? [PDF].
Finally, I want to mention again the article "Kernel Support Vector Machines for Classification and Regression in C#" by Accord.NET's César Souza. Here, he introduces the SVM in a much more formal way than I've done here, he includes a video about "The Kernel Trick" (which is a simple yet mind-blowing way in which the linear kernel can be used to classify data that doesn't immediately look like it would work for), he talks about other kernels (aside from linear he also describes polynomial and gaussian) and he describes how the Sequential Minimal Optimization learning algorithm works. There's a lot of code and there's illustrated examples at the end. When I found this article fairly early on in my research, I recognised how much care had gone into preparing it but also knew that I needed something a little more beginner-level first! So I bookmarked it for future reference and now I think that I understand (almost all of) it. Hopefully, if you started off knowing nothing but have followed all the way through this post, then you will too!
Posted at 18:39
Dan is a big geek who likes making stuff with computers! He can be quite outspoken so clearly needs a blog :)
In the last few minutes he seems to have taken to referring to himself in the third person. He's quite enjoying it.