28 April 2021
By training another type of model from the open source .NET library that I've been using and combining its results with the similarity model from last time (see Automating "suggested / related posts" links for my blog posts), I'm going to improve the automatically-generated "you may be interested in" links that I'm adding to my blog.
Improvement, in fact, sufficient such that I'll start displaying the machine-suggested links at the bottom of each post.
In my last post, I had trained a fastText model (as part of the Catalyst .NET library) by having it read all of my blog posts so that it could predict which posts were most likely to be similar to which other posts.
This came back with some excellent suggestions, like this:
Learning F# via some Machine Learning: The Single Layer Perceptron
How are barcodes read?? (Library-less image processing in C#)
Writing F# to implement 'The Single Layer Perceptron'
Face or no face (finding faces in photos using C# and AccordNET)
.. but it also produced some less good selections, like this:
Simple TypeScript type definitions for AMD modules
STA ApartmentState with ASP.Net MVC
WCF with JSON (and nullable types)
The joys of AutoMapper
I'm still not discounting the idea that I might be able to improve the results by tweaking hyperparameters on the training model (such as epoch, negative sampling rate and dimensions) or maybe even changing how it processes the blog posts - eg. it's tackling the content as English language documents but there are large code segments in many of the posts and maybe that's confusing it; maybe removing the code samples before processing would give better results?
However, fiddling with those options and rebuilding over and over is a time-consuming process and there is no easy way to evaluate the "goodness" of the results - so I need to flick through them all myself and try to get a rough feel for whether I think the last run was an improvement or not.
The premise that I wil be experimenting with is to determine what words in my post titles are "interesting" and to then order the suggested-similar posts first by a score based upon how many interesting words they share and then by the similarity score that I already have.
The model that I'll be training for this is called "TF-IDF" or "Term Frequency - Inverse Document Frequency" and it looks at every word in every blog post and considers how many times that word appears in the document (the more often, the more likely that the document relates to the word) and how many times it appears across multiple documents (the more often, the more common and less "specific" it's likely to be).
For each blog post that I'm looking for similar posts to, I'll:
Taking the example from above that didn't have particularly good similar-post recommendations, the words in its title will have the following scores:
Word | Score |
---|---|
Simple | 0.6618375 |
TypeScript | 4.39835453 |
type | 0.7873714 |
definitions | 2.60178781 |
for | 0 |
AMD | 3.81998682 |
modules | 3.96386051 |
.. so it should be clear that any other titles that contain the word "TypeScript" will be given a boost.
This is by no means a perfect system as there will often be posts whose main topics are similar but whose titles are not. The example from earlier that fastText generated really good similar-post suggestions for is a great illustration of this:
Learning F# via some Machine Learning: The Single Layer Perceptron
How are barcodes read?? (Library-less image processing in C#)
Writing F# to implement 'The Single Layer Perceptron'
Face or no face (finding faces in photos using C# and AccordNET)
All of them are investigations into some form of machine learning or computer vision but the titles share very little in common. It's likely that the prediction quality of this one will actually suffer a little with the change I'm introducing but I'm looking for an overall improvement, across the entire blog. I'm also not looking for a perfect general solution, I'm trying to find something that works well for my data (again, bearing in mind that there is a relatively small quantity of it as there are only around 120 posts, which doesn't give the computer a huge amount of data to work from).
(It's also worth noting that the way I implement this in my blog is that I maintain two lists - the manually-curated list that I had before that had links for about a dozen posts and a machine-generated list; if there are manual links present then they will be displayed and the auto-generated ones will be hidden - so if I find that I have a particularly awkward post where the machine can't find nice matches then I can always tidy it up myself by manually creating the related-post links for that post)
Last time, I had code that was reading and parsing my blog posts into a "postsWithDocuments" list.
After training the fastText model, I'll train a TF-IDF model on all of the documents. I'll then go back round each document again, have this new model "Process" them and retrieve Frequency values for each word. These values allow for a score to be generated - since the scores depend upon how often a word appears in a given document, the scores will vary from one blog post to another and so I'm taking an average score for each distinct word.
(Confession: I'm not 100% sure that this averaging is the ideal approach here but it seems to be doing a good enough job and I'm only fiddling around with things, so good enough should be all that I need)
Console.WriteLine("Training TF-IDF model..");
var tfidf = new TFIDF(pipeline.Language, version: 0, tag: "");
await tfidf.Train(postsWithDocuments.Select(postWithDocument => postWithDocument.Document));
Console.WriteLine("Getting average TF-IDF weights per word..");
var tokenValueTFIDF = new Dictionary<string, List<float>>(StringComparer.OrdinalIgnoreCase);
foreach (var doc in postsWithDocuments.Select(postWithDocument => postWithDocument.Document))
{
// Calling "Process" on the document updates data on the tokens within the document
// (specifically, the token.Frequency value)
tfidf.Process(doc);
foreach (var sentence in doc)
{
foreach (var token in sentence)
{
if (!tokenValueTFIDF.TryGetValue(token.Value, out var freqs))
{
freqs = new();
tokenValueTFIDF.Add(token.Value, freqs);
}
freqs.Add(token.Frequency);
}
}
}
var averagedTokenValueTFIDF = tokenValueTFIDF.ToDictionary(
entry => entry.Key,
entry => entry.Value.Average(), StringComparer.OrdinalIgnoreCase
);
Now, with a couple of helper methods:
private static float GetProximityByTitleTFIDF(
string similarPostTitle,
HashSet<string> tokenValuesInInitialPostTitle,
Dictionary<string, float> averagedTokenValueTFIDF,
Pipeline pipeline)
{
return GetAllTokensForText(similarPostTitle, pipeline)
.Where(token => tokenValuesInInitialPostTitle.Contains(token.Value))
.Sum(token =>
{
var tfidfValue = averagedTokenValueTFIDF.TryGetValue(token.Value, out var score)
? score
: 0;
if (tfidfValue <= 0)
{
// Ignore any tokens that report a negative impact (eg. punctuation or
// really common words like "in")
return 0;
}
return tfidfValue;
});
}
private static IEnumerable<IToken> GetAllTokensForText(string text, Pipeline pipeline)
{
var doc = new Document(text, pipeline.Language);
pipeline.ProcessSingle(doc);
return doc.SelectMany(sentence => sentence);
}
.. it's possible, for any given post, to sort the titles of the other posts according to how many "interesting" words (and how "interesting" they are) they have in common like this:
// Post 82 on my blog is "Simple TypeScript type definitions for AMD modules"
var post82 = postsWithDocuments.Select(p => p.Post).FirstOrDefault(p => p.ID == 82);
var title = post82.Title;
var tokenValuesInTitle =
GetAllTokensForText(NormaliseSomeCommonTerms(title), pipeline)
.Select(token => token.Value)
.ToHashSet(StringComparer.OrdinalIgnoreCase);
var others = postsWithDocuments
.Select(p => p.Post)
.Where(p => p.ID != post82.ID)
.Select(p => new
{
Post = p,
ProximityByTitleTFIDF = GetProximityByTitleTFIDF(
NormaliseSomeCommonTerms(p.Title),
tokenValuesInTitle,
averagedTokenValueTFIDF,
pipeline
)
})
.OrderByDescending(similarResult => similarResult.ProximityByTitleTFIDF);
foreach (var result in others)
Console.WriteLine($"{result.ProximityByTitleTFIDF:0.000} {result.Post.Title}");
The top 11 scores (after which, everything has a TF-IDF proximity score of zero) are these:
7.183 Parsing TypeScript definitions (functional-ly.. ish)
4.544 TypeScript State Machines
4.544 Writing React components in TypeScript
4.544 TypeScript classes for (React) Flux actions
4.544 TypeScript / ES6 classes for React components - without the hacks!
4.544 Writing a Brackets extension in TypeScript, in Brackets
0.796 A static type system is a wonderful message to the present and future
0.796 A static type system is a wonderful message to the present and future - Supplementary
0.796 Type aliases in Bridge.NET (C#)
0.796 Hassle-free immutable type updates in C#
0.000 I love Immutable Data
So the idea is to then use the fastText similarity score when deciding which of these matches is best.
There are all sorts of ways that these two scoring mechanisms could be combined - eg. I could take the 20 titles with the greatest TF-IDF proximity scores and then order them by similarity (ie. which results the fastText model thinks are best) or I could reverse it and take the 20 titles that fastText thought were best and then take the three with the greatest TF-IDF proximity scores from within those. For now, I'm using the simplest approach and ordering by the TF-IDF scores first and then by the fastText similarity model. So, from the above list, the 7.183-scoring post will be taken first and then 2 out of the 5 posts that have a TF-IDF score of 4.544 will be taken, according to which ones the fastText model thought were more similar.
Again, there are lots of things that could be tweaked and fiddled with - and I imagine that I will experiment with them at some point. The main problem is that I have enough data across my posts that it's tedious looking through the output to try to decide if I've improved things each time I make change but there isn't enough data that the algorithms have a huge pile of information to work on. Coupled with the fact that training takes a few minutes to run and I have recipe for frustration if I obsess too much about it. Right now, I'm happy enough with the suggestions and any that I want to manually override, I can do so easily.
If you want to try out the code, you can find a complete sample in the "SimilarityWithTitleTFIDF" project in the solution of this repo: BlogPostSimilarity.
Let's return to those examples that I started with.
Good suggestions from last time:
Learning F# via some Machine Learning: The Single Layer Perceptron
How are barcodes read?? (Library-less image processing in C#)
Writing F# to implement 'The Single Layer Perceptron'
Face or no face (finding faces in photos using C# and AccordNET)
Less good suggestions:
Simple TypeScript type definitions for AMD modules
STA ApartmentState with ASP.Net MVC
WCF with JSON (and nullable types)
The joys of AutoMapper
Now, the not-very-good one has improved and has these offered:
Simple TypeScript type definitions for AMD modules
Parsing TypeScript definitions (functional-ly.. ish)
TypeScript State Machines
Writing a Brackets extension in TypeScript, in Brackets
.. but, as I said before, the good suggestions are now not as good as they were:
How are barcodes read?? (Library-less image processing in C#)
Face or no face (finding faces in photos using C# and Accord.NET)
Implementing F#-inspired "with" updates for immutable classes in C#
A follow-up to "Implementing F#-inspired 'with' updates in C#"
There are lots of suggestions that are still very good - eg.
Creating a C# ("Roslyn") Analyser - For beginners by a beginner
Using Roslyn to identify unused and undeclared variables in VBScript WSC components
Locating TODO comments with Roslyn
Using Roslyn code fixes to make the "Friction-less immutable objects in Bridge" even easierMigrating my Full Text Indexer to .NET Core (supporting multi-target NuGet packages)
Revisiting .NET Core tooling (Visual Studio 2017)
The Full Text Indexer Post Round-up
The NeoCities Challenge! aka The Full Text Indexer goes client-side!Dependency Injection with a WCF Service
Ramping up WCF Web Service Request Handling.. on IIS 6 with .Net 4.0
Consuming a WCF Web Service from PHP
WCF with JSON (and nullable types)Translating VBScript into C#
VBScript is DIM
Using Roslyn to identify unused and undeclared variables in VBScript WSC components
If you can keep your head when all about you are losing theirs and blaming it on VBScript
.. but still some less-good suggestions, like:
Auto-releasing Event Listeners
Writing React apps using Bridge.NET - The Dan Way (Part Three)
Persistent Immutable Lists - Extended
Extendable LINQ-compilable MappersProblems in Immutability-land
Language detection and words-in-sentence classification in C#
Using Roslyn to identify unused and undeclared variables in VBScript WSC components
Writing a Brackets extension in TypeScript, in Brackets
However, having just looked through the matches to try to find any really awful suggestions, there aren't many that jump out at me. And, informal as that may be as a measure of success, I'm fairly happy with that!
Posted at 21:56
7 April 2021
Using the same open source .NET library as I did in my last post (Language detection and words-in-sentence classification in C#), I use some of its other machine learning capabilities to automatically generate "you may also be interested in" links to similar posts for any given post on this blog.
This site has always had a way for me to link related posts together - for example, if you scroll to the bottom of "Learning F# via some Machine Learning: The Single Layer Perceptron" then it suggests a link to "Face or no face (finding faces in photos using C# and Accord.NET)" on the basis that you might be super-excited into my fiddlings with computers being trained how to make decisions on their own. But there aren't many of these links because they're something that I have to maintain manually. Firstly, that means that I have to remember / consider every previous post and decide whether it might be worth linking to the new post that I've just finished writing and, secondly, I often just forget.
There are models in the Catalyst library* that make this possible and so I thought that I would see whether I could train it with my blog post data and then incorporate the suggestions into the final content.
* (Again, see my last post for more details on this library and a little blurb about my previous employers who are doing exciting things in the Enterprise Search space)
Specifically, I'll be using the fastText model that was published by Facebook's AI Research lab in 2015 and then rewritten in C# as part of the Catalyst library.
When I first launched my blog (just over a decade ago), I initially hosted it somewhere as an ASP.NET MVC application. Largely because I wanted to try my hand at writing an MVC app from scratch and fiddling with various settings, I think.. and partly because it felt like the "natural" thing to do, seeing as I was employed as a .NET Developer at the time!
To keep things simple, I had a single text file for each blog post and the filenames were of a particular format containing a unique post ID, date and time of publishing, whether it should appear in the "Highlights" column and any tags that should be associated with it. Like this:
1,2011,3,14,20,14,2,0,Immutability.txt
That's the very first post (it has ID 1), it was published on 2011-03-14 at 20:14:02 and it is not shown in the Highlights column (hence the final zero). It has a single tag of "Immutability". Although it has a ".txt" extension, it's actually markdown content, so ".md" would have been more logical (the reason why I chose ".txt" over ".md" will likely remain forever lost in the mists of time!)
A couple of years later, I came across the project neocities.org and thought that it was a cool idea and did some (perhaps slightly hacky) work to make things work as a static site (including pushing the search logic entirely to the client) as described in The NeoCities Challenge!.
Some more years later, GitHub Pages started supporting custom domains over HTTPS (in May 2018 according to this) and so, having already moved web hosts once due to wildly inconsistent performance from the first provider, I decided to use this to-static-site logic and start publishing via GitHub Pages.
This is a long-winded way of saying that, although I publish my content these days as a static site, I write new content by running the original blog app locally and then turning it into static content later. Meaning that the original individual post files are available in the ASP.NET MVC Blog GitHub repo here:
github.com/ProductiveRage/Blog/tree/master/Blog/App_Data/Posts
Therefore, if you were sufficiently curious and wanted to play along at home, you can also access the original markdown files for my blog posts and see if you can reproduce my results.
Following shortly is some code to do just that. GitHub has an API that allows you to query folder contents and so we can get a list of blog post files without having to do anything arduous like clone the entire repo or trying to scrape the information from the site or even creating an authenticated API access application because GitHub allows us rate-limited non-authenticated access for free! Once we have the list of files, each will have a "download_url" that we can retrieve the raw content from.
To get the list of blog post files, you would call:
api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts?ref=master
.. and get results that look like this:
[
{
"name": "1,2011,3,14,20,14,2,0,Immutability.txt",
"path": "Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt",
"sha": "b243ea15c891f73550485af27fa06dd1ccb8bf45",
"size": 18965,
"url": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt?ref=master",
"html_url": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt",
"git_url": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/b243ea15c891f73550485af27fa06dd1ccb8bf45",
"download_url": "https://raw.githubusercontent.com/ProductiveRage/Blog/master/Blog/App_Data/Posts/1%2C2011%2C3%2C14%2C20%2C14%2C2%2C0%2CImmutability.txt",
"type": "file",
"_links": {
"self": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt?ref=master",
"git": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/b243ea15c891f73550485af27fa06dd1ccb8bf45",
"html": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt"
}
},
{
"name": "10,2011,8,30,19,06,0,0,Mercurial.txt",
"path": "Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt",
"sha": "ab6cf2fc360948212e29c64d9c886b3dbfe0d6fc",
"size": 3600,
"url": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt?ref=master",
"html_url": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt",
"git_url": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/ab6cf2fc360948212e29c64d9c886b3dbfe0d6fc",
"download_url": "https://raw.githubusercontent.com/ProductiveRage/Blog/master/Blog/App_Data/Posts/10%2C2011%2C8%2C30%2C19%2C06%2C0%2C0%2CMercurial.txt",
"type": "file",
"_links": {
"self": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt?ref=master",
"git": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/ab6cf2fc360948212e29c64d9c886b3dbfe0d6fc",
"html": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt"
}
},
..
While the API is rate-limited, retrieving content via the "download_url" locations is not - so we can make a single API call for the list and then download all of the individual files that we want.
Note that there are a couple of files in that folders that are NOT blog posts (such as the "RelatedPosts.txt" file, which is the way that I manually associate "You may also be interested in" post) and so each filename will have to be checked to ensure that it matches the format shown above.
The title of the blog post is not in the file name, it is always the first line of the content in the file (to obtain it, we'll need to process the file as markdown content, convert it to plain text and then look at that first line).
private static async Task<IEnumerable<BlogPost>> GetBlogPosts()
{
// Note: The GitHub API is rate limited quite severely for non-authenticated apps, so we just
// call it once for the list of files and then retrieve them all further down via the Download
// URLs (which don't count as API calls). Still, if you run this code repeatedly and start
// getting 403 "rate limited" responses then you might have to hold off for a while.
string namesAndUrlsJson;
using (var client = new WebClient())
{
// The API refuses requests without a User Agent, so set one before calling (see
// https://docs.github.com/en/rest/overview/resources-in-the-rest-api#user-agent-required)
client.Headers.Add(HttpRequestHeader.UserAgent, "ProductiveRage Blog Post Example");
namesAndUrlsJson = await client.DownloadStringTaskAsync(new Uri(
"https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts?ref=master"
));
}
// Deserialise the response into an array of entries that have Name and Download_Url properties
var namesAndUrls = JsonConvert.DeserializeAnonymousType(
namesAndUrlsJson,
new[] { new { Name = "", Download_Url = (Uri)null } }
);
return await Task.WhenAll(namesAndUrls
.Select(entry =>
{
var fileNameSegments = Path.GetFileNameWithoutExtension(entry.Name).Split(",");
if (fileNameSegments.Length < 8)
return default;
if (!int.TryParse(fileNameSegments[0], out var id))
return default;
var dateContent = string.Join(",", fileNameSegments.Skip(1).Take(6));
if (!DateTime.TryParseExact(dateContent, "yyyy,M,d,H,m,s", default, default, out var date))
return default;
return (PostID: id, PublishedAt: date, entry.Download_Url);
})
.Where(entry => entry != default)
.Select(async entry =>
{
// Read the file content as markdown and parse into plain text (the first line of which
// will be the title of the post)
string markdown;
using (var client = new WebClient())
{
markdown = await client.DownloadStringTaskAsync(entry.Download_Url);
}
var plainText = Markdown.ToPlainText(markdown);
var title = plainText.Replace("\r\n", "\n").Replace('\r', '\n').Split('\n').First();
return new BlogPost(entry.PostID, title, plainText, entry.PublishedAt);
})
);
}
private sealed class BlogPost
{
public BlogPost(int id, string title, string plainTextContent, DateTime publishedAt)
{
ID = id;
Title = !string.IsNullOrWhiteSpace(title)
? title
: throw new ArgumentException("may not be null, blank or whitespace-only");
PlainTextContent = !string.IsNullOrWhiteSpace(plainTextContent)
? plainTextContent
: throw new ArgumentException("may not be null, blank or whitespace-only");
PublishedAt = publishedAt;
}
public int ID{ get; }
public string Title { get; }
public string PlainTextContent { get; }
public DateTime PublishedAt { get; }
}
(Note: I use the Markdig library to process markdown)
This raw blog post content needs to transformed into Catalyst "documents", then tokenised (split into individual sentences and words), then fed into a FastText model trainer.
Before getting to the code, I want to discuss a couple of oddities coming up. Firstly, Catalyst documents are required to train the FastText model and each document instance must be uniquely identified by a UID128 value, which is fine because we can generate them from the Title text of each blog post using the "Hash128()" extension method in Catalyst. However, (as we'll see a bit further down), when you ask for vectors* from the FastText model for the processed documents, each vector comes with a "Token" string that is the ID of the source document - so that has to be parsed back into a UID128. I'm not quite sure why the "Token" value isn't also a UID128 but it's no massive deal.
* (Vectors are just 1D arrays of floating point values - the FastText algorithm does magic to produce vectors that represent the text of the documents such that the distance between them can be compared; the length of these arrays is determined by the "Dimensions" option shown below and shorter distances between vectors suggest more similar content)
Next, there are the FastText settings that I've used. The Catalyst README has some code near the bottom for training a FastText embedding model but I didn't have much luck with the default options. Firstly, when I used the "FastText.ModelType.CBow" option then I didn't get any vectors generated and so I tried changing it to "FastText.ModelType.PVDM" and things started looked promising. Then I fiddled with some of the other settings. Some of which I have a rough idea what they mean and some, erm.. not so much.
The settings that I ended up using are these:
var fastText = new FastText(language, version: 0, tag: "");
fastText.Data.Type = FastText.ModelType.PVDM;
fastText.Data.Loss = FastText.LossType.NegativeSampling;
fastText.Data.IgnoreCase = true;
fastText.Data.Epoch = 50;
fastText.Data.Dimensions = 512;
fastText.Data.MinimumCount = 1;
fastText.Data.ContextWindow = 10;
fastText.Data.NegativeSamplingCount = 20;
I already mentioned changing the Data.Type / ModelType and the LossType ("NegativeSampling") is the value shown in the README. Then I felt like an obvious one to change was IgnoreCase, since that defaults to false and I think that I want it to be true - I don't care about the casing in any words when it's parsing my posts' content.
Now the others.. well, this library is built to work with systems with 10s or 100s of 1,000s of documents and that is a LOT more data than I have (currently around 120 blog posts) and so I made a few tweaks based on that. The "Epoch" count is the number of iterations that the training process will go through when constructing its model - by default, this is only 5 but I have limited data (meaning there's less for it to learn from but also that it's faster to complete each iteration) and so I bumped that up to 50. Then "Dimensions" is the size of the vectors generated - again, I figured that with limited data I would want a higher value and so I picked 512 (a nice round number if you're geeky enough) over the default 200. The "MinimumCount", I believe, relates to how often a word may appear and it defaults to 5 so I pulled it down to 1. The "ContextWindow" is (again, I think) how far to either side of any word that the process will look at in order to determine context - the larger the value, the more expensive the calculation; I bumped this from the default 5 up to 10. Then there's the "NegativeSamplingCount" value.. I have to just put my hands up and say that I have no idea what that actually does, only that I seemed to be getting better results with a value of 20 than I was with the default of 10.
With machine learning, there is almost always going to be some value to tweaking options (the "hyperparameters", if we're being all fancy) like this when building a model. Depending upon the model and the library, the defaults can be good for the general case but my tiny data set is not really what this library was intended for. Of course, machine learning experts have more idea what they're tweaking and (sometimes, at least) hopefully what results they'll get.. but I'm happy enough with where I've ended up with these.
This talk about what those machine learning experts do brings me on to the final thing that I wanted to talk about before showing the code; a little pre-processing / data-massaging. The better the data is that goes in, generally the better the results that come out will be. So another less glamorous part of the life of a Data Scientist is cleaning up data for training models.
In my case, that only extended to noticing that a few terms didn't seem to be getting recognised as essentially being the same thing and so I wanted to give it a little hand - for example, a fair number of my posts are about my "Full Text Indexer" project and so it probably makes sense to replace any instances of that string with a single concatenated word "FullTextIndexer". And I have a range of posts about React but I didn't want it to get confused with the verb "react" and so I replaced any "React" occurrence with "ReactJS" (now, this probably means that some "React" verb occurrences were incorrectly changed but I made the replacements of this word in a case-sensitive manner and felt like I would have likely used it as the noun more often than a verb with a capital letter due to the nature of my posts).
So I have a method to tidy up the plain text content a little:
private static string NormaliseSomeCommonTerms(string text) => text
.Replace(".NET", "NET", StringComparison.OrdinalIgnoreCase)
.Replace("Full Text Indexer", "FullTextIndexer", StringComparison.OrdinalIgnoreCase)
.Replace("Bridge.net", "BridgeNET", StringComparison.OrdinalIgnoreCase)
.Replace("React", "ReactJS");
Now let's get training!
Console.WriteLine("Reading posts from GitHub repo..");
var posts = await GetBlogPosts();
Console.WriteLine("Parsing documents..");
Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var language = Language.English;
var pipeline = Pipeline.For(language);
var postsWithDocuments = posts
.Select(post =>
{
var document = new Document(NormaliseSomeCommonTerms(post.PlainTextContent), language)
{
UID = post.Title.Hash128()
};
pipeline.ProcessSingle(document);
return (Post: post, Document: document);
})
.ToArray(); // Call ToArray to force evaluation of the document processing now
Console.WriteLine("Training FastText model..");
var fastText = new FastText(language, version: 0, tag: "");
fastText.Data.Type = FastText.ModelType.PVDM;
fastText.Data.Loss = FastText.LossType.NegativeSampling;
fastText.Data.IgnoreCase = true;
fastText.Data.Epoch = 50;
fastText.Data.Dimensions = 512;
fastText.Data.MinimumCount = 1;
fastText.Data.ContextWindow = 10;
fastText.Data.NegativeSamplingCount = 20;
fastText.Train(
postsWithDocuments.Select(postWithDocument => postWithDocument.Document),
trainingStatus: update => Console.WriteLine($" Progress: {update.Progress}, Epoch: {update.Epoch}")
);
Now that a model has been built that can represent all of my blog posts as vectors, we need to go through those post / vector combinations and identify others that are similar to it.
This will be achieved by using the HNSW.NET NuGet package that enables K-Nearest Neighbour (k-NN) searches over "high-dimensional space"*.
* (This just means that the vectors are relatively large; 512 in this case - two dimensions would be a point on a flat plane, three dimensions would be a physical point in space, anything with more dimensions that that is in "higher-dimensional space".. though that's not to say that any more than three dimensions is definitely a bad fit for a regular k-NN search but 512 dimensions IS going to be a bad fit and the HNSW approach will be much more efficient)
There are useful examples on the README about "How to build a graph?" and "How to run k-NN search?" and tweaking those for the data that I have so far leads to this:
Console.WriteLine("Building recommendations..");
// Combine the blog post data with the FastText-generated vectors
var results = fastText
.GetDocumentVectors()
.Select(result =>
{
// Each document vector instance will include a "token" string that may be mapped back to the
// UID of the document for each blog post. If there were a large number of posts to deal with
// then a dictionary to match UIDs to blog posts would be sensible for performance but I only
// have a 100+ and so a LINQ "First" scan over the list will suffice.
var uid = UID128.Parse(result.Token);
var postForResult = postsWithDocuments.First(
postWithDocument => postWithDocument.Document.UID == uid
);
return (UID: uid, result.Vector, postForResult.Post);
})
.ToArray(); // ToArray since we enumerate multiple times below
// Construct a graph to search over, as described at
// https://github.com/curiosity-ai/hnsw-sharp#how-to-build-a-graph
var graph = new SmallWorld<(UID128 UID, float[] Vector, BlogPost Post), float>(
distance: (to, from) => CosineDistance.NonOptimized(from.Vector, to.Vector),
DefaultRandomGenerator.Instance,
new() { M = 15, LevelLambda = 1 / Math.Log(15) }
);
graph.AddItems(results);
// For every post, use the "KNNSearch" method on the graph to find the three most similar posts
const int maximumNumberOfResultsToReturn = 3;
var postsWithSimilarResults = results
.Select(result =>
{
// Request one result too many from the KNNSearch call because it's expected that the original
// post will come back as the best match and we'll want to exclude that
var similarResults = graph
.KNNSearch(result, maximumNumberOfResultsToReturn + 1)
.Where(similarResult => similarResult.Item.UID != result.UID)
.Take(maximumNumberOfResultsToReturn); // Just in case the original post wasn't included
return new
{
result.Post,
Similar = similarResults
.Select(similarResult => new
{
similarResult.Id,
similarResult.Item.Post,
similarResult.Distance
})
.ToArray()
};
})
.OrderBy(result => result.Post.Title, StringComparer.OrdinalIgnoreCase)
.ToArray();
And with that, there is a list of every post from my blog and a list of the three blog posts most similar to it!
Well, "most similar" according to the model that we trained and the hyperparameters that we used to do so. As with many machine learning algorithms, it will have started from a random state and tweaked and tweaked until it's time for it to stop (based upon the "Epoch" value in this FastText case) and so the results each time may be a little different.
However, if we inspect the results like this:
foreach (var postWithSimilarResults in postsWithSimilarResults)
{
Console.WriteLine();
Console.WriteLine(postWithSimilarResults.Post.Title);
foreach (var similarResult in postWithSimilarResults.Similar.OrderBy(other => other.Distance))
Console.WriteLine($"{similarResult.Distance:0.000} {similarResult.Post.Title}");
}
.. then there are some good results to be found! Like these:
Learning F# via some Machine Learning: The Single Layer Perceptron
0.229 How are barcodes read?? (Library-less image processing in C#)
0.236 Writing F# to implement 'The Single Layer Perceptron'
0.299 Face or no face (finding faces in photos using C# and AccordNET)Translating VBScript into C#
0.257 VBScript is DIM
0.371 If you can keep your head when all about you are losing theirs and blaming it on VBScript
0.384 Using Roslyn to identify unused and undeclared variables in VBScript WSC componentsWriting React components in TypeScript
0.376 TypeScript classes for (React) Flux actions
0.378 React and Flux with DuoCode
0.410 React (and Flux) with Bridge.net
However, there are also some less good ones - like these:
A static type system is a wonderful message to the present and future
0.271 STA ApartmentState with ASP.Net MVC
0.291 CSS Minification Regular Expressions
0.303 Publishing RSSSimple TypeScript type definitions for AMD modules
0.162 STA ApartmentState with ASP.Net MVC
0.189 WCF with JSON (and nullable types)
0.191 The joys of AutoMapperSupporting IDispatch through the COMInteraction wrapper
0.394 A static type system is a wonderful message to the present and future
0.411 TypeScript State Machines
0.414 Simple TypeScript type definitions for AMD modules
I'd like to get this good enough that I can include auto-generated recommendations on my blog and I don't feel like the consistency in quality is there yet. If they were all like the good examples then I'd be ploughing ahead right now with enabling it! But there are mediocre examples as well as those poorer ones above.
It's quite possible that I could get closer by experimenting with the hyperparameters more but that does tend to get tedious when you have to analyse the output of each run manually - looking through all the 120-ish post titles and deciding whether the supposed best matches are good or not. It would be lovely if I could concoct some sort of metric of "goodness" and then have the computer try lots of variations of parameters but one of the downsides of having relatively little data is that that is difficult*.
* (On the flip side, if I had 1,000s of blog posts as source data then the difficult part would be manually labelling enough of them as "quite similar" in numbers sufficient for the computer to know if it's done better or done worse with each experiment)
Fortunately, I have another trick up my sleeve - but I'm going to leave that for next time! This post is already more than long enough, I think. The plan is to combine results from another model in the Catalyst with the FastText results and see if I can encourage things to look a bit neater.
If you want to try fiddling with this code but don't want to copy-paste the sections above into a new project, you can find the complete sample in the "Similarity" project in the solution of this repo: BlogPostSimilarity.
Posted at 22:21
Dan is a big geek who likes making stuff with computers! He can be quite outspoken so clearly needs a blog :)
In the last few minutes he seems to have taken to referring to himself in the third person. He's quite enjoying it.