Productive Rage

Dan's techie ramblings

Creating a C# ("Roslyn") Analyser - For beginners by a beginner

I've been meaning to try writing a post about creating analysers for a little while now - they're a technology that I think has huge promise for improving code quality and they're something that I've successfully played around with recently.. but I'm still very much in the early phases of being proficient and they're not something that I can just sit down and bang out easily (ie. not without a lot of googling).

So this won't be the post of an expert - but I'm hoping to use that to my advantage since I hopefully remember the pain points all too well and can go through the sort of things that I try when I'm hashing these out.

Most of the analysers I've been writing have been for libraries that work with Bridge.NET, which introduces some of its own complications. I'm hoping to talk about those problems and how to overcome them in a later post - this one will be a more general introduction.

Creating a fresh Analyser project

The easiest way to get started is to use a Microsoft template. To do this, first you need to install the Visual Studio 2016 SDK and to do this you go to File / New / Project and then choose C# in the left navigation pane, click on Extensibility and then select "Install the Visual Studio Extensibility Tools" (you may already have it installed, it's an optional component of VS2015 - if you see no link to "Install the Visual Studio Extensibility Tools" then hopefully that's why). Next, from the same Extensibility section, you need to select "Download the .NET Compiler Platform SDK". This will ensure that you have the project template installed that we're going to use and it installs some other helpful tools, such as the Syntax Visualizer (which we'll see in a moment).

Now that you have the template and since you're already in File / New / Project / C# / Extensibility, select "Analyzer with Code Fix (NuGet + VSIX)" to create an example analyser solution. This will be a fully operational analyser, split into three projects - the analyser itself, a unit test library and a "Vsix" project. This last one would be used if you wanted to create an analyser that would be installed and applied to all projects that you would ever open and not apply to any specific library. What I'll be talking about here will be creating an analyser to work with a particular library (that would be distributed with the library) - so that everyone consuming the library can benefit from it. As such, to keep things simple, delete the "Vsix" project now,

The example analyser that this template installs does something very simple - it looks for class names that are not upper case and it warns about them. In terms of functionality, this is not particularly useful.. but in terms of education and illustrating how to get started it's a good jumping off point. In fact, the project includes not just an analyser but also a "code fix" - once a non-all-upper-case class name is identified and warned about, a quick fix will be offered in the IDE to change the name to match the upper case regime that it's pushing. Code fixes can be really helpful but I'll talk about them another day, I think that there already will be plenty to deal with in this post.

The analyser class looks basically like this (I've removed comments and replaced localisable strings with hard-coded strings, just to make it a little less to absorb all at once) -

using System.Collections.Immutable;
using System.Linq;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.Diagnostics;

namespace ExampleAnalyser
{
    [DiagnosticAnalyzer(LanguageNames.CSharp)]
    public class ExampleAnalyserAnalyzer : DiagnosticAnalyzer
    {
        public const string DiagnosticId = "ExampleAnalyser";
        private const string Category = "Naming";
        private static readonly LocalizableString Title
            = "Type name contains lowercase letters";
        private static readonly LocalizableString MessageFormat
            = "Type name '{0}' contains lowercase letters";
        private static readonly LocalizableString Description
            = "Type names should be all uppercase.";

        private static DiagnosticDescriptor Rule = new DiagnosticDescriptor(
            DiagnosticId,
            Title,
            MessageFormat,
            Category,
            DiagnosticSeverity.Warning,
            isEnabledByDefault: true,
            description: Description
        );

        public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics
        {
            get { return ImmutableArray.Create(Rule); }
        }

        public override void Initialize(AnalysisContext context)
        {
            context.RegisterSymbolAction(AnalyzeSymbol, SymbolKind.NamedType);
        }

        private static void AnalyzeSymbol(SymbolAnalysisContext context)
        {
            var namedTypeSymbol = (INamedTypeSymbol)context.Symbol;
            if (namedTypeSymbol.Name.ToCharArray().Any(char.IsLower))
            {
                context.ReportDiagnostic(Diagnostic.Create(
                    Rule,
                    namedTypeSymbol.Locations[0],
                    namedTypeSymbol.Name
                ));
            }
        }
    }
}

To summarise what's in the above code:

  1. Every analyser needs at least one rule that it will declare, where a rule has various properties such as a Diagnostic Id, Category, Title, MessageFormat, Description and Severity. The two that are most immediately interesting are Severity (make it a Warning to point out a potential mistake or make it an Error to indicate a critical problem that will prevent a build from being completed) and MessageFormat, since MessageFormat is responsible for the text that will be displayed to the user in their Error List. MessageFormat supports string replacement; in the above example, you can see that there is a "{0}" placeholder in the MessageFormat - when "Diagnostic.Create" is called, the argument "namedTypeSymbol.Name" is injected into that "{0}" placeholder.
  2. Every analyser needs to declare a "SupportedDiagnostics" value that lists all of the types of rule that it is possible for the analyser to raise. This is vital in order for the analyser to work correctly at runtime. (If you create an analyser that has three different types of rule that it can report but you forget to declare one of the types in the "SupportedDiagnostics" property, there is actually an analyser that is installed with the template that points out the mistake to you - which is a great example of how analysers can protect you at compile time from potential runtime problems!)
  3. Every analyser needs an "Initialize" method that registers what type of symbol (more on what this actually means in a moment) it's interested in and provides a reference to a method that will perform the actual analysis

The simple task of the class above is to look at any "named type" (ie. classes and structs) and inspect their name to ensure that they consist entirely of capital letters (remember, this example included in the "Analyzer with Code Fix (NuGet + VSIX)" template is simply for educational purposes and not because it's believed that all class names should be SHOUTING_FORMAT! :) Any class that doesn't have an all-caps name will result in a warning in the Error List.

To illustrate how this should work, the test project includes the following test method -

[TestMethod]
public void TestMethod2()
{
    var test = @"
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class TypeName
    {   
    }
}";
    var expected = new DiagnosticResult
    {
        Id = "ExampleAnalyser",
        Message = String.Format("Type name '{0}' contains lowercase letters", "TypeName"),
        Severity = DiagnosticSeverity.Warning,
        Locations = new[] {
            new DiagnosticResultLocation("Test0.cs", 11, 15)
        }
    };

    VerifyCSharpDiagnostic(test, expected);
}

This makes it clear to see precisely what sort of thing the analyser is looking for but it also gives us another immediate benefit - we can actually execute the analyser and step through it in the debugger if we want to have a poke around with exactly what is in the SymbolAnalysisContext reference or if we want to look at the properties of a particular INamedTypeSymbol instance. This is as easy as putting a breakpoint into the "AnalyzeSymbol" method in the example analyser and then going back into the test class, right-clicking within "TestMethod2" and selecting "Debug Tests".

I want to introduce one other useful technique before moving on - the use of the "Syntax Visualizer". An analyser works on an in-memory tree of nodes that represent the source code of the file that you're looking at*. In the unit test above, the named symbol "TypeName" is a child node of the "TypeName" class declaration, which is a child node of the "ConsoleApplication1" namespace, which is a child of a top-level construct called the "CompilationUnit". Understanding the various types of node will be key to writing analysers and the Syntax Visualizer makes this a little bit easier.

* (Although an analyser starts by examining source code in a particular file, it's also possible to look up types and values that are referenced in that code that live elsewhere - to find out what namespace a class that is referenced exists in, for example, or to determine what arguments a method that is called that exists in a different library. These lookups are more expensive than looking solely at the content in the current file, however, and so should only be done if strictly necessary. We will see how to do this shortly. When looking only at content parsed from the current file, we are looking at the "syntax tree". When looking up references elsewhere in the solution we accessing the "semantic model").

Having installed the ".NET Compiler Platform SDK" earlier, you will now have access to this tool - go to View / Other Windows / Syntax Visualizer. This shows the syntax tree for any code within your project. So, if you click on the name "TestMethod2" then you will see that it is an IdentifierToken (which is the name "TestMethod2") that is a child node of a MethodDeclaration which is a child node of a ClassDeclaration which is a child node of a NamespaceDeclaration, which is a child node of a CompilationUnit. You can click on any of these nodes in the Syntax Visualiser to inspect some of the properties of the node and you can open further branches to inspect more - for example, there is a "Block" node that will appear shortly after the IdentifierToken that you may click to reveal the nodes that represent the statements within the method.

The Syntax Visualizer

Writing a real analyser

I'm going to walk through an analyser that I created recently - starting from scratch and, hopefully, encountering the same problems that I did last time so that I can illustrate how to find out how to solve them.

The analyser is part of my Bridge.React library but you won't need to know anything about React or Bridge to follow along.

The root of the problem relates to the rendering of html "select" elements. There are three related properties to consider when rendering a "select" element; "Multiple", "Value" and "Values". Multiple is a boolean that indicates whether the elements supports only single selections (false) or zero, one or more selections (true). If rendering an element with pre-selected items then the "Value" or "Values" properties must be used. "Value" is a string while "Values" is a string array. If "Multiple" is false and "Values" is set then React will display a warning at runtime and ignore the value, similarly if "Multiple" is true and "Value" is set.

I wanted an analyser that handled these simple cases -

// This is fine
return DOM.Select(new SelectAttributes { Multiple = false, Value = "x" };

// This is fine
return DOM.Select(new SelectAttributes { Multiple = true, Values = new [] { "x", "y" } };

// Wrong (shouldn't use "Value" when "Multiple" is true)
return DOM.Select(new SelectAttributes { Multiple = true, Value = "x" };

// Wrong (shouldn't use "Values" when "Multiple" is false)
return DOM.Select(new SelectAttributes { Multiple = false, Values = new [] { "x", "y" } };

// Wrong (shouldn't use "Values" when "Multiple" defaults to false)
return DOM.Select(new SelectAttributes { Values = new [] { "x", "y" } };

It's worth mentioning that I'm only considering these simple cases so this analyser won't be "perfect". If "Multiple" is set according to a variable then I'm not going to try to follow all possible code paths to ensure that it is never true/false if Values/Value is set. I'm also not going to cater for the technically-valid case where someone instantiates a SelectAttributes and sets "Values" on it initially (but leaves "Multiple" as false) and then sets "Multiple" to true on a later line of code. While this would be valid (there would be no runtime warning), I think that it would be clearer to set "Multiple" and "Values" together. In this case, I'm imposing what I believe to be a best practice on the consumer of my library - some analysers do this, some don't.

To keep things as simple as possible for now, instead of trying to pull in the real Bridge.React library, we'll just create another class library project in the solution to work against - call it "Bridge.React" and rename the "Class1.cs" file that is automatically created as part of a class library project to "SelectAttributes.cs". Change its contents to the following:

namespace Bridge.React
{
    public sealed class SelectAttributes
    {
        public bool Multiple { private get; set; }
        public string Value { private get; set; }
        public string[] Values { private get; set; }
    }
}

This will be enough to start writing the analyser.

What I want to do is to take the example analyser from the "Analyzer with Code Fix (NuGet + VSIX)" and change it to ensure that SelectAttributes properties are always configured according to the rule outlined above. Before getting started on that, though, it seems like a good time to formalise the rules by decribing them with unit tests. We get many bonuses here - writing individual tests may help guide us through fixing them up one at a time and so help us focus on individual problems that the analyser has to solve. It will also provide us with a way to exercise the analyser and step through it with the debugger (which I find invaluable when I'm not very familiar with a library or object model - when I do have a good grasp on code then stepping through a debugger can feel very time-consuming but it can be helpful in cases like this, as I'll demonstrate shortly). Finally, the tests will help avoid regressions creeping in if I decide to refactor the analyser or extend its functionality in the future.

So, replace the contents of "UnitTest.cs" with the following:

using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.Diagnostics;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using TestHelper;

namespace ExampleAnalyser.Test
{
    [TestClass]
    public class UnitTest : DiagnosticVerifier
    {
        [TestMethod]
        public void DoNotUseValueWhenMultipleIsTrue()
        {
            var testContent = @"
                using Bridge.React;

                namespace TestCase
                {
                    public class Example
                    {
                        public void Go()
                        {
                            new SelectAttributes { Multiple = true, Value = ""1"" };
                        }
                    }
                }";

            var expected = new DiagnosticResult
            {
                Id = ExampleAnalyserAnalyzer.DiagnosticId,
                Message = "If 'Multiple' is true then the 'Values' property should be used instead of 'Value'",
                Severity = DiagnosticSeverity.Warning,
                Locations = new[]
                {
                    new DiagnosticResultLocation("Test0.cs", 10, 29)
                }
            };

            VerifyCSharpDiagnostic(testContent, expected);
        }

        [TestMethod]
        public void DoNotUseValuesWhenMultipleIsFalse()
        {
            var testContent = @"
                using Bridge.React;

                namespace TestCase
                {
                    public class Example
                    {
                        public void Go()
                        {
                            new SelectAttributes { Multiple = false, Values = new[] { ""1"" } };
                        }
                    }
                }";

            var expected = new DiagnosticResult
            {
                Id = ExampleAnalyserAnalyzer.DiagnosticId,
                Message = "If 'Multiple' is false then the 'Value' property should be used instead of 'Values'",
                Severity = DiagnosticSeverity.Warning,
                Locations = new[]
                {
                    new DiagnosticResultLocation("Test0.cs", 10, 29)
                }
            };

            VerifyCSharpDiagnostic(testContent, expected);
        }

        [TestMethod]
        public void DoNotUseValueWhenMultipleDefaultsToFalse()
        {
            var testContent = @"
                using Bridge.React;

                namespace TestCase
                {
                    public class Example
                    {
                        public void Go()
                        {
                            var x = new SelectAttributes { Values = new[] { ""1"" } };
                            x.Multiple = True;
                        }
                    }
                }";

            var expected = new DiagnosticResult
            {
                Id = ExampleAnalyserAnalyzer.DiagnosticId,
                Message = "If 'Multiple' is false then the 'Value' property should be used instead of 'Values'",
                Severity = DiagnosticSeverity.Warning,
                Locations = new[]
                {
                    new DiagnosticResultLocation("Test0.cs", 10, 37)
                }
            };

            VerifyCSharpDiagnostic(testContent, expected);
        }

        protected override DiagnosticAnalyzer GetCSharpDiagnosticAnalyzer()
        {
            return new ExampleAnalyserAnalyzer();
        }
    }
}

Now there's one more important thing to do before actually writing the analyser. When those unit tests run, the ".NET Compiler Platform" (referred to as "Roslyn") will parse and compile those code snippets in memory. This means that the code snippets need to actually be able to compile! Currently they won't because Roslyn won't know how to resolve the "Bridge.React" namespace that is referenced.

This is quite easily fixed - the DiagnosticVerifier class (which is part of the template that we started with) configures some environment options. That's why each test checks a file called "Test0.cs" - because Roslyn wants a filename to work with and that's what the DiagnosticVerifier tells it to use. It also specifies what assemblies to include when building the project. So, if the code snippets referenced "System" or "Sytem.Collections.Generic" then those references will work fine. However, it doesn't initially know about the "Bridge.React" project, so we need to tell it to support it.

  1. Add a reference to the "Bridge.React" project to the "ExampleAnalayser.Test" project
  2. Edit the file "Helpers/DiagnosticVerifier.Helper.cs" in the "ExampleAnalayser.Test" project and add the following near the top, where other MetadataReference instances are created:

    private static readonly MetadataReference CSharpBridgeReactReference
        = MetadataReference.CreateFromFile(typeof(Bridge.React.SelectAttributes).Assembly.Location);
    
  3. Open all of the code regions in that file and add pass "CSharpBridgeReactReference" into the solution by adding an additional "AddMetadataReference" call. The "CreateProject" method should now look like this:

    private static Project CreateProject(string[] sources, string language = LanguageNames.CSharp)
    {
        string fileNamePrefix = DefaultFilePathPrefix;
        string fileExt = language == LanguageNames.CSharp
            ? CSharpDefaultFileExt
            : VisualBasicDefaultExt;
        var projectId = ProjectId.CreateNewId(debugName: TestProjectName);
        var solution = new AdhocWorkspace()
        .CurrentSolution
            .AddProject(projectId, TestProjectName, TestProjectName, language)
            .AddMetadataReference(projectId, CorlibReference)
            .AddMetadataReference(projectId, SystemCoreReference)
            .AddMetadataReference(projectId, CSharpSymbolsReference)
            .AddMetadataReference(projectId, CodeAnalysisReference)
            .AddMetadataReference(projectId, CSharpBridgeReactReference);
        int count = 0;
        foreach (var source in sources)
        {
            var newFileName = fileNamePrefix + count + "." + fileExt;
            var documentId = DocumentId.CreateNewId(projectId, debugName: newFileName);
            solution = solution.AddDocument(documentId, newFileName, SourceText.From(source));
            count++;
        }
        return solution.GetProject(projectId);
    }
    

Really writing the analyser

Now that the groundwork is done and we've decided what precisely needs doing (and documented it with tests), we need to write the actual code.

Although I can use the debugger to inspect the syntax tree for the code snippets in the unit tests, at this point I think even that would be information overload. To begin with, just add the following line to one of the unit test methods - it doesn't matter which one because it will be deleted very shortly, it's just to have a bit of a poke around with the Syntax Visualizer:

var x = new Bridge.React.SelectAttributes { Multiple = true, Value = "x" };

Ensuring that the Syntax Visualizer is visible (View / Other Windows / Syntax Visualizer), clicking on "Multiple" shows the following:

ObjectInitializerExpression

The IdentifierToken is the "Multiple" property, which is part of a SimpleAssignment (ie. "Multiple = 1") which is a child of an ObjectInitializerExpression (which is the curly brackets around the two properties being set) which is a child of an ObjectCreationExpression (which is the entire statement that includes "new Bridge.React.SelectAttributes" and the setting of the two properties) and that itself is part of a VariableDeclaration (which sets "x" to be the result of the object creation). With the Syntax Visualizer, we could go all the way up to the top of the method and then to the class and then to the namespace and then to the top-level CompilationUnit. However, what we're most interested in is the ObjectInitializerExpression, since that contains the properties that we want to verify.

So, how do we alter the analyser class that we currently have in order to identify object initialisers such as this?

Currently the example analyser class has an "Initialize" method that looks like this:

public override void Initialize(AnalysisContext context)
{
    context.RegisterSymbolAction(AnalyzeSymbol, SymbolKind.NamedType);
}

The first thing to try would be to see what other options are in the "SymbolKind" enum. However, this contains things like "Alias", "Event", "Method", "NamedType", "Property" which don't bear much resemblance to ObjectInitializerExpression. Without any better plan, I recommend turning to Google. If "SymbolKind" doesn't seem to have what we want, maybe there's something else that we can extract from the AnalysisContext instance that the "Initialize" method has.

Googling for "AnalysisContext ObjectInitializerExpression" doesn't actually return that many results. However, the second one RoslynClrHeapAllocationAnalyzer/ExplicitAllocationAnalyzer.cs has some code that looks promising:

public override void Initialize(AnalysisContext context)
{
    var kinds = new[]
    {
        SyntaxKind.ObjectCreationExpression,
        SyntaxKind.AnonymousObjectCreationExpression,
        SyntaxKind.ArrayInitializerExpression,
        SyntaxKind.CollectionInitializerExpression,
        SyntaxKind.ComplexElementInitializerExpression,
        SyntaxKind.ObjectInitializerExpression,
        SyntaxKind.ArrayCreationExpression,
        SyntaxKind.ImplicitArrayCreationExpression,
        SyntaxKind.LetClause
    };
    context.RegisterSyntaxNodeAction(AnalyzeNode, kinds);
}

Instead of calling "RegisterSymbolAction" and passing a "SymbolKind" value, we can call "RegisterSyntaxNodeAction" and pass it an array of "SyntaxKind" values - where "SyntaxKind" is an enum that has an "ObjectInitializerExpression" value.

Actually, by starting to change the "Initialize" method to

public override void Initialize(AnalysisContext context)
{
    context.RegisterSyntaxNodeAction(AnalyzeSymbol,

.. it becomes clear that the method actually takes a params array and so it will be perfectly happy for us to specify only a single "SyntaxKind" value. "Initialize" now becomes:

public override void Initialize(AnalysisContext context)
{
    context.RegisterSyntaxNodeAction(
        AnalyzeSymbol,
        SyntaxKind.ObjectInitializerExpression
    );
}

But the analyser project doesn't compile now - it complains about the type of one of the arguments of the call to "SymbolAnalysisContext". It definitely takes a "SyntaxKind" enum as its second argument so it must be the first that is wrong. Intellisense indicates that it wants the first argument to be of type Action<SymbolAnalysisContext> but the "AnalyzeSymbol" method currently takes a SyntaxNodeAnalysisContext (and so is an Action<SymbolAnalysisContext>, rather than an Action<SyntaxNodeAnalysisContext>).

This is easily fixed by changing the argument of the "AnalyzeSymbol" method. Doing so, however, will mean that it causes a compile error because the example code was expecting a SymbolAnalysisContext and we want to give it a SyntaxNodeAnalysisContext. No matter, that code doesn't do what we want anyway! So change the method argument, delete its body and - while we're making changes - rename it to something better than "AnalyzeSymbol", such as "LookForInvalidSelectAttributeProperties" -

public override void Initialize(AnalysisContext context)
{
    context.RegisterSyntaxNodeAction(
        LookForInvalidSelectAttributeProperties,
        SyntaxKind.ObjectInitializerExpression
    );
}

private static void LookForInvalidSelectAttributeProperties(SyntaxNodeAnalysisContext context)
{
}

Now that the basic structure is there, we can start work on the new "LookForInvalidSelectAttributeProperties" implementation. The "context" reference that is passed in has a "Node" property and this will match the SyntaxKind value that we passed to "RegisterSyntaxNodeAction". So "context.Node" will be a reference to a node that represents an "object initialisation".

Sanity check: The SyntaxNode class (which is the base node class) has a "Kind()" method that will return the "SyntaxKind" enum value that applies to the current node - so calling "Kind()" on "context.Node" here will return the "ObjectInitializerExpression" option from the "SyntaxKind" enum.

Now that we have a reference to an object initialisation node, we can go one of two ways. We want to ensure that the type being initialised is the SelectAttributes class from the "Bridge.React" assembly and we want to check whether any invalid property combinations are being specified. The first task will involve getting the type name and then doing a lookup in the rest of the solution to work out where that type name comes from (to ensure that it is actually the "Bridge.React" SelectAttributes class and not another class that exists somewhere with the same name). The second task only requires us to look at what properties are set by code in the syntax tree that we already have. This means that the first task is more expensive to perform than the second task and so we should try to deal with "step two" first since we will be able to avoid "step one" altogether if no invalid property combinations appear.

So, to look for invalid property combinations first.. The Syntax Visualizer (as seen in the last image) shows that each individual property-setting is represented by a "SimpleAssignmentExpression" and that each of these is a direct child of the object initialisation. The SyntaxNode class has a ChildNodes() method that will return all of the children, which seems like a good place to start. So, we might be able to do something like this:

// This doesn't work,SimpleAssignmentExpressionSyntax isn't a real class :(
var propertyInitialisers = context.Node.ChildNodes()
    .OfType<SimpleAssignmentExpressionSyntax>();

.. however, "SimpleAssignmentExpressionSyntax" is not a real type. I tried starting to type out "Simple" to see if intellisense would pick up what the correct name was - but that didn't get me anywhere.

Next, I resorted to deleting those last few lines (since they don't compile) and to just putting a breakpoint at the top of "LookForInvalidSelectAttributeProperties". I then used Debug Tests on "DoNotUseValueWhenMultipleIsTrue". The breakpoint is hit.. but I can't see the child nodes with QuickWatch because "ChildNodes()" is a method, not a property, and QuickWatch only shows you property values (it doesn't offter to execute methods and show you what is returned). So I go to the Immediate Window (Debug / Windows / Immediate), type the following and hit [Enter] -

context.Node.ChildNodes().First().GetType().Name

This displays "AssignmentExpressionSyntax".

This clue is enough to stop the debugger and go back to trying to populate the "LookForInvalidSelectAttributeProperties". It may now start with:

var propertyInitialisers = context.Node.ChildNodes()
    .OfType<AssignmentExpressionSyntax>();

Using Go To Definition on AssignmentExpressionSyntax shows that it has a "Left" and a "Right" property. These are the expressions that come either side of the operator, which is always an Equals sign when considering object property initialisations.

The Syntax Visualizer shows that each "SimpleAssignmentExpression" has an "IdentifierName" on the left, so we should be able to get the property name from that.

To try to work out what type "IdentifierName" relates to, I start typing "Identifier" and intellisense suggests IdentifierNameSyntax (if it hadn't suggested anything helpful then I would have resorted to using Debug Tests again and inspecting types in the debugger). Having a poke around the IdentifierNameSyntax class, I see that it has a property "Identifier" and that has a string property "ValueText". This looks like the name of the property being set. Things are coming together. The start of the "LookForInvalidSelectAttributeProperties" can now look like this:

var propertyInitialisers = context.Node.ChildNodes()
    .OfType<AssignmentExpressionSyntax>()
    .Select(propertyInitialiser => new
    {
        PropertyName = ((IdentifierNameSyntax)propertyInitialiser.Left).Identifier.ValueText,
        ValueExpression = propertyInitialiser.Right
    });

It's worth noting that we don't have to worry about the "Left" property ever being anything other than a simple identifier because assignments in object initialisers are only ever allow to be simple assignments. For example, the following would not compile:

var x = new MyClass { Name.Value = "Ted };

.. because attempting to set nested properties in object initialisers does not compile in C#. Because it's not valid C#, we don't have to worry about it being passed through the analyser.

Maybe it's worth adding another unit test around this - to ensure that invalid C# can't result in a load of edge cases that we need to be concerned about:

[TestMethod]
public void IgnoreInvalidPropertySetting()
{
    var testContent = @"
        using Bridge.React;

        namespace TestCase
        {
            public class Example
            {
                public void Go()
                {
                    new SelectAttributes { Nested.Multiple = true };
                }
            }
        }";

    VerifyCSharpDiagnostic(testContent);
}

Note: Calling the "VerifyCSharpDiagnostic" with no "expected" value means that the test expects that the analyser will not report any violated rules.

Now we can really move things along. We're interested in property initialisers where "Multiple" is clearly true or false (meaning it is set specifically to true or false or it's not specified at all, leaving it with its default value of false). So, again using the Syntax Visualizer to work out how to tell whether an expression means a "true" constant or a "false" constant, I've come up with this:

var propertyInitialisers = context.Node.ChildNodes()
    .OfType<AssignmentExpressionSyntax>()
    .Select(propertyInitialiser => new
    {
        PropertyName = ((IdentifierNameSyntax)propertyInitialiser.Left).Identifier.ValueText,
        ValueExpression = propertyInitialiser.Right
    });

var multiplePropertyInitialiser = propertyInitialisers.FirstOrDefault(
    propertyInitialiser => propertyInitialiser.PropertyName == "Multiple"
);
bool multiplePropertyValue;
if (multiplePropertyInitialiser == null)
    multiplePropertyValue = false; // Defaults to false if not explicitlt set
else
{
    var multiplePropertyValueKind = multiplePropertyInitialiser.ValueExpression.Kind();
    if (multiplePropertyValueKind == SyntaxKind.TrueLiteralExpression)
        multiplePropertyValue = true;
    else if (multiplePropertyValueKind == SyntaxKind.FalseLiteralExpression)
        multiplePropertyValue = false;  
    else
    {
        // Only looking for very simple cases - where explicitly set to true or to false or not set at
        // all (defaulting to false). If it's set according to a method return value or a variable then
        // give up (this is just intended to catch obvious mistakes, not to perform deep and complex
        // analysis)
        return;
    }
}

The next thing to do is to look for a "Value" or "Values" property being specified that is not appropriate for the "Multiple" value that we've found.

From the above code, it should be fairly clear that the way to do this is the following:

var valuePropertyIsSpecified = propertyInitialisers.Any(
    propertyInitialiser => propertyInitialiser.PropertyName == "Value"
);
var valuesPropertyIsSpecified = propertyInitialisers.Any(
    propertyInitialiser => propertyInitialiser.PropertyName == "Values"
);
if (!valuePropertyIsSpecified && !valuesPropertyIsSpecified)
    return;

The final step is to ensure that the object initialisation that we're looking at is indeed for a SelectAttributes instance. This is the bit that requires a lookup into the "SemanticModel" and which is more expensive than just looking at the current syntax tree because it needs the project to compile and to then work out what references to external code there may be.

Knowing that I'm going to be dealing with the full semantic model, I'll start by looking through the methods available on "context.SemanticModel" to see what might help me. Using the intellisense / documentation, it doesn't take long to find a "GetTypeInfo" method that takes an ObjectCreationExpression instance - this is ideal because we have an ObjectInitializerExpressionSyntax and we know that an ObjectInitializerExpressionSyntax is a child of an ObjectCreationExpressionSyntax, so it's easy for us to get an ObjectCreationExpression (it's just the parent of ObjectInitializerExpressionSyntax that we have).

"GetTypeInfo" returns a TypeInfo instance which has two properties; "Type" and "ConvertedType". "ConvertedType" is (taken from the xml summary documentation):

The type of the expression after it has undergone an implicit conversion

which shouldn't apply here, so we'll just look at "Type". Note, though, that the documentation for "Type" says that:

For expressions that do not have a type, null is returned. If the type could not be determined due to an error, than an IErrorTypeSymbol is returned.

Since this is an object creation expression, there should always be a type returned (the type of the object being instantiated) but we do need to be careful about the error response. Here, it's fine to stop processing if there's an error - it might mean that there is a "new SelectAttributes" statements in the code being analysed but no "Using Bridge.React;" at the top of the file. We'll ignore these error cases and plan to only analyse valid code.

This is the code that needs adding to ensure that the properties that we're looking at are for a Bridge.React.SelectAttributes -

var objectCreation = (ObjectCreationExpressionSyntax)context.Node.Parent;
var objectCreationTypeInfo = context.SemanticModel.GetTypeInfo(objectCreation);
if ((objectCreationTypeInfo.Type is IErrorTypeSymbol)
|| (objectCreationTypeInfo.Type.ContainingAssembly.Identity.Name != "Bridge.React")
|| (objectCreationTypeInfo.Type.Name != "SelectAttributes"))
    return;

Having written this code, it strikes me as a good idea to add another test - one that ensures that we don't raise false positives about "Multiple" and "Value" / "Values" in cases where it's a different SelectAttributes class, that is declared somewhere other than in "Bridge.React".

/// <summary>
/// Don't analyse a SelectAttributes initialisation that is for a different SelectAttributes class
/// (only target the SelectAttributes class that is part of the Bridge.React library)
/// </summary>
[TestMethod]
public void OnlyTargetBridgeReactSelectAttributes()
{
    var testContent = @"
        namespace TestCase
        {
            public class Example
            {
                public void Go()
                {
                    new SelectAttributes { Multiple = true, Value = ""x"" };
                }
            }

            public class SelectAttributes
            {
                public bool Multiple { get; set; }
                public string Value { get; set; }
            }
        }";

    VerifyCSharpDiagnostic(testContent);
}

Now we have all of the required information to display a warning for invalid "Multiple" / "Value" / "Values" combinations. What we don't have is appropriate message content to display - we've only got the warning content from the example analyser in the project template.

So delete all of the code at the top of the analyser - the const and static strings, the "Rule" reference and the "SupportedDiagnostics" property and replace them with this:

public const string DiagnosticId = "Bridge.React";
private static readonly LocalizableString Title
    = "Be careful to use the appropriate 'Value' or 'Values' property for the 'Multiple' setting";
private static readonly LocalizableString MultipleWithValueMessage
    = "If 'Multiple' is true then the 'Values' property should be used instead of 'Value'";
private static readonly LocalizableString NoMultipleWithValuesMessage
    = "If 'Multiple' is false then the 'Value' property should be used instead of 'Values'";
private const string Category = "Configuration";

private static DiagnosticDescriptor MultipleWithValueRule = new DiagnosticDescriptor(
    DiagnosticId,
    Title,
    MultipleWithValueMessage,
    Category,
    DiagnosticSeverity.Warning,
    isEnabledByDefault: true
);
private static DiagnosticDescriptor NoMultipleWithValuesRule = new DiagnosticDescriptor(
    DiagnosticId,
    Title,
    NoMultipleWithValuesMessage,
    Category,
    DiagnosticSeverity.Warning,
    isEnabledByDefault: true
);

public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics
{
    get { return ImmutableArray.Create(MultipleWithValueRule, NoMultipleWithValuesRule); }
}

The final step, then, is to report rules when they are broken. The following needs adding to the end of the "LookForInvalidSelectAttributeProperties" method in order to complete it:

if ((multiplePropertyValue == true) && valuePropertyIsSpecified)
{
    context.ReportDiagnostic(Diagnostic.Create(
        MultipleWithValueRule,
        context.Node.GetLocation()
    ));
}
else if ((multiplePropertyValue == false) && valuesPropertyIsSpecified)
{
    context.ReportDiagnostic(Diagnostic.Create(
        NoMultipleWithValuesRule,
        context.Node.GetLocation()
    ));
}

Localisation support

There's just one final thing to do now, which is more of a good practice than an essential - that is to replace the hard-coded strings in the analyser class with resources (that may potentially be translated into different languages one day). The project template includes a "Resources.resx" file, which is where we should move these strings to. Edit that file in Visual Studio and delete the existing entries and then add the following Name and Value pairs:

Name: SelectAttributesAnalyserTitle

Value: Be careful to use the appropriate 'Value' or 'Values' property for the 'Multiple' setting

Name: SelectAttributesAnalyserMultipleWithValueMessage

Value: If 'Multiple' is true then the 'Values' property should be used instead of 'Value'

Name: SelectAttributesAnalyserNoMultipleWithValuesTitle

Value: If 'Multiple' is false then the 'Value' property should be used instead of 'Values'

To make accessing these resources a little easier, add the following method to the bottom of the analyser class:

private static LocalizableString GetLocalizableString(string nameOfLocalizableResource)
{
    return new LocalizableResourceString(
        nameOfLocalizableResource,
        Resources.ResourceManager,
        typeof(Resources)
    );
}

Finally, replace the three hard-coded string property initialisers with the following:

    private static readonly LocalizableString Title = GetLocalizableString(
        nameof(Resources.SelectAttributesAnalyserTitle)
    );
    private static readonly LocalizableString MultipleWithValueTitle = GetLocalizableString(
        nameof(Resources.SelectAttributesAnalyserMultipleWithValueMessage)
    );
    private static readonly LocalizableString NoMultipleWithValuesTitle = GetLocalizableString(
        nameof(Resources.SelectAttributesAnalyserNoMultipleWithValuesTitle)
    );

Summary

That completes the analyser. I've included the complete source code for the final implementation below - now that it's written it doesn't look like much, which hopefully illustrates how powerful and complete the Roslyn library is. And, hopefully, it's shown that this powerful library doesn't need to be daunting because there's many resources out there for helping you understand how to use it; people have written a lot about it and so Googling for terms relating to what you want to do often yields helpful results, people have answered a lot of questions about it on Stack Overflow and so you will often find example and sample code there.

If you're not sure what terms to use to try to search for help then using the Syntax Visualizer to explore your code can set you on the right path, as can writing a test or two and then examining the "context.Node" reference in the debugger (if you do this then ensure that you are building your project in Debug mode since Release builds may prevent your breakpoints from being hit and may optimise some of the variable references away, which will mean that you won't be able to use QuickWatch on them). Finally, don't forget that there is a lot of helpful information in the xml summary documentation that's available in Visual Studio when you examine the Roslyn classes and their methods - often the names of methods are descriptive enough to help you choose the appropriate one or, at least, give you a clue as to what direction to go in.

This has really only scraped the surface of what analysers are capable of, it's a technology with huge capability and potential. I might talk about other uses for analysers (or talk about how particular analysers may be implemented) another day but two topics that I definitely will talk about soon are "code fixes" and how to get analysers to work with Bridge.NET libraries.

Code fixes are interesting because they allow you to go beyond just saying "this is wrong" to saying "this is how it may be fixed (automatically, by the IDE)". For example, if someone changed a SelectAttributes instantiation to enable multiple selections - eg. started with:

DOM.Select(
    new SelectAttributes { Value = selectedId },
    options
)

.. and changed it to:

DOM.Select(
    new SelectAttributes { Multiple = true, Value = selectedId },
    options
)

.. then the analyser could point out that the "Value" property should not be used now that "Multiple" is true but it could also offer to fix it up to the following automatically:

DOM.Select(
    new SelectAttributes { Multiple = true, Values = new[] { selectedId } },
    options
)

There will be times that the warning from an analyser will require manual intervention to correct but there will also be times where the computer could easily correct it, so it's great having the ability to explain to the computer how to do so and thus make life that bit easier for the person consuming your library.

The reason that I also want to spend a little bit of time talking about making analysers work with Bridge.NET libraries soon is that it's something of a special case since Bridge projects don't have references to the standard .net System, System.Collections, etc.. assemblies because they are replaced by special versions of those libraries that have JavaScript translations. This means that you can't reference a Bridge library from a project that relies on the standard .net assemblies, which is a bit of a problem when you want to write a Roslyn analyser for types in a Bridge library (since the analyser project will rely on standard .net assemblies and the analyser will want to reference the Bridge library whose rules are to be applied by the analyser). But there are ways to get around it and I'll go through that another time.

The complete analyser

using System.Collections.Immutable;
using System.Linq;
using System.Reflection;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Diagnostics;

namespace ExampleAnalyser
{
    [DiagnosticAnalyzer(LanguageNames.CSharp)]
    public sealed class ExampleAnalyserAnalyzer : DiagnosticAnalyzer
    {
        public const string DiagnosticId = "Bridge.React";
        private static readonly LocalizableString Title = GetLocalizableString(
            nameof(Resources.SelectAttributesAnalyserTitle)
        );
        private static readonly LocalizableString MultipleWithValueTitle = GetLocalizableString(
            nameof(Resources.SelectAttributesAnalyserMultipleWithValueMessage)
        );
        private static readonly LocalizableString NoMultipleWithValuesTitle = GetLocalizableString(
            nameof(Resources.SelectAttributesAnalyserNoMultipleWithValuesTitle)
        );
        private const string Category = "Configuration";

        private static DiagnosticDescriptor MultipleWithValueRule = new DiagnosticDescriptor(
            DiagnosticId,
            Title,
            MultipleWithValueTitle,
            Category,
            DiagnosticSeverity.Warning,
            isEnabledByDefault: true
        );
        private static DiagnosticDescriptor NoMultipleWithValuesRule = new DiagnosticDescriptor(
            DiagnosticId,
            Title,
            NoMultipleWithValuesTitle,
            Category,
            DiagnosticSeverity.Warning,
            isEnabledByDefault: true
        );

        public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics
        {
            get { return ImmutableArray.Create(MultipleWithValueRule, NoMultipleWithValuesRule); }
        }

        public override void Initialize(AnalysisContext context)
        {
            context.RegisterSyntaxNodeAction(
                LookForInvalidSelectAttributeProperties,
                SyntaxKind.ObjectInitializerExpression
            );
        }

        private static void LookForInvalidSelectAttributeProperties(SyntaxNodeAnalysisContext context)
        {
            var propertyInitialisers = context.Node.ChildNodes()
                .OfType<AssignmentExpressionSyntax>()
                .Select(propertyInitialiser => new
                {
                    PropertyName = ((IdentifierNameSyntax)propertyInitialiser.Left).Identifier.ValueText,
                    ValueExpression = propertyInitialiser.Right
                });

            var multiplePropertyInitialiser = propertyInitialisers.FirstOrDefault(
                propertyInitialiser => propertyInitialiser.PropertyName == "Multiple"
            );
            bool multiplePropertyValue;
            if (multiplePropertyInitialiser == null)
                multiplePropertyValue = false; // Defaults to false if not explicitlt set
            else
            {
                var multiplePropertyValueKind = multiplePropertyInitialiser.ValueExpression.Kind();
                if (multiplePropertyValueKind == SyntaxKind.TrueLiteralExpression)
                    multiplePropertyValue = true;
                else if (multiplePropertyValueKind == SyntaxKind.FalseLiteralExpression)
                    multiplePropertyValue = false;
                else
                {
                    // Only looking for very simple cases - where explicitly set to true or to false or
                    // not set at all (defaulting to false). If it's set according to a method return
                    // value or a variable then give up (this is just intended to catch obvious
                    // mistakes, not to perform deep and complex analysis)
                    return;
                }
            }

            var valuePropertyIsSpecified = propertyInitialisers.Any(
                propertyInitialiser => propertyInitialiser.PropertyName == "Value"
            );
            var valuesPropertyIsSpecified = propertyInitialisers.Any(
                propertyInitialiser => propertyInitialiser.PropertyName == "Values"
            );
            if (!valuePropertyIsSpecified && !valuesPropertyIsSpecified)
                return;

            var objectCreation = (ObjectCreationExpressionSyntax)context.Node.Parent;
            var objectCreationTypeInfo = context.SemanticModel.GetTypeInfo(objectCreation);
            if ((objectCreationTypeInfo.Type is IErrorTypeSymbol)
            || (objectCreationTypeInfo.Type.ContainingAssembly.Identity.Name != "Bridge.React")
            || (objectCreationTypeInfo.Type.Name != "SelectAttributes"))
                return;

            if ((multiplePropertyValue == true) && valuePropertyIsSpecified)
            {
                context.ReportDiagnostic(Diagnostic.Create(
                    MultipleWithValueRule,
                    context.Node.GetLocation()
                ));
            }
            else if ((multiplePropertyValue == false) && valuesPropertyIsSpecified)
            {
                context.ReportDiagnostic(Diagnostic.Create(
                    NoMultipleWithValuesRule,
                    context.Node.GetLocation()
                ));
            }
        }

        private static LocalizableString GetLocalizableString(string nameOfLocalizableResource)
        {
            return new LocalizableResourceString(
                nameOfLocalizableResource,
                Resources.ResourceManager,
                typeof(Resources)
            );
        }
    }
}

Posted at 07:49

Comments

Using Roslyn to identify unused and undeclared variables in VBScript WSC components

(Note: I had intended to keep this aside for April Fools since it's intended to be a bit tongue-in-cheek and really just an excuse to play with technology for technology's sake.. but I haven't got many other posts that I'm working on at the moment so I'm going to just unleash this now, rather than waiting!)

Imagine that you maintain a project which was migrated over time from an old and fragile platform to a new and improved (C#) code base. But there are various complicated external components that have been left untouched since they were (mostly) working and the new code could continue to use them - allowing the valuable rewriting time to be spent elsewhere, on less compartmentalised areas of code.

For some projects, these could be C++ COM components - I'm no expert on C++, but since people are still writing a lot of code in it and there are powerful IDEs to support this (such as Visual Studio), I presume that maintaining these sorts of components is possibly a little annoying (because COM) but not the worst thing in the world. For other projects, though, these could be "Windows Scripting Components" - these are basically COM components that are written in scripting languages, such as VBScript. They look something like the following:

<?xml version="1.0" ?>
<?component error="false" debug="false" ?>
<package>
  <component id="ExampleComponent">
    <registration progid="ExampleComponent" description="Example Component" version="1" />
    <public>
      <method name="DoSomething" />
    </public>
    <script language="VBScript">
    <![CDATA[

      Function DoSomething(ByVal objOutput)
        Dim intIndex: For intIndex = 1 To 5
          objOutput.WriteLine "Entry " & iIndex
        Next
      End Function

    ]]>
    </script>
  </component>
</package>

Creating "Classic ASP" web projects using these components had the advantage that interfaces between components could be limited and documented, enabling a semblance of organisation to be brought to bear on large solutions.. but "Classic ASP" and VBScript are technologies that, by this point, should have long since been put to bed. They do not have good IDE support or debugging tools (nor do they perform well, nor is it easy to hire good people to work on your solutions that contain code in this language).

If you have components that work and that will never be needed to change, then maybe that's no big deal. Or maybe there is something in the migration plan that says that legacy components that work (and do not require adapting or extending) will be left as-is and any components that need work will be rewritten.

If this is the case, then it's easy enough to use these working components from C# -

var filename = "ExampleComponent.wsc";
dynamic component = Microsoft.VisualBasic.Interaction.GetObject(
  "script:" + new FileInfo(filename).FullName,
  null
);
component.DoSomething(new ConsoleWriter());

Note: In order for the above code to run with the WSC presented further up, the C# code needs to provide a ComVisible "objOutput" reference which has a "WriteLine" method that takes a single (string) argument. The snippet above uses a ConsoleWriter class, which could be implemented as follows:

[ComVisible(true)]
public class ConsoleWriter
{
  public void WriteLine(string value)
  {
    Console.WriteLine(value);
  }
}

But what if there isn't an agreement to rewrite any WSCs that need work and what if there are some that need bug-fixing or new functionality? Well, good luck! Error messages from these components tend to be vague and - just to really add a little extra joy to your life - they don't include line numbers. Oh, "Object expected"? Great.. will you tell me where? No. Oh, good.

If you were so intrigued by what I've written here so far that you've actually been playing along and have saved the WSC content from the top of this post into a file and executed it using the C# above, you might have noticed another problem when you ran it. Below is what is output to the console:

Entry

Entry

Entry

Entry

Entry

But, since the VBScript is performing a simple loop and writing a message that includes that loop variable in it, shouldn't it be this instead??

Entry 1

Entry 2

Entry 3

Entry 4

Entry 5

Identifying unused and undeclared variables with the VBScriptTranslator and Roslyn

Well, I do have a glimmer of hope for the problem above and, potentially, for other VBScript-writing pitfalls.

What we could do is process WSC files to -

  1. Extract VBScript section(s) from them
  2. Run the VBScript content through the VBScriptTranslator to generate C#
  3. Parse and build the resulting C# using Roslyn
  4. Use information gleaned from steps 2 and 3 to identify errors that might otherwise not be apparent before runtime

The packages we want are available through NuGet -

Before I go through these steps, let me just explain briefly what the problem was in the VBScript sample code shown further up - just in case you're not familiar with VBScript or didn't spot it.

The loop variable in the code

Dim intIndex: For intIndex = 1 To 5
  objOutput.WriteLine "Entry " & iIndex
Next

is named "intIndex" but the line that writes out the text refers to "iIndex", which is an undeclared variable.

In C#, if we tried to do something similar then the compiler would bring it immediately to our attention - eg.

for (var i = 1; i <= 5; i++)
  Console.WriteLine("Entry " + j);

Presuming that "j" was not defined elsewhere within the scope of the above code, we would be informed that

The name 'j' does not exist in the current context

But VBScript doesn't care about this, declaring variables (such as with the use of "Dim intIndex") is generally optional. The "iIndex" value in the code above is never defined, which means it gets the special VBScript "Empty" value, which is treated as an empty string when introduced into a string concatenation operation.

VBScript does support a mode that requires that variables be declared before they are referenced; "Option Explicit". If we changed the code to the following:

Option Explicit

Dim intIndex: For intIndex = 1 To 5
  objOutput.WriteLine "Entry " & iIndex
Next

then we would get an error at runtime:

Variable is undefined: 'iIndex'

Which seems much better, but there's one big gotcha to "Option Explicit" - it is not enforced when the VBScript code is parsed, it is only enforced as the code is executed. This means that enabling Option Explicit and having a script run successfully does not mean that it contains no undeclared variables, it only means that the code path that just ran contained no undeclared variables.

To illustrate, the following script will run successfully except on Saturdays -

Option Explicit

Dim intIndex: For intIndex = 1 To 5
  If IsSaturday() Then
    objOutput.WriteLine "Entry " & iIndex
  Else
    objOutput.WriteLine "Entry " & intIndex
  End If
Next

Function IsSaturday()
  IsSaturday = WeekDay(Now()) = 7
End Function

This is a pity. I think that it would have been much better for Option Explicit to have been enforced when the script was loaded. But that ship has loooooong since sailed.

So, instead of crying about spilt milk, let's look at something positive. We've got a four step plan to crack on with!

1. Extracting VBScript content from a WSC

This is the most boring step and so I'll try not to get bogged down too much here. A WSC file is xml content and we want to identify CDATA content sections within "script" tags that have a "language" attribute with the value "VBScript".

The below is some rough-and-ready code, taken from a project that I wrote years ago, dusted off to reuse here -

private static IEnumerable<Tuple<string, int>> GetVBScriptSections(string wscContent)
{
  var document = new XPathDocument(new StringReader(wscContent));
  var nav = document.CreateNavigator();
  if (nav.HasChildren && nav.MoveToFirstChild())
  {
    while (true)
    {
      foreach (var scriptSection in TryToGetVBScriptContentFromNode(nav))
        yield return scriptSection;
      if (!nav.MoveToNext())
        break;
    }
  }
}

private static IEnumerable<Tuple<string, int>> TryToGetVBScriptContentFromNode(XPathNavigator nav)
{
  if (nav.NodeType == XPathNodeType.Text)
  {
    var navParent = nav.Clone();
    navParent.MoveToParent();
    if (navParent.Name.Equals("script", StringComparison.OrdinalIgnoreCase)
    && DoesNodeHaveVBScriptLanguageAttribute(navParent))
      yield return Tuple.Create(nav.Value, ((IXmlLineInfo)nav).LineNumber - 1);
  }
  if (nav.HasChildren)
  {
    var navChildren = nav.Clone();
    if (navChildren.MoveToFirstChild())
    {
      while (true)
      {
        foreach (var scriptSection in TryToGetVBScriptContentFromNode(navChildren))
          yield return scriptSection;
        if (!navChildren.MoveToNext())
          break;
      }
    }
  }
}

private static bool DoesNodeHaveVBScriptLanguageAttribute(XPathNavigator node)
{
  node = node.Clone();
  if (!node.HasAttributes || !node.MoveToFirstAttribute())
    return false;

  while (true)
  {
    if (node.Name.Equals("language", StringComparison.OrdinalIgnoreCase)
    && node.Value.Equals("VBScript", StringComparison.OrdinalIgnoreCase))
      return true;
    if (!node.MoveToNextAttribute())
      return false;
  }
}

The "GetVBScriptSections" function will return a set of Tuples - pairs of values where the first value is the VBScript content and the second value is the line index that the content starts at in the WSC. It returns a set, rather than a single Tuple, since it is valid for WSC files to contain multiple script tags.

The source line index will be important for identifying where in the WSC that any warnings we generate later originate.

2. Translate the VBScript sections

Now that we've got VBScript content, let's translate it into C#!

After the VBScriptTranslator NuGet package is installed, the following code may be written -

foreach (var vbscriptCodeSection in GetVBScriptSections(wscContent))
{
  // When translating the VBScript, add in new lines before the content so
  // that the lines indexes in the only-VBScript content match the line
  // indexes in the WSC
  var lineIndexInSourceFile = vbscriptCodeSection.Item2;
  var blankLinesToInject = string.Join(
    "",
    Enumerable.Repeat(Environment.NewLine, lineIndexInSourceFile)
  );

  var vbscriptContent = vbscriptCodeSection.Item1;
  var translatedStatements = DefaultTranslator.Translate(
    blankLinesToInject + vbscriptContent,
    externalDependencies: new string[0],
    warningLogger: message =>
    {
      if (message.StartsWith("Undeclared variable:"))
        Console.WriteLine(message);
    }
  );

This actually goes a long way to identifying my original problem - in order for the VBScriptTranslator to do its thing, it needs to identify any undeclared variables (because it will have to create explicitly declared variables in the resulting C# code). When it encounters an undeclared variable, it will log a warning message - the code above writes to the console any warnings about undeclared variables.

Running the above against the content at the top of this post results in the following being written out:

Undeclared variable: "iIndex" (line 14)

Success! Line 14 is, indeed, the line where an undeclared variable "iIndex" was accessed.

Now that we have a C# interpretation of the source code, though, it seems like we should be able to do more by bringing the impressive array of C# analysis tools that are now available to bear (ie. Roslyn aka "Microsoft.CodeAnalysis").

Imagine if the original VBScript content was something more like this -

Function DoSomething(ByVal objOutput)
  Dim intIndex, strName

  ' .. loads of code

  For intIndex = 1 To 5
    objOutput.Write "Entry " & iIndex
  Next

  ' .. loads more code

End Function

Those legacy VBScript writers sure did love their huge functions with 100s of lines of code! So the "loads of code" sections above really could be loads of code.

One day, someone has to change this long, long function a little bit and thinks that they've removed the only use of the "strName" variable from the function. But it's hard to be sure since the function is so long and it's got conditions nested so deeply that it's headache-inducing. The Boy Scout Rule makes it seem attractive to remove the "strName" declaration if it's no longer used.. the problem is that this someone is not utterly, 100% confident that it's safe to remove. And it's not like they could just remove the variable declaration then re-run and rely on Option Explicit to inform them if the variable is still used somewhere (for the reason outlined earlier).

One way to obtain confidence as to whether a variable is used or not is to continue to the next step..

3. Build the generated C# using Roslyn

Adding the Microsoft.CodeAnalysis.CSharp NuGet package allows us to write:

private static IEnumerable<Tuple<string, int>> GetUnusedVariables(string translatedContent)
{
  // Inspired by code from www.tugberkugurlu.com (see http://goo.gl/HYT8eo)
  var syntaxTree = CSharpSyntaxTree.ParseText(translatedContent);
  var compilation = CSharpCompilation.Create(
    assemblyName: "VBScriptTranslatedContent",
    syntaxTrees: new[] { syntaxTree },
    references:
      new[]
      {
        // VBScriptTranslator content requires System, System.Collections, System.Runtime
        // and one of its own libraries to run. To identify these assemblies, one type
        // from each is identified, then its Assembly location is used to create the
        // MetadataReferences that we need here
        typeof(object),
        typeof(List<string>),
        typeof(ComVisibleAttribute),
        typeof(DefaultRuntimeSupportClassFactory),
      }
      .Select(type => MetadataReference.CreateFromFile(type.Assembly.Location)),
    options: new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary)
  );
  EmitResult result;
  using (var ms = new MemoryStream())
  {
    result = compilation.Emit(ms);
  }
  if (!result.Success)
  {
    var errorMessages = result.Diagnostics
      .Where(diagnostic =>
        diagnostic.IsWarningAsError || (diagnostic.Severity == DiagnosticSeverity.Error)
      )
      .Select(diagnostic => $"{diagnostic.Id}: {diagnostic.GetMessage()}");
    throw new Exception(
      "Compilation of generated C# code failed: " + string.Join(", ", errorMessages)
    );
  }
  return result.Diagnostics
    .Where(diagnostic => diagnostic.Id == "CS0219")
    .Select(diagnostic => Tuple.Create(
      diagnostic.GetMessage(),
      diagnostic.Location.GetLineSpan().StartLinePosition.Line
    ));
}

This will take the VBScriptTranslator-generated C# code and return information about any unused variables; a set of Tuples where each pair of values is a message about an unused variable and the line index of this variable's declaration.

We'll use this information in the final step..

4. Use information gleaned from steps 2 and 3 to identify errors that might otherwise not be apparent before runtime

In the VBScriptTranslator-calling code from step 2, we got a list of translated statements. Each of these represents a single line of C# code and has the properties "Content", "IndentationDepth" and "LineIndexOfStatementStartInSource". If we so desired, we could use the "Content" and "IndentationDepth" properties to print to the console the generated C# in a nicely-indented format.

But that's not important right now, what we really want are two things; a single string for the entirety of the generated C# content (to compile with Roslyn) and we want mappings for line index values in the C# back to line index values in the source VBScript. The C# code may have more or less lines than the VBScript (the translation process is not a simple line-to-line process), which is why these line index mappings will be important.

// Each "translatedStatements" item has a Content string and a
// LineIndexOfStatementStartInSource value (these are used to
// create a single string of C# code and to map each line in
// the C# back to a line in the VBScript)
var translatedContent = string.Join(
  Environment.NewLine,
  translatedStatements.Select(c => c.Content)
);
var lineIndexMappings = translatedStatements
  .Select((line, index) => new { Line = line, Index = index })
  .ToDictionary(
    entry => entry.Index,
    entry => entry.Line.LineIndexOfStatementStartInSource
  );

Now it's a simple case of bringing things together -

foreach (var unusedVariableWarning in GetUnusedVariables(translatedContent))
{
  var unusedVariableWarningMessage = unusedVariableWarning.Item1;
  var lineIndexInTranslatedContent = unusedVariableWarning.Item2;
  var lineIndexInSourceContent = lineIndexMappings[lineIndexInTranslatedContent];

  // Line index values are zero-based but warnings messages that refer to
  // a line generally refer to a line NUMBER, which is one-based (hence
  // the +1 operation)
  Console.WriteLine(
    $"{unusedVariableWarningMessage} (line {lineIndexInSourceContent + 1})"
  );
}

If this was run against our second WSC sample, then we would get a new warning reported:

The variable 'strname' is assigned but its value is never used (line 13)

Which is precisely what we wanted to find out - the "strName" variable is declared but never used, so it's safe for our Boy Scout Developer to remove it!

Ooooooo, I'm excited! What else could I do??

I must admit, I haven't thought too much about what other possibilities are available when some static analysis is available for VBScript code, I was just intending to mess about with Roslyn a bit. But, thinking about it, a few ideas come to mind.

As an example of the frankly terrible errors that you get when working with VBScript WSCs, if you took the WSC example from earlier and decided to refactor the FUNCTION into a SUB (in VBScript, a SUB is basically a FUNCTION that can not return a value) and you made the silly mistake of changing the function "header" but not its "terminator" - eg.

Sub DoSomething(ByVal objOutput)
  Dim intIndex: For intIndex = 1 To 5
    objOutput.Write "Entry " & iIndex
  Next
End Function

Then you would get a particularly unhelpful error when trying to load the WSC into the .net runtime -

Cannot create ActiveX component.

The problem is that the "END FUNCTION" should have been changed "END SUB", since the first VBScript line has had the keyword "FUNCTION" changed to "SUB". It would seem that the VBScript interpreter would have plenty of information available to it that would allow it to raise a more descriptive error. However, it chooses not to.

If this WSC content was run through the VBScriptTranslator, though, then an exception with the following error message would be raised:

Encountered must-handle keyword in statement content, this should have been handled by a previous AbstractBlockHandler: "End", line 16 (this often indicates a mismatched block terminator, such as an END SUB when an END FUNCTION was expected)

Ok.. I'll admit that this is not the friendliest error message ever formed. What exactly is a "must-handle keyword"? What is an "AbstractBlockHandler"?? But the good thing is that a line number is included along with a reference to an "END" token - and this hopefully is enough to point you at where the problem is.

Another idea that springs to mind is to try to identify functions that have inconsistent return types, in terms of whether they are value types or object references. In VBScript, you must be aware of this distinction at all times - if calling a function that you expect to return an object, then you need to write the function call using the "SET" keyword - eg.

Set objPrice = GetPriceDetails(order)

But if you expect it to return a value type, then you would write it as

sngPrice = GetPriceDetails(order)

VBScript has a special kind of null that represents an object with no value; "Nothing". This allows you to write functions that will always return an object reference, but that may return a reference that means "no result" - eg.

Function GetPriceDetails(ByVal x)
  If IsObject(x) Then
    Set GetPriceDetails = x.PriceDetails
    Exit Function
  End If
  Set GetPriceDetails = Nothing
End Function

However, I've seen code that forgets this and returns a value type "Null" instead - eg.

Function GetPriceDetails(ByVal x)
  If IsObject(x) Then
    Set GetPriceDetails = x.PriceDetails
    Exit Function
  End If
  GetPriceDetails = Null
End Function

Now, when calling GetPriceDetails, you will get an object reference sometimes and a value type other times. How do you know whether to use "SET" when calling it if you don't know whether you are expecting an object reference or a value type back? Answer: You don't. Most likely whoever wrote the code used "SET" because they tested the "happy case" (which returns an object reference) and forgot to test the less-happy case, which returned a "Null" value type (and that would fail at runtime if called with use of "SET").

Well, this is something else that the VBScriptTranslator can help with. Instead of using the DefaultTranslator's "Translate" method, we can use its "Parse" method. This will return a syntax tree describing the source code. By examining this data, we can identify cases, like the one above, which are almost certainly mistakes.

Below is a complete example. I won't go too deeply into the details, since that would send me even further off track than I am now!

static void Main(string[] args)
{
  var scriptContent = @"
    Function GetPriceDetails(ByVal x)
      If IsObject(x) Then
        Set GetPriceDetails = x.Price
        Exit Function
      End If
      GetPriceDetails = Null
    End Function";

  // Note: An "AbstractFunctionBlock" is a Function, a Sub, or a Property - they are
  // all variations on a theme
  var codeBlocks = DefaultTranslator.Parse(scriptContent);
  foreach (var function in GetAllCodeBlocks(codeBlocks).OfType<AbstractFunctionBlock>())
  {
    var returnValueSetters = GetAllCodeBlocks(function.Statements)
      .OfType<ValueSettingStatement>()
      .Where(ValueSetterTargetIs(function.Name));
    var valueTypeReturnValueSetterLineNumbers = returnValueSetters
      .Where(v => v.ValueSetType == ValueSettingStatement.ValueSetTypeOptions.Let)
      .Select(v => v.ValueToSet.Tokens.First().LineIndex + 1)
      .Distinct();
    var objectReturnValueSetterLineNumbers = returnValueSetters
      .Where(v => v.ValueSetType == ValueSettingStatement.ValueSetTypeOptions.Set)
      .Select(v => v.ValueToSet.Tokens.First().LineIndex + 1)
      .Distinct();
    if (valueTypeReturnValueSetterLineNumbers.Any()
    && objectReturnValueSetterLineNumbers.Any())
    {
      Console.WriteLine(
        "{0} \"{1}\" has both LET (lines {2}) and SET (lines {3}) return values",
        function.GetType().Name,
        function.Name.Content,
        string.Join(", ", valueTypeReturnValueSetterLineNumbers),
        string.Join(", ", objectReturnValueSetterLineNumbers)
      );
    }
  }
  Console.ReadLine();
}

private static IEnumerable<ICodeBlock> GetAllCodeBlocks(IEnumerable<ICodeBlock> blocks)
{
  foreach (var block in blocks)
  {
    yield return block;

    var parentBlock = codeBlock as IHaveNestedContent;
    if (parentBlock != null)
    {
      foreach (var nestedBlock in GetAllCodeBlocks(parentBlock.AllExecutableBlocks))
        yield return nestedBlock;
    }
  }
}

private static Func<ValueSettingStatement, bool> ValueSetterTargetIs(NameToken target)
{
  return valueSetter =>
  {
    if (valueSetter.ValueToSet.Tokens.Count() > 1)
      return false;
    var valueSetterTarget = valueSetter.ValueToSet.Tokens.Single();
    return
      (valueSetterTarget is NameToken) &&
      valueSetterTarget.Content.Equals(target.Content, StringComparison.OrdinalIgnoreCase);
  };
}

This will write out the warning

FunctionBlock "GetPriceDetails" has both LET (lines 7) and SET (lines 4) return value setters

Hurrah! Very helpful! No more waiting for run time execution to find out that some code paths return object references and some return value types!

Static analysis is very valuable. It's one of the reasons why I like C# so much because there is a lot of power in static analysis - and I'm always looking out for ways to leverage it further, such as more strongly-typed classes (should a phone number really be a string or should it be a "PhoneNumber" class?) and technologies such as code contracts (which I've been meaning to look back into for about a year now.. must stop making excuses).

But there's one other thing that could be done with VBScript WSCs and the VBScriptTranslator - instead of just translating the code to analyse it, it could be translated into C# and then executed as C#! This way the (very expensive) COM boundary would be removed between the .net hosting environment and the old legacy component. And the translated code will execute more quickly than VBScript. Double-win!

The output from a "DefaultTranslator.Translate" call is content that may be saved into a file that will then define a class called "TranslatedProgram" (this string content is what we were earlier pushing through Roslyn for further analysis). This may be executed using a runtime library included in the VBScriptTranslator NuGet package (or that is available on its own, in the VBScriptTranslator.RuntimeSupport NuGet package) with the following code -

// The "compatLayer" provides implementations of VBScript functions (like "CInt")
// to the translated code, along with functions such as "CALL", which enable late-
// bound method calls to be executed (which are then compiled into LINQ expressions
// and cached so that subsequent calls are close in performance to hand-written C#)
using (var compatLayer = DefaultRuntimeSupportClassFactory.Get())
{
  // The Runner's "Go" function returns a new instance of the translated
  // component. The "DoSomething" method from the component may then be
  // called. Translated names are all lower-cased, it makes the mismatch
  // between VBScript's case insensitivity and C#'s case SENSITIVITY
  // less important.
  var component = new TranslatedProgram.Runner(compatLayer).Go();
  component.dosomething(new ConsoleWriter());
}

So.. not actually that much Roslyn then?

Sticklers for accuracy may note, at this point, that there hasn't actually been that much use of Roslyn in a post that features that word in its title. Well.. yes, that is fair enough.

But, then, this entire post was only intended to be a slightly silly foray into "just because I can.." that included a detour through Roslyn. Let's not take things too seriously, though - I mean, really, who is still even using VBScript in any serious production applications these days??

Posted at 23:41

Comments

Locating TODO comments with Roslyn

I picked up an old project recently that I knew I'd made good progress on and that the bits that were finished were looking good.. but also I knew that it had TODO comments littered throughout it to remind me what I hadn't finished.

To get an idea just how many of these there were, I did a solution-wide search for "TODO" in Visual Studio. There were just over two hundred of them. The search results gave me a fair idea of where they were but I got it into my head that I wanted to export this into a list and then map them onto projects and - ideally - classes and methods. The first part is easy, the search results output contains the path to the file, which indicates the project name. The classes, also, could often be extracted from the filename - so long as there was only one class (or interface or enum or whatever) per file, though no nested types would be awkward.

And this, really, would have been enough information to start tracking my progress and have a checklist that I could take satisfaction in crossing items off from. But of course I wanted more! Isn't this new* Roslyn thing supposed to be about parsing code, shouldn't I be able to use it to find out what properties or methods the TODO comments I've found are associated with? And don't I sometimes need a break from genuinely productive work to play with something new and shiny under the pretense of doing something useful with it?? :)

* (Not that new, actually, seeing as it was announced for preview back in 2011).

The two sides of Roslyn

Roslyn is often talked about as enabling a "compiler as a service" - where code can be compiled and executed on-the-fly. So some sort of scripting engine could be created to dynamically change behaviour on already-executing code. Essentially, Roslyn can take source code (C# or VB) and generate IL, which can then be executed and interacted with by the application that fed that source code through it.

However, the other side of it is that it provides "rich code analysis APIs" (according to its page on MSDN) - meaning that it will help you examine the source code, even if you have no intention of executing that code. Which sounds exactly like what I want to try to locate my TODO comments within a containing method / property / type / namespace.

If I had more ambitious aims in mind then it could also be used for all manner of IDE extensions for code investigation, refactoring or "best practices analysis". A bit like many of the features that ReSharper provides (though ReSharper predates it, and woe betide anyone who asks if they are thinking of integrating with Roslyn so that they don't have to maintain as much parsing code of their own - Ask me again if ReSharper will use Roslyn.. I dare you).

To getting started with Roslyn, you install it through NuGet - though, currently, it's marked as pre-release so mightn't show up when you search for it. The best thing to do is follow the instruction on the NuGet package page and run

Install-Package Microsoft.CodeAnalysis -Pre

at the Package Manager Console.

With this done, parsing code is as easy as

var parsedContent = CSharpSyntaxTree.ParseText(content);

where "content" is a string. This string may be an entire file as you would expect to encounter it in a project - with a namespace containing class / interface / enum and fields / properties / methods / values - or it may be a "fragment", such as a single method or method call (as often illustrated when people talk about using Roslyn for scripting).

The "ParseText" method returns a SyntaxTree instance. This is an immutable structure that describes the parsed content. I'm a huge fan of immutable structures since I think it makes code much easier to reason about (my love of immutability has been a theme through many of the posts I've written). In Roslyn's design it has been stated that

The short answer to why syntax trees are immutable in Roslyn is that it makes parallel work much easier. You can take a syntax tree and pass it to any thread and not worry that someone else will mutate it while you are in the middle of doing analysis. This is useful in the command line compiler so that multiple trees can have their methods bound in parallel (which may need to occasionally access information from a different tree), but it's EXTREMELY important for VS scenarios where we want to have an extensibility model that allows many extensions to analyze and transform the same tree in parallel, and it doesn't really work to have a model that forces all those separate extensions to co-ordinate locking a single tree. Similarly, providing each extension its own copy of the tree would be prohibitive from a memory overhead point of view.

(I took this from a Google Groups thread Why are Roslyn Syntax Trees Immutable? and the answer is attributed to "the Roslyn PM").

Eric Lippert has also written about the design, saying that they wanted the data structures to be immutable and persistent and that

By persistence I mean the ability to reuse most of the existing nodes in the tree when an edit is made to the text buffer. Since the nodes are immutable, there's no barrier to reusing them, as I've discussed many times on this blog. We need this for performance; we cannot be re-parsing huge wodges of text every time you hit a key. We need to re-lex and re-parse only the portions of the tree that were affected by the edit, because we are potentially re-doing this analysis between every keystroke.

This is in the context of using Roslyn to analyse code being written within Visual Studio - the full post is titled Persistence, Facades and Roslyn's Red-Green Trees.

Get to the point already!

So. Enough history. Back to my TODO-search.

The SyntaxTree returned from "ParseText" looks quite complex at first glance when you starting poking around it with Visual Studio's "QuickWatch" facility, at least (which is the first thing I did).

However, Roslyn helpfully provides a SyntaxWalker class, which may be used to easily examine every node within the tree. It uses the vistor pattern to do this. Design patterns are said to be a benefit when their form is appropriate to your problem such that they extend your vocabulary to describe the solution. There seem like there are times, unfortunately, that people layer on design patterns and abstractions only because they think they should - which is why it's nice in cases like this where it makes perfect sense and succeeds in makings things simple if you know the pattern being used. Last year, I was writing a plugin for dotLess which used the visitor pattern to traverse the nodes in a stylesheet (see Cross Browser (Pseudo) Source Mapping with LESS) and it was nice to see the exact same concept in use here.

The simplest implementation is

public class TriviaVisitor : SyntaxWalker
{
  public TriviaVisitor() : base(SyntaxWalkerDepth.StructuredTrivia) { }
  protected override void VisitTrivia(SyntaxTrivia trivia)
  {
    // Examine Trivia here..
  }
}

When the "Visit" method is called (which is defined by the SyntaxWalker class) and given a parsed tree, the "VisitTrivia" method is called for every SyntaxTrivia instance that is encountered within that tree - eg.

(new TriviaVisitor()).Visit(
  CSharpSyntaxTree.ParseText(content).GetRoot()
);

Comments and whitespace are SyntaxTrivia. Everything else will be represented by the SyntaxNode and SyntaxToken types. A SyntaxNode is made up on SyntaxTokens. For example, a "UsingDirectiveSyntax" represents a "using" statement such as

using System;

and will contain SyntaxTokens for the "using", "System" and ";" components of the statement.

These SyntaxNodes and SyntaxTokens are part of the tree that describes that parsed content. Trivia, however, are not directly part of the hierarchical data - rather, they are related to tokens and accessible through the token's "LeadingTrivia" and "TrailingTrivia" properties. Conversely, SyntaxTrivia instances have a "Token" property which allows you to map from the trivia back to the associated token.

So, within a "VisitTrivia" method, we can identify trivia we're interested in (comments, in this case, rather than whitespace) and determine what token they're associated with. The token will have a "Parent" property, which is the SyntaxNode that it's part of. The node is part of a hierarchy, which can be traversed up through via the "Parent" property values - each node may be something we're interested in identifying; such as the method containing the comment, the type containing that method or the namespace containing that type (must remember, though, that not all comments will be within methods - some may be TODO comments annotating a class, or even just sitting out on their own in an otherwise-empty file).

public class CommentLocatingVisitor : SyntaxWalker
{
  private readonly Action<ToDoComment> _commentLocated;
  public CommentLocatingVisitor(Action<ToDoComment> commentLocated)
    : base(SyntaxWalkerDepth.StructuredTrivia)
  {
    if (commentLocated == null)
      throw new ArgumentNullException("commentLocated");

    _commentLocated = commentLocated;
  }

  protected override void VisitTrivia(SyntaxTrivia trivia)
  {
    if (_commentTypes.Contains(trivia.CSharpKind()))
    {
      string triviaContent;
      using (var writer = new StringWriter())
      {
        trivia.WriteTo(writer);
        triviaContent = writer.ToString();
      }

      // Note: When looking for the containingMethodOrPropertyIfAny, we want MemberDeclarationSyntax
      // types such as ConstructorDeclarationSyntax, MethodDeclarationSyntax, IndexerDeclarationSyntax,
      // PropertyDeclarationSyntax but NamespaceDeclarationSyntax and TypeDeclarationSyntax also
      // inherit from MemberDeclarationSyntax and we don't want those
      var containingNode = trivia.Token.Parent;
      var containingMethodOrPropertyIfAny = TryToGetContainingNode<MemberDeclarationSyntax>(
        containingNode,
        n => !(n is NamespaceDeclarationSyntax) && !(n is TypeDeclarationSyntax)
      );
      var containingTypeIfAny = TryToGetContainingNode<TypeDeclarationSyntax>(containingNode);
      var containingNameSpaceIfAny = TryToGetContainingNode<NamespaceDeclarationSyntax>(containingNode);
      _commentLocated(new ToDoComment(
        triviaContent,
        trivia.SyntaxTree.GetLineSpan(trivia.Span).StartLinePosition.Line,
        containingMethodOrPropertyIfAny,
        containingTypeIfAny,
        containingNameSpaceIfAny
      ));
    }
    base.VisitTrivia(trivia);
  }

  private static HashSet<SyntaxKind> _commentTypes = new HashSet<SyntaxKind>(new[] {
    SyntaxKind.SingleLineCommentTrivia,
    SyntaxKind.MultiLineCommentTrivia,
    SyntaxKind.DocumentationCommentExteriorTrivia,
    SyntaxKind.SingleLineDocumentationCommentTrivia,
    SyntaxKind.MultiLineDocumentationCommentTrivia
  });

  private T TryToGetContainingNode<T>(SyntaxNode node, Predicate<T> optionalFilter = null)
    where T : SyntaxNode
  {
    if (node == null)
      throw new ArgumentNullException("node");

    var currentNode = node;
    while (true)
    {
      var nodeOfType = currentNode as T;
      if (nodeOfType != null)
      {
        if ((optionalFilter == null) || optionalFilter(nodeOfType))
          return nodeOfType;
      }
      if (currentNode.Parent == null)
        break;
      currentNode = currentNode.Parent;
    }
    return null;
  }
}

This CommentLocatingVisitor class is instantiated with a callback that is executed for every comment that is encountered when its "ParseText" method is called and the provided root traversed.

To keep things organised, this callback passes a Comment instance, as follows:

public class Comment
{
  public Comment(
    string content,
    int lineNumber,
    MemberDeclarationSyntax methodOrPropertyIfAny,
    TypeDeclarationSyntax typeIfAny,
    NamespaceDeclarationSyntax namespaceIfAny)
  {
    if (string.IsNullOrEmpty(content))
      throw new ArgumentException("Null/blank content specified");
    if (lineNumber < 1)
      throw new ArgumentOutOfRangeException("lineNumber");

    Content = content;
    LineNumber = lineNumber;
    MethodOrPropertyIfAny = methodOrPropertyIfAny;
    TypeIfAny = typeIfAny;
    NamespaceIfAny = namespaceIfAny;
  }

  /// <summary>
  /// This will never be null or blank
  /// </summary>
  public string Content { get; private set; }

  /// <summary>
  /// This will always be a positive integer
  /// </summary>
  public int LineNumber { get; private set; }

  /// <summary>
  /// This may be null since the comment may not exist within a method or property
  /// </summary>
  public MemberDeclarationSyntax MethodOrPropertyIfAny { get; private set; }

  /// <summary>
  /// This may be null since the comment may not exist within an class, interface or struct
  /// </summary>
  public TypeDeclarationSyntax TypeIfAny { get; private set; }

  /// <summary>
  /// This may be null since the comment may not exist within a method or property
  /// </summary>
  public NamespaceDeclarationSyntax NamespaceIfAny { get; private set; }
}

So now, given the contents of any C# file, the comments can be identified and traced to the constructs that they're associated with. Now they just need to be filtered to those containing the text "TODO", since those are the particular comments of interest.

For the first stab I took at this, I did a search-all-solution for "TODO" and copy-pasted the results into a file. I then read in this file, extracted the filenames and ran the above against the contents of each file.

But surely there's a better way..

Parsing the solution

What would be ideal would be the ability to point some code at a solution file, for it to determine what projects are in the solution, what C# code files are in the projects and then to extract all of the locations of TODO comments within those. None of this search-all / copy-paste / parse-the-results-and-read-the-files-from-there nonsense.

There are two parts to this - reading the solution file to get the projects and reading the individual project files. I'll start with the latter since it turned out to be easier.

If you add a reference to "Microsoft.Build" then you can can use the ProjectCollection type in a method such as

private static IEnumerable<FileInfo> GetCSharpCompileItemFilesForProject(FileInfo projectFile)
{
  if (projectFile == null)
    throw new ArgumentNullException("projectFile");

  return (new ProjectCollection()).LoadProject(projectFile.FullName).AllEvaluatedItems
    .Where(item => item.ItemType == "Compile")
    .Select(item => item.EvaluatedInclude)
    .Where(include => include.EndsWith(".cs", StringComparison.OrdinalIgnoreCase))
    .Select(include => new FileInfo(Path.Combine(projectFile.Directory.FullName, include)));
}

Nice when the framework provides you just what you need! This is basically just looking for ".cs" items in a given project file and returning FileInfo instances such that the full path is made available (the filenames in the project will be paths relative to the location of the project file and so need to be combined with the project file location to get the full path of the file).

The solution file parsing is not quite so elegant.

There is a Stack Overflow question "How do I compile a C# solution with Roslyn?" which talks about parsing a solution file. But it's very out of date and the code doesn't compile. But it leads to another question "Roslyn / Find References - Can't properly load Workspace" which looks like it's going to help but I encountered the same problem as this question: "MSBuildWorkspace.Create() throws exception". The gist is that to use this you need to Microsoft.Build version 14, whereas the version available (for VS 2013, at least) is version 4. It seems like the solution is to download the VS 2014 CTP or get the ISO file and root around for the version 14 assembly.

At this point, I got bored with it and fell back to parsing the solution field with a regular expression, looking for ".csproj" files in what look like project declarations.

private static IEnumerable<FileInfo> GetProjectFilesForSolution(FileInfo solutionFile)
{
  if (solutionFile == null)
    throw new ArgumentNullException("solutionFile");

  var projectFileMatcher = new Regex(
    @"Project\(""\{\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\}""\) = ""(.*?)"", ""(?<projectFile>(.*?\.csproj))"", ""\{\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\}"""
  );
  foreach (Match match in projectFileMatcher.Matches(solutionFile.OpenText().ReadToEnd()))
  {
    yield return new FileInfo(
      Path.Combine(solutionFile.Directory.FullName, match.Groups["projectFile"].Value)
    );
  }
}

It feels a bit dirty but it seems to do the job! And this is hardly production code so I can live with it.

Cryptic warnings

There is another small niggle with all this code. It works but there's a compile warning

Found conflicts between different versions of the same dependent assembly that could not be resolved. These reference conflicts are listed in the build log when log verbosity is set to detailed.

I don't like compile warnings, if something's wrong then I want to make it right. Plenty of people have eloquently made the case for always resolving compile warnings so I won't go over old ground here - just suffice to say that I agree!

The log verbosity can be altered by going to Tools / Option / Projects and Solutions / Build and Run, from there "MSBuild project build output verbosity" can be changed. So I set it to "Detailed" as instructed in the warning message and found.. nothing useful.

It turns out that this warning is telling a bit of a fib and you actually need to bump the verbosity up another step to "Diagnostic". Then the log includes the following

There was a conflict between "Microsoft.Build, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" and "Microsoft.Build, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a".

It also includes lots of other useful information like what references have what dependencies, so I can see that Microsoft Build v4 is required by project item "Microsoft.Build" (meaning that is the version that I explicitly added as a reference to parse the project files). And I can see that Microsoft Build v14 is required by the project items "Microsoft.CodeAnalysis.Workspaces", "Microsoft.CodeAnalysis.VisualBasic.Workspaces" and "Microsoft.CodeAnalysis.CSharp.Workspaces", which are references pulled in by the Roslyn NuGet package.

Unfortunately, I've already explained that I gave up trying to install Microsoft.Build v14! If this was "real" code then I would do it properly and investigate installing that package properly to get rid of this warning.. but for this sort of one-off task (pulling the TODO comments out of a solution, once) I decided I can live with the warning. At least I have an idea how to sort it out if I ever do want to use this code in a more demanding environment.

Parting words

This first foray into Roslyn's capabilities has been interesting. I've clearly scratched only the very outer surface of it but it seems like a really well considered product, I think it could be useful in many scenarios and fully intend to have a poke around with its compiling capabilities at some point (since I do love a bit of dynamic compilation, as I was writing about last time!).

If anything that I've written about today could be useful to you, I've put a complete solution up on Bitbucket - find it at The TODOCommentRetriever.

Posted at 19:38

Comments