Productive Rage - Immutability

4 May 2016

Using Roslyn code fixes to make the "Friction-less immutable objects in Bridge" even easier

This is going to be a short post about a Roslyn (or "The .NET Compiler Platform", if you're from Microsoft) analyser and code fix that I've added to a library. I'm not going to try to take you through the steps required to create an analyser nor how the Roslyn object model describes the code that you've written in the IDE* but I want to talk about the analyser itself because it's going to be very useful if you're one of the few people using my ProductiveRage.Immutable library. Also, I feel like the inclusion of analysers with libraries is something that's going to become increasingly common (and I want to be able to have something to refer back to if I get the chance to say "told you!" in the future).

* (This is largely because I'm still struggling with it a bit myself; my current process is to start with Use Roslyn to Write a Live Code Analyzer for Your API and the "Analyzer with Code Fix (NuGet + VSIX)" Visual Studio template. I then tinker around a bit and try running what I've got so far, so that I can use the "Syntax Visualizer" in the Visual Studio instance that is being debugged. Then I tend to do a lot of Google searches when I feel like I'm getting close to something useful.. how do I tell if a FieldDeclarationSyntax is for a readonly field or not? Oh, good, someone else has already written some code doing something like what I want to do - I look at the "Modifiers" property on the FieldDeclarationSyntax instance).

As new .net libraries get written, some of them will have guidelines and rules that can't easily be described through the type system. In the past, the only option for such rules would be to try to ensure that the documentation (whether this be the project README and / or more in-depth online docs and / or the xml summary comment documentation for the types, methods, properties and fields that intellisense can bring to your attention in the IDE). The support that Visual Studio 2015 introduced for customs analysers* allows these rules to be communicated in a different manner.

* (I'm being English and stubborn, hence my use of "analysers" rather than "analyzers")

In short, they allow these library-specific guidelines and rules to be highlighted in the Visual Studio Error List, just like any error or warning raised by Visual Studio itself (even refusing to allow the project to be built, if an error-level message is recorded).

An excellent example that I've seen recently was encountered when I was writing some of my own analyser code. To do this, you can start with the "Analyzer with Code Fix (NuGet + VSIX)" template, which pulls in a range of NuGet packages and includes some template code of its own. You then need to write a class that is derived from DiagnosticAnalyzer. Your class will declare one of more DiagnosticDescriptor instances - each will be a particular rule that is checked. You then override an "Initialize" method, which allows your code to register for syntax changes and to raise any rules that have been broken. You must also override a "SupportedDiagnostics" property and return the set of DiagnosticDescriptor instances (ie. rules) that your analyser will cover. If the code that the "Initialize" method hooks up tries to raise a rule that "SupportedDiagnostics" did not declare, the rule will be ignored by the analysis engine. This would be a kind of (silent) runtime failure and it's something that is documented - but it's still a very easy mistake to make; you might create a new DiagnosticDescriptor instance and raise it from your "Initialize" method but forget to add it to the "SupportedDiagnostics" set.. whoops. In the past, you may not realise until runtime that you'd made a mistake and, as a silent failure, you might end up getting very frustrated and be stuck wondering what had gone wrong. But, mercifully (and I say this as I made this very mistake), there is an analyser in the "Microsoft.CodeAnalysis.CSharp" NuGet package that brings this error immediately to your attention with the message:

RS1005 ReportDiagnostic invoked with an unsupported DiagnosticDescriptor

The entry in the Error List links straight to the code that called "context.ReportDiagnostic" with the unexpected rule. This is fantastic - instead of suffering a runtime failure, you are informed at compile time precisely what the problem is. Compile time is always better than run time (for many reasons - it's more immediate, so you don't have to wait until runtime, and it's more thorough; a runtime failure may only happen if a particular code path is followed, but static analysis such as this is like having every possible code path tested).

The analysers already in ProductiveRage.Immutable

The ProductiveRage uber-fans (who, surely exist.. yes? ..no? :D) may be thinking "doesn't the ProductiveRage.Immutable library already have some analysers built into it?"

And they would be correct, for some time now it has included a few analysers that try to prevent some simple mistakes. As a quick reminder, the premise of the library is that it will make creating immutable types in Bridge.NET easier.

Instead of writing something like this:

public sealed class EmployeeDetails
{
  public EmployeeDetails(PersonId id, NameDetails name)
  {
    if (id == null)
      throw new ArgumentNullException("id");
    if (name == null)
      throw new ArgumentNullException("name");

    Id = id;
    Name = name;
  }

  /// <summary>
  /// This will never be null
  /// </summary>
  public PersonId Id { get; }

  /// <summary>
  /// This will never be null
  /// </summary>
  public NameDetails Name { get; }

  public EmployeeDetails WithId(PersonId id)
  {
    return Id.Equals(id) ? this : return new EmployeeDetails(id, Name);
  }
  public EmployeeDetails WithName(NameDetails name)
  {
    return Name.Equals(name) ? this : return new EmployeeDetails(Id, name);
  }
}

.. you can express it just as:

public sealed class EmployeeDetails : IAmImmutable
{
  public EmployeeDetails(PersonId id, NameDetails name)
  {
    this.CtorSet(_ => _.Id, id);
    this.CtorSet(_ => _.Name, name);
  }
  public PersonId Id { get; }
  public NameDetails Name { get; }
}

The if-null-then-throw validation is encapsulated in the CtorSet call (since the library takes the view that no value should ever be null - it introduces an Optional struct so that you can identify properties that may be without a value). And it saves you from having to write "With" methods for the updates as IAmImmutable implementations may use the "With" extension method whenever you want to create a new instance with an altered property - eg.

var updatedEmployee = employee.With(_ => _.Name, newName);

The library can only work if certain conditions are met. For example, every property must have a getter and a setter - otherwise, the "CtorSet" extension method won't know how to actually set the value "under the hood" when populating the initial instance (nor would the "With" method know how to set the value on the new instance that it would create).

If you forgot this and wrote the following (note the "DisplayNameLength" property that is now effectively a computed value and there would be no way for us to directly set it via a "With" call) -

public sealed class EmployeeDetails : IAmImmutable
{
  public EmployeeDetails(PersonId id, NameDetails name)
  {
    this.CtorSet(_ => _.Id, id);
    this.CtorSet(_ => _.Name, name);
  }
  public PersonId Id { get; }
  public NameDetails Name { get; }
  public int DisplayNameLength { get { return Name.DisplayName.Length; } }
}

.. then you would see the following errors reported by Visual Studio (presuming you are using 2015 or later) -

Example analyser errors raised by the ProductiveRage.Immutable library

.. which is one of the "common IAmImmutable mistakes" analysers identifying the problem for you.

Getting Visual Studio to write code for you, using code fixes

I've been writing more code with this library and I'm still, largely, happy with it. Making the move to assuming never-allow-null (which is baked into the "CtorSet" and "With" calls) means that the classes that I'm writing are a lot shorter and that type signatures are more descriptive. (I wrote about all this in my post at the end of last year "Friction-less immutable objects in Bridge (C# / JavaScript) applications" if you're curious for more details).

However.. I still don't really like typing out as much code for each class as I have to. Each class has to repeat the property names four times - once in the constructor, twice in the "CtorSet" call and a fourth time in the public property. Similarly, the type name has to be repeated twice - once in the constructor and once in the property.

This is better than the obvious alternative, which is to not bother with immutable types. I will gladly take the extra lines of code (and the effort required to write them) to get the additional confidence that a "stronger" type system offers - I wrote about this recently in my "Writing React with Bridge.NET - The Dan Way" posts; I think that it's really worthwhile to bake assumptions into the type system where possible. For example, the Props types of React components are assumed, by the React library, to be immutable - so having them defined as immutable types represents this requirement in the type system. If the Props types are mutable then it would be possible to write code that tries to change that data and then bad things could happen (you're doing something that library expects not to happen). If the Props types are immutable then it's not even possible to write this particular kind of bad-things-might-happen code, which is a positive thing.

But still I get a niggling feeling that things could be better. And now they are! With Roslyn, you can not only identify particular patterns but you can also offer automatic fixes for them. So, if you were to start writing the EmployeeDetails class from scratch and got this far:

public sealed class EmployeeDetails : IAmImmutable
{
  public EmployeeDetails(PersonId id, NameDetails name)
  {
  }
}

.. then an analyser could identify that you were writing an IAmImmutable implementation and that you have an empty constructor - it could then offer to fix that for you by filling in the rest of the class.

The latest version of the ProductiveRage.Immutable library (1.7.0) does just that. The empty constructor will not only be identified with a warning but a light bulb will also appear alongside the code. Clicking this (or pressing [Ctrl]-[.] while within the empty constructor, for fellow keyboard junkies) will present an option to "Populate class from constructor" -

Screenshot showing the analyser identifying an empty constructor on an IAmImmutable implementation

Selecting the "Populate class from constructor" option -

Screenshot showing the code fix that may auto-populate the incomplete IAmImmutable implementation

.. will take the constructor arguments and generate the "CtorSet" calls and the public properties automatically. Now you can have all of the safety of the immutable type with no more typing effort than the mutable version!

// This is what you have to type of the immutable version,
// then the code fix will expand it for you
public sealed class EmployeeDetails : IAmImmutable
{
  public EmployeeDetails(PersonId id, NameDetails name)
  {
  }
}

// This is what you would have typed if you were feeling
// lazy and creating mutable types because you couldn't
// be bothered with the typing overhead of immutability
public sealed class EmployeeDetails
{
  public PersonId Id;
  public NameDetails name;
}

To summarise

If you're already using the library, then all you need to do to start taking advantage of this code fix is update your NuGet reference* (presuming that you're using VS 2015 - analysers weren't supported in previous versions of Visual Studio).

* (Sometimes you have to restart Visual Studio after updating, you will know that this is the case if you get a warning in the Error List about Visual Studio not being able to load the Productive.Immutable analyser)

If you're writing your own library that has any guidelines or common gotchas that you have to describe in documentation somewhere (that the users of your library may well not read unless they have a problem - at which point they may even abandon the library, if they're only having an investigative play around with it) then I highly recommend that you consider using analysers to surface some of these assumptions and best practices. While I'm aware that I've not offered much concrete advice on how to write these analysers, the reason is that I'm still very much a beginner at it - but that puts me in a good position to be able to say that it really is fairly easy if you read a few articles about it (such as Use Roslyn to Write a Live Code Analyzer for Your API) and then just get stuck in. With some judicious Google'ing, you'll be making progress in no time!

I guess that only time will tell whether library-specific analysers become as prevalent as I imagine. It's very possible that I'm biased because I'm such a believer in static analysis. Let's wait and see*!

* Unless YOU are a library writer that this might apply to - in which case, make it happen rather than just sitting back to see what MIGHT happen! :)

Posted at 22:33

Tags:

Comments

11 February 2016

Easy "PureComponent" React performance boosts for Bridge.Net

React's great strength is that it makes creating UIs simple(r) because you can treat the view as a pure function - often, you essentially give a props reference into a top level component and it works out what to draw. Then, when something changes, you do the same again; trigger a full re-draw and rely upon React's Virtual DOM to work out what changed in an efficient manner and apply those changes to the browser DOM. The browser DOM is slow, which is why interactions with it should be minimised. The Virtual DOM is fast.

The common pre-React way to deal with UIs was to have some code to render the UI in an initial state and then further code that would change the UI based upon user interactions. React reduces these two types of state-handling (initial-display and update-for-specific-interaction) into one (full re-render).

And a lot of the time, the fast Virtual DOM performs quickly enough that you don't have to worry about what it's doing. But sometimes, you may have a UI that is so complicated that it's a lot of work for the Virtual DOM to calculate the diffs to apply to the browser DOM. Or you might have particularly demanding performance requirements, such as achieving 60 fps animations on mobile.

Handily, React has a way for you to give it hints - namely the ShouldComponentUpdate method that components may implement. This method can look at the component's current props and state values and the next props and state values and let React know if any changes are required. The method returns a boolean - false meaning "no, I don't need to redraw, this data looks the same" and true meaning "yes, I need to redraw for this new data". The method is optional, if a component doesn't implement it then it's equivalent to it always returning true. Remember, if a component returns true for "do I need to be redrawn?", the Virtual DOM is still what is responsible for dealing with the update - and it usually deals with it in a very fast and efficient manner. Returning true is not something to necessarily be worried about. However, if you can identify cases where ShouldComponentUpdate can return false then you can save the Virtual DOM from working out whether that component or any of its child components need to be redrawn. If this can be done high up in a deeply-nested component tree then it could save the Virtual DOM a lot of work.

The problem is, though, that coming up with a mechanism to reliably and efficiently compare object references (ie. props and / or state) to determine whether they describe the same data is difficult to do in the general case.

Let me paint a picture by describing a very simple example React application..

The Message Editor Example

Imagine an app that can read a list of messages from an API and allow the user of the app to edit these messages. Each message has "Content" and "Author" properties that are strings. Either of these values may be edited in the app. These messages are part of a message group that has a title - this also may be edited in the app.

(I didn't say that it was a useful or realistic app, it's just one to illustrate a point :)

The way that I like to create React apps is to categorise components as one of two things; a "Container Component" or a "Presentation Component". Presentation Components should be state-less, they should just be handed a props reference and then go off and draw themselves. Any interactions that the user makes with this component or any of its child components are effectively passed up (via change handlers on the props reference) until it reaches a Container Component. The Container Component will translate these interaction into actions to send to the Dispatcher. Actions will be handled by a store (that will be listening out for Dispatcher actions that it's interested in). When a store handles an action, it emits a change event. The Container Component will be listening out for change events on stores that it is interested in - when this happens, the Container Component will trigger a re-render of itself by updating its state based upon data now available in the store(s) it cares about. This is a fairly standard Flux architecture and, I believe, the terms "Container Component" / "Presentation Component are in reasonably common use (I didn't make them up, I just like the principle - one of the articles that I've read that uses these descriptions is Component Brick and Mortar: The React documentation I wish I had a year ago).

So, for my example app, I might have a component hierarchy that looks this:

AppContainer
  Title
    TextInput
      Input
  MessageList
    MessageRow
      TextInput
        Input
      TextInput
        Input
    MessageRow
      TextInput
        Input
      TextInput
        Input

There will be as many "MessageRow" components as there are messages to edit. Input is a standard React-rendered element and all of the others (AppContainer, Title, MessageList, MessageRow and TextInput) are custom components.

(Note: This is not a sufficiently deeply-nested hierarchy that React would have any problems with rendering performance, it's intended to be just complicated enough to demonstrate the point that I'm working up to).

The AppContainer is the only "Container Component" and so is the only component that has a React state reference as well as props. A state reference is, essentially, what prevents a component from being what you might consider a "pure function" - where the props that are passed in are all that affects what is rendered out. React "state" is required to trigger a re-draw of the UI, but it should be present in as few places as possible - ie. there should only be one, or a small number of, top level component(s) that have state. Components that render only according to their props data are much easier to reason about (and hence easier to write, extend and maintain).

My Bridge.NET React bindings NuGet package makes it simple to differentiate between stateful (ie. Container) components and stateless (ie. Presentation) components as it has both a Component<TProps, TState> base class and a StatelessComponent<TProps> base class - you derive from the appropriate one when you create custom components (for more details, see React (and Flux) with Bridge.net - Redux).

To start with the simplest example, below is the TextInput component. This just renders a text Input with a specified value and communicates up any requests to change that string value via an "OnChange" callback -

public class TextInput : StatelessComponent<TextInput.Props>
{
  public TextInput(Props props) : base(props) { }

  public override ReactElement Render()
  {
    return DOM.Input(new InputAttributes
    {
      Type = InputType.Text,
      Value = props.Content,
      OnChange = OnTextChange
    });
  }

  private void OnTextChange(FormEvent<InputElement> e)
  {
    props.OnChange(e.CurrentTarget.Value);
  }

  public class Props
  {
    public string Content { get; set; }
    public Action<string> OnChange { get; set; }
  }
}

It is fairly easy to envisage how you might try to implement "ShouldComponentUpdate" here - given a "this is the new props value" reference (which gets passed into ShouldComponentUpdate as an argument called "nextProps") and the current props reference, you need only look at the "Content" and "OnChange" references on the current and next props and, if both Content/Content and OnChange/OnChange references are the same, then we can return false (meaning "no, we do not need to re-draw this TextInput").

(Two things to note here: Firstly, it is not usually possible to directly compare the current props reference with the "nextProps" reference because it is common for the parent component to create a new props instance for each proposed re-render of a child component, rather than re-use a previous props instance - so the individual property values within the props references may all be consistent between the current props and nextProps, but the actual props references will usually be distinct. Secondly, the Bridge.NET React bindings only support React component life cycle method implementations on custom components derived from Component<TProps, TState> classes and not those derived from StatelessComponent<TProps>, so you couldn't actually write your own "ShouldComponentUpdate" for a StatelessComponent - but that's not important here, we're just working through a thought experiment).

Now let's move on to the MessageList and MessageRow components, since things get more complicated there -

public class MessageList : StatelessComponent<MessageList.Props>
{
  public MessageList(Props props) : base(props) { }

  public override ReactElement Render()
  {
    var messageRows = props.IdsAndMessages
      .Select(idAndMessage => new MessageRow(new MessageRow.Props
      {
        Key = idAndMessage.Item1,
        Message = idAndMessage.Item2,
        OnChange = newMessage => props.OnChange(idAndMessage.Item1, newMessage)
      }));
    return DOM.Div(
      new Attributes { ClassName = "message-list" },
      messageRows
    );
  }

  public class Props
  {
    public Tuple<int, MessageEditState>[] IdsAndMessages;
    public Action<int, MessageEditState> OnChange;
  }
}

public class MessageRow : StatelessComponent<MessageRow.Props>
{
  public MessageRow(Props props) : base(props) { }

  public override ReactElement Render()
  {
    // Note that the "Key" value from the props reference does not explicitly need
    // to be mentioned here, the React bindings will deal with it (it is important
    // to give dynamic children components unique key values, but it is handled by
    // the bindings and the React library so long as a "Key" property is present
    // on the props)
    // - See https://facebook.github.io/react/docs/multiple-components.html for
    //   more details
    return DOM.Div(new Attributes { ClassName = "message-row" },
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Content,
        OnChange = OnContentChange
      }),
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Author,
        OnChange = OnAuthorChange
      })
    );
  }

  private void OnContentChange(string newContent)
  {
    props.OnChange(new MessageEditState
    {
      Content = newContent,
      Author = props.Message.Author
    });
  }
  private void OnAuthorChange(string newAuthor)
  {
    props.OnChange(new MessageEditState
    {
      Content = props.Message.Content,
      Author = newAuthor
    });
  }

  public class Props
  {
    public int Key;
    public MessageEditState Message;
    public Action<MessageEditState> OnChange;
  }
}

public class MessageEditState
{
  public string Content;
  public string Author;
}

If the MessageList component wanted to implement "ShouldComponentUpdate" then its job is more difficult as it has an array of message data to check. It could do one of several things - the first, and most obviously accurate, would be to perform a "deep compare" of the arrays from the current props and the "nextProps"; ensuring firstly that there are the same number of items in both and then comparing each "Content" and "Author" value in each item of the arrays. If everything matches up then the two arrays contain the same data and (so long as the "OnChange" callback hasn't changed) the component doesn't need to re-render. Avoiding re-rendering this component (and, subsequently, any of its child components) would be a big win because it accounts for a large portion of the total UI. Not re-rendering it would give the Virtual DOM much less work to do. But would a deep comparison of this type actually be any cheaper than letting the Virtual DOM do what it's designed to do?

The second option is to presume that whoever created the props references would have re-used any MessageEditState instances that haven't changed. So the array comparison could be reduced to ensuring that the current and next props references both have the same number of elements and then performing reference equality checks on each item.

The third option is to presume that whoever created the props reference would have re-used the array itself if the data hadn't changed, meaning that a simple reference equality check could be performed on the current and next props' arrays.

The second and third options are both much cheaper than a full "deep compare" but they both rely upon the caller following some conventions. This is why I say that this is a difficult problem to solve for the general case.

Immutability to the rescue

There is actually another option to consider, the object models for the props data could be rewritten to use immutable types. These have the advantage that if you find that two references are equal then they are guaranteed to contain the same data. They also have the advantage that it's much more common to re-use instances to describe the same data - partly because there is some overhead to initialising immutable types and partly because there is no fear that "if I give this reference to this function, I want to be sure that it can't change the data in my reference while doing its work" because it is impossible to change an immutable reference's data. (I've seen defensively-written code that clones mutable references that it passes into other functions, to be sure that no other code can change the data in the original reference - this is never required with immutable types).

Conveniently, I've recently written a library to use with Bridge.NET which I think makes creating and working with immutable types easier than C# makes it on its own. I wrote about it in "Friction-less immutable objects in Bridge (C# / JavaScript) applications" but the gist is that you re-write MessageEditState as:

// You need to pull in the "ProductiveRage.Immutable" NuGet package to use IAmImmutable
public class MessageEditState : IAmImmutable
{
  public MessageEditState(string content, string author)
  {
    this.CtorSet(_ => _.Content, content);
    this.CtorSet(_ => _.Author, author);
  }
  public string Content { get; private set; }
  public string Author { get; private set; }
}

It's still a little more verbose than the mutable version, admittedly, but I'm hoping to convince you that it's worth it (if you need convincing!) for the benefits that we'll get.

When you have an instance of this new MessageEditState class, if you need to change one of the properties, you don't have to call the constructor each time to get a new instance, you can use the "With" extension methods that may be called on any IAmImmutable instance - eg.

var updatedMessage = message.With(_ => _.Content, "New information");

This would mean that the change handlers from MessageRow could be altered from:

private void OnContentChange(string newContent)
{
  props.OnChange(new MessageEditState
  {
    Content = newContent,
    Author = props.Message.Author
  });
}
private void OnAuthorChange(string newAuthor)
{
  props.OnChange(new MessageEditState
  {
    Content = props.Message.Content,
    Author = newAuthor
  });
}

and replaced with:

private void OnContentChange(string newContent)
{
  props.OnChange(props.Message.With(_ => _.Content, newContent));
}
private void OnAuthorChange(string newAuthor)
{
  props.OnChange(props.Message.With(_ => _.Author, newAuthor));
}

Immediately, the verbosity added to MessageEditState is being offset with tidier code! (And it's nice not having to set both "Content" and "Author" when only changing one of them).

The "With" method also has a small trick up its sleeve in that it won't return a new instance if the new property value is the same as the old property value. This is an eventuality that could happen in the code above as an "Input" element rendered by React will raise an "OnChange" event for any action that might have altered the text input's content. For example, if you had a text box with the value "Hello" in it and you selected all of that text and then pasted in text from the clipboard over the top of it, if the clipboard text was also "Hello" then the "OnChange" event will be raised, even though the actual value has not changed (it was "Hello" before and it's still "Hello" now). The "With" method will deal with this, though, and just pass the same instance straight back out. This is an illustration of the "reuse of instances for unchanged data" theme that I alluded to above.

The next step would be to change the array type in the MessageList.Props type from

public Tuple<int, MessageEditState>[] IdsAndMessages;

public NonNullList<Tuple<int, MessageEditState>> IdsAndMessages;

The NonNullList class is also in the ProductiveRage.Immutable NuGet package. It's basically an immutable IEnumerable that may be used in Bridge.NET projects. A simple example of it in use is:

// Create a new set of values (the static "Of" method uses type inference to determine
// the type of "T" in the returned "NonNullList<T>" - since 1, 2 and 3 are all ints, the
// "numbers" reference will be of type "NonNullList<int>")
var numbers = NonNullList.Of(1, 2, 3);

// SetValue takes an index and a new value, so calling SetValue(2, 4) on a set
// containing 1, 2, 3 will return a new set containing the values 1, 2, 4
numbers = numbers.SetValue(2, 4);

// Calling SetValue(2, 4) on a set containing values 1, 2, 4 does not require any
// changes, so the input reference is passed straight back out
numbers = numbers.SetValue(2, 4);

As with IAmImmutable instances we get two big benefits - we can rely on reference equality comparisons more often, since the data with any given reference can never change, and references will be reused in many cases if operations are requested that would not actually change the data. (It's worth noting that the guarantees fall apart if any property on an IAmImmutable reference is a of a mutable type, similarly if a NonNullList has elements that are a mutable type, or that have nested properties that are of a mutable type.. but so long as immutability is used "all the way down" then all will be well).

If this philosophy was followed, then suddenly the "ShouldComponentUpdate" implementation for the MessageList component would be very easy to write - just perform reference equality comparisons on the "IdsAndMessages" and "OnChange" values on the current props and on the nextProps. While solving the problem for the general case is very difficult, solving it when you introduce some constraints (such as the use of immutable and persistent data types) can be very easy!

If we did implement this MessageList "ShouldComponentUpdate" method, then we could be confident that when a user makes changes to the "Title" text input that the Virtual DOM would not have to work out whether the MessageList or any of its child components had changed - because we'd have told the Virtual DOM that they hadn't (because the "IdsAndMessages" and "OnChange" property references wouldn't have changed).

We could take this a step further, though, and consider the idea of implementing "ShouldComponentUpdate" on other components - such as MessageRow. If the user edits a text value within one row, then the MessageList will have to perform some re-rendering work, since one of its child components needs to be re-rendered. But there's no need for any of the other rows to re-render, it could be just the single row in which the change was requested by the user.

So the MessageRow could look at its props values and, if they haven't changed between the current props and the nextProps, then inform React (via "ShouldComponentUpdate") that no re-render is required.

And why not go even further and just do this on all Presentation Components? The TextInput could avoid the re-render of its child Input if the props' "Content" and "OnChange" reference are not being updated.

Introducing the Bridge.React "PureComponent"

To make this easy, I've added a new base class to the React bindings (available in 1.4 of Bridge.React); the PureComponent<TProps>.

This, like the StatelessComponent<TProps>, is very simple and does not support state and only allows the "Render" method to be implemented - no other React lifecycle functions (such "ComponentWillMount", "ShouldComponentUpdate", etc..) may be defined on components deriving from this class.

The key difference is that it has its own "ShouldComponentUpdate" implementation that presumes that the props data is immutable and basically does what I've been describing above automatically - when React checks "ShouldComponentUpdate", it will look at the "props" and "nextProps" instances and compare their property values. (It also deals with the cases where one or both of them are null, in case you want components whose props reference is optional).

This is not an original idea, by a long shot. I first became aware of people doing this in 2013 when I read The Future of JavaScript MVC Frameworks, which was talking about using ClojureScript and its React interface "Om". More recently, I was reading Performance Engineering with React (Part 1), which talks about roughly the same subject but with vanilla JavaScript. And, of course, Facebook has long had its PureRenderMixin - though mixins can't be used with ES6 components (which seems to be the approach to writing components that Facebook is pushing at the moment).

So, this is largely just making it easy it when writing React applications with Bridge. However, using Bridge to do this does give us some extra advantages (on top of the joy of being able to write React apps in C#!). In the code earlier (from the MessageRow Render method) -

new TextInput(new TextInput.Props
{
  Content = props.Message.Content,
  OnChange = OnContentChange
})

Bridge will bind the "OnContentChange" method to the current MessageRow instance so that when it is called by the TextInput's "OnChange" event, "this" is the MessageRow and not the TextInput (which is important because OnContentChange needs to access the "props" reference scoped to the MessageRow).

This introduces a potential wrinkle in our plan, though, as this binding process creates a new JavaScript method each time and means that each time the TextInput is rendered, the "OnChange" reference is new. So if we try to perform simple reference equality checks on props values, then we won't find the current "OnChange" and the new "OnChange" to be the same.

This problem is mentioned in the "Performance Engineering" article I linked above:

Unfortunately, each call to Function.bind produces a new function.. No amount of prop checking will help, and your component will always re-render.

..

The simplest solution we've found is to pass the unbound function.

When using Bridge, we don't have the option of using an unbound function since the function-binding is automatically introduced by the C#-to-JavaScript translation process. And it's very convenient, so it's not something that I'd ideally like to have to workaround.

Having a dig through Bridge's source code, though, revealed some useful information. When Bridge.fn.bind is called, it returns a new function (as just discussed).. but with some metadata attached to it. When it returns a new function, it sets two properties on it "$scope" and "$method". The $scope reference is what "this" will be set to when the bound function is called and the $method reference is the original function that is being bound. This means that, when the props value comparisons are performed, if a value is a function and it the reference equality comparison fails, a fallback approach may be attempted - if both functions have $scope and $method references defined then compare them and, if they are both consistent between the function value on the current props and the function value on the nextProps, then consider the value to be unchanged.

The PureComponent's "ShouldComponentUpdate" implementation deals with this automatically, so you don't have to worry about it.

It's possibly worth noting that the "Performance Engineering" post did briefly consider something similar -

Another possibility we've explored is using a custom bind function that stores metadata on the function itself, which in combination with a more advanced check function, could detect bound functions that haven't actually changed.

Considering that Bridge automatically includes this additional metadata, it seemed to me to be sensible to use it.

There's one other equality comparison that is supported; as well as simple referential equality and the function equality gymnastics described above, if both of the values are non-null and the first has an "Equals" function then this function will be considered. This means that any custom "Equals" implementations that you define on classes will be automatically taken into consideration by the PureComponent's logic.

Another Bridge.NET bonus: Lambda support

When I started writing this post, there was going to be a section here with a warning about using lambdas as functions in props instances, rather than using named functions (which the examples thus far have done).

As with bound functions, anywhere that an anonymous function is present in JavaScript, it will result in a new function value being created. If, for example, we change the MessageRow class from:

public class MessageRow : PureComponent<MessageRow.Props>
{
  public MessageRow(Props props) : base(props) { }

  public override ReactElement Render()
  {
    return DOM.Div(new Attributes { ClassName = "message-row" },
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Content,
        OnChange = OnContentChange
      }),
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Author,
        OnChange = OnAuthorChange
      })
    );
  }

  private void OnContentChange(string newContent)
  {
    props.OnChange(props.Message.With(_ => _.Content, newContent));
  }
  private void OnAuthorChange(string newAuthor)
  {
    props.OnChange(props.Message.With(_ => _.Author, newAuthor));
  }

  public class Props
  {
    public int Key;
    public MessageEditState Message;
    public Action<MessageEditState> OnChange;
  }
}

to:

public class MessageRow : PureComponent<MessageRow.Props>
{
  public MessageRow(Props props) : base(props) { }

  public override ReactElement Render()
  {
    return DOM.Div(new Attributes { ClassName = "message-row" },
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Content,
        OnChange = newContent =>
          props.OnChange(props.Message.With(_ => _.Content, newContent))
      }),
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Author,
        OnChange = newAuthor =>
          props.OnChange(props.Message.With(_ => _.Author, newAuthor))
      })
    );
  }

  public class Props
  {
    public int Key;
    public MessageEditState Message;
    public Action<MessageEditState> OnChange;
  }
}

then there would be problems with the "OnChange" props values specified because each new lambda - eg..

OnChange = newContent =>
  props.OnChange(props.Message.With(_ => _.Content, newContent))

would result in a new JavaScript function being passed to Bridge.fn.bind every time that it was called:

onChange: Bridge.fn.bind(this, function (newContent) {
  this.getprops().onChange(
    ProductiveRage.Immutable.ImmutabilityHelpers.$with(
       this.getprops().message,
      function (_) { return _.getContent(); },
      newContent
    )
  );
})

And this would prevent the PureComponent's "ShouldComponentUpdate" logic from being effective, since the $method values from the current props "OnChange" and the nextProps "OnChange" bound functions would always be different.

I was quite disappointed when I realised this and was considering trying to come up with some sort of workaround - maybe calling "toString" on both $method values and comparing their implementations.. but I couldn't find definitive information about the performance implications of this and I wasn't looking forward to constructing my own suite of tests to investigate any potential performance impact of this across different browsers and different browser versions.

My disappointment was two-fold: firstly, using the lambdas allows for more succinct code and less syntactic noise - since the types of the lambda's argument(s) and return value (if any) are inferred, rather than having to be explicitly typed out.

newContent => props.OnChange(props.Message.With(_ => _.Content, newContent))

is clearly shorter than

private void OnContentChange(string newContent)
{
  props.OnChange(props.Message.With(_ => _.Content, newContent));
}

The other reason that I was deflated upon realising this was that it meant that the "ShouldComponentUpdate" implementation would, essentially, silently fail for components that used lambdas - "ShouldComponentUpdate" would return true in cases where I would like it to return false. There would be no compiler error and the UI code would still function, but it wouldn't be as efficient as it could be (the Virtual DOM would have to do more work than necessary).

Instead, I had a bit of a crazy thought.. lambdas like this, that only need to access their own arguments and the "this" reference, could be "lifted" into named functions quite easily. Essentially, I'm doing this manually by writing methods such as "OnContentChange". But could the Bridge translator do something like this automatically - take those C# lambdas and convert them into named functions in JavaScript? That way, I would get the benefit of the succinct lambda format in C# and the PureComponent optimisations would work.

Well, once again the Bridge.NET Team came through for me! I raised a Feature Request about this, explained what I'd like in an ideal world (and why) and five days later there was a branch on GitHub where I could preview changes that did precisely what I wanted!

This is not just an example of fantastic support from the Bridge Team, it is also, I believe, an incredible feature for Bridge and a triumph for writing front-end code in C#! Having this "translation step" from C# to JavaScript provides the opportunity for handy features to be included for free - earlier we saw how the insertion of Bridge.fn.bind calls by the translator meant that we had access to $method and $scope metadata (which side-steps one of the problems that were had by the author of Performance Engineering with React) but, here, the translation step can remove the performance overhead that anonymous functions were going to cause for our "ShouldComponentUpdate" implementation, without there being any burden on the developer writing the C# code.

It's also worth considering the fact that every allocation made in JavaScript is a reference that needs to be tidied up by the browser's garbage collector at some point. A big reason why judicious use of "ShouldComponentUpdate" can make UIs faster is that there is less work for the Virtual DOM to do, but it also eases the load on the garbage collector because none of the memory allocations need to be made for child components of components that do not need to be re-rendered. Since anonymous JavaScript functions are created over and over again (every time that the section of code that declares the anonymous function is executed), lifting them into named functions means that there will be fewer allocations in your SPA and hence even less work for the garbage collector to do.

Note: As of the 11th of February 2016, this Bridge.NET improvement has not yet been made live - but their release cycles tend to be fairly short and so I don't imagine that it will be very long until it is included in an official release. If you were desperate to write any code with PureComponent before then, you could either avoid lambdas in your C# code or you could use lambdas now, knowing that the PureComponent won't be giving you the full benefit immediately - but that you WILL get the full benefit when the Bridge Team release the update.

So it's an unequivocable success then??

Well, until it transpired that the Bridge translator would be altered to convert these sorts of lambdas into named functions, I was going to say "this is good, but..". However, with that change in sight, I'm just going to say outright "yes, and I'm going to change all classes that derive from StatelessComponent in my projects to derive from PureComponent". This will work fine, so long as your props references are all immutable (meaning that they are immutable all the way down - you shouldn't have, say, a props property that is an immutable NonNullList of references, but where those references have mutable properties).

And, if you're not using immutable props types - sort yourself out! While a component is being rendered (according to the Facebook React Tutorial):

props are immutable: they are passed from the parent and are "owned" by the parent

So, rather than having props only be immutable during component renders (by a convention that the React library enforces), why not go whole-hog and use fully immutable classes to describe your props types - that way props are fully immutable and you can use the Bridge.React's PureComponent to get performance boosts for free!

(Now seems like a good time to remind you of my post "Friction-less immutable objects in Bridge (C# / JavaScript) applications", which illustrates how to use the ProductiveRage.Immutable NuGet package to make defining immutable classes just that bit easier).

Posted at 20:11

Tags:

Comments

16 December 2015

Friction-less immutable objects in Bridge (C# / JavaScript) applications

One of the posts that I've written that got the most "audience participation*" was one from last year "Implementing F#-inspired 'with' updates for immutable classes in C#", where I spoke about trying to ease the burden in C# or representing data with immutable classes by reducing the repetitive typing involved.

* (I'd mis-remembered receiving more criticism about this than I actually did - now that I looking back at the comments left on the post and on reddit, the conversations and observations are pretty interesting and largely constructive!)

The gist was that if I wanted to have a class that represented, for the sake of a convoluted example, an employee that had a name, a start-of-employment date and some notes (which are optional and so may not be populated) then we might have something like the following:

public class EmployeeDetails
{
  public EmployeeDetails(string name, DateTime startDate, string notesIfAny)
  {
    if (string.IsNullOrWhiteSpace(name))
      throw new ArgumentException("name");

    Name = name.Trim();
    StartDate = startDate;
    NotesIfAny = (notesIfAny == null) ? null : notesIfAny.Trim();
  }

  /// <summary>
  /// This will never be null or blank, it will not have any leading or trailing whitespace
  /// </summary>
  public string Name { get; private set; }

  public DateTime StartDate { get; private set; }

  /// <summary>
  /// This will be null if it has not value, otherwise it will be a non-blank string with no
  /// leading or trailing whitespace
  /// </summary>
  public string NotesIfAny { get; private set; }
}

If we wanted to update a record with some notes where previously it had none then we'd need to create a new instance, something like:

var updatedEmployee = new EmployeeDetails(
  employee.Name,
  employee.StartDate,
  "Awesome attitude!"
);

This sort of thing (calling the constructor explicitly) gets old quickly, particularly if the class gets extended in the future since then anywhere that did something like this would have to add more arguments to the constructor call.

So an alternative is to include "With" functions in the class -

public class EmployeeDetails
{
  public EmployeeDetails(string name, DateTime startDate, string notesIfAny)
  {
    if (string.IsNullOrWhiteSpace(name))
      throw new ArgumentException("name");

    Name = name.Trim();
    StartDate = startDate;
    NotesIfAny = (notesIfAny == null) ? null : notesIfAny.Trim();
  }

  /// <summary>
  /// This will never be null or blank, it will not have any leading or trailing whitespace
  /// </summary>
  public string Name { get; private set; }

  public DateTime StartDate { get; private set; }

  /// <summary>
  /// This will be null if it has not value, otherwise it will be a non-blank string with no
  /// leading or trailing whitespace
  /// </summary>
  public string NotesIfAny { get; private set; }

  public EmployeeDetails WithName(string value)
  {
      return (value == Name) ? this : new EmployeeDetails(value, StartDate, NotesIfAny);
  }
  public EmployeeDetails WithStartDate(DateTime value)
  {
      return (value == StartDate) ? this : new EmployeeDetails(Name, value, NotesIfAny);
  }
  public EmployeeDetails WithNotesIfAny(string value)
  {
      return (value == NotesIfAny) ? this : new EmployeeDetails(Name, StartDate, value);
  }
}

Now the update code is more succinct -

var updatedEmployee = employee.WithNotesIfAny("Awesome attitude!");

Another benefit of this approach is that the With functions can include a quick check to ensure that the new value is not the same as the current value - if it is then there's no need to generate a new instance, the current instance can be returned straight back out. This saves generating a new object reference and it makes it easier to rely upon simple reference equality tests when determining whether data has changed - eg.

var updatedEmployee = employee.WithNotesIfAny("Awesome attitude!");
var didEmployeeAlreadyHaveAwesomeAttitude = (updatedEmployee == employee);

Last year's post went off on a few wild tangents but basically was about allowing something like the following to be written:

public class EmployeeDetails
{
  public EmployeeDetails(string name, DateTime startDate, string notesIfAny)
  {
    if (string.IsNullOrWhiteSpace(name))
      throw new ArgumentException("name");

    Name = name.Trim();
    StartDate = startDate;
    NotesIfAny = (notesIfAny == null) ? null : notesIfAny.Trim();
  }

  /// <summary>
  /// This will never be null or blank, it will not have any leading or trailing whitespace
  /// </summary>
  public string Name { get; private set; }

  public DateTime StartDate { get; private set; }

  /// <summary>
  /// This will be null if it has not value, otherwise it will be a non-blank string with no
  /// leading or trailing whitespace
  /// </summary>
  public string NotesIfAny { get; private set; }

  public EmployeeDetails With(
    Optional<string> name = new Optional<string>(),
    Optional<DateTime> startDate = new Optional<DateTime>(),
    Optional<Optional<string>> notesIfAny = new Optional<Optional<string>>())
  {
    return DefaultUpdateWithHelper.GetGenerator<EmployeeDetails>()(
      this, name, startDate, notesIfAny
    );
  }
}

It allowed you to include a single "With" function that could change one or more of the properties in a single call like this:

var updatedEmployee = employee.With(name: "Jimbo", notesIfAny: "So lazy!");

And it would do it with magic, so you wouldn't have to write all of the "return-same-instance-if-no-values-changed" logic and it would.. erm.. well, to be honest, I've forgotten some of the finer details! But I remember that it was fun messing around with, getting my hands dirty with reflection, compiled LINQ expressions, stack-trace-sniffing (and some required JIT-method-inlining-disabling).

A couple of weeks later I wrote a follow-up, taking on board some of the feedback and criticisms in the comments and doing some performance testing. One of the ways I came up with to create the "magic With method" was only twice as slow as writing it by hand.. which, now, doesn't sound all that awesome - twice as slow is often a bad thing, but I was quite proud of it at the time!

Immutable Classes in 2015

Recently, I've been making Bridge.NET applications and I've been favouring writing immutable types for the messages passed around the system. And, again, I've gotten a bit bored of writing the same lines of code over and over -

if (value == null)
  throw new ArgumentNullException("value");

and

/// <summary>
/// This will never be null
/// </summary>

and

/// <summary>
/// This will never be null, nor blank, nor have any leading or trailing whitespace
/// </summary>

Never mind contemplating writing all of those "WithName", "WithStartDate" methods (checking in them that the values have actually changed and returning the same instance back out if not). I love the benefits of having these immutable types (reducing places where it's possible for state to change makes reasoning about code soooooooo much easier) but I'm getting tired of banging out the same constructs and sentences! follows a

So I've started on a new tack. I want to find those places where repetition is getting me down and I want to reduce it as much as possible. But I don't want to sacrifice my validation checks or the guarantees of immutability. And, again, to put this into context - I'm going to be concentrating on the classes that I write in C# but that Bridge.NET then translates into JavaScript, so there are different considerations to take into account. First of which being that Bridge doesn't support reflection and so none of the crazy stuff I was doing in "pure" C# will be possible! Not in the same way that I wrote it last time, at least..

Before I get into any silly stuff, though, I want to talk about a couple of techniques that hopefully aren't too controversial and that I think have improved my code as well as requiring me to type less.

First off is a variation of the "Optional" struct that I used in my C# library last time. Previously, as illustrated in the code at the top of this post, I was relying on argument names and comments to indicate when values may and may not be null. The "Name" property has a comment saying that it will not be null while the "NotesIfAny" property has a comment saying that it might be null - and it follows a convention of having an "IfAny" suffix, which suggests that it might not always have a value.

Instead, I want to move to assuming that all references are non-null and that values that may be null have their type wrapped in an Optional struct.

This would change the EmployeeDetails example to look like this:

public class EmployeeDetails
{
  public EmployeeDetails(string name, DateTime startDate, Optional<string> notes)
  {
    if (string.IsNullOrWhiteSpace(name))
      throw new ArgumentException("name");
    Name = name.Trim();
    StartDate = startDate;
    Notes = !notes.IsDefined || (notes.Value.Trim() == "") ? null : notes.Value.Trim();
  }
  public string Name { get; private set; }
  public DateTime StartDate { get; private set; }
  public Optional<string> Notes { get; private set; }
}

The "IfAny" suffix is gone, along with all of the comments about null / non-null. Now the type system indicates whether a value may be null (in which case it will be wrapped in an Optional) or not.

(Note: I'll talk more about Optional later in this post - there's nothing too revolutionary or surprising in there, but it will distract from what I'm trying to build up to here).

We have lost something, though, because in the earlier code the Name and Notes fields had comments that stated that the values (if non-null, in the case of Notes) would not be blank and would not have any leading or trailing whitespace. This information is no longer included in comments, because I want to lose the comments. But, if I've solved the null / non-null problem by leveraging the type system, why not do the same with the non-blank-trimmed strings?

Introducing..

public class NonBlankTrimmedString
{
  public NonBlankTrimmedString(string value)
  {
    if (string.IsNullOrWhiteSpace(value))
      throw new ArgumentException("Must be non-null and have some non-whitespace content");
    Value = value.Trim();
  }

  /// <summary>
  /// This will never have any leading or trailing whitespace, it will never be blank
  /// </summary>
  public string Value { get; private set; }

  public static implicit operator NonBlankTrimmedString(string value)
  {
    return new NonBlankTrimmedString(value);
  }
  public static implicit operator string(NonBlankTrimmedString value)
  {
    return value.Value;
  }
}

Ok, so it looks like the comments are back.. but the idea is that the "will never have any leading or trailing whitespace, it will never be blank" need only appear once (in this class) and not for every property that should be non-null and non-blank and not-have-any-leading-or-trailing-whitespace.

Now the EmployeeDetails class can become:

public class EmployeeDetails
{
  public EmployeeDetails(
      NonBlankTrimmedString name,
    DateTime startDate,
    Optional<NonBlankTrimmedString> notes)
  {
    if (name == null)
      throw new ArgumentNullException("name");
    Name = name;
    StartDate = startDate;
    Notes = notes;
  }
  public NonBlankTrimmedString Name { get; private set; }
  public DateTime StartDate { get; private set; }
  public Optional<NonBlankTrimmedString> Notes { get; private set; }
}

This looks a lot better. Not only is there less to read, there was less repetitive code (and comments) to write but the same information is still available for someone reading / using the code. In fact, I think that it's better on that front now because the constructor signature and the property types themselves communicate this information - which makes it harder to ignore than a comment does. And the type system is the primary reason that I want to write my front-end applications in C# rather than JavaScript!

However, there are still a couple of things that I'm not happy with. Firstly, in an ideal world, the constructors would magically have if-null-then-throw conditions injected for every argument - there are no arguments that should be null now; Optional is a struct and so can never be null, while any references that could be null should be wrapped in an Optional. One way to achieve that this in regular C# is with IL rewriting but I'm not a huge fan of that because I have suspicions about PostSharp (that I should probably revisit one day because I'm no longer completely sure what grounds they're based on). But, aside from that, it would be use when writing C# for Bridge, since IL doesn't come into the process - C# source code is translated into JavaScript and IL isn't involved!

Secondly, I need to tackle the "With" function(s) and I'd like to make that as painless as possible, really. Writing them all by hand is tedious.

Get to the point, already!

So.. I've been playing around and I've written a Bridge.NET library that allows me to write something like this:

public class EmployeeDetails : IAmImmutable
{
  public EmployeeDetails(
    NonBlankTrimmedString name,
    DateTime startDate,
    Optional<NonBlankTrimmedString> notes)
  {
    this.CtorSet(_ => _.Name, name);
    this.CtorSet(_ => _.StartDate, startDate);
    this.CtorSet(_ => _.Notes, notes);
  }
  public NonBlankTrimmedString Name { get; private set; }
  public DateTime StartDate { get; private set; }
  public Optional<NonBlankTrimmedString> Notes { get; private set; }
}

Which is not too bad! Unfortunately, yes, there is some duplication still - there are three places that each of the properties are mentioned; in the constructor argument list, in the constructor body and as public properties. However, I think that this is the bare minimum number of times that they could be repeated without sacrificing any type guarantees. The constructor has to accept a typed argument list and it has to somehow map them onto properties. The properties have to repeat the types so that any one accessing those property values know what they're getting.

But let's talk about the positive things, rather than the negative (such as the fact that while the format shown above is fairly minimal, it's still marginally more complicated in appearance than a simple mutable type). Actually.. maybe we should first talk about the weird things - like what is this "CtorSet" method?

"CtorSet" is an extension method that sets a specified property on the target instance to a particular value. It has the following signature:

public static void CtorSet<T, TPropertyValue>(
  this T source,
  Func<T, TPropertyValue> propertyIdentifier,
  TPropertyValue value)
    with T : IAmImmutable

It doesn't just set it, though, it ensures that the value is not null first and throws an ArgumentNullException if it is. This allows me to avoid the repetitive and boring if-null-then-throw statements. I don't need to worry about cases where I do want to allow nulls, though, because I would use an Optional type in that case, which is a struct and so never can be null!

The method signature ensures that the type of the value is consistent with the type of the target property. If not, then the code won't compile. I always favour static type checking where possible, it means that there's no chance that a mistake you make will only reveal itself when a particular set of condition are met (ie. when a particular code path is executed) at runtime - instead the error is right in your face in the IDE, not even letting you try to run it!

Which makes the next part somewhat unfortunate. The "propertyIdentifier" must be:

A simple lambda expression..
.. that identifies a property getter which has a corresponding setter (though it's fine for that setter to be private)..
.. where neither the getter nor setter have a Bridge [Name] / [Template] / [Ignore] / etc.. attribute on it..

If any of these conditions are not met then the "CtorSet" method will throw an exception. But you might not find out until runtime because C#'s type system is not strong enough to describe all of these requirements.

The good news, though, is that while the C# type system itself isn't powerful enough, with Visual Studio 2015 it's possible to write a Roslyn Analyser that can pick up any invalid propertyRetriever before run time, so that errors will be thrown right in your face without you ever executing the code. The even better news is that such an analyser is included in the NuGet package! But let's not get ahead of ourselves, let me finish describing what this new method actually does first..

If it's not apparent from looking at the example code above, "CtorSet" is doing some magic. It's doing some basic sort of reflection in JavaScript to work out how to identify and set the target property. Bridge won't support reflection until v2 but my code does an approximation where it sniffs about in the JavaScript representation of the "propertyIdentifier" and gets a hold of the setter. Once it has done the work to identify the setter for a given "T" and "propertyIdentifier" combination, it saves it away in an internal cache - while we can't control and performance-tune JavaScript in quite the same way that we can with the CLR, it doesn't mean that we should do the same potentially-complicated work over and over again if we don't need to!

Another thing to note, if you haven't already spotted it: "CtorSet" will call private setters. This has the potential to be disastrous, if it could be called without restrictions, since it could change the data on types that should be able to give the appearance of immutability (ie. classes that set their state in their constructor and have private-only setters.. the pedantic may wish to argue that classes with private setters should not be considered strictly immutable because private functions could change those property values, but it's entirely possible to have classes that have the attribute of being Observational Immutability in this manner, and that's all I'm really interested in).

So there are two fail-safes built in. Firstly, the type constraint on "CtorSet" means that the target must implement the IAmImmutable interface. This is completely empty and so there is no burden on the class that implements it, it merely exists as an identifier that the current type should be allowed to work with "CtorSet".

The second protection is that once "CtorSet" has been called for a particular target instance and a particular property, that property's value is "locked" - meaning that a subsequent call to "CtorSet" for the same instance and property will result in an exception being thrown. This prevents the situation from occurring where an EmployeeDetails is initialised using "CtorSet" in its constructor but then gets manipulated externally via further calls to "CtorSet". Since the EmployeeDetails properties are all set in its constructor using "CtorSet", no-one can change them later with another call to "CtorSet". (This is actually something else that is picked up by the analyser - "CtorSet" may only be called from within constructor - so if you're using this library within Visual Studio 2015 then you wouldn't have to worry about "CtorSet" being called from elsewhere, but if you're not using VS 2015 then this extra runtime protection may be reassuring).

Now that "CtorSet" is explained, I can get to the next good bit. I have another extension method:

public T With<T, TPropertyValue>(
  this T source,
  Func<T, TPropertyValue> propertyIdentifier,
  TPropertyValue value)
    with T : IAmImmutable

This works in a similar manner to "CtorSet" but, instead of setting a property value on the current instance, it will clone the target instance then update the property on that instance and then return that instance. Unless the new property value is the same as the current one, in which case this work will be bypassed and the current instance will be returned unaltered. As with "CtorSet", null values are not allowed and will return in an ArgumentNullException being thrown.

With this method, having specific "With" methods on classes is not required. Continuing with the EmployeeDetails class from the example above, if we have:

var employee = new EmployeeDetails(
  "John Smith",
  new DateTime(2014, 9, 3),
  null
);

.. and we then discover that his start date was recorded incorrectly, then this instance of the record could be replaced by calling:

employee = employee.With(_ => _.StartDate, new DateTime(2014, 9, 2));

And, just to illustrate that if-value-is-the-same-return-instance-immediately logic, if we then did the following:

var employeeUpdatedAgain= employee.With(_ => _.StartDate, new DateTime(2014, 9, 2));

.. then we could use referential equality to determine whether any change was made -

// This will be false because the "With" call specified a StartDate value that was
// the same as the StartDate value that the employee reference already had
var wasAnyChangeMade = (employeeUpdatedAgain != employee);

Bonus features

So, in this library, there are the "CtorSet" and "With" extensions methods and there is an Optional type -

public struct Optional<T> : IEquatable<Optional<T>>
{
  public static Optional<T> Missing { get; }

  public bool IsDefined { get; }
  public T Value { get { return this.value; } }
  public T GetValueOrDefault(T defaultValue);

  public static implicit operator Optional<T>(T value);
}

This has a convenience static function -

public static class Optional
{
  public static Optional<T> For<T>(T value)
  {
    return value;
  }
}

.. which makes it easier any time that you explicitly need to create an Optional<> wrapper for a value. It lets you take advantage of C#'s type inference to save yourself from having to write out the type name yourself. For example, instead of writing something like

DoSomething(new Optional<string>("Hello!"));

.. you could just write

DoSomething(Optional.For("Hello!"));

.. and type inference will know that the Optional's type is a string.

However, this is often unnecessary due to Optional's implicit operator from "T" to Optional<T>. If you have a function

public void DoSomething(Optional<string> value)
{
  // Do.. SOMETHING

.. then you can call it with any of the following:

// The really-explicit way
DoSomething(new Optional<string>("Hello!"));

// The rely-on-type-inference way
DoSomething(Optional.For("Hello!"));

// The rely-on-Optional's-implicit-operator way
DoSomething("Hello!");

There is also an immutable collection type; the NonNullList<T>. This has a very basic interface -

public sealed class NonNullList<T> : IEnumerable<T>
{
  public static NonNullList<T> Empty { get; }

  public int Count { get; }
  public T this[int index] { get; }
  public NonNullList<T> SetValue(int index, T value);
  public NonNullList<T> Insert(T item);
}

.. and it comes with a similar convenience static function -

public static class NonNullList
{
    public static NonNullList<T> Of<T>(params T[] values);
}

The reason for this type is that it's so common to need collections of values but there is nothing immediately available in Bridge that allows me to do this while maintaining guarantees about non-null values and immutability.

I thought about using the Facebook Immutable-Js library but..

It's a further dependency
I really wanted to continue the do-not-allow-null philosophy that I use with "CtorSet" and "With"

I actually considered calling the "NonNullList" type the "NonNullImmutableList" but "NonNull" felt redundant when I was trying to encourage non-null-by-default and "Immutable" felt redundant since immutability is what this library is for. So that left my with List<T> and that's already used! So I went with simply NonNullList<T>.

Immutable lists like this are commonly written using linked lists since, if the nodes are immutable, then sections of the list can often be shared between multiple lists - so, if you have a list with three items in it and you call "Insert" to create a new list with four items that has the new item as the new first first item in the linked list then the following three items will be the same three node instances that existed in the original list. This reuse of data is a way to make immutable types more efficient than the naive copy-the-entire-list-and-then-manipulate-the-new-version approach would be. I'm 99% sure that this is what the Facebook library uses for the simple list type and it's something I wrote about doing in C# a few years ago if you want to read more (see "Persistent Immutable Lists").

The reason that I mention this is to try to explain why the NonNullList interface is so minimal - there are no Add, InsertAt, etc.. functions. The cheapest operations to do to this structure are to add a new item at the start of the list and to iterate through the items from the start, so I started off with only those facilities initially. Then I added a getter (which is an O(n) operation, rather than the O(1) that you get with a standard array) and a setter (which is similarly O(n) in cost, compared to O(1) for an array) because they are useful in many situations. In the future I might expand this class to include more List-like functions, but I haven't for now.

Just to make this point clear one more time: NonNullList<T> functions will throw exceptions if null values are ever specified - all values should be non-null and the type of "T" should be an Optional if null values are required (in which case none of the actual elements of the set will be null since they will all be Optional instances and Optional is a struct).

To make it easier to work with properties that are collections of items, there is another "With" method signature:

public T With<T, TPropertyElement>(
  this T source,
  Func<T, NonNullList<TPropertyElement>> propertyIdentifier,
  int index,
  TPropertyElement value)

So, if you had a class like this -

public class Something : IAmImmutable
{
  public Something(int id, NonNullList<string> items)
  {
    this.CtorSet(_ => _.Id, id);
    this.CtorSet(_ => _.Items, items);
  }
  public int Id { get; private set; }
  public NonNullList<string> Items { get; private set; }
}

.. and an instance of one created with:

var s = new Something(1, NonNullList.Of("ZERO", "One"));

.. and you then wanted to change the casing of that second item, then you could do so with:

s = s.With(_ => _.Items, 1, "ONE");

If you specified an invalid index then it would fail at runtime, as it would if you tried to pass a null value. If you tried to specify a value that was of an incompatible type then you would get a compile error as the method signature ensures that the specified value matches the NonNullList<T>'s item type.

Getting hold of the library

If this has piqued your interest then you can get the library from NuGet - it's called "ProductiveRage.Immutable". It should work fine with Visual Studio 2013 but I would recommend that you use 2015, since then the analysers that are part of the NuGet package will be installed and enabled as well. The analysers confirm that every "property retriever" argument is always a simple lambda, such as

_ => _.Name

.. and ensures that "Name" is a property that both "CtorSet" and "With" are able to use in their manipulations*. If this is not the case, then you will get a descriptive error message explaining why.

* (For example, properties may not be used whose getter or setter has a Bridge [Name], [Template] or [Ignore] attribute attached to it).

One think to be aware of with using Visual Studio 2015 with Bridge.Net, though, is that Bridge does not yet support C# 6 syntax. So don't get carried away with the wonderful new capabilities (like my beloved nameof). Support for this new syntax is, I believe, coming in Bridge v2..

If you want to look at the actual code then feel free to check it out at github.com/ProductiveRage/Bridge.Immutable. That's got the library code itself as well as the analysers and the unit tests for the analysers. It's the first time that I've tried to produce a full, polished analyser and I had fun! As well as a few speed bumps.. (possibly a topic for another day).

While the library, as delivered through the NuGet package, should work fine for both VS 2013 and VS 2015, building the solution yourself requires VS 2015 Update 1.

Is this proven and battle-hardened?

No.

At this point in time, this is mostly still a concept that I wanted to try out. I think that what I've got is reliable and quite nicely rounded - I've tried to break it and haven't been able to yet. And I intend to use it in some projects that I'm working on. However, at this moment in time, you might want to consider it somewhat experimental. Or you could just be brave and starting using it all over the place to see if it fits in with your world view regarding how you should write C# :)

Is "IAmImmutable" really necessary?

If you've really been paying attention to all this, you might have noticed that I said earlier that the IAmImmutable interface is used to identify types that have been designed to work with "CtorSet", to ensure that you can't call "CtorSet" on references that weren't expecting it and whose should-be-private internals you could then meddle with. Well, it would be a reasonable question to ask:

Since there is an analyser to ensure that "CtorSet" is only called from within a constructor, surely IAmImmutable is unnecessary because it would not be possible to call "CtorSet" from places where it shouldn't be?

I have given this some thought and have decided (for now, at least) to stick with the IAmImmutable marker interface for two reasons:

If you're writing code where the analyser is not being used (such as in Visual Studio versions before 2015) then it makes it harder to write code that could change private state where it should not be possible
It avoids polluting the auto-complete matches by only allowing "CtorSet" and "With" to be called against any type, even where it's not applicable (such as on the string class, for example)

The first point refers to the fallback defense mechanism that will not allow properties to have their value set more than once using "CtorSet", attempting to do so will result in a runtime error. If a class has all of its properties set using "CtorSet" within its constructor then any external, subsequent "CtorSet" call will fail. Having to implement the IAmImmutable interface when writing immutable types hopefully acts as a reminder to do this. Without this extra protection (and without the analyser), your code could contain "CtorSet" calls that manipulate private state in classes that have no idea what's hit them!

Meanwhile, the second just feels like a good practice so that "CtorSet" and "With" don't crop up over and over again on types that you would not want to use them with.

If anyone really wanted the IAmImmutable-requirement to be relaxed (which would allow the immutable types to be written in an even more succinct manner, since they wouldn't need to implement that interface) then I would definitely be up for a debate.

Posted at 22:22

Tags:

Comments

2 November 2014

Locating TODO comments with Roslyn

I picked up an old project recently that I knew I'd made good progress on and that the bits that were finished were looking good.. but also I knew that it had TODO comments littered throughout it to remind me what I hadn't finished.

To get an idea just how many of these there were, I did a solution-wide search for "TODO" in Visual Studio. There were just over two hundred of them. The search results gave me a fair idea of where they were but I got it into my head that I wanted to export this into a list and then map them onto projects and - ideally - classes and methods. The first part is easy, the search results output contains the path to the file, which indicates the project name. The classes, also, could often be extracted from the filename - so long as there was only one class (or interface or enum or whatever) per file, though no nested types would be awkward.

And this, really, would have been enough information to start tracking my progress and have a checklist that I could take satisfaction in crossing items off from. But of course I wanted more! Isn't this new* Roslyn thing supposed to be about parsing code, shouldn't I be able to use it to find out what properties or methods the TODO comments I've found are associated with? And don't I sometimes need a break from genuinely productive work to play with something new and shiny under the pretense of doing something useful with it?? :)

* (Not that new, actually, seeing as it was announced for preview back in 2011).

The two sides of Roslyn

Roslyn is often talked about as enabling a "compiler as a service" - where code can be compiled and executed on-the-fly. So some sort of scripting engine could be created to dynamically change behaviour on already-executing code. Essentially, Roslyn can take source code (C# or VB) and generate IL, which can then be executed and interacted with by the application that fed that source code through it.

However, the other side of it is that it provides "rich code analysis APIs" (according to its page on MSDN) - meaning that it will help you examine the source code, even if you have no intention of executing that code. Which sounds exactly like what I want to try to locate my TODO comments within a containing method / property / type / namespace.

If I had more ambitious aims in mind then it could also be used for all manner of IDE extensions for code investigation, refactoring or "best practices analysis". A bit like many of the features that ReSharper provides (though ReSharper predates it, and woe betide anyone who asks if they are thinking of integrating with Roslyn so that they don't have to maintain as much parsing code of their own - Ask me again if ReSharper will use Roslyn.. I dare you).

To getting started with Roslyn, you install it through NuGet - though, currently, it's marked as pre-release so mightn't show up when you search for it. The best thing to do is follow the instruction on the NuGet package page and run

Install-Package Microsoft.CodeAnalysis -Pre

at the Package Manager Console.

With this done, parsing code is as easy as

var parsedContent = CSharpSyntaxTree.ParseText(content);

where "content" is a string. This string may be an entire file as you would expect to encounter it in a project - with a namespace containing class / interface / enum and fields / properties / methods / values - or it may be a "fragment", such as a single method or method call (as often illustrated when people talk about using Roslyn for scripting).

The "ParseText" method returns a SyntaxTree instance. This is an immutable structure that describes the parsed content. I'm a huge fan of immutable structures since I think it makes code much easier to reason about (my love of immutability has been a theme through many of the posts I've written). In Roslyn's design it has been stated that

The short answer to why syntax trees are immutable in Roslyn is that it makes parallel work much easier. You can take a syntax tree and pass it to any thread and not worry that someone else will mutate it while you are in the middle of doing analysis. This is useful in the command line compiler so that multiple trees can have their methods bound in parallel (which may need to occasionally access information from a different tree), but it's EXTREMELY important for VS scenarios where we want to have an extensibility model that allows many extensions to analyze and transform the same tree in parallel, and it doesn't really work to have a model that forces all those separate extensions to co-ordinate locking a single tree. Similarly, providing each extension its own copy of the tree would be prohibitive from a memory overhead point of view.

(I took this from a Google Groups thread Why are Roslyn Syntax Trees Immutable? and the answer is attributed to "the Roslyn PM").

Eric Lippert has also written about the design, saying that they wanted the data structures to be immutable and persistent and that

By persistence I mean the ability to reuse most of the existing nodes in the tree when an edit is made to the text buffer. Since the nodes are immutable, there's no barrier to reusing them, as I've discussed many times on this blog. We need this for performance; we cannot be re-parsing huge wodges of text every time you hit a key. We need to re-lex and re-parse only the portions of the tree that were affected by the edit, because we are potentially re-doing this analysis between every keystroke.

This is in the context of using Roslyn to analyse code being written within Visual Studio - the full post is titled Persistence, Facades and Roslyn's Red-Green Trees.

Get to the point already!

So. Enough history. Back to my TODO-search.

The SyntaxTree returned from "ParseText" looks quite complex at first glance when you starting poking around it with Visual Studio's "QuickWatch" facility, at least (which is the first thing I did).

However, Roslyn helpfully provides a SyntaxWalker class, which may be used to easily examine every node within the tree. It uses the visitor pattern to do this. Design patterns are said to be a benefit when their form is appropriate to your problem such that they extend your vocabulary to describe the solution. There seem like there are times, unfortunately, that people layer on design patterns and abstractions only because they think they should - which is why it's nice in cases like this where it makes perfect sense and succeeds in makings things simple if you know the pattern being used. Last year, I was writing a plugin for dotLess which used the visitor pattern to traverse the nodes in a stylesheet (see Cross Browser (Pseudo) Source Mapping with LESS) and it was nice to see the exact same concept in use here.

The simplest implementation is

public class TriviaVisitor : SyntaxWalker
{
  public TriviaVisitor() : base(SyntaxWalkerDepth.StructuredTrivia) { }
  protected override void VisitTrivia(SyntaxTrivia trivia)
  {
    // Examine Trivia here..
  }
}

When the "Visit" method is called (which is defined by the SyntaxWalker class) and given a parsed tree, the "VisitTrivia" method is called for every SyntaxTrivia instance that is encountered within that tree - eg.

(new TriviaVisitor()).Visit(
  CSharpSyntaxTree.ParseText(content).GetRoot()
);

Comments and whitespace are SyntaxTrivia. Everything else will be represented by the SyntaxNode and SyntaxToken types. A SyntaxNode is made up on SyntaxTokens. For example, a "UsingDirectiveSyntax" represents a "using" statement such as

using System;

and will contain SyntaxTokens for the "using", "System" and ";" components of the statement.

These SyntaxNodes and SyntaxTokens are part of the tree that describes that parsed content. Trivia, however, are not directly part of the hierarchical data - rather, they are related to tokens and accessible through the token's "LeadingTrivia" and "TrailingTrivia" properties. Conversely, SyntaxTrivia instances have a "Token" property which allows you to map from the trivia back to the associated token.

So, within a "VisitTrivia" method, we can identify trivia we're interested in (comments, in this case, rather than whitespace) and determine what token they're associated with. The token will have a "Parent" property, which is the SyntaxNode that it's part of. The node is part of a hierarchy, which can be traversed up through via the "Parent" property values - each node may be something we're interested in identifying; such as the method containing the comment, the type containing that method or the namespace containing that type (must remember, though, that not all comments will be within methods - some may be TODO comments annotating a class, or even just sitting out on their own in an otherwise-empty file).

public class CommentLocatingVisitor : SyntaxWalker
{
  private readonly Action<ToDoComment> _commentLocated;
  public CommentLocatingVisitor(Action<ToDoComment> commentLocated)
    : base(SyntaxWalkerDepth.StructuredTrivia)
  {
    if (commentLocated == null)
      throw new ArgumentNullException("commentLocated");

    _commentLocated = commentLocated;
  }

  protected override void VisitTrivia(SyntaxTrivia trivia)
  {
    if (_commentTypes.Contains(trivia.CSharpKind()))
    {
      string triviaContent;
      using (var writer = new StringWriter())
      {
        trivia.WriteTo(writer);
        triviaContent = writer.ToString();
      }

      // Note: When looking for the containingMethodOrPropertyIfAny, we want MemberDeclarationSyntax
      // types such as ConstructorDeclarationSyntax, MethodDeclarationSyntax, IndexerDeclarationSyntax,
      // PropertyDeclarationSyntax but NamespaceDeclarationSyntax and TypeDeclarationSyntax also
      // inherit from MemberDeclarationSyntax and we don't want those
      var containingNode = trivia.Token.Parent;
      var containingMethodOrPropertyIfAny = TryToGetContainingNode<MemberDeclarationSyntax>(
        containingNode,
        n => !(n is NamespaceDeclarationSyntax) && !(n is TypeDeclarationSyntax)
      );
      var containingTypeIfAny = TryToGetContainingNode<TypeDeclarationSyntax>(containingNode);
      var containingNameSpaceIfAny = TryToGetContainingNode<NamespaceDeclarationSyntax>(containingNode);
      _commentLocated(new ToDoComment(
        triviaContent,
        trivia.SyntaxTree.GetLineSpan(trivia.Span).StartLinePosition.Line,
        containingMethodOrPropertyIfAny,
        containingTypeIfAny,
        containingNameSpaceIfAny
      ));
    }
    base.VisitTrivia(trivia);
  }

  private static HashSet<SyntaxKind> _commentTypes = new HashSet<SyntaxKind>(new[] {
    SyntaxKind.SingleLineCommentTrivia,
    SyntaxKind.MultiLineCommentTrivia,
    SyntaxKind.DocumentationCommentExteriorTrivia,
    SyntaxKind.SingleLineDocumentationCommentTrivia,
    SyntaxKind.MultiLineDocumentationCommentTrivia
  });

  private T TryToGetContainingNode<T>(SyntaxNode node, Predicate<T> optionalFilter = null)
    where T : SyntaxNode
  {
    if (node == null)
      throw new ArgumentNullException("node");

    var currentNode = node;
    while (true)
    {
      var nodeOfType = currentNode as T;
      if (nodeOfType != null)
      {
        if ((optionalFilter == null) || optionalFilter(nodeOfType))
          return nodeOfType;
      }
      if (currentNode.Parent == null)
        break;
      currentNode = currentNode.Parent;
    }
    return null;
  }
}

This CommentLocatingVisitor class is instantiated with a callback that is executed for every comment that is encountered when its "ParseText" method is called and the provided root traversed.

To keep things organised, this callback passes a Comment instance, as follows:

public class Comment
{
  public Comment(
    string content,
    int lineNumber,
    MemberDeclarationSyntax methodOrPropertyIfAny,
    TypeDeclarationSyntax typeIfAny,
    NamespaceDeclarationSyntax namespaceIfAny)
  {
    if (string.IsNullOrEmpty(content))
      throw new ArgumentException("Null/blank content specified");
    if (lineNumber < 1)
      throw new ArgumentOutOfRangeException("lineNumber");

    Content = content;
    LineNumber = lineNumber;
    MethodOrPropertyIfAny = methodOrPropertyIfAny;
    TypeIfAny = typeIfAny;
    NamespaceIfAny = namespaceIfAny;
  }

  /// <summary>
  /// This will never be null or blank
  /// </summary>
  public string Content { get; private set; }

  /// <summary>
  /// This will always be a positive integer
  /// </summary>
  public int LineNumber { get; private set; }

  /// <summary>
  /// This may be null since the comment may not exist within a method or property
  /// </summary>
  public MemberDeclarationSyntax MethodOrPropertyIfAny { get; private set; }

  /// <summary>
  /// This may be null since the comment may not exist within an class, interface or struct
  /// </summary>
  public TypeDeclarationSyntax TypeIfAny { get; private set; }

  /// <summary>
  /// This may be null since the comment may not exist within a method or property
  /// </summary>
  public NamespaceDeclarationSyntax NamespaceIfAny { get; private set; }
}

So now, given the contents of any C# file, the comments can be identified and traced to the constructs that they're associated with. Now they just need to be filtered to those containing the text "TODO", since those are the particular comments of interest.

For the first stab I took at this, I did a search-all-solution for "TODO" and copy-pasted the results into a file. I then read in this file, extracted the filenames and ran the above against the contents of each file.

But surely there's a better way..

Parsing the solution

What would be ideal would be the ability to point some code at a solution file, for it to determine what projects are in the solution, what C# code files are in the projects and then to extract all of the locations of TODO comments within those. None of this search-all / copy-paste / parse-the-results-and-read-the-files-from-there nonsense.

There are two parts to this - reading the solution file to get the projects and reading the individual project files. I'll start with the latter since it turned out to be easier.

If you add a reference to "Microsoft.Build" then you can can use the ProjectCollection type in a method such as

private static IEnumerable<FileInfo> GetCSharpCompileItemFilesForProject(FileInfo projectFile)
{
  if (projectFile == null)
    throw new ArgumentNullException("projectFile");

  return (new ProjectCollection()).LoadProject(projectFile.FullName).AllEvaluatedItems
    .Where(item => item.ItemType == "Compile")
    .Select(item => item.EvaluatedInclude)
    .Where(include => include.EndsWith(".cs", StringComparison.OrdinalIgnoreCase))
    .Select(include => new FileInfo(Path.Combine(projectFile.Directory.FullName, include)));
}

Nice when the framework provides you just what you need! This is basically just looking for ".cs" items in a given project file and returning FileInfo instances such that the full path is made available (the filenames in the project will be paths relative to the location of the project file and so need to be combined with the project file location to get the full path of the file).

The solution file parsing is not quite so elegant.

There is a Stack Overflow question "How do I compile a C# solution with Roslyn?" which talks about parsing a solution file. But it's very out of date and the code doesn't compile. But it leads to another question "Roslyn / Find References - Can't properly load Workspace" which looks like it's going to help but I encountered the same problem as this question: "MSBuildWorkspace.Create() throws exception". The gist is that to use this you need to Microsoft.Build version 14, whereas the version available (for VS 2013, at least) is version 4. It seems like the solution is to download the VS 2014 CTP or get the ISO file and root around for the version 14 assembly.

At this point, I got bored with it and fell back to parsing the solution field with a regular expression, looking for ".csproj" files in what look like project declarations.

private static IEnumerable<FileInfo> GetProjectFilesForSolution(FileInfo solutionFile)
{
  if (solutionFile == null)
    throw new ArgumentNullException("solutionFile");

  var projectFileMatcher = new Regex(
    @"Project\(""\{\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\}""\) = ""(.*?)"", ""(?<projectFile>(.*?\.csproj))"", ""\{\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\}"""
  );
  foreach (Match match in projectFileMatcher.Matches(solutionFile.OpenText().ReadToEnd()))
  {
    yield return new FileInfo(
      Path.Combine(solutionFile.Directory.FullName, match.Groups["projectFile"].Value)
    );
  }
}

It feels a bit dirty but it seems to do the job! And this is hardly production code so I can live with it.

Cryptic warnings

There is another small niggle with all this code. It works but there's a compile warning

Found conflicts between different versions of the same dependent assembly that could not be resolved. These reference conflicts are listed in the build log when log verbosity is set to detailed.

I don't like compile warnings, if something's wrong then I want to make it right. Plenty of people have eloquently made the case for always resolving compile warnings so I won't go over old ground here - just suffice to say that I agree!

The log verbosity can be altered by going to Tools / Option / Projects and Solutions / Build and Run, from there "MSBuild project build output verbosity" can be changed. So I set it to "Detailed" as instructed in the warning message and found.. nothing useful.

It turns out that this warning is telling a bit of a fib and you actually need to bump the verbosity up another step to "Diagnostic". Then the log includes the following

There was a conflict between Microsoft.Build, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a and Microsoft.Build, Version=14.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a.

It also includes lots of other useful information like what references have what dependencies, so I can see that Microsoft Build v4 is required by project item "Microsoft.Build" (meaning that is the version that I explicitly added as a reference to parse the project files). And I can see that Microsoft Build v14 is required by the project items "Microsoft.CodeAnalysis.Workspaces", "Microsoft.CodeAnalysis.VisualBasic.Workspaces" and "Microsoft.CodeAnalysis.CSharp.Workspaces", which are references pulled in by the Roslyn NuGet package.

Unfortunately, I've already explained that I gave up trying to install Microsoft.Build v14! If this was "real" code then I would do it properly and investigate installing that package properly to get rid of this warning.. but for this sort of one-off task (pulling the TODO comments out of a solution, once) I decided I can live with the warning. At least I have an idea how to sort it out if I ever do want to use this code in a more demanding environment.

Parting words

This first foray into Roslyn's capabilities has been interesting. I've clearly scratched only the very outer surface of it but it seems like a really well considered product, I think it could be useful in many scenarios and fully intend to have a poke around with its compiling capabilities at some point (since I do love a bit of dynamic compilation, as I was writing about last time!).

If anything that I've written about today could be useful to you, I've put a complete solution up on Bitbucket - find it at The TODOCommentRetriever.

Posted at 19:38

Tags:

Comments

2 October 2014

A follow-up to "Implementing F#-inspired 'with' updates in C#"

A couple of weeks ago, I was talking about a way to structure an "UpdateWith" method that immutable classes in C# could have so that callers can change one or more properties in a single call, resulting in a new instance of the class. Presuming, of course, that the new property values varied from the old values - otherwise the original instance should be returned (there's no point creating a new instance to represent the exact same data when the containing type is an immutable "value"). Feel free to go read Implementing F#-inspired "with" updates for immutable classes in C# if you didn't already!

The really simple way to do something like this is to actually not have an "UpdateWith" method at all and for the calling code to call the constructor directly, but means that there will potentially be a lot places that need fixing if the constructor arguments are changed or re-ordered at any time. Another simple approach is for there to be multiple "Update" methods, one for each property (so you might have an "UpdateName" method, an "UpdateStartDate"; a distinct "Update{whatever}" for each individual property).

I was feeling oh so proud of myself for thinking to combine a multiple-parameter "Update" method with an "Optional" struct so that the best of every world could be had - a single call could update one or more properties without having to specify values for properties that are not to be updated. Unlike with the "Update{whatever}" methods, if two properties need to be updated, only a single new instance will be required - there will not be new instances for each separate property update - so there would be no added GC pressure from unnecessary "intermediate" instances.

To illustrate -

public class RoleDetails
{
  public RoleDetails(string title, DateTime startDate, DateTime? endDateIfAny)
  {
    Title = title;
    StartDate = startDate;
    EndDateIfAny = endDateIfAny;
  }

  public string Title { get; private set; }
  public DateTime StartDate { get; private set; }
  public DateTime? EndDateIfAny { get; private set; }

  public RoleDetails UpdateWith(
    Optional<string> title = new Optional<string>(),
    Optional<DateTime> startDate = new Optional<DateTime>(),
    Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
  {
    if (!title.IndicatesChangeFromValue(Title)
    && !startDate.IndicatesChangeFromValue(StartDate)
    && !endDateIfAny.IndicatesChangeFromValue(EndDateIfAny))
      return this;

    return new RoleDetails(
      title.GetValue(Title),
      startDate.GetValue(StartDate),
      endDateIfAny.GetValue(EndDateIfAny)
    );
  }
}

The Optional struct looked like this:

public struct Optional<T>
{
  private T _valueIfSet;
  private bool _valueHasBeenSet;

  public T GetValue(T valueIfNoneSet)
  {
    return _valueHasBeenSet ? _valueIfSet : valueIfNoneSet;
  }

  public bool IndicatesChangeFromValue(T value)
  {
    if (!_valueHasBeenSet)
      return false;

    if ((value == null) && (_valueIfSet == null))
      return false;
    else if ((value == null) || (_valueIfSet == null))
      return true;

    return !value.Equals(_valueIfSet);
  }

  public static implicit operator Optional<T>(T value)
  {
    return new Optional<T>
    {
      _valueIfSet = value,
      _valueHasBeenSet = true
    };
  }
}

I then went on a bit of a wild tangent and thought "if pretty much all of these UpdateWith methods are going to look the same and be boring to write, could I have some magic code generate it for me on the fly?" - this led me to write a small library that allows the following:

public RoleDetails UpdateWith(
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  return DefaultUpdateWithHelper.GetGenerator<RoleDetails>()(this, title, startDate);
}

I got a variety of feedback on the post. One of the really interesting things to find was that the main idea itself was already in real-world use, in Microsoft's Roslyn .net compiler, for example. The file ProjectInfo.cs has a "With" method that follows a very similar structure with a corresponding Optional.cs struct that is also very similar to what I'd written. I found this very encouraging.. even if it did steal the thunder from "my" idea!

More of the feedback related to performance concerns regarding the "DefaultUpdateWithHelper.GetGenerator" method. It returns a delegate to create a new instance, based upon the provided arguments. This delegate is a compiled LINQ Expression, cached against the target type and the provided argument structure. The problem was that some reflection was required in order to determine whether there was a compiled expression in the cache that could be re-used, so each call to "GetGenerator" carried that reflection overhead. The question was just how much..

But before I go into that, one of the constructive comments was that I wasn't generating a hash code on my cache key type correctly. The cache key contained the information about the target type, along with the number of arguments and their types. The function to produce a combined hash for this information was

public int GetHashCode(CacheKeyData obj)
{
  if (obj == null)
    throw new ArgumentNullException("obj");
  var hash = obj.DeclaringType.GetHashCode() ^ obj.TargetType.GetHashCode();
  for (var index = 0; index < obj.NumberOfUpdateParameters; index++)
    hash = hash ^ obj.GetUpdateParameter(index).GetHashCode();
  return hash;
}

This goes through each aspect of the cache key data and performs XOR operations to get a combined result. It was pointed out by Strilanc on Reddit that it's better practice to multiple by a prime number after every XOR. This way, if there are two references that report the same hash code then they won't cancel each other out.

The reason that I'd used XOR without thinking about it too much was that I knew that XOR on two ints could never cause an overflow and so seemed like a safe easy option. But, in C#, this isn't something we normally have to worry about - for example

// Trying to set "var i0 = int.MaxValue + 1;" will result in a compile error
//   "The operation overflows at compile time in checked mode"
// but performing in two steps will not
var i0 = int.MaxValue;
var i1 = i0 + 1;

does not result in an overflow exception. Instead, it wraps around (so i1 will be equal to int.MinValue). In order to "opt in" to overflow exceptions being raised for theses sorts of operations, the "checked" keyword needs to be used (or there's a "checked" compiler option that does the same).

So we can safely change the implementation to

public int GetHashCode(CacheKeyData obj)
{
  if (obj == null)
    throw new ArgumentNullException("obj");
  var hash = obj.DeclaringType.GetHashCode() ^ obj.TargetType.GetHashCode();
  for (var index = 0; index < obj.NumberOfUpdateParameters; index++)
    hash = (3 * hash) ^ obj.GetUpdateParameter(index).GetHashCode();
  return hash;
}

There was also a comment left on my blog

.. your usage of the object.Equals() method also creates garbage..

which I had to think about to understand what was meant. When I realised, I kicked myself that I'd missed it! In the Optional struct there's the method

public bool IndicatesChangeFromValue(T value)
{
  if (!_valueHasBeenSet)
    return false;

  if ((value == null) && (_valueIfSet == null))
    return false;
  else if ((value == null) || (_valueIfSet == null))
    return true;

  return !value.Equals(_valueIfSet);
}

That final call has to resort to

public virtual bool Equals(object obj);

on the base Object type since the compiler has no other choice that could apply to any "T". But if "T" is not a reference type then it has to be boxed in order to access it as an Object (which is necessary to access this lowest-common-denominator "Equals" method).

A better solution is to check whether "obj" implements IEquatable<T>. Microsoft recommends that structs implement this interface (see the article Struct Design on MSDN) and the primitive types such System.Int32 (aka int) all follow this suggestion.

So the boxing can be avoided in most cases by changing the method to

public bool IndicatesChangeFromValue(T value)
{
  if (!_valueHasBeenSet)
    return false;

  if ((value != null) && (value is IEquatable<T>))
    return !((IEquatable<T>)value).Equals(value);

  if ((value == null) && (_valueIfSet == null))
    return false;
  else if ((value == null) || (_valueIfSet == null))
    return true;

  return !value.Equals(_valueIfSet);
}

I'm chalking up these two recommendations as even more evidence that code reviewing can be helpful.. :)

So how does it perform?

Having addressed the above improvements, the question about how the code actually performs still remains.

There are three candidates to consider when weighing up the automagical DefaultUpdateWithHelper. The first two appear above. One is the hand-written version shown in the RoleDetails class right at the top of the post. The other is the one-liner "GetGenerator" call. There is a third option, however, that allows multiple calls to avoid the cache-check and so avoid reflection entirely on all but the first request; that is to call "GetGenerator" once and record it in a static reference -

private static UpdateWithSignature<RoleDetails> updater
  = DefaultUpdateWithHelper.GetGenerator<RoleDetails>(typeof(RoleDetails).GetMethod("UpdateWith"));
public RoleDetails UpdateWith(
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  return updater(this, title, startDate);
}

To get an idea of the raw performance of these methods, I wrote a console app that would repeatedly call a variation of an "UpdateWith" method. I've named the three varieties that I'm interested in: "ManualWith" (the hand-written version), "SimpleWith" (the one-liner) and "StaticWith" (shown above; the one-liner where the result is stored in a static reference to avoid multiple calls to "GetGenerator").

Having a console app meant that the process would be started fresh and then torn down for each run, hopefully ensuring an even playing field. This is particularly in relation to GC, which can introduce variance into longer-running processes. In this case, I'm interested in the direct execution performance of the various methods and I'm not trying to compare GC overhead (which is something that can be investigated, but which can be very complicated to do correctly).

The source code for this app can be found at gist.github.com/anonymous/31b752d24212ad43836e. It's as simple as possible and must be run in Release configuration in order to provide the most realistic results. I ran it multiple times for each of the variations, running a complete set of each before repeating (just to try to give everything the best odds of averaging out as possible).

For "ManualWith", the loop count had to be ten million to get any sensible measurements. The average time per execution was 1.0 ticks (an average of 3538ms for 10,000,000 calls).

For "SimpleWith", the loop count had to be 100,000. The average per execution was 81.7 ticks (averaging 2997ms for 100,00 calls).

"StaticWith" needed the loop count bumping back up to ten million again - averaging 2.1 ticks per execution (7874ms average for 10,000,000 calls).

Now, actually, I don't think that's too bad (the "StaticWith" result, I mean). If something's a real convenience and the only overhead it introduces is that object instantiation is twice as slow, I think that in most cases it could be considered a win - the reality is that instantiating objects is not likely to be a bottleneck where performance becomes a concern*. The reason for the performance difference between "ManualWith" and "StaticWith" is going to be from the boxing of the Optional values when they are passed to the delegate, combined with the fact that the arguments are passed to the "updater" as a params array; ie. an object[] - which must be instantiated. My original post talked about more tweaks that the library allowed to specify the number of arguments and so not require the object array, but it would still have to box the Optional values.

* (Insert comment here about profiling before assigning blame for performance and another about how exchanging convenience for performance only works if any performance cost is offset by having said convenience).

So.. all things considered, do I genuinely expect to use one of the "magic" approaches in my code going forward? Well, no. I will be using the format of the "UpdateWith" method and utilising the Optional struct in the method signature, but I probably won't bother with the DefaultUpdateWithHelper and the library I wrote. It was fun to write and I learnt a lot doing it and through the feedback on it, but I still have a niggly feeling about the worry that changes to the constructor (in a refactor, or whatever) will not cause compile-time errors in the "UpdateWith" method if I forget to update that as well. I won't find out until runtime that there's a problem (or until the unit tests, that I suggested last time as one of the trade-offs for the convenience, are executed). And I'm a big fan of helping the compiler to help me.

Plus there's the fact that the difference in code size between the "StaticWith" code and the "ManualWith" isn't really that large. Even as more properties are added, it's still very scannable and doesn't bloat up too much even though you have to write the code for each property's "IndicatesChangeFromValue" check and manually pass the "GetValue" result for each constructor argument. Looking at that Roslyn code doesn't make me think that the methods (written in the "ManualWith" manner) are too big, and some of them have a lot of constructor arguments.

If only there was some way to get the best of both worlds; brevity in type definitions but all the benefits of static analysis..

The "ImmutableObjectGraph" T4 Templates

This was another thing that came from the comments on my blog (thanks Ian Yates! :), a library of templates that take a simple definition such as

class Fruit
{
  string color;
  int skinThickness;
}

and transforms it into a fully-fledged immutable class with a "With" method (which is exactly like the "UpdateWith" method I've been talking about). It has its own Optional struct, the same as in Roslyn's source. The generated types even have a nested Builder type which has mutable properties and a "ToImmutable" method which returns an immutable type with the described data - for times when it's just easier to prepare a reference in a few steps before "freezing" it (or for "efficient multi-step mutation", according to the README). It's little indications of attention to detail such as this that I liked when I looked into the project: github.com/AArnott/ImmutableObjectGraph.

The idea of constructing T4 templates like this is one that I've kicked around before but never gotten round to actually implementing, so finding this was a nice surprise!

Now, there are a few flies in the ointment. The library relies on a pre-release version of Microsoft's Immutable Collections, and references to the binary's location are hard-coded into the template files. Also, the template files currently need to be copied into every project that you want to use them with. There's no NuGet package to make it easy to pull into a project - and if you try to pull down the code from GitHub using "Download Zip" then it refuses to compile (though cloning it in GitHub for Windows works fine). It assumes that all generated types should support a "DefaultInstance" (which I disagree with since it's basically too close to another version of null - an instance that has not been given any information to represent real information.. for a list type, this may make sense - the empty list - but not for types such as the RoleDetails I've been using as an example so far).

But hopefully this is where the wonders of open source will come to the fore! I've submitted a pull request to try to encourage the templates into a NuGet package (putting the impetus on the consumer to include a version of the Immutable Collections, if required). You can find it at Generate a NuGet package (and prevent the templates being copied into each consuming project). However, there is another pull request that has been open for some time (since April) which I think has merit and which I have tested myself, that has been acknowledged by the author but not merged: Fixing compiler warnings with collections and inheritance. I don't know why it hasn't been merged. Considering that one of the decisions in my request may be contentious (pulling "CollectionHelper" methods into the generated types that require them, in order to prevent the imported binary requiring an Immutable Collection reference), I'm not sure how confident I am at the moment that it will be accepted.

Further changes to address my other concerns could be made as well - such as an attribute that could be added to indicate that a default instance should not be defined. Depending upon how the pull request is received, I might submit more or I might go rogue and maintain my own fork. As I understand the "MS-PL" license, I'm fairly sure this is allowed (though I'd be much happier to end up with everything merged into one beautiful definitive version).

The really big question that I want to answer, though, is whether the use of the templates will mesh well with code contracts. The generated types do specify "partial class" and so can be extended - they could implement an interface, for example, which has contracts specified on it. And the classes call an optional "Validate" method, which could be used to verify the constructor arguments. I'm not sure yet if this will all be capable of what I have in mind, I've only had a very preliminary look into it.. but I think it has promise!

Just imagine: the brevity of the type declarations above, the guarantees of contracts (though this will necessarily affect the succinctness of the code - even if a separate "contract interface" is implemented, the contract for that interface must still be written somewhere), the static analysis benefits for the generated types.. all this goodness in one solution! So maybe I don't actually have all the pieces together just yet.. but I'm certainly going to be trying to get them over the next few weeks and carrying it all onward to programming nirvana!

Posted at 21:49

Tags:

Immutability

Comments

17 September 2014

Implementing F#-inspired "with" updates for immutable classes in C#

I've been prototyping a data service for a product at work that communicates with immutable types and one of the feedback comments was a question as to whether the classes supported a flexible F#-esque "with" method that would allow multiple properties to be changed without the garbage collection churn of creating intermediate references for each individual property (since, of course, the property values aren't actually changed on an instance, a new instance is generated that reflects the requested changes).

To pull an example straight from the excellent F# for fun and profit site:

let p1 = {first="Alice"; last="Jones"}
let p2 = {p1 with last="Smith"}

This creates a new record p2 that takes p1 and changes one of the fields. Multiple fields may be altered in one use "with" statement

let p2 = {p1 with first="John";last="Smith"}

To start with a very simple example in C#, take the following class:

public class RoleDetails
{
  public RoleDetails(string title, DateTime startDate, DateTime? endDateIfAny)
  {
    Title = title;
    StartDate = startDate;
    EndDateIfAny = endDateIfAny;
  }

  public string Title { get; private set; }
  public DateTime StartDate { get; private set; }
  public DateTime? EndDateIfAny { get; private set; }
}

This is a very close parallel to the F# record type since it just assigns read-only properties (they're not strictly read-only since they don't use the "readonly" keyword but they're not externally alterable and are only set once within the class so it's close enough).

If I was writing something like this for real use, I would probably try to make more guarantees.. or at least, document behaviour. Something like:

public class RoleDetails
{
  public RoleDetails(string title, DateTime startDate, DateTime? endDateIfAny)
  {
    if (string.IsNullOrWhiteSpace(title))
      throw new ArgumentException("title");
    if ((endDateIfAny != null) && (endDateIfAny <= startDate))
      throw new ArgumentException("title");

    Title = title.Trim();
    StartDate = startDate;
    EndDateIfAny = endDateIfAny;
  }

  /// <summary>
  /// This will never be null or blank, it will not have any leading or trailing whitespace
  /// </summary>
  public string Title { get; private set; }

  public DateTime StartDate { get; private set; }

  /// <summary>
  /// If non-null, this will greater than the StartDate
  /// </summary>
  public DateTime? EndDateIfAny { get; private set; }
}

As I've said before, this validation and commenting is really a poor substitute for code contracts which would allow for compile time detection of invalid data rather than relying on runtime exceptions (speaking of which, I need to give the .net code contracts solution another go - last time I got stuck in I hit some problems which hopefully they've ironed out by now).

Another variation on the "aggressive validation" illustrated above would be a type that represents a non-blank string to prevent duplicating calls to IsNullOrWhiteSpace and trim. This concept could be taken even further to "strongly type" string values so that a "Title" can not be passed into a function that expects a "Notes" string value, for example. This is far from an original idea but it was something I was experimenting again with recently.

Incidentally, there is talk of a future version of C# getting a record type which would reduce boilerplate code when defining simple immutable types. For example (from the InfoQ article Easier Immutable Objects in C# 6 and VB 12) -

public record class Cartesian(double x: X, double y: Y);

This will define an immutable class with two read-only properties that are set through a constructor call. This future C# specification is also apparently going to allow read-only auto properties - so in my RoleDetails class above instead of "get; private set;" properties, which are externally unalterable but could actually be changed within the instance, the properties could be truly readonly. This is possible currently but it requires a private readonly field and a property with a getter that returns that field's value, which is even more boring boilerplate.

The obvious, verbose and potentially more GC-churny way

To prevent callers from having to call the constructor every time a property needs to be altered, update methods for each "mutable" property may be added (they don't really mutate the values since a new instance is returned rather than the value changed on the current instance). This prevents the caller from having to repeat all of the constructor arguments that are not to be changed whenever one property needs altering. Forcing callers to call constructors in this way is particularly annoying if a constructor argument is added, removed or re-ordered at a later date; this can result in a lot of calling code that needs correcting.

public class RoleDetails
{
  public RoleDetails(string title, DateTime startDate, DateTime? endDateIfAny)
  {
    Title = title;
    StartDate = startDate;
    EndDateIfAny = endDateIfAny;
  }

  public string Title { get; private set; }
  public DateTime StartDate { get; private set; }
  public DateTime? EndDateIfAny { get; private set; }

  public RoleDetails Update(string title)
  {
    return (title == Title)
      ? this
      : new RoleDetails(title, StartDate, EndDateIfAny);
  }
  public RoleDetails UpdateStartDate(DateTime startDate)
  {
    return (startDate == StartDate)
      ? this
      : new RoleDetails(Title, startDate, EndDateIfAny);
  }
  public RoleDetails UpdateEndDateIfAny(DateTime? endDateIfAny)
  {
    return (endDateIfAny == EndDateIfAny)
      ? this
      : new RoleDetails(Title, StartDate, endDateIfAny);
  }
}

To update two properties on a given instance, you would need to call

var updatedRoleDetails = existingRoleDetails
  .UpdateStartDate(new DateTime(2014, 9, 21))
  .UpdateEndDateIfAny(new DateTime(2014, 11, 21));

If either of the new values is the same as the property value that it should be replacing, then no new instance is required for that property update - since the Update*{Whatever}* method will return back the same instance. But if both properties are changed then two new instances are required even though the first, intermediate value is immediately discarded and so is "wasted".

There could be an Update method that takes multiple parameters for the different properties but then you're basically just mirroring the constructor. Or there could be various Update methods that took combinations of properties to try to cover either the most common cases or all combinations of cases, but neither of these are particularly elegant and they would all result in quite a lot of code duplication.

A better way

It struck me that it should be possible to do something with named and optional method arguments (support for which was added to C# when .net 4 came out, if I remember correctly). Something like

public RoleDetails UpdateWith(
  string title = Title,
  DateTime startDate = StartDate,
  DateTime? endDateIfAny = EndDateIfAny)
{
  if ((title == Title) && (startDate == StartDate) && (endDateIfAny == EndDateIfAny))
    return this;
  return new RoleDetails(title, startDate, endDateIfAny);
}

would allow for only a subset of the arguments to be specified and for those that are left unspecified to default to the current property value of the instance. So the earlier update code becomes

var updatedRoleDetails = existingRoleDetails
  .UpdateWith(startDate: new DateTime(2014, 9, 21), endDateIfAny: new DateTime(2014, 11, 21));

However, this won't fly. The compiler gives the errors

Default parameter value for 'title' must be a compile-time constant

Default parameter value for 'startDate' must be a compile-time constant

Default parameter value for 'endDateIfAny' must be a compile-time constant

That's a bummer.

Another thought that briefly crossed my mind was for the default argument values to all be null. This would work if the arguments were all reference types and would result in the method body looking something like

if ((title == null) && (startDate == null) && (endDateIfAny == null))
  return this;
return new RoleDetails(title ?? Title, startDate ?? StartDate, endDateIfAny ?? EndDateIfAny);

But that is too restrictive a constraint since in this case we have a non-reference type argument (startDate) and we also have a reference type for which null is a valid value (endDateIfAny).

So what we really need is a wrapper type around the arguments that indicates when no value has been specified. Since we're being conscious of avoiding GC churn, this should be a struct since structs essentially avoid adding GC pressure since they are always copied when passed around - this means that no struct is referenced by multiple scopes and so they don't have to be traced in the same way as reference types; when the scope that has access to the struct is terminated, the struct can safely be forgotten as well. This is not a particularly precise description of what happens and more details can be found in the MSDN article Choosing Between Class and Struct. Particularly see the paragraph

The first difference between reference types and value types we will consider is that reference types are allocated on the heap and garbage-collected, whereas value types are allocated either on the stack or inline in containing types and deallocated when the stack unwinds or when their containing type gets deallocated. Therefore, allocations and deallocations of value types are in general cheaper than allocations and deallocations of reference types.

The other guidelines in that article around cases where structs may be appropriate (if the type "logically represents a single value", "has an instance size under 16 bytes", "is immutable" and "will not have to be boxed frequently") are followed by this type:

public struct Optional<T>
{
  private T _valueIfSet;
  private bool _valueHasBeenSet;

  public T GetValue(T valueIfNoneSet)
  {
    return _valueHasBeenSet ? _valueIfSet : valueIfNoneSet;
  }

  public bool IndicatesChangeFromValue(T value)
  {
    if (!_valueHasBeenSet)
      return false;

    if ((value == null) && (_valueIfSet == null))
      return false;
    else if ((value == null) || (_valueIfSet == null))
      return true;

    return !value.Equals(_valueIfSet);
  }

  public static implicit operator Optional<T>(T value)
  {
    return new Optional<T>
    {
      _valueIfSet = value,
      _valueHasBeenSet = true
    };
  }
}

This type allows us to write an UpdateWith method

public RoleDetails UpdateWith(
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  if (!title.IndicatesChangeFromValue(Title)
  && !startDate.IndicatesChangeFromValue(StartDate)
  && !endDateIfAny.IndicatesChangeFromValue(EndDateIfAny))
    return this;

  return new RoleDetails(
    title.GetValue(Title),
    startDate.GetValue(StartDate),
    endDateIfAny.GetValue(EndDateIfAny)
  );
}

The Optional type could have exposed properties for has-a-value-been-set and get-value-if-any but since each property comparison (to determine whether a new instance is actually required) would have to follow the pattern if-value-has-been-set-and-if-value-that-has-been-set-does-not-equal-current-value, it made sense to me to hide the properties and to instead expose only the access methods "IndicatesChangeFromValue" and "GetValue". The "IndicatesChangeFromValue" method returns true if the Optional describes a value that is different to that passed in and "GetValue" returns the wrapped value if there is one, and returns the input argument if not. This enables the relatively succinct "UpdateWith" method format shown above.

The other method on the struct is an implicit operator for the wrapped type which makes the "UpdateWith" calling code simpler. Instead of having to do something like

var updatedRoleDetails = existingRoleDetails
  .UpdateWith(startDate = Optional<DateTime>(new DateTime(2014, 9, 21)));

the implicit conversion allows you to write

var updatedRoleDetails = existingRoleDetails
  .UpdateWith(startDate = new DateTime(2014, 9, 21));

because the DateTime will be implicitly converted into an Optional<DateTime>. In fact, I went one step further and made it such that this is the only way to create an Optional that wraps a value. There is no constructor that may be used to initialise an Optional with a value, you must rely upon the implicit conversion. This means that it's very clear that there's only one way to use this type. It also happens to be very similar to the most common way that the Nullable type is used in C# - although that does have a public constructor that accepts the value to wrap, in practice I've only ever seen values cast to Nullable (as opposed to the Nullable constructor being passed the value).

Turning it up to eleven

Now this is all well and good and I think it would be a solid leap forward simply to leave things as they are shown above. Unnecessary GC pressure is avoided since there are no "intermediary" instances when changing properties, while the use of structs means that we're not generating a load of property-update-value references that need to be collected either.

But I just couldn't resist trying to push it a bit further since there's still quite a lot of boring code that needs to be written for every immutable type - the UpdateWith method needs to check all of the properties to ensure that they haven't changed and then it needs to pass values into a constructor. If a class has quite a lot of properties (which is not especially unusual if the types are representing complex data) then this UpdateWith method could grow quite large. Wouldn't it be nice if we could just write something like:

public RoleDetails UpdateWith(
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  return magicUpdater(title, startDate, endDateIfAny);
}

Wouldn't it?? Yes it would.

And we can.. if we dip into some of the .net framework's crazier parts - reflection and stack tracing. With some LINQ expressions thrown in to make it work efficiently when called more than once or twice.

What this "magicUpdater" needs to do is take the names and values of the arguments passed to it and then analyse the target type (RoleDetails in this example) to find the constructor to call that will allow all of these named values to be passed into a new instance, using existing property values on the source instance for any constructor arguments that are not provided by the update arguments. It also needs to do the same work to determine whether the update arguments actually require a new instance to be generated - if only the StartDate is being provided to change but the new value is the same as the current value then no new instance is required, the source instance can be returned directly by the "magicUpdater".

This is handled by two steps. The first based around this line:

var callingMethod = new StackFrame(1).GetMethod();

It returns a MethodBase with metadata about the method that called the "magicUpdater" (the "1" in the call above is how many steps back to go in the call stack). From this the names of the arguments can be extracted and a delegate returned which will take the argument values themselves. So the call would actually look more like (if this "magicUpdater" method return a delegate which then must itself be called):

return magicUpdater()(title, startDate, endDateIfAny);

Before we move on to the second step, there are some important considerations in relation to the use of StackFrame. Firstly, there is some expense to performing analysis like this, as with using reflection - but we'll not worry about that here, some optimisations will be covered later which hopefully mean we can ignore it. What's more important is that analysing the call stack can seem somewhat.. unreliable, in a sense. In the real world, the code that gets executed is not always the code as it appears in the C# source. A release build will apply optimisations that won't be applied to debug builds and when code is manipulated by the JIT compiler more optimisations again may occur - one of the more well-known of which is "method inlining". Method inlining is when the compiler sees a chain of Method1 -> Method2 -> Method3 -> Method4 and observes that Method2 is so small that instead of being a distinct method call (which has a cost, as every method call does - the arguments have to be passed into a new scope and this must be considered by the garbage collector; as a very basic example of one of these costs) the code inside Method2 can be copied inside Method1. This would mean that if Method3 tried to access Method2's metadata through the StackFrame class, it would be unable to - it would be told it was called by Method1!

There's a short but informative article about this by Eric Gunnerson: More on inlining. In a nutshell it says that -

Methods that are greater than 32 bytes of IL will not be inlined.
Virtual functions are not inlined.
Methods that have complex flow control will not be in-lined. Complex flow control is any flow control other than if/then/else; in this case, switch or while.
Methods that contain exception-handling blocks are not inlined, though methods that throw exceptions are still candidates for inlining.
If any of the method's formal arguments are structs, the method will not be inlined.

This means that we shouldn't have to worry about the UpdateWith method being inlined (since its arguments are all Optional which are structs), but the "magicUpdater" method may be a concern. The way that my library gets around that is that the method "GetGenerator" on the UpdateWithHelper class (it's not really called "magicUpdater" :) has the attribute

[MethodImpl(MethodImplOptions.NoInlining)]
public UpdateWithSignature<T> GetGenerator<T>(int numberOfFramesFromCallSite = 1)

which tells the JIT compiler not to inline it and so, since the caller isn't inlined (because of the structs), we don't have to worry about stack "compressing".

This "GetGenerator" method, then, has access to the argument names and argument types of the method that called it. The generic type param T is the immutable type that is being targeted by the "UpdateWith" method. UpdateWithSignature<T> is a delegate with the signature

public delegate T UpdateWithSignature<T>(T source, params object[] updateValues);

This delegate is what takes the property update values and creates a new instance (or returns the source instance if no changes are required). It does this by considering every public constructor that T has and determining what constructor arguments it can satisfy with update arguments. It does this by matching the update argument names to the constructor argument names and ensuring that the types are compatible. If a constructor is encountered with arguments that don't match any update arguments but T has a property whose name and type matches the constructor argument, then that will be used. If a constructor argument is encountered that can't be matched to an update argument or a property on T but the constructor argument has a default value, then the default value may be used if the constructor.

If a constructor does not have at least one argument that can be matched to each update argument name, then that constructor is ignored (otherwise an update argument would be ignored, which would make the UpdateWith somewhat impotent!). If there are multiple constructors that meet all of these conditions, they are sorted by the number of arguments they have that are fulfilled by update arguments and then sorted by the number of arguments that are satisfied by other properties on T - the best match from this sorted set it used.

The return UpdateWithSignature<T> delegate itself is a compiled LINQ expression so, once the cost of generating it has been paid the first time that it's required, the calls to this delegate are very fast. The "GetGenerator" method caches these compiled expressions, so the method

public RoleDetails UpdateWith(
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  return DefaultUpdateWithHelper.GetGenerator<RoleDetails>()(this, title, startDate);
}

can be called repeatedly and cheaply.

Note that in the above example, the DefaultUpdateWithHelper is used. This is a static wrapper around the UpdateWithHelper which specifies a default configuration. The UpdateWithHelper takes arguments that describe how to match update argument names to constructor argument names, for example (amongst other configuration options). The implementation in the DefaultUpdateWithHelper matches by name in a case-insensitive manner, which should cover the most common cases. But the relevant UpdateWithHelper constructor argument is of type

public delegate bool UpdateArgumentToConstructorArgumentComparison(
  ParameterInfo updateArgument,
  ConstructorInfo constructor,
  ParameterInfo constructorArgument);

so a custom implementation could implement any complex scheme based upon target type or constructor or update argument type.

The UpdateWithHelper also requires a cache implementation for maintaining the compiled expressions, as well as matchers for other comparisons (such as property name to constructor argument name, for constructor arguments that can't be matched by an update argument). If a custom UpdateWithHelper is desired that only needs to override some behaviour, the DefaultUpdateWithHelper class has a static nested class DefaultValues with properties that are the references that it uses for the UpdateWithHelper constructor arguments - some of these may be reused by the custom configuration, if appropriate.

I considered going into some detail about how the LINQ expressions are generated since I think it's hard to find a good "how-to" walkthrough on these. It's either information that seems too simple or fine-grained that it's hard to put it together into something useful or it's the other extreme; dense code that's hard to get to grips with if you don't know much about them. But I feel that it would balloon this post too much - so maybe another day!

Incidentally, the DefaultUpdateWithHelper's static "GetGenerator" method inserts another layer into the call stack, which is why the UpdateWithHelper's method requires an (optional) "numberOfFramesFromCallSite" argument - so that it can be set to 2 in this case, rather than the default 1 (since it will need to step back through the DefaultUpdateWithHelper method before getting to the real "UpdateWith" method). This also means that DefaultUpdateWithHelper has the "MethodImplOptions.NoInlining" attribute on its "GetGenerator" method.

It's also worthy of note that the "GetGenerator" methods support extension methods for "UpdateWith" implementations, as opposed to requiring that they be instance methods. So the following is also acceptable

public static RoleDetails UpdateWith(
  this RoleDetails source,
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  return DefaultUpdateWithHelper.GetGenerator<RoleDetails>()(source, title, startDate);
}

The analysis detects that the first argument is not an OptionalType<T> and asserts that its type is assignable to the type param T and then ignores it when generating the translation expression. The extension method will pass through the "source" reference where "this" was used in the instance method implementation shown earlier.

Further performance optimisations

Although the compiled "generator" expressions are cached, the cache key is based upon the "UpdateWith" method's metadata. This means that the cost of accessing the StackFrame is paid for every "UpdateWith" call, along with the reflection access to get the UpdateWith argument's metadata. If you feel that this might be an unbearable toll, a simple alternative is something like

private static UpdateWithSignature<RoleDetails> updater
  = DefaultUpdateWithHelper.GetGenerator<RoleDetails>(typeof(RoleDetails).GetMethod("UpdateWith"));
public RoleDetails UpdateWith(
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  return updater(this, title, startDate);
}

The "GetGenerator" methods have alternate signatures that accept a MethodBase reference relating to the "UpdateWith" method, rather than relying upon StackFrame to retrieve it. And using a static "updater" reference means that "GetGenerator" is only ever called once, so subsequent calls that would require reflection in order to check for a cached expression are avoided entirely. The trade-off is that the method must be named in a string, which would break if the method was renamed. Not quite as convenient as relying upon stack-tracing magic.

If you really want to get crazy, you can go one step further. If part of the reason for this experiment was to reduce GC pressure, then surely the params array required by the UpdateWithSignature<T> is a step backwards from the less-automated method, where the number of update arguments is known at compile time? (Since that didn't require a params array for a variable number of arguments, there were no method calls where the precise number of update arguments was unknown). Well that params array can be avoided if we make some more trade-offs. Firstly, we may only use an approach like above, which doesn't rely on expression caching (ie. use a static property that requests a generator only once). Secondly, there may only be up to nine update arguments. The first reason is because the cache that the UpdateWithHelper uses records UpdateWithSignature<T> references, which are no good since they use the params array that we're trying to avoid. The second reason is because a distinct delegate is required for each number of arguments, as is a distinct method to construct the generator - so there had to be a limit somewhere and I chose nine. The methods are

public UpdateWithSignature1<T> GetUncachedGenerator1<T>(MethodBase updateMethod)
public UpdateWithSignature2<T> GetUncachedGenerator2<T>(MethodBase updateMethod)
public UpdateWithSignature3<T> GetUncachedGenerator3<T>(MethodBase updateMethod)
// .. etc, up to 9

and the delegates are of the form

public delegate T UpdateWithSignature1<T>(T source, object arg0);
public delegate T UpdateWithSignature2<T>(T source, object arg0, object arg1);
public delegate T UpdateWithSignature3<T>(T source, object arg0, object arg1, object arg2);
// .. etc, up to 9

They may be used in a similar manner to that already shown, but you must be careful to match the number of arguments required by the "UpdateWith" method. In a way, there is actually a compile-time advantage here - if you choose the wrong one, then the compiler will warn you that you have specified three update arguments when the delegate requires four (for example). With the generic form (the non-numbered "GetGenerator" method), the params array means that you can specify any number of update arguments and you won't find out until runtime that you specified the wrong amount.

So, to illustrate -

private static UpdateWithSignature3<RoleDetails> updater
  = DefaultUpdateWithHelper.GetUncachedGenerator3<RoleDetails>(
    typeof(RoleDetails).GetMethod("UpdateWith"));

public RoleDetails UpdateWith(
  Optional<string> title = new Optional<string>(),
  Optional<DateTime> startDate = new Optional<DateTime>(),
  Optional<DateTime?> endDateIfAny = new Optional<DateTime?>())
{
  return updater(this, title, startDate, endDateIfAny);
}

If I'm being honest, however, if you really think that this optimisation is beneficial (by which, I mean you've done performance analysis and found it to be a bottleneck worth addressing), you're probably better replacing this automated approach with the hand-written code that I showed earlier. It's not all that long and it removes all of this "magic" and also gives the compiler more opportunity to pick up on mistakes. But most importantly (in terms of performance) may be the fact that all update arguments are passed as "object" in these delegates. This means that any value types (ints, structs, etc..) will be boxed when they are passed around and then unboxed when used as constructor arguments. This is explained very clearly in the article 5 Basic Ways to Improve Performance in C# and more information about the use of the heap and stack can be found at Six important .NET concepts: Stack, heap, value types, reference types, boxing, and unboxing - I'd not seen this article before today but I thought it explained things really clearly.

Chances are that you won't have to worry about such low level details as whether values are being boxed-unboxed 99% of the time and I think there's a lot of mileage to be had from how convenient this automated approach is. But it's worth bearing in mind the "what ifs" of performance for the times when they do make a difference.

Any other downsides to the automagical way?

I can't claim to have this code in production anywhere yet. But I'm comfortable enough with it at this stage that I intend to start introducing it into prototype projects that it will be applicable to - and then look to using it in real-world, scary, production projects before too long! My only concern, really, is about making silly mistakes with typos in update argument names. If I mistyped "tittle" in the RoleDetails "UpdateWith" example I've been using, I wouldn't find out until runtime that I'd made the mistaken - at which point, the "GetGenerator" call would throw an exception as it wouldn't be able to match "tittle" to any argument on any accessible constructor. I think the trade-off here would be that every "UpdateWith" method that used this library would need a unit test so that discovering the problem at "runtime" doesn't mean "when I hit code in manual testing that triggers the exception" but rather equates to "whenever the test suite is run - whether locally or when pushed to the build server". I doubt that Update methods of this type would normally get a unit test since they're so basic (maybe you disagree!) but in this case the convenience of using the automated "GetGenerator" method still wins even with the (simple) unit test recommended for each one.

Now that I think about it, this is not a dissimilar situation to using a Dependency Injection framework or using AutoMapper in your code - there is a lot of convenience to be had, but at the risk that configuration errors are not exposed until the code is executed.

In summary, until I find a good reason not to use this library going forward, I intend to do so! To revisit my (F#) inspiration, how can it not be enticing to be able to write

// F#
let p2 = {p1 with first="Jim";last="Smith"}

// C#
var p2 = p1.UpdateWith(first:"Jim",last:"Smith");

with so little code having to be written to enable it?!

Go get the code at bitbucket.org/DanRoberts/updatewith!

Or alternatively, pull the NuGet package straight down from nuget.org/packages/CSharpImmutableUpdateWith.

Update (19th September 2014): There's been quite a lot of interest in this post and some good comments made here and at the discussion on Reddit/implementing-f-sharp-inspired-with-updates-for-immutable-classes-in-c-sharp. I intend to write a follow-up post that talks about some of the observations and includes some performance stats. In summary, though, I may have to admit to considering a slight about-turn in the crazy magical approach and strike that up as a convenience for rattling out code quickly but probably something that won't make it into production code that I write. The idea of using an "UpdateWith" method with named, optional arguments (using the Optional struct) will make it into my "real world" code, though! It's also strikingly similar to some of the code in Roslyn, it was pointed out (I'll touch on this in the follow-up too). I still had a lot of fun with the "Turning it up to eleven" investigation and I think there's useful information in here and in the library code I wrote - even more so when I get round to documenting how I approach writing the LINQ expression-generating code. But maybe it didn't result in something that should always be everyone's immediate go-to method for writing this sort of code. Such is life! :)

Update (2nd October 2014): See A follow-up to "Implementing F#-inspired 'with' updates in C#".

Posted at 23:12

Tags:

Comments

19 June 2013

Parsing CSS

A few months ago I wrote about some extensions to the CSS Minifier to support pseudo-Source-Mapping for compiled and minified content (among other things) and I've been meaning to write about the code I used to analyse the style sheet content.

History Lesson

A long time ago I wrote some code to parse javascript to remove comments and minify the content, this was before there were a proliferation of excellent plugins and such like to do it all for you - I think the YUI Compressor might have been around but since it required java to be installed where it would be used, we couldn't use it to compile scripts on-the-fly.

The first pass through the content would break it down into strings representing javascript code, javascript strings and comments. Strings can be quoted with either single or double quotes, single-quote-wrapped strings could contain double quotes without escaping them and vice versa, either string format could contain their own quotes so long as they were formatted. Comments could be multi-line if they were wrapped in /* and */ or single line if they started with // (terminating with a line return or end-of-file). So similar to CSS in a lot of ways! (Particularly if you consider parsing LESS which supports single line comments, unlike regular CSS).

I wrote it in a fairly naive manner, trying to handle each case at a time, building up a primary loop which went through each character, deciding what to do with it based upon what character it was and whether the current content was a string (and what would indicate the end of the string), a comment (and what would terminate that) or javascript code. There were various variables to keep track of these items of state. It did the job but I was keen not to repeat the same approach when writing this "CSS Parser" I wanted.

Employing Immutability (what a surprise!)

Keeping track of the changing state in this way meant that at any point in time there was a lot to hold in my head while I was trying to understand what was going on if something appeared to be misbehaving and made each change to add new functionality increasingly difficult. But reducing the places where state change is a large part of the immutability obsession I've got going on so I figured there must be a better way.

The idea was to start with two interfaces

public interface IProcessCharacters
{
  CharacterProcessorResult Process(IWalkThroughStrings stringNavigator);
}

public interface IWalkThroughStrings
{
  char? CurrentCharacter { get; }
  IWalkThroughStrings Next { get; }
}

with corresponding class and enum

public class CharacterProcessorResult
{
  public CharacterProcessorResult(
    CharacterCategorisationOptions characterCategorisation,
    IProcessCharacters nextProcessor)
  {
    if (!Enum.IsDefined(typeof(CharacterCategorisationOptions), characterCategorisation))
      throw new ArgumentOutOfRangeException("characterCategorisation");
    if (nextProcessor == null)
      throw new ArgumentNullException("nextProcessor");

    CharacterCategorisation = characterCategorisation;
    NextProcessor = nextProcessor;
  }

  public CharacterCategorisationOptions CharacterCategorisation { get; private set; }
  public IProcessCharacters NextProcessor { get; private set; }
}

public enum CharacterCategorisationOptions
{
  Comment,
  CloseBrace,
  OpenBrace,
  SemiColon,
  SelectorOrStyleProperty,
  StylePropertyColon,
  Value,
  Whitespace
}

such that a given string can be traversed character-by-character with a processor returning the type of that character and providing a processor appropriate to the next character.

The clever part being each processor will have very tightly-scoped behaviour and responsibility. For example, if a string is encountered that starts with double quotes then a processor whose entire job is string-handling would be used. This processor would know what quote character would terminate the string and that processing should go back to the previous processor when the string has terminated. All characters encountered within the string would be identified as the same type (generally this will be of type Value since strings are most commonly used in style properties - eg. a url string as part of a background property - so if a semi-colon is encountered it would be identified as type Value despite a semi-colon having more significant meaning when not part of a string value). Handling escape characters becomes very simple if a skip-characters processor is used; when a backslash is encountered, the quoted-section processor hands off to a processor that returns a fixed type for the next character and then returns control back to the quoted-section processor. This means that the quoted-section processor doesn't need to maintain any state such as even-if-the-next-character-is-the-terminating-quote-character-do-not-terminate-the-string-yet-as-it-is-being-escaped.

Comment sections can be handled in a very similar manner, with different processors for multiline comments than single line since the termination manners are different (and this helps keep things really easy).

There is a primary processor which is a bit meatier than I'd like (but still only 320-odd commented lines) that looks out for the start of string or comments and hands off processing appropriately, but also identifies single signficant characters such as opening or closing braces, colons (usually indicating the a separator between a style property name and its value but sometimes a pseudo-class indicator - eg. in "a:hover") and semi-colons.

Parsing is made more challenging as I wanted to support LESS which allows for nesting of rules whereas the only nesting that regular CSS supports is selectors within media queries. CSS 2.1 only allows for a single media query to wrap a selector while CSS 3 may support nesting media rules - see this answer on Stack Overflow: Nesting @media rules in CSS.

As a bit of a cop-out, I don't differentiate between a selector and a property name in the CharacterCategorisationOptions enum, they are both rolled into the value SelectorOrStyleProperty (similarly, media query content is classified as a SelectorOrStyleProperty). While this feels lazy on the one hand, on the other I wanted to make this pass through the content as cheap and clean as possible and accurately determining whether a given character is a selector or a property name could involve significant reading back and forth through the content to find out for sure.

This way, not only is the implementation easier to follow but it enables the main loop to parse only as much content as required to enumerate as far through the content as the caller requires.

To explain what I mean, I need to introduce the class that wraps IProcessCharacters and IWalkThroughStrings -

public interface ICollectStringsOfProcessedCharacters
{
  IEnumerable<CategorisedCharacterString> GetStrings(
    IWalkThroughStrings contentWalker,
    IProcessCharacters contentProcessor
  );
}

and its return type..

public class CategorisedCharacterString
{
  public CategorisedCharacterString(
    string value,
    int indexInSource,
    CharacterCategorisationOptions characterCategorisation)
  {
    if (string.IsNullOrEmpty(value))
      throw new ArgumentException("Null/blank value specified");
    if (indexInSource < 0)
      throw new ArgumentOutOfRangeException("indexInSource", "must be zero or greater");
    if (!Enum.IsDefined(typeof(CharacterCategorisationOptions), characterCategorisation))
      throw new ArgumentOutOfRangeException("characterCategorisation");

    Value = value;
    IndexInSource = indexInSource;
    CharacterCategorisation = characterCategorisation;
  }

  public string Value { get; private set; }

  public int IndexInSource { get; private set; }

  public CharacterCategorisationOptions CharacterCategorisation { get; private set; }
}

The default ICollectStringsOfProcessedCharacters implementation will traverse through the IWalkThroughStrings content and group together characters of the same CharacterCategorisationOptions into a single CategorisedCharacterString, using yield return to return the values.

This means that

/* Test */ .Content { color: black; }

would return content identified as

"/* Test */"     Comment
" "              Whitespace
".Content"       SelectorOrStyleProperty
" "              Whitespace
"{"              OpenBrace
" "              Whitespace
"color"          SelectorOrStyleProperty
":"              StylePropertyColon
" "              Whitespace
"black"          Value
";"              SemiColon
" "              Whitespace
"}"              CloseBrace

But if the enumeration of the data returned from the GetStrings method stopped after the ".Content" string was returned then no more parsing of the CSS would be carried out. If accurate differentiation of selectors, media queries and style property names was required at this point then a lot more parsing may be required to ensure that that string (".Content") was indeed a selector.

Another benefit arises if a large amount of content is to be parsed; an IWalkThroughStrings implementation that wraps a TextReader may be used so the content could be loaded from disk in chunks and as much or as little parsed as desired, using relatively few resources.

No Read-ahead at all??

Having just jabbered on about how amazing it is that this SelectorOrStyleProperty categorisation requires absolutely zero reading ahead in order to categorise any given character (so long as all of the preceeding characters have been parsed), there are a couple of exceptions to this rue:

When a @media rule is encountered, all of the following content needs to be considered to be either Whitespace or SelectorOrStyleProperty until the opening brace for the rule is encountered, since the rule may contain otherwise-significant characters such as colon (eg. the ":" in "@media (min-width:500px)" is part of the media query and does not signify the separator symbol between a property name and a property value), so when a "@" is encountered, the next characters are read to determine whether it's a media rule or not
A colon in a pseudo class should also not be identified as a StylePropertyColon, it should be considered part of the SelectorOrStyleProperty, so if a colon is encountered while processing what is thought to be a selector then some reading ahead is done to try to determine whether the content indicates that it is indeed a psuedo-class selector and not a separator between a property name and its value

To make this easier, the IWalkThroughStrings interface has an additional method

/// <summary>
/// This will try to extract a string of length requiredNumberOfCharacters from the current
/// position in the string navigator. If there are insufficient characters available, then
/// a string containing all of the remaining characters will be returned. This will be an
/// empty string if there is no more content to deliver. This will never return null.
/// </summary>
string TryToGetCharacterString(int requiredNumberOfCharacters);

I contemplated making this an extension method since the data can always be retrieved using the CurrentCharacter and Next properties, but depending upon the implementation there may be more efficient ways to retrieve the data and so it became an interface method.

An original idea?

I'm really happy with the way this approach to the problem has influenced the final design. There were a few issues that I hadn't foreseen when I started (the complications with pseudo classes giving different meaning to the colon character, for example, as outlined above, had somehow slipped my mind entirely when I got going) but extending it to cover these cases wasn't particularly difficult as keeping all of the complicated bits as segregated as possible made it easy to reason about where changes needed to be made and whether they could have any unpleasant side effects.

I don't think I can take credit for the originality of the idea, though. The overarching plan is to have a processor instance which is posed to start processing content, at this point it has produced no results and is in an uninitialised state. This is the first IProcessCharacters instance. When its Process method is called, the first character from the IWalkThroughStrings is taken and a CharacterProcessorResult returned which identifies the type of that first character and specifies an IProcessCharacters instance to process the next character. That character triggered the change in state. The next call to Process might return a result with a different type of IProcessCharacters and/or a different CharacterCategorisationOptions.

The point is that for any current state, there are a finite number of states that can be moved to next (since there are a limited number of CharacterCategorisationOptions values and IProcessCharacters implementations) and a finite number of triggers for each change in state (since there are only so many possible characters, even if we do consider the huge extended alphabets available). This puts me in mind of a Finite State Machine which is a well-documented concept.. the article on Wikipedia is thorough and there's another article on learn you some Erlang for great good! which I haven't read all of, but I've heard good things about that site so intend to read that article properly before hopefully reading and following more of the tutorials on there.

Overview of processors

Just to emphasise how this approach made things easier and spread much of the logic across self-contained components, I'll spin through the processors which loop through the content, passing control back and forth as appropriate.

The first is always the SelectorOrStylePropertySegment, which is actually the one that has to deal with the most different circumstances. By default it will identify each character as being of type SelectorOrStyleProperty unless it encounters any one-offs like an OpenBrace or a SemiColon or anything that constitutes Whitespace. If it encounters the ":" character then it has to do a little reading ahead to try to determine whether that indicates that a delimiter between a Style Property Name and the Property Value or whether it's part of a pseudo class (eg. ":hover"). If it's a Property Value then it hands off to the StyleValueSegment class which walks through content, marking it as either type Value or Whitespace until it hits a ";" and returns control back to the SelectorOrStylePropertySegment.

If the StyleValueSegment encounters a quote character then it hands off control to a QuotedSegment instance which walks through the content marking it as type Value until it encounters the closing quote and returns control back to where it came from. The QuotedSegment has a constructor argument for the termination character (the closing quote) so doesn't have to do anything complicated other than wait for that character to show up!

The SelectorOrStylePropertySegment does something similar to handing off to the StyleValueSegment when it encounters an opening square bracket as that indicates the start of an attribute selector (eg. "a[href]") - control is given to a BracketedSelectorSegment which identifies all content as being type SelectorOrStyleProperty until the closing "]" character is encountered.

All three of SelectorOrStylePropertySegment, StyleValueSegment and BracketedSelectorSegment have to make exceptions for comments. When a "/" is encountered, they will look ahead to see if the next is either "/" or "" and hand off to a SingleLineCommentSegment or MultiLineCommentSegment, respectively. The first simply has to mark everything as Comment content until passing back control when a line break is encountered. The second marks content as Comment until it encounters a "" which the character after is a "/". When this "*" is encountered it hands off to a SkipCharactersSegment which marks the next character as Comment as well and then hands back to whatever handed control to the MultiLineCommentSegment. Only a single character can be identified at once, hence the use of the SkipCharactersSegment, but even this small hoop is only a small one to jump through. These three classes are very minor specialisation of a shared base class so that this logic is shared.

The QuotedSegment doesn't inherit from the same since all content should be identified as being of a particular type, comment-like content within a quoted string does not constitute an actual comment. The QuotedSegment class takes a constructor argument to indicate the type of content that it will be representing since a quoted section while processing Value content should be identified as type Value while a quoted section in SelectorOrStyleProperty content (eg. in "input[type='text']") should also be identified as type SelectorOrStyleProperty.

So essentially it all boils down to is-the-current-processor-ok-for-this-character? If yes, then continue to use it. If a condition is encountered where the processor should change (either handing control to a new processor or handing control back to a previous processor) then do that and let it continue.

When I started writing it, I somehow forgot all about attribute selectors (there's a fair argument that more planning might have been beneficial but I wanted to do it an exercise in jumping in with this approach and then hoping that the entire design would lend itself well to "changing requirements" - aka. me overlooking things!). If this had been processed in some contorted single loop full of complicated interacting conditions - like that javascript parser of my past - then adding that extra set of conditions would have filled me with dread. With this approach, it was no big deal.

The Processor Factory

There was only one thing that struck me with the idea of all of these processor instances being created left, right and centre; that there could be a lot of churn. That if there was content being processed then there could thousands of MultiLineCommentSegment instances being created, for instance, when they're nearly all to perform the same task - record comment content and pass back to the primary SelectorOrStylePropertySegment processor. If these instances could be shared then the churn could be reduced. And since each processor is immutable there is no state to worry about and so they are inherently shareable.

To achieve this, an IGenerateCharacterProcessors is passed as a constructor argument to classes that need to instantiate other processors. The simplest implementation of this is to spin up a new instance of the requested processor type, passing the provided constructor arguments. This is what the CharacterProcessorsFactory class does. But the CachingCharacterProcessorsFactory class will wrap this and keep a record of everything it's instantiated and return a previous reference if it has the same type and constructor arguments as the request specifies. This enables the reuse that I had in mind.

I will admit that there is a slight air of premature optimisation around this, worrying about churn with no evidence that it's a problem, but I intend for these processors to be used on substantial sized chunks of CSS / LESS - and when the IWalkThroughStrings interface allows for a class to be written backed onto a TextReader (as described earlier) so that only the minimum content need be held in memory at any one time, then this extra work to reuse processor instances seems to make sense.

Deeper Analysis

Ok, that explanation of how simple everything was ended up longer and quite possibly more detailed than I'd originally expected but there's one more thing I want to address!

All of the code described above really only allows for quite a simplistic representation of the data. But it paves the way for more complicated processing.

What I really needed was a way to analyse the structure of LESS content - this is all looping back to the idea of "linting" stylesheets to see if they adhere to the rules in the Non-cascading CSS Post. A simple example is being able to determine whether all content in a stylesheet (that has been identified as not being one of the Resets or Themes sheets) should have the content wrapped in a html tag which limits the scope of any declared mixins or values.

A naive way approach would be trim the raw string content and see if it starts with "html {" or some variation with whitespace, hoping that there is no comment content that needs to be ignored. A better way is to use the CSS Processor as-is and skip over any leading comment and whitespace content and look for a html tag at the start of the content. However, more work would have to be done to ensure that that html tag isn't closed and then followed with more content which may or may not be wrapped in a scope-restricting html tag.

To deal with cases like this which require "deep analysis", the "ExtendedLESSParser" project has a class, the LessCssHierarchicalParser, which takes the output from the CSSParser (a CategorisedCharacterString set) and transforms it into hierarchical data describing selectors, media queries, import statements, style property names and style property values. Selectors and media queries are containers that have child "fragments" (these could be style properties or they could be nested selectors). All mention of whitespace and comments are removed and just a representation of the raw style data remains.

// Example
html
{
  h1
  {
    color: black;
    background: white url("background.jpg") no-repeat top left;
  }
  p.Intro { padding: 8px; }
}

becomes something like

html
  h1
    color
      black
    background
      white
      url("background.jpg")
      no-repat
      top
      left
  p.Intro
    padding
      8px

(Above: "html" represent a Selector instance with a ChildFragments property containing Selector instances for the "h1" and "p", each with ChildFragments data made up of StylePropertyValue and StylePropertyValue instances. These classes implement ICSSFragment as do the Import and MediaQuery, which aren't present in the example here).

To ensure that content is wrapped in scope-restricting html tags, what must be done is that the output from the LessCssHierarchicalParser (a set of ICSSFragment implementations) must be considered and it be asserted that they are either Import instances or Selector instances whose Selectors property indicates that the selector in the source content was only "html". An implementation can be found in my NonCascadingCSSRulesEnforcer project on Bitbucket, specifically the file HtmlTagScopingMustBeAppliedToNonResetsOrThemesSheets.cs.

Unfortunately, since this level of analysis requires that the entire content be considered before the structure can be described, this is not as lightweight a process as the CSSProcessor's parsing. However, it is much more powerful in enabling you to drill down into the structure of a stylesheet. The NonCascadingCSSRulesEnforcer has code to enforce nearly all of the rules in my original Non-cascading CSS Post, along with an ITextFileLoader implementation which allows the rules validation to be integrated with my CSSMinifier project which I've been using to rebuild a real site (not just my blog) with these rules. It's been going really well and I intend to put up a concluding post to this "Non-cascading CSS" mini-series with any final insights and any evidence I can present for and against trying to apply them to all builds I'm involved with in the future.

Posted at 21:50

Tags:

Comments

17 December 2012

The Full Text Indexer Post Round-up

This is a compilation of links to articles outlining some of the details of the Full Text Indexer project I put together, just so I could point a link to everything all in one place (like from the BitBucket ReadMe!)

I wrote about the basic building blocks of the Index Generator, went off on a few tangents about how using different key types could allow for searches over data with multi-lingual content (or support Product data that has different descriptions for different web sites, for example) and then came back round to illustrate how I've used the code for this blog's search functionality.

Along the journey, I got to learn a few new things, take advantage of other's research and have fun trying to improve the performance of some of the bottlenecks in the index generation process.

I also had a chance to revisit the basic immutable list structure that I used from the get-go in this project and improve its performance characteristics as well (again, taking a lot of inspiration from cleverer people who've tackled the same problems before me! :)

The code can be found in the Full Text Indexer BitBucket Repository. I've still got a few ideas I'm contemplating toying with - but I've also got other projects I want to investigate! So we'll just have to see what happens with this next..

Update (5th March 2013): I just can't seem to let this lie! :) I've added another post The Full Text Indexer - Automating Index Generation which demonstrates some new code that will examine your source data type and generate an index for you, all on its own! Easy! (Added to the list above).

Update (14th March 2013): And another! This time about support for structured querying, a way to combine terms with AND, OR, NOT operators. See The Full Text Indexer - Structured Queries. (Added to the list above).

Update (28th March 2013): Documenting an extension to the index data that allow for more performant consecutive term matching: The Full Text Indexer: Source Locations. Followed by a way to utilise this information for Search Term Highlighting with Source Locations. (Added to the list above).

Update (25th July 2013): Inspired by the "The 10 Megabyte Manifesto" and NeoCities, I've developed a way to consume search index data with JavaScript to enable a copy of this blog to be hosted where the searching is done entirely client-side. Read about it at The Full Text Indexer goes client-side! and see it in action live at productiverage.neocities.org! (Added to the list above).

Update (30th July 2013): A follow-up to the "The Full Text Indexer goes client-side" describing how the search index data can be compressed to take up less space on the host: JavaScript Compression (Putting my JSON Search Indexes on a diet). (Added to the list above).

Posted at 18:06

Tags:

Comments

6 December 2012

Persistent Immutable Lists - Extended

In my last post (Persistent Immutable Lists) I offered some code as an alternate (and more performant) way to write an immutable list to that I suggested right back in my first post (I love Immutable Data). Well now I'd like to present a minor follow-up to the follow-up! I've incorporated the new list implementation into a few projects and have filled out a few more methods such as a "Remove" method (to remove a particular value, rather than removing from a given index with "RemoveAt" or "RemoveRange"), alternate "Sort" signatures and a "To" method that allows for derived types to be written that return their the derived type from the manipulation methods (see examples below):

[Serializable]
public class ImmutableList<T> : IEnumerable<T>
{
    private readonly Node _tail;
    private readonly IValueValidator<T> _optionalValueValidator;
    private T[] _allValues;

    public ImmutableList() : this((IValueValidator<T>)null) { }
    public ImmutableList(IEnumerable<T> values) : this(values, null) { }
    public ImmutableList(IValueValidator<T> optionalValueValidator)
        : this((Node)null, optionalValueValidator) { }
    public ImmutableList(IEnumerable<T> values, IValueValidator<T> optionalValueValidator)
    {
        if (values == null)
            throw new ArgumentNullException("values");

        Node node = null;
        foreach (var value in values)
        {
            if (optionalValueValidator != null)
                optionalValueValidator.EnsureValid(value);
            if (node == null)
                node = new Node(value, null);
            else
                node = new Node(value, node);
        }
        _tail = node;
        _optionalValueValidator = optionalValueValidator;
        _allValues = null;
    }
    protected ImmutableList(Node tail, IValueValidator<T> optionalValueValidator)
    {
        _tail = tail;
        _optionalValueValidator = optionalValueValidator;
        _allValues = null;
    }

    public T this[int index]
    {
        get
        {
            if ((index < 0) || (index >= Count))
                throw new ArgumentOutOfRangeException("index");

            EnsureAllValuesDataIsPopulated();
            return _allValues[index];
        }
    }

    public int Count
    {
        get { return (_tail == null) ? 0 : _tail.Count; }
    }

    public bool Contains(T value)
    {
        return Contains(value, null);
    }

    public bool Contains(T value, IEqualityComparer<T> optionalComparer)
    {
        if (_tail == null)
            return false;

        EnsureAllValuesDataIsPopulated();
        for (var index = 0; index < _allValues.Length; index++)
        {
            if (DoValuesMatch(_allValues[index], value, optionalComparer))
                return true;
        }
        return false;
    }

    public ImmutableList<T> Add(T value)
    {
        // Add is easy since we keep a reference to the tail node, we only need to wrap it
        // in a new node to create a new tail!
        if (_optionalValueValidator != null)
            _optionalValueValidator.EnsureValid(value);
        return new ImmutableList<T>(
            new Node(value, _tail),
            _optionalValueValidator
        );
    }

    public ImmutableList<T> AddRange(IEnumerable<T> values)
    {
        if (values == null)
            throw new ArgumentNullException("values");
        if (!values.Any())
            return this;

        // AddRange is easy for the same reason as Add
        var node = _tail;
        foreach (var value in values)
        {
            if (_optionalValueValidator != null)
                _optionalValueValidator.EnsureValid(value);
            node = new Node(value, node);
        }
        return new ImmutableList<T>(node, _optionalValueValidator);
    }

    public ImmutableList<T> Insert(IEnumerable<T> values, int insertAtIndex)
    {
        if (values == null)
            throw new ArgumentNullException("values");

        return Insert(values, default(T), insertAtIndex);
    }

    public ImmutableList<T> Insert(T value, int insertAtIndex)
    {
        return Insert(null, value, insertAtIndex);
    }

    private ImmutableList<T> Insert(
        IEnumerable<T> multipleValuesToAdd,
        T singleValueToAdd,
        int insertAtIndex)
    {
        if ((insertAtIndex < 0) || (insertAtIndex > Count))
            throw new ArgumentOutOfRangeException("insertAtIndex");
        if ((multipleValuesToAdd != null) && !multipleValuesToAdd.Any())
            return this;

        // If the insertion is at the end of the list then we can use Add or AddRange which
        // may allow some optimisation
        if (insertAtIndex == Count)
        {
            if (multipleValuesToAdd == null)
                return Add(singleValueToAdd);
            return AddRange(multipleValuesToAdd);
        }

        // Starting with the tail, walk back to the insertion point, record the values we
        // pass over
        var node = _tail;
        var valuesBeforeInsertionPoint = new T[Count - insertAtIndex];
        for (var index = 0; index < valuesBeforeInsertionPoint.Length; index++)
        {
            valuesBeforeInsertionPoint[index] = node.Value;
            node = node.Previous;
        }

        // Any existing node chain before the insertion point can be persisted and the new
        // value(s) appended
        if (multipleValuesToAdd == null)
        {
            if (_optionalValueValidator != null)
                _optionalValueValidator.EnsureValid(singleValueToAdd);
            node = new Node(singleValueToAdd, node);
        }
        else
        {
            foreach (var valueToAdd in multipleValuesToAdd)
            {
                if (_optionalValueValidator != null)
                    _optionalValueValidator.EnsureValid(valueToAdd);
                node = new Node(valueToAdd, node);
            }
        }

        // Finally, add back the values we walked through before to complete the chain
        for (var index = valuesBeforeInsertionPoint.Length - 1; index >= 0; index--)
            node = new Node(valuesBeforeInsertionPoint[index], node);
        return new ImmutableList<T>(node, _optionalValueValidator);
    }

    /// <summary>
    /// Removes the first occurrence of a specific object from the list, if the item is
    /// not present then this instance will be returned
    /// </summary>
    public ImmutableList<T> Remove(T value)
    {
        return Remove(value, null);
    }

    /// <summary>
    /// Removes the first occurrence of a specific object from the list, if the item is
    /// not present then this instance will be returned
    /// </summary>
    public ImmutableList<T> Remove(T value, IEqualityComparer<T> optionalComparer)
    {
        // If there are no items in the list then the specified value can't be present,
        // so do nothing
        if (_tail == null)
            return this;

        // Try to find the last node that matches the value when walking backwards from
        // the tail; this will be the first in the list when considered from start to end
        var node = _tail;
        Node lastNodeThatMatched = null;
        int? lastNodeIndexThatMatched = null;
        var valuesBeforeRemoval = new T[Count];
        for (var index = 0; index < Count; index++)
        {
            if (DoValuesMatch(value, node.Value, optionalComparer))
            {
                lastNodeThatMatched = node;
                lastNodeIndexThatMatched = index;
            }
            valuesBeforeRemoval[index] = node.Value;
            node = node.Previous;
        }
        if (lastNodeThatMatched == null)
            return this;

        // Now build a new chain by taking the content before the value-to-remove and
        // adding back the values that were stepped through
        node = lastNodeThatMatched.Previous;
        for (var index = lastNodeIndexThatMatched.Value - 1; index >= 0; index--)
            node = new Node(valuesBeforeRemoval[index], node);
        return new ImmutableList<T>(node, _optionalValueValidator);
    }

    private bool DoValuesMatch(T x, T y, IEqualityComparer<T> optionalComparer)
    {
        if (optionalComparer != null)
            return optionalComparer.Equals(x, y);

        if ((x == null) && (y == null))
            return true;
        else if ((x == null) || (y == null))
            return false;
        else
            return x.Equals(y);
    }

    public ImmutableList<T> RemoveAt(int removeAtIndex)
    {
        return RemoveRange(removeAtIndex, 1);
    }

    public ImmutableList<T> RemoveRange(int removeAtIndex, int count)
    {
        if (removeAtIndex < 0)
            throw new ArgumentOutOfRangeException(
                "removeAtIndex",
                "must be greater than or equal zero"
            );
        if (count <= 0)
            throw new ArgumentOutOfRangeException("count", "must be greater than zero");
        if ((removeAtIndex + count) > Count)
            throw new ArgumentException("removeAtIndex + count must not exceed Count");

        // Starting with the tail, walk back to the end of the removal range, recording
        // the values we pass over
        var node = _tail;
        var valuesBeforeRemovalRange = new T[Count - (removeAtIndex + count)];
        for (var index = 0; index < valuesBeforeRemovalRange.Length; index++)
        {
            valuesBeforeRemovalRange[index] = node.Value;
            node = node.Previous;
        }

        // Move past the values in the removal range
        for (var index = 0; index < count; index++)
            node = node.Previous;

        // Now add back the values we walked through above to the part of the chain that
        // can be persisted
        for (var index = valuesBeforeRemovalRange.Length - 1; index >= 0; index--)
            node = new Node(valuesBeforeRemovalRange[index], node);
        return new ImmutableList<T>(node, _optionalValueValidator);
    }

    public ImmutableList<T> Sort()
    {
        return Sort((IComparer<T>)null);
    }

    public ImmutableList<T> Sort(Comparison<T> optionalComparison)
    {
        if (optionalComparison == null)
            return Sort((IComparer<T>)null);
        return Sort(new SortComparisonWrapper(optionalComparison));
    }

    public ImmutableList<T> Sort(IComparer<T> optionalComparer)
    {
        EnsureAllValuesDataIsPopulated();
        return new ImmutableList<T>(
            (optionalComparer == null)
                ? _allValues.OrderBy(v => v)
                : _allValues.OrderBy(v => v, optionalComparer),
            _optionalValueValidator
        );
    }

    public IEnumerator<T> GetEnumerator()
    {
        // As documented at http://msdn.microsoft.com/en-us/library/system.array.aspx,
        // from .Net 2.0 onward, the Array class implements IEnumerable<T> but this is
        // only provided at runtime so we have to explicitly cast access its generic
        // GetEnumerator method
        EnsureAllValuesDataIsPopulated();
        return ((IEnumerable<T>)_allValues).GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    /// <summary>
    /// So that a derived class may override the public methods with implementations that
    /// return the derived type's class, this method exposes a manner to access the _tail
    /// reference of a return ImmutableList instance without having to make both it and the
    /// Node class public - eg. a derived class NonNullOrEmptyStringList may incorporate its
    /// own hard-coded validation and wish to have a NonNullOrEmptyStringList instance
    /// returned from its Add method. If it calls the ImmutableList's Add method it will
    /// receive a new ImmutableList instance which can be transformed into an instance of
    /// NonNullOrEmptyStringList if it has a constructor which take a Node argument by
    /// passing a lambda wrapping a call to that constructor into this method, along with
    /// the new ImmutableList reference that is to be wrapped. This introduce does have the
    /// overhead of an additional initialisation (of the NonNullOrEmptyStringList) but it
    /// allows for more strictly-typed return values from the NonNullOrEmptyStringList's
    /// methods.
    /// </summary>
    protected static U To<U>(ImmutableList<T> list, Func<Node, U> generator)
    {
        if (list == null)
            throw new ArgumentNullException("list");
        if (generator == null)
            throw new ArgumentNullException("generator");

        return generator(list._tail);
    }

    /// <summary>
    /// For enumerating the values we need to walk through all of the nodes and then reverse
    /// the set (since we start with the tail and work backwards). This can be relatively
    /// expensive so the list is cached in the "_allValues" member array so that subsequent
    /// requests are fast (wouldn't be a big deal for a single enumeration of the contents
    /// but it could be for multiple calls to the indexed property).
    /// </summary>
    private void EnsureAllValuesDataIsPopulated()
    {
        if (_allValues != null)
            return;

        // Since we start at the tail and work backwards, we need to reverse
        // the order of the items in values array that is populated here
        var numberOfValues = Count;
        var values = new T[numberOfValues];
        var node = _tail;
        for (var index = 0; index < numberOfValues; index++)
        {
            values[(numberOfValues - 1) - index] = node.Value;
            node = node.Previous;
        }
        _allValues = values;
    }

    /// <summary>
    /// This is used by the Sort method if a Comparison<T> is specified
    /// </summary>
    private class SortComparisonWrapper : IComparer<T>
    {
        private Comparison<T> _comparison;
        public SortComparisonWrapper(Comparison<T> comparison)
        {
            if (comparison == null)
                throw new ArgumentNullException("comparison");

            _comparison = comparison;
        }

        public int Compare(T x, T y)
        {
            return _comparison(x, y);
        }
    }

    protected class Node
    {
        public Node(T value, Node previous)
        {
            Value = value;
            Previous = previous;
            Count = (previous == null) ? 1 : (previous.Count + 1);
        }

        public T Value { get; private set; }

        /// <summary>
        /// This will be null if there is no previous node (ie. this is the start of the
        /// chain, the head)
        /// </summary>
        public Node Previous { get; private set; }

        public int Count { get; private set; }
    }
}

public interface IValueValidator<T>
{
    /// <summary>
    /// This will throw an exception for a value that does pass validation requirements
    /// </summary>
    void EnsureValid(T value);
}

To implement a NonNullImmutableList we want to inherit from the ImmutableList and introduce a compulsory IValueValidator that ensures that no item in the list is null. Each of the methods are then "overridden" using the "new" keyword so that if they are called on an instance of the NonNullImmutableList then an instance of the NonNullImmutableList will be returned but if it is used anywhere as an ImmutableList then the ImmutableList's methods will be called directly and an ImmutableList (rather than a NonNullImmutableList) reference will be returned. This approach does have a minor overhead as described in the comment on the "To" method seen above but it does offer a straight-forward way to write derived classes that maintain their type (and so their implicit validation rules and assurances) when manipulations are performed.

[Serializable]
public class NonNullImmutableList<T> : ImmutableList<T> where T : class
{
    private readonly static Validator _defaultValidator = new Validator(null);
    private IValueValidator<T> _optionalValueValidator;

    public NonNullImmutableList() : this((IValueValidator<T>)null) { }
    public NonNullImmutableList(IEnumerable<T> values) : this(values, null) { }
    public NonNullImmutableList(IValueValidator<T> optionalValueValidator)
        : base((Node)null, GetValidator(optionalValueValidator))
    {
        _optionalValueValidator = optionalValueValidator;
    }
    public NonNullImmutableList(
        IEnumerable<T> values,
        IValueValidator<T> optionalValueValidator
    ) : base(values, GetValidator(optionalValueValidator))
    {
        _optionalValueValidator = optionalValueValidator;
    }
    private NonNullImmutableList(Node tail, IValueValidator<T> optionalValueValidator)
        : base(tail, GetValidator(optionalValueValidator))
    {
        _optionalValueValidator = optionalValueValidator;
    }

    private static IValueValidator<T> GetValidator(IValueValidator<T> optionalValueValidator)
    {
        if (optionalValueValidator == null)
            return _defaultValidator;
        return new Validator(optionalValueValidator);
    }

    public new NonNullImmutableList<T> Add(T value)
    {
        return ToNonNullOrEmptyStringList(base.Add(value));
    }
    public new NonNullImmutableList<T> AddRange(IEnumerable<T> values)
    {
        return ToNonNullOrEmptyStringList(base.AddRange(values));
    }
    public new NonNullImmutableList<T> Insert(T value, int insertAtIndex)
    {
        return ToNonNullOrEmptyStringList(base.Insert(value, insertAtIndex));
    }
    public new NonNullImmutableList<T> Insert(IEnumerable<T> values, int insertAtIndex)
    {
        return ToNonNullOrEmptyStringList(base.Insert(values, insertAtIndex));
    }
    public new NonNullImmutableList<T> Remove(T value)
    {
        return ToNonNullOrEmptyStringList(base.Remove(value));
    }
    public new NonNullImmutableList<T> Remove(T value, IEqualityComparer<T> optionalComparer)
    {
        return ToNonNullOrEmptyStringList(base.Remove(value, optionalComparer));
    }
    public new NonNullImmutableList<T> RemoveAt(int removeAtIndex)
    {
        return ToNonNullOrEmptyStringList(base.RemoveAt(removeAtIndex));
    }
    public new NonNullImmutableList<T> RemoveRange(int removeAtIndex, int count)
    {
        return ToNonNullOrEmptyStringList(base.RemoveRange(removeAtIndex, count));
    }
    public new NonNullImmutableList<T> Sort()
    {
        return ToNonNullOrEmptyStringList(base.Sort());
    }
    public new NonNullImmutableList<T> Sort(Comparison<T> optionalComparison)
    {
        return ToNonNullOrEmptyStringList(base.Sort(optionalComparison));
    }
    public new NonNullImmutableList<T> Sort(IComparer<T> optionalComparer)
    {
        return ToNonNullOrEmptyStringList(base.Sort(optionalComparer));
    }
    private NonNullImmutableList<T> ToNonNullOrEmptyStringList(ImmutableList<T> list)
    {
        if (list == null)
            throw new ArgumentNullException("list");

        return To<NonNullImmutableList<T>>(
            list,
            tail => new NonNullImmutableList<T>(tail, _optionalValueValidator)
        );
    }

    private class Validator : IValueValidator<T>
    {
        private IValueValidator<T> _optionalInnerValidator;
        public Validator(IValueValidator<T> optionalInnerValidator)
        {
            _optionalInnerValidator = optionalInnerValidator;
        }

        /// <summary>
        /// This will throw an exception for a value that does pass validation requirements
        /// </summary>
        public void EnsureValid(T value)
        {
            if (value == null)
                throw new ArgumentNullException("value");
            if (_optionalInnerValidator != null)
                _optionalInnerValidator.EnsureValid(value);
        }
    }
}

A very similar approach could be taken to implement a "NonNullOrEmptyStringList" class (referred to in previous posts as a "DefinedStringList") but dropping the type param and inheriting from ImmutableList<string> and swapping out the validator to check for null or blank strings.

The final piece of the puzzle I've used in my code is to throw in some extension methods:

public static class IEnumerable_Extensions
{
    public static ImmutableList<T> ToImmutableList<T>(this IEnumerable<T> data)
    {
        return new ImmutableList<T>(data);
    }

    /// <summary>
    /// valueValidator is optional (may be null)
    /// </summary>
    public static ImmutableList<T> ToImmutableList<T>(
        this IEnumerable<T> data,
        IValueValidator<T> valueValidator)
    {
        return new ImmutableList<T>(data, valueValidator);
    }

    /// <summary>
    /// This will throw an exception if any of the values are null
    /// </summary>
    public static NonNullImmutableList<T> ToNonNullImmutableList<T>(
        this IEnumerable<T> data) where T : class
    {
        return new NonNullImmutableList<T>(data);
    }

    /// <summary>
    /// This will throw an exception if any of the values are null, valueValidator is
    /// optional (may be null)
    /// </summary>
    public static NonNullImmutableList<T> ToNonNullImmutableList<T>(
        this IEnumerable<T> data,
        IValueValidator<T> valueValidator) where T : class
    {
        return new NonNullImmutableList<T>(data, valueValidator);
    }
}

And that's it! I'm happy with these updated lists for now and, as I already mentioned, have been using them in a few projects and consider them ready for use!

Posted at 20:02

Tags:

Immutability

Comments

26 November 2012

Persistent Immutable Lists

I've written before about immutable data structures (I'm all for them; see I love Immutable Data and Problems in Immutability-land) but watching a talk recently by Clojure-creator Rich Hickey made me think about one particular area again recently. In that first post I put up some example cost for an Immutable List that wrapped the .Net List<T> class - this was very simple to implement and understand, and in many cases I was using the immutable class as a return type or a method argument which meant that the instance would be built once and further manipulations would be limited. This meant that I wasn't too concerned with internally creating a new list instance each time a new immutable instance was required and copying the references over.

However, in this talk it was reiterated that all of the core data structures in Clojure were intended to be immutable and that considerable work was done to ensure that the performance of these structures was sufficient that it could compete with Java and C#. A persistent linked list structure was used so that operations could be performed without having to recreate the entire dataset.

This is something that I didn't know a huge amount about but sounded like it could be an interesting avenue to explore!

A basic introduction into the singly-linked list

The singly-linked list is a fairly basic structure built around nodes; each node has a value and a link to the next node, if there is one. We know we're at the end of the list when the current node has a null "next" node reference.

An empty list would have a null "head" node.

An int list with a single item would have a head node of the form

{ 1, null }

where the value of the item is 1 and there is no next node.

An int list with two items could be illustrated as

{ 1, { 2, null } }

And one with four values as

{ 1, { 2, { 3, { 4, null } } } }

Well, you get the idea!

The interesting thing comes when we look at how the structure changes as items are added. Starting off with an empty list and adding items one at a time to the front of the list, the structure grows in this manner:

{ 1, null }

{ 2, { 1, null } }

{ 3, { 2, { 1, null } } }

{ 4, { 3, { 2, { 1, null } } } }

Each time we take a list L0 and create a new instance L1 by adding a single item, the head node of L1 can be taken to be a new node that contains the new value and whose "next" reference points to the head node of L0. This is where the "persistent" part comes into play. (This is only possible if the nodes themselves are immutable as otherwise one instance of the list could affect the data in another instance if they shared node chain references where the nodes were not immutable).

This means that creating a new list with a new item is a very simple and fast action! This operation is considerably faster than the doing the same on the original immutable list approach I was using, especially as the size of the list grows.

Enumerating the list is also straight-forward; we start at the head node (if non-null) and then walk down the "next" references until we hit a null, indicating the end of the list.

A basic implementation of this could be:

public class SimplePersistentImmutableList<T> : IEnumerable<T>
{
    private readonly Node _head;
    public SimplePersistentImmutableList() : this(null) { }
    private SimplePersistentImmutableList(Node head)
    {
        _head = head;
    }

    public SimplePersistentImmutableList<T> InsertAtStart(T value)
    {
        return new SimplePersistentImmutableList<T>(
            new Node(value, _head)
        );
    }

    public IEnumerator<T> GetEnumerator()
    {
        return new SetEnumerator(_head);
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    private class Node
    {
        public Node(T value, Node next)
        {
            Value = value;
            Next = next;
        }

        public T Value { get; private set; }

        /// <summary>
        /// This will be null if there is no next node
        /// </summary>
        public Node Next { get; private set; }
    }

    private class SetEnumerator : IEnumerator<T>
    {
        private readonly Node _topNode;
        private Node _currentNode;
        public SetEnumerator(Node topNode)
        {
            // The way that the enumeration operates is that it will call MoveNext before
            // trying to retrieve the first value when processing a foreach loop to ensure
            // that data is present. In order to deal with this, we need to wrap the Top
            // Node in another node so that the first MoveNext call moves us to the start
            // of the data.
            _topNode = new Node(default(T), topNode);
            _currentNode = _topNode;
        }

        public void Dispose() { }

        public T Current
        {
            get
            {
                if (_currentNode == null)
                    throw new InvalidOperationException("No Current value");
                return _currentNode.Value;
            }
        }

        object IEnumerator.Current
        {
            get { return Current; }
        }

        public bool MoveNext()
        {
            if ((_currentNode == null) || (_currentNode.Next == null))
                return false;
            _currentNode = _currentNode.Next;
            return true;
        }

        public void Reset()
        {
            _currentNode = _topNode;
        }
    }
}

And most of that code is the implementation of IEnumerable!

Limitations

This example only exposes an InsertAtStart method as a manner in which to alter the list. An obvious counterpart would be to add a RemoveFromStart method, since all that need do is create a new list whose head node is the "next" node of the head node of the current list (if the head node of the initial list was null then there are no items, and so RemoveFromStart would be invalid).

public SimplePersistentImmutableList<T> RemoveFirstItem()
{
    if (_head == null)
        throw new InvalidOperationException("The list is empty");

    return new SimplePersistentImmutableList<T>(
        _head.Next
    );
}

At this point, we could very easily take this code and create an immutable stack by renaming "InsertAtStart" to "Push", "RemoveFromStart" to "Pop" and adding in a way to retrieve the current value, if any:

public T Peek()
{
    if (_head == null)
        throw new InvalidOperationException("The list is empty");

    return _head.Value;
}

public bool IsEmpty
{
    get
    {
        return (_head == null);
    }
}

However, to support the other actions that are expected from a list such as inserting-into and removing-from arbitrary locations we need to consider how to find the appropriate place in the node chain from which to snip out values or insert new ones. Unless these operations are to remove the first item(s) from a list or to add some to the start of the list, only some of the existing chain may be shared between the current and new instances.

For example, to add the value 99 into index 2 of the list that is described by the following node chain

{ 3, { 2, { 1, { 0, null } } } }

then we'd need to end up with the chain

{ 3, { 2, { 99, { 1, { 0, null } } } } }

managing to re-use only the last two nodes of the existing chain.

This brings me onto the issue that I have with the above implementation; it's my gut feeling that the majority of operations that I might perform on a list are generating an immutable list from a mutable set, adding items to the end of an existing list and enumerating through the values. Keeping a reference to the head node means that every time a new value is added to the end of the list, none of the chain may be persisted. So to optimise for this operation we can store a reference to the tail of the chain. Now the same logic from the InsertAtStart method becomes the Add method:

public SimplePersistentImmutableList<T> Add(T value)
{
    return new SimplePersistentImmutableList<T>(
        new Node(value, _tail)
    );
}

so long as the Node class is also altered to reflect this reversed nature:

private class Node
{
    public Node(T value, Node previous)
    {
        Value = value;
        Previous = previous;
    }

    public T Value { get; private set; }

    /// <summary>
    /// This will be null if there is no previous node
    /// </summary>
    public Node Previous { get; private set; }
}

This does raise one thorny issue, though; we have to re-think enumeration of the list since we can only step backwards through the list as the "master" node reference we store is the tail. A simple approach would be as follows:

public IEnumerator<T> GetEnumerator()
{
    var values = new List<T>();
    var node = _tail;
    while (_tail != null)
    {
        values.Insert(0, node.Value);
        node = node.Previous;
    }
    return values.GetEnumerator();
}

This makes enumeration potentially an expensive operation, especially if there are a large number of items in the set since a new List is built and populated for each enumeration. And if there are a lot of items to deal with then the list may have to resize its internal array multiple times (with a copy operation from one array to the next) since we don't know up front how large the list needs to be.

To address this, I'm going to make two changes. Firstly, the Node class will be given a Count property which is always the Count of the previous Node plus one, unless the previous Node is null in which case the Count is one.

private class Node
{
    public Node(T value, Node previous)
    {
        Value = value;
        Previous = previous;
        Count = (previous == null) ? 1 : (previous.Count + 1);
    }

    public T Value { get; private set; }

    /// <summary>
    /// This will be null if there is no previous node
    /// the head)
    /// </summary>
    public Node Previous { get; private set; }

    public int Count { get; private set; }
}

Secondly, I'm going to introduce a class member array "_allValues" which is only populated the first time that an enumeration is requested and that effectively caches the value set in an easily-enumerable format. This is only populated "on demand" to avoid any overhead where it is generated for a list that will never be enumerated over (if an instance L0 has a value added to it, resulting in L1, which then has a further value added to it, resulting in L2, we don't want to waste time generating the "_allValues" array for L1 if the reference to L1 is dropped when L2 is created).

/// <summary>
/// For enumerating the values we need to walk through all of the nodes and then reverse the
/// set (since we start with the tail and work backwards). This can be relatively expensive
/// so the list is cached in the "_allValues" member array so that subsequent requests are
/// fast (mightn't be a big deal for a single enumeration of the contents but it could
/// be for multiple calls to the indexed property, for example).
/// </summary>
private void EnsureAllValuesDataIsPopulated()
{
    if (_allValues != null)
        return;

    // Since we start at the tail and work backwards, we need to reverse the order of the
    // items in values array that is populated here
    var numberOfValues = Count;
    var values = new T[numberOfValues];
    var node = _tail;
    for (var index = 0; index < numberOfValues; index++)
    {
        values[(numberOfValues - 1) - index] = node.Value;
        node = node.Previous;
    }
    _allValues = values;
}

The Count property of the node allows an array to be initialised of the required size since now we know the required size. The "_allValues" array is set to null in the constructor and this EnsureAllValuesDataIsPopulated method must be called before anything references it (eg. the GetEnumerator method).

There's something potentially a bit hairy in this, though, as the internals of the class are no longer immutable and so could we be open to crazy things happening in multi-threaded scenarios? Joseph Albahari's Advanced Threading article shows a scary first example and Jon Skeet's Implementing the Singleton Pattern in C# has an example with code that looks very similar to what we're doing here, and that's clearly marked as not thread-safe. The first example illustrates how issues may arise as the "compiler, CLR or CPU may reorder your program's instructions to improve efficiency" but "C# and the runtime are very careful to ensure that such optimizations don’t break ordinary single-threaded code" so in this case we needn't worry as there is only one "_allValues" reference being compared to null and then set and no significant rearrangement could be made that wouldn't affect single-threaded processing. In the Singleton example, the issue is that the work could potentially be performed multiple times if multiple threads checked for null before any thread had completed the work and set the "_allValues" reference. For the lock-free reading that be possible result when "_allValues" has been set, I'm happy with the trade-off in this case. (If multiple threads have to do the work of generating the array while they're all clamouring for the "_allValues" data at the same time that's fine since once they finish, subsequent requests will be able to access the pre-generated array with no locking or other complications). If I wasn't happy with it then I'd probably use the .Net 4.0 Lazy<T> construct I've talked about before (see Check, check it out) but this could potentially add some overhead for each new instance of the immutable list, which I wanted to avoid for instances that will never be enumerated over.

public class PersistentImmutableList<T> : IEnumerable<T>
{
    private readonly Node _tail;
    private T[] _allValues;

    public PersistentImmutableList() : this((Node)null) { }
    public PersistentImmutableList(IEnumerable<T> values)
    {
        if (values == null)
            throw new ArgumentNullException("values");

        Node node = null;
        foreach (var value in values)
        {
            if (node == null)
                node = new Node(value, null);
            else
                node = new Node(value, node);
        }
        _tail = node;
    }
    private PersistentImmutableList(Node tail)
    {
        _tail = tail;
    }

    public int Count
    {
        get { return (_tail == null) ? 0 : _tail.Count; }
    }

    public PersistentImmutableList<T> Add(T value)
    {
        return AddRange(new[] { value });
    }

    public PersistentImmutableList<T> AddRange(IEnumerable<T> values)
    {
        if (values == null)
            throw new ArgumentNullException("values");

        var node = _tail;
        foreach (var value in values)
            node = new Node(value, _tail);
        return new PersistentImmutableList<T>(node);
    }

    public IEnumerator<T> GetEnumerator()
    {
        // As documented at http://msdn.microsoft.com/en-us/library/system.array.aspx, from
        // .Net 2.0 onward, the Array class implements IEnumerable<T> but this is only
        // provided at runtime so we have to explicitly cast access its generic
        // GetEnumerator method
        EnsureAllValuesDataIsPopulated();
        return ((IEnumerable<T>)_allValues).GetEnumerator();
    }
    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    /// <summary>
    /// For enumerating the values we need to walk through all of the nodes and then reverse
    /// the set (since we start with the tail and work backwards). This can be relatively
    /// expensive so the list is cached in the "_allValues" member array so that subsequent
    /// requests are fast (mightn't be a big deal for a single enumeration of the contents
    /// but it could be for multiple calls to the indexed property).
    /// </summary>
    private void EnsureAllValuesDataIsPopulated()
    {
        if (_allValues != null)
            return;

        // Since we start at the tail and work backwards, we need to reverse the order of
        // the items in values array that is populated here
        var numberOfValues = Count;
        var values = new T[numberOfValues];
        var node = _tail;
        for (var index = 0; index < numberOfValues; index++)
        {
            values[(numberOfValues - 1) - index] = node.Value;
            node = node.Previous;
        }
        _allValues = values;
    }

    private class Node
    {
        public Node(T value, Node previous)
        {
            Value = value;
            Previous = previous;
            Count = (previous == null) ? 1 : (previous.Count + 1);
        }

        public T Value { get; private set; }

        /// <summary>
        /// This will be null if there is no previous node
        /// the head)
        /// </summary>
        public Node Previous { get; private set; }

        public int Count { get; private set; }
    }
}

Having a Count property on the Node enables the immutable list to expose a Count property without having to recursively loop through the nodes.

Rounding it out

Since we have a "_tail" Node reference and each Node has a Previous property, this Count on the Node represents the number of items in the list up to and including the current Node. So the tail Node's Count is the number of items in the entire list, the Count property on the node before the tail (if any) would have a Count value of one less - indicating the number of Nodes there are one place before the tail Node. I mention this is because I hope it makes the following methods easier to follow!

public PersistentImmutableList<T> InsertAt(T value, int insertAtIndex)
{
    return InsertAt(new[] { value }, insertAtIndex);
}

public PersistentImmutableList<T> InsertAt(IEnumerable<T> values, int insertAtIndex)
{
    if (values == null)
        throw new ArgumentNullException("values");
    if (!values.Any())
        return this;
    if ((insertAtIndex < 0) || (insertAtIndex > Count))
        throw new ArgumentOutOfRangeException("insertAtIndex");

    // If the insertion is at the end of the list then we can use AddRange and avoid any
    // messing about
    if (insertAtIndex == Count)
        return AddRange(values);

    // Starting with the tail, walk back to the insertion point, recording the values we
    // pass over
    var node = _tail;
    var valuesBeforeInsertionPoint = new T[Count - insertAtIndex];
    for (var index = 0; index < valuesBeforeInsertionPoint.Length; index++)
    {
        valuesBeforeInsertionPoint[index] = node.Value;
        node = node.Previous;
    }

    // Any existing node chain before the insertion point can be persisted and the new
    // value(s) appended
    foreach (var value in values)
        node = new Node(value, node);

    // Finally, add back the values we walked through before to complete the chain
    for (var index = valuesBeforeInsertionPoint.Length - 1; index >= 0; index--)
        node = new Node(valuesBeforeInsertionPoint[index], node);
    return new PersistentImmutableList<T>(node);
}

To insert into an arbitrary location in the list, we need to walk backwards from the tail to the insertion point and then insert the new value(s) by persisting the rest of the node chain (from the insertion point up to the head) and appending the new values and then the values which we have to walk through to get to the insertion point. The nodes from the tail to the insertion point can not be maintained as their "Previous" chain will not include the new values!

A very similar approach may be taken to removals:

public PersistentImmutableList<T> RemoveAt(int removeAtIndex)
{
    return RemoveRange(removeAtIndex, 1);
}

public PersistentImmutableList<T> RemoveRange(int removeAtIndex, int count)
{
    if (removeAtIndex < 0)
        throw new ArgumentOutOfRangeException(
            "removeAtIndex",
            "must be greater than or equal zero"
        );
    if (count <= 0)
        throw new ArgumentOutOfRangeException("count", "must be greater than zero");
    if ((removeAtIndex + count) > Count)
        throw new ArgumentException("removeAtIndex + count must not exceed Count");

    // Starting with the tail, walk back to the end of the removal range, recording the
    // values we passed over
    var node = _tail;
    var valuesBeforeRemovalRange = new T[Count - (removeAtIndex + count)];
    for (var index = 0; index < valuesBeforeRemovalRange.Length; index++)
    {
        valuesBeforeRemovalRange[index] = node.Value;
        node = node.Previous;
    }

    // Move past the values in the removal range
    for (var index = 0; index < count; index++)
        node = node.Previous;

    // Now add back the values we walked through above to the part of the chain that can be
    // persisted
    for (var index = valuesBeforeRemovalRange.Length - 1; index >= 0; index--)
        node = new Node(valuesBeforeRemovalRange[index], node);
    return new PersistentImmutableList<T>(node);
}

And really, that's most of the complication out of the way! We can still flesh out a few more properties like an index property:

public T this[int index]
{
    get
    {
        if ((index < 0) || (index >= Count))
            throw new ArgumentNullException("index");

        EnsureAllValuesDataIsPopulated();
        return _allValues[index];
    }
}

and a sort method:

public PersistentImmutableList<T> Sort(IComparer<T> comparison)
{
    if (comparison == null)
        throw new ArgumentNullException("comparison");

    EnsureAllValuesDataIsPopulated();
    return new PersistentImmutableList<T>(
        _allValues.OrderBy(x => x, comparison)
    );
}

but we're getting down to icing-on-the-cake now.

Conclusion

I've enjoyed this little foray and intend to replace that old simple (effective but slow) immutable list I was using before with a version of this code! In existing code that used the previous implementation, there was a measurable performance hit in some loops where lists were being built up in a method before being returned - I rewrote these to use a mutable list internally and return an immutable representation when the work was complete (because of this performance hit). But now I think I could probably get away with using this new immutable list throughout method internals as well! I need to do some profiling of previously-seen trouble areas to be sure, but I get the sneaky feeling that in some of the larger data sets where performance was seen to be taking a hit, this new immutable list variation may work even better than the built-in mutable list. And that, I'm very happy with! :)

Posted at 20:33

Tags:

Immutability

Comments

22 March 2011

Problems in Immutability-land

Having sung the praises of immutability last time, there are a couple of flies in the ointment. The first is a bit of a non-issue I think, but bears mentioning; if I have a class with half a dozen properties it feels like a lot of monotonous typing hammering out those private properties, those arguments-not-whitespace-or-null checks, those property assignments, those public properties, those comments about the class contract - it's boring! Now I know that developers should all be superfast, touch-typist maniacs (http://www.codinghorror.com/blog/2008/11/we-are-typists-first-programmers-second.html) - and I am, that's not the problem - but it still makes me grimace when I know I have to throw down a big chunk of boilerplate-looking code like the Name class I used in an example last time.

public class Name
{
  public Name(string title, string firstName, string lastName)
  {
    if ((title ?? "").Trim() == "")
      throw new ArgumentException("Null/empty title specified");
    if ((firstName ?? "").Trim() == "")
      throw new ArgumentException("Null/empty firstName specified");
    if ((lastName ?? "").Trim() == "")
      throw new ArgumentException("Null/empty lastName specified");

    Title = title;
    FirstName = firstName;
    LastName = lastName;
  }

  /// <summary>
  /// This will never be null or empty
  /// </summary>
  public string Title { get; private set; }

  /// <summary>
  /// This will never be null or empty
  /// </summary>
  public string FirstName { get; private set; }

  /// <summary>
  /// This will never be null or empty
  /// </summary>
  public string LastName { get; private set; }
}

Now, on the other hand, this means that callers can take on certain guarantees about the class and so contain less "gotcha" code (checking for null values and all that). So this is probably code that would have to be written in one way or another elsewhere. Possibly many times over. So I think it's definitely a win overall, which is why I said it's kind of a non-issue - but still it makes my fingers hurt a bit for that quick blaze of crazy typing. I'm concerned the key motions for ArgumentNullException are so ingrained in my muscle memory that my one day my hands will refuse to type anything else!

The DeferredElementStore

Another issue I've come across a few times was highlighted quite nicely by something I was writing the other day; we had some forms that we wanted to generate from xml config files so some of the elements could be added, removed, made optional, compulsory, etc, etc.. It was fairly straight-forward and each element was parsed from the file and described by a corresponding immutable class but there was a problem - some of the elements were related, or rather one might depend on another. A cascading dropdown scenario, basically. So each element needed a way to access the other elements in the form data to read their values and whatnot. But when initialising each element there didn't exist any single object that had awareness of all of the elements since we were still in the process of initialising them! Catch-22, bang!

To work around this I used an object that appear immutable to the element but which would not guarantee to be able to respond to requests for references to other element until the Init phase of the pagecycle (this was in ASP.Net and the process was to parse the config file, build a list of controls that the elements required and then add those controls to a Page all at once - so the Init event for each of those controls would be raised after all of the elements had been initialised and the controls created). This object would contain no data initially and be used just as a reference to pass to the elements during initialisation. When the initialisation of all of the elements was complete, a reference to the list of these elements was passed to this mystery object; our "deferred element store". And then the elements' controls were added to the Page. So when the element classes requested access to other elements during or after the Init phase, the data was available!

Now, this clearly immutable data - it's more like some sort of single-setting, delayed-instantiation object.. or something. I'm going to link to Eric Lippert again here since he's pretty much the guru on this sort of thing and since he describes this precise scenario in the following article:

http://blogs.msdn.com/b/ericlippert/archive/2007/11/13/immutability-in-c-part-one-kinds-of-immutability.aspx

.. I'm not so sure about the phrase "popsicle immutability" but that's basically what I'm talking about! There's a slight variation that I've used here (which actually is talked about in the comments for that article) where the real "element store" class is not passed to the elements during initialisation, only a wrapper around it. This ensures that the element classes couldn't mess with the state, only the form parser could:

public interface IDeferredElementStore
{
  AbstractFormElement TryToGetElement(string id);
}

public class DeferredElementStore : IDeferredElementStore
{
  private NonNullImmutableList<AbstractFormElement> _elements;
  public DeferredElementStore()
  {
    _elements = new NonNullImmutableList<AbstractFormElement>();
  }
  public void StoreElementData(NonNullImmutableList<AbstractFormElement> elements)
  {
    if (elements == null)
      throw new ArgumentNullException("elements");
    _elements = elements;
  }
  public AbstractFormElement TryToGetElement(string id)
  {
    var element = _elements.FirstOrDefault(e.Id = id);
    if (element == null)
      throw new ArgumentException("Invalid Id");
    return element;
  }
}

public class ReadOnlyDeferredElementStore : IDeferredElementStore
{
  private IDeferredElementStore _elementStore;
  public ReadOnlyDeferredElementStore(IDeferredElementStore elementStore)
  {
    if (elementStore == null)
      throw new ArgumentNullException("elementStore");
    _elementStore = elementStore;
  }
  public AbstractFormElement TryToGetElement(string id)
  {
    return _elementStore.TryToGetElement(id);
  }
}

.. and the element generation code could look something like:

var elements = new List<AbstractFormElement>();
var elementStore = new DeferredElementStore();
var elementStoreReadOnly = new ReadOnlyDeferredElementStore(elementStore);
elements.Add(new FreeTextElement(.., elementStoreReadOnly, ..));
elements.Add(new DropDownElement(.., elementStoreReadOnly, ..));
elementStore.StoreElementData(new NonNullImmutableList<AbstractFormElement>(elements));
foreach (var element in elements)
{
  foreach (var control in element.Controls)
    this.Controls.Add(control);
}

This could just as well be used if there are circular references between classes. I suppose then you'd have to have a container to handle both objects being instantiated and pass a read-only wrapper of this container to both classes, then push references to those instances into the container.

This isn't quite the same as the "observational immutability" described in that article, but I feel I've got an article about dynamic factory classes coming on which will touch on that!

Still loving it

All in all, I'm still definitely a big fan of this immutability lark and am still convinced it makes the code easier to deal with overall. I was reading something earlier that I know can't find so I'll have to paraphrase - they were saying that when you're trying to get to grips with existing code, the less you have to keep in your head about what's going on at any time, the easier it is. This is hardly news but it was used in the context of the advantages of immutable data; that if you have references that just are and aren't going to undergo all sorts of states changes, there's much fewer potential interactions you have to deal with mentally. And that means it should be easier to deal with!

Posted at 19:00

Tags:

Immutability

Comments

14 March 2011

I love Immutable Data

I love immutable data. There, I said it. I think over the last couple of years a few major factors have had the most influence in leading me to this point -

I've been driven mad by dealing with code full of complicated object models with no indication which properties are required, which are optional, which go together, which are mutually exclusive, etc..
I was working on a store of data that would be read from and written to by multiple threads and the initial implementation had a naive lock-on-every-interaction approach when it seemed like we should be able to make the reads work without locking (especially since reads were massively more common than writes)
I've been working largely on Tourism websites (and all the related backend services) for a few years now and most of the data feels like it's read-only, though having thought about it I'm not sure if I'd change my mind if I was doing CRUD day-in, day-out instead

The first point could really be addressed in all sorts of ways - the code's all a bit wishy-washy and poorly defined and nobody seems to know which fields are for what in the example I'm thinking of. But when I think of immutable types I instinctively think of classes whose values are set once through a constructor (though there are other variations that can be used) and then that instance is "locked" such that we know its state will never change - and that constructor will have ensured that this state is valid. If the classes in point were all written in this way then never again (hopefully!) would there be concerns regarding the validity of the states of the objects, they must have been valid in order to be instantiated and immutability means they can't have changed since!

While we're doing some sort of validation on the constructor arguments I think it also encourages you to think about the various states that can exist - eg.

public class Employee
{
  public string Title { get; set; }
  public string FirstName { get; set; }
  public string LastName { get; set; }
  public string[] Roles { get; set; }
}

This is the sort of thing that's found all over the place - especially across webservice interfaces. Assume that we have the requirements that Title, FirstName and LastName all have values and that all Employees have zero or more Roles. I think describing the requirements in constructor validation and then some liberal commenting ends up in nicer code:

public class Employee
{
  public Employee(Name name, DefinedStringList roles)
  {
    if (name == null)
      throw new ArgumentNullException("name");
    if (roles == null)
      throw new ArgumentNullException("roles");

    Name = name;
    Roles = roles;
  }

  /// <summary>
  /// This will never be null
  /// </summary>
  public Name Name { get; private set; }

  /// <summary>
  /// This will never be null
  /// </summary>
  public DefinedStringList Roles { get; private set; }
}

public class Name
{
  public Name(string title, string firstName, string lastName)
  {
    if ((title ?? "").Trim() == "")
      throw new ArgumentException("Null/empty title specified");
    if ((firstName ?? "").Trim() == "")
      throw new ArgumentException("Null/empty firstName specified");
    if ((lastName ?? "").Trim() == "")
      throw new ArgumentException("Null/empty lastName specified");

    Title = title;
    FirstName = firstName;
    LastName = lastName;
  }

  /// <summary>
  /// This will never be null or empty
  /// </summary>
  public string Title { get; private set; }

  /// <summary>
  /// This will never be null or empty
  /// </summary>
  public string FirstName { get; private set; }

  /// <summary>
  /// This will never be null or empty
  /// </summary>
  public string LastName { get; private set; }
}

Except - wow! - the amount of code seems to have ballooned and I've not even included the "DefinedStringList" class! (Well, not here at least - it's down the bottom of the post).

But what we do have now will be instances of Employee that are always in a known good state and we can safely retrieve employee.Name.FirstName without first ensuring that Name is not null. We also know that Employees that have not been assigned roles will have a Roles instance that declares a Count of zero rather than wondering if it will be that or whether there will be a null Roles instance. So the upshot should be that there will actually be less code in places where Employee instances are accessed.

Multithreaded access

Now, to recreate a really trivial version of the multithreaded datastore I mentioned earlier, imagine we have a local store of Employees that is being written to and read from - eg.

public class EmployeeStore
{
  private List<Employee> _data = new List<Employee>();

  public IEnumerable<Employee> GetAll()
  {
    lock (_data)
    {
      return _data.AsReadOnly();
    }
  }

  public void Add(Employee employeeToAdd)
  {
    if (employeeToAdd == null)
      throw new ArgumentNullException("employeeToAdd");

    lock (_data)
    {
      _data.Add(employeeToAdd);
    }
  }
}

We'll ignore any concept or deleting or updating for now. Since we don't know how many threads are at work in this scenario, or who's doing what, we lock the internal data at each read or write. We're also returning the data as an IEnumerable and using List's .AsReadOnly method in an optimistic attempt to keep the internal data from being manipulated externally after we return it. In fact, in the example I had, the data was actually (deep-)cloned before returning to ensure that no caller could manipulate any data inside the data store.

If we're working with immutable data types and have access to an immutable list then we can change this without much effort to require no locks for reading and we can implicitly forget any AsReadOnly or cloning malarkey if we have an immutable list to work with as well. An immutable list works by returning new instances when methods that would otherwise effect its contents are called - so if a list has 3 items and we call Add then the existing list is unchanged and the Add method returns a new list with all 4 items. Example code is at the end of this post, along with a DefinedStringList implementation, as mentioned earlier.

public class EmployeeStoreWithoutReadLocking
{
  private object _writeLock = new object();
  private ImmutableList<Employee> _data = new ImmutableList<Employee>();

  public ImmutableList<Employee> GetAll()
  {
    return _data;
  }

  public void Add(Employee employeeToAdd)
  {
    if (employeeToAdd == null)
      throw new ArgumentNullException("employeeToAdd");

    lock (_writeLock)
    {
      _data = _data.Add(employeeToAdd);
    }
  }
}

Easy! Of course this relies upon the Employee class being immutable (which must cover all of its properties' types as well). Now we're not just reaping the benefits in state validity but we've got more performant threaded code (again, my example was heavy on reads and light). In a lot of cases immutability such as this can make areas of multi-threaded code much easier to write and maintain.

I think in this case I extended the ImmutableList to a NonNullImmutableList which had validation to ensure it would never contain any null references. Similar to how the DefinedStringList will ensure it has no null or empty values. Another layer of comforting behaviour guarantee so that callers don't have to worry about nulls. It makes me feel warm and fuzzy.

Undo!

In most scenarios it seems I've been working with recently, classes such as Employee would be instantiated just the once and then not changed unless another query was executed that returned a new set of Employee data. But feasibly we may want to alter the Employee class such that it is "editable" in the same way that the DefinedStringList that we're talking about is - you can call methods that return a new instance of the class with the alteration made, leaving the original reference unaltered.

public class Employee
{
  public Employee(Name name, DefinedStringList roles)
  {
    if (name == null)
      throw new ArgumentNullException("name");
    if (roles == null)
      throw new ArgumentNullException("roles");

    Name = name;
    Roles = roles;
  }

  /// <summary>
  /// This will never be null
  /// </summary>
  public Name Name { get; private set; }

  /// <summary>
  /// This will never be null
  /// </summary>
  public DefinedStringList Roles { get; private set; }

  public Employee UpdateName(Name name)
  {
    // This will throw an exception for a null name reference
    return new Employee(name, _roles);
  }

  public Employee AddRole(string role)
  {
    // This will throw an exception for a null or empty role value
    return new Employee(_name, _roles.Add(role));
  }

  public Employee RemoveRole(string role)
  {
    return new Employee(_name, _roles.Remove(role));
  }
}

Here the name can be overwritten and roles can be added or removed. What's interesting about this approach is that returning new instances each time means you could persists a chain of changes - an undo history or sorts! I must admit that I've never taken advantage of this in any way, but it's often struck me that it could be useful in some situations..

Some more views

While writing this post, I did a bit of research to try and make sure I wasn't say anything either too done-to-death or too stupid and the following links are articles I like, largely because they agree with me! :)

Immutable data structures are the way of the future in C#

http://blogs.msdn.com/b/ericlippert/archive/2007/10/04/path-finding-using-a-in-c-3-0-part-two.aspx

One of reasons why immutable types can be faster is that they are optimized due to having dealt with memory management in years past

http://en.csharp-online.net/CSharp_Coding_Solutions-Immutable_Types_Are_Scalable_Types

However there's also this one:

The "verbose constructor" is itself a good candidate for an anti-pattern for the following reasons:

http://blog.dezfowler.com/2009/05/always-valid-entity-anti-pattern.html

I've worked with Derek before so although I read that article two or three times and couldn't agree with it, I didn't give up 'cos I know he's a bright guy. And it finally broke for me what I think he meant when I read the comments on that piece - there's only four and it's the last one that made it stick for me. Partly because someone I work with now has a similar view, I think. The way I see things working together is that the validation in these "verbose constructors" is a last line of defense to ensure that the object's state is ensured to be valid and is not business logic where the intention is to throw a load of possibly-valid values at it and see what sticks. There should be a nice validation layer between the UI and these constructors that only allows through allowable state and handles the aggregation of errors where required. The exceptions in the constructor should still be just that; exceptions, not the norm for invalid UI input.

But in summary, I'm still all for these "verbose constructors" - as this final defense that allows us not to worry about instances of these immutable classes - if they exist, then they're valid. And I like that.

An immutable list (and the DefinedStringList class)

Since this code is a bit long to jam in the middle of the article, here it is in all its glory:

public class ImmutableList<T> : IEnumerable<T>
{
  private List<T> values;
  private IValueValidator<T> validator;
  public ImmutableList(IEnumerable<T> values, IValueValidator<T> validator)
  {
    if (values == null)
      throw new ArgumentNullException("values");

    var valuesList = new List<T>();
    foreach (var value in values)
    {
      if (validator != null)
      {
        try { validator.EnsureValid(value); }
        catch (Exception e)
        {
          throw new ArgumentException("Invalid reference encountered in values", e);
        }
      }
      valuesList.Add(value);
    }
    this.values = valuesList;
    this.validator = validator;
  }
  public ImmutableList(IEnumerable<T> values) : this(values, null) { }
  public ImmutableList(IValueValidator<T> validator, params T[] values)
    : this((IEnumerable<T>)values, validator) { }
  public ImmutableList(params T[] values) : this(null, values) { }

  public T this[int index]
  {
    get
    {
      if ((index < 0) || (index >= this.values.Count))
        throw new ArgumentOutOfRangeException("index");
      return this.values[index];
    }
  }

  public int Count
  {
    get { return this.values.Count; }
  }

  public bool Contains(T value)
  {
    return this.values.Contains(value);
  }

  public ImmutableList<T> Add(T value)
  {
    if (this.validator != null)
    {
      try { this.validator.EnsureValid(value); }
      catch (Exception e)
      {
        throw new ArgumentException("Invalid value", e);
      }
    }
    var valuesNew = new List<T>();
    valuesNew.AddRange(this.values);
    valuesNew.Add(value);
    return new ImmutableList<T>()
    {
      values = valuesNew,
      validator = this.validator
    };
  }

  /// <summary>
  /// Removes the first occurrence of a specific object
  /// </summary>
  public ImmutableList<T> Remove(T value)
  {
    var valuesNew = new List<T>();
    valuesNew.AddRange(this.values);
    valuesNew.Remove(value);
    return new ImmutableList<T>()
    {
      values = valuesNew,
      validator = this.validator
    };
  }

  /// <summary>
  /// This is just a convenience method so that derived types can call Add, Remove, etc.. and return
  /// instances of themselves without having to pass that data back through a constructor which will
  /// check each value against the validator even though we already know they're valid! Note: This
  /// can only be used by derived classes that don't have any new requirements of any type - we're
  /// setting only the values and validator references here!
  /// </summary>
  protected static U toDerivedClass<U>(ImmutableList<T> list) where U : ImmutableList<T>, new()
  {
    if (list == null)
      throw new ArgumentNullException("list");

    // Use same trick as above methods to cheat - we're changing the state of the object after
    // instantiation, but after returning from
    // this method it can be considered immutable
    return new U()
    {
      values = list.values,
      validator = list.validator
    };
  }

  public IEnumerator<T> GetEnumerator()
  {
    return this.values.GetEnumerator();
  }

  IEnumerator IEnumerable.GetEnumerator()
  {
    return GetEnumerator();
  }
}

public interface IValueValidator<T>
{
  /// <summary>
  /// This will throw an exception for a value that does pass validation requirements
  /// </summary>
  void EnsureValid(T value);
}

That's all the setup to enable a DefinedStringList class, which we can with:

public class DefinedStringList : ImmutableList<string>
{
  public DefinedStringList(IEnumerable<string> values)
    : base(values, new NonNullOrEmptyWrappingValueValidator()) { }
  public DefinedStringList(params string[] values) : this((IEnumerable<string>)values) { }
  public DefinedStringList() : this(new string[0]) { }

  public new DefinedStringList Add(string value)
  {
    return toDerivedClass<DefinedStringList>(base.Add(value));
  }
  public new DefinedStringList Remove(string value)
  {
    return toDerivedClass<DefinedStringList>(base.Remove(value));
  }

  private class NonNullOrEmptyWrappingValueValidator : IValueValidator<string>
  {
    public void EnsureValid(string value)
    {
      if ((value ?? "").Trim() == null)
        throw new ArgumentException("Null/empty value specified");
    }
  }
}

These are actually cut-down versions of classes I've got in one of my projects that also includes AddRange, Insert, RemoveAt, Contains(T value, IEqualityComparer comparer), etc.. but this is more than enough to get the gist. At some point I may look into that GitHub thing..

Immutability purity

A final side note(*) - you might notice that internally the ImmutableList does actually participate in some mutability! When calling the Add method, we validate the new value (if required) and then create a new instance of the class with no data and then assign its internal "values" and "validator" references, meaning we sidestep the looping of all the data in the constructor which is unnecessary since we know the values are all valid, that's part of the point of the class! BTW, it feels like a bit of a trick updating these private references after creating the new instances and it's only possible because we've just created the instance ourself and the new object is an instance of the class that is performing the work. I don't know if there's a phrase to describe this method and I was a bit surprised to discover it could be done since it has a feeling of breaking the "private" member contract!

* I don't want to go into too much detail since I want to talk about this further another time!

Update (26th November 2012): A re-visit of this principle can be seen in the post Persistent Immutable Lists which has an alternate implementation of the immutable list with improved performance but all of the immutability-based safety!

Posted at 20:14

Tags:

Immutability

Comments

You may also be interested in (see here for information about how these are generated):

You may also be interested in:

You may also be interested in:

You may also be interested in:

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

About

Recent Posts

Highlights

Archives