Meta-programming in C# with JIT dead code removal and inlining

Luca Bolognese - Jan 2019

Abstract

Thanks to Mike for reviewing this.

This is a way to enable compile time customization of classes/functions in the style of C++ template meta-programming as in Modern C++ Design. In particular, we are going to implement the policy pattern, which is compile time version of the strategy pattern.

What do we gain by constraining ourselves to compile time customization, instead of run time one? High performance. Blazingly high performance. You gain an abstraction, without paying for it.

Too good to be true? Read on!

BTW1: none of this is new. It has been ﬂoating around in various forms. But I have never seen explained fully and associated to the policy pattern. BTW2: it is also related to the uber cool Shape proposal for C#, where it becomes an implementation detail.

Implementation

First, the usual plethora of namespaces … BenchmarkDotNet is really heavy into separating abstractions in different namespaces …

using System.Runtime.CompilerServices;
using System.Threading;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Diagnostics.Windows.Configs;
using BenchmarkDotNet.Mathematics;
using BenchmarkDotNet.Order;
using BenchmarkDotNet.Running;

Let’s take a look at the strategy pattern, as more traditionally implemented. You got an interface IIncrementer and two implementations, either thread safe or not. Let’s also add a couple of struct implementations so that we can measure the performance difference of implementation by classes and by structs.

public interface IIncrementer { void Increment(ref int location); }

public sealed class CStandardIncrementer        : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => location += 1; }
public sealed class CInterlockedIncrementer     : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => Interlocked.Increment(ref location); }

public readonly struct SStandardIncrementer     : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => location += 1; }
public readonly struct SInterlockedIncrementer  : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => Interlocked.Increment(ref location); }

We then need a class that can be customized by an Incrementer. Think of it as: the policy of incrementing something is independent from the class in question.

Let’s take a Counter and call it Dynamic as we want to be able to customize it at runtime. We need to keep things simple so that looking at ASM is doable. Also we inline everything to try to squeeze the most performance out of our program.

public class DynamicCounter<T> where T: IIncrementer
{
    int _count;
    T _incrementer;

    public DynamicCounter(T incrementer) => _incrementer = incrementer;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Increment() => _incrementer.Increment(ref _count);
}

Then we look at how to implement the strategy pattern at compile time (transforming it magically into the policy pattern).

(no-one every talks about the negative aspects of giving names to things, one of these days I’ll blog about it …)

There are many ways to go about it. They all rely on the fact that the JIT Compiler instantiates a different type for each struct used to customize the type (aka each T).

For each one of these types, the JIT knows at compile type what T is, which brings about certain optimizations.

The optimization exploited by StaticCounterInterface is that the call to Increment becomes a non-virtual call that can be inlined.

public class StaticCounterInterface<T> where T : struct, IIncrementer
{
    int _count;
    T _incrementer = new T();

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Increment() => _incrementer.Increment(ref _count);
}

For StaticCounterMatch the situation is even more obvious.

The Jitter doesn’t generate code for any of the if statements. It just puts the right code for the type T directly in the body of the Increment method.

It is as if the if statement were executed at compile time, as with C++ templates. Also notice the IncrementRaw method used for perf comparison.

public class StaticCounterMatch<T>
{
    int _count;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Increment() {
        if (typeof(T) == typeof(SStandardIncrementer))      _count += 1;
        if (typeof(T) == typeof(SInterlockedIncrementer))   Interlocked.Increment(ref _count);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void IncrementRaw() => _count += 1;
}

We are now ready for the performance testing. We create one class for each combination and benchmark their Increment method.

A few comments on the attributes:

DryCoreJob doesn’t do the benchmark, but just runs the diagnosers, in this case produces assembly code.
InProcessAttribute makes everything go faster, but cannot be used to generate assembly.
DisassemblyDiagnoser creates files with the assembly code.
RankColumn generates a nicer table.

//[DryCoreJob]
//[InProcessAttribute]
[DisassemblyDiagnoser(printAsm: true, printPrologAndEpilog: true, printIL: false, printSource: false, recursiveDepth: 3)]
[Orderer(SummaryOrderPolicy.FastestToSlowest)]
[RankColumn(NumeralSystem.Stars)]
public  class MainClass
{
    DynamicCounter<IIncrementer>         dynInterface           = new DynamicCounter<IIncrementer>(new CStandardIncrementer());
    DynamicCounter<CStandardIncrementer> dynConcrete            = new DynamicCounter<CStandardIncrementer>(new CStandardIncrementer());
    DynamicCounter<SStandardIncrementer> dynStruct              = new DynamicCounter<SStandardIncrementer>(new SStandardIncrementer());
    StaticCounterMatch<SStandardIncrementer> staticCounterM     = new StaticCounterMatch<SStandardIncrementer>();
    StaticCounterInterface<SStandardIncrementer> staticCounterI = new StaticCounterInterface<SStandardIncrementer>();

    [Benchmark] public void DynamicInterface()                      => dynInterface.Increment();
    [Benchmark] public void DynamicConcrete()                       => dynConcrete.Increment();
    [Benchmark] public void DynamicStruct()                         => dynStruct.Increment();
    [Benchmark] public void StaticCounterInterface()                => staticCounterI.Increment();
    [Benchmark] public void StaticCounterMatch()                    => staticCounterM.Increment();
    [Benchmark(Baseline = true)] public void IncrementRaw()         => staticCounterM.IncrementRaw();

    public static void Main() => BenchmarkRunner.Run<MainClass>();
}

Results

Please note that the results are valid just for the tested conﬁguration.

I have no reason to think that they would be different on other modern runtimes/OSs as the optimizations are quite well known.


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009812
  [Host]     : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Rank
StaticCounterInterface	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.000	0.00	*
IncrementRaw	0.1036 ns	0.0515 ns	0.0573 ns	0.1071 ns	1.000	0.00	**
StaticCounterMatch	0.1122 ns	0.0422 ns	0.0943 ns	0.1020 ns	2.092	1.95	**
DynamicStruct	0.2707 ns	0.0910 ns	0.2683 ns	0.1407 ns	4.135	6.28	***
DynamicConcrete	1.9216 ns	0.1506 ns	0.4440 ns	1.7883 ns	23.417	21.44	****
DynamicInterface	2.2441 ns	0.1170 ns	0.3449 ns	2.1470 ns	32.783	30.52	*****

As expected, you gain an order of magnitude in performance by foregoing run time customization, except when using a struct as the optimizer manages to inline that one (as we’ll see).

Notice that these numbers are really low. In fact the order of the ﬁrst 4 lines might change when you run it. But they are always much faster than the rest.

But why? Let’s look at the generated code.

IL and ASM

First let’s look at IL for a few of the methods

MainClass.IncrementRaw()
       IL_0000: ldarg.0
       IL_0001: ldfld StaticCounterMatch`1<SStandardIncrementer> MainClass::staticCounterM
       IL_0006: callvirt System.Void StaticCounterMatch`1<SStandardIncrementer>::IncrementRaw()
       IL_000b: ret

; MainClass.StaticCounterInterface()
       IL_0000: ldarg.0
       IL_0001: ldfld StaticCounterInterface`1<SStandardIncrementer> MainClass::staticCounterI
       IL_0006: callvirt System.Void StaticCounterInterface`1<SStandardIncrementer>::Increment()
       IL_000b: ret

; MainClass.StaticCounterMatch()
       IL_0000: ldarg.0
       IL_0001: ldfld StaticCounterMatch`1<SStandardIncrementer> MainClass::staticCounterM
       IL_0006: callvirt System.Void StaticCounterMatch`1<SStandardIncrementer>::Increment()
       IL_000b: ret

Ok, nothing interesting there apart from the use of callvirt when you would think a standard call would do (i.e. IncrementRaw).

I vaguely remember from my C# compiler days that we do that as a way to short-circuit the test for null, as callvirt does it automatically.

The assembly code is more interesting. Let’s start from the three fast-resolved-at-compile-time methods.

BTW: remember that looking at optimized ASM code is like peering into a muddy lake with foggy glasses. Let’s do it.

; MainClass.IncrementRaw()
       mov     rax,qword ptr [rcx+20h]
       inc     dword ptr [rax+8]
       ret

; MainClass.StaticCounterInterface()
       mov     rax,qword ptr [rcx+28h]
       (*)mov     edx,dword ptr [rax]
       add     rax,8
       inc     dword ptr [rax]
       ret

; MainClass.StaticCounterMatch()
       mov     rax,qword ptr [rcx+20h]
       (*)mov     edx,dword ptr [rax]
       inc     dword ptr [rax+8]
       ret

; MainClass.DynamicStruct()
       mov     rax,qword ptr [rcx+18h]
       (*) mov     edx,dword ptr [rax]
       add     rax,8
       inc     dword ptr [rax]
       ret

Yep, they are all the same (apart from the mysterious * instruction)! Get the memory location of the ﬁeld, increment it.

The Jitter has completely inlined the code. It is as if you had written the incrementing code directly into the function, despite you composing the type using two independent abstractions.

I think that is pretty cool! You can abstract your code properly and not pay the price for it (well, apart from one assembly instruction).

For the sake of completeness, let’s look at the assembly code for the dynamic dispatching cases.

; MainClass.DynamicConcrete()
       mov     rcx,qword ptr [rcx+10h]
       cmp     dword ptr [rcx],ecx
       lea     rdx,[rcx+10h]
       mov     rcx,qword ptr [rcx+8]
       mov     r11,7FF8C2B30470h
       mov     rax,qword ptr [r11]
       cmp     dword ptr [rcx],ecx
       jmp     rax

; MainClass.DynamicInterface()
       mov     rcx,qword ptr [rcx+8]
       cmp     dword ptr [rcx],ecx
       lea     rdx,[rcx+10h]
       mov     rcx,qword ptr [rcx+8]
       mov     r11,7FF8C2B10470h
       mov     rax,qword ptr [r11]
       cmp     dword ptr [rcx],ecx
       jmp     rax

The ﬁrst thing to notice is that the code is identical, despite their seemingly different declarations. The Jitter doesn’t care.

Notice the machinations the Jitter performs, very likely related to dynamic dispatching, to calculate the address to ﬁnally jump to. That’s where our Increment method is located.

No wonder it is slower.

Summary

If you can afford to use the policy pattern instead of the more generic strategy pattern (i.e. compile time vs run time dispatch) and/or you need bare to the metal performance, consider the code above.

As for me, I plan to use it in the near future for a few low level abstractions (i.e. memory allocators).

Abstract

Implementation

Results

IL and ASM

Summary

Tags