Meta-programming in C# with JIT dead code removal and inlining

-

Abstract

Thanks to Mike for re­view­ing this.

This is a way to en­able com­pile time cus­tomiza­tion of classes/​func­tions in the style of C++ tem­plate meta-pro­gram­ming as in Modern C++ Design. In par­tic­u­lar, we are go­ing to im­ple­ment the pol­icy pat­tern, which is com­pile time ver­sion of the strat­egy pat­tern.

What do we gain by con­strain­ing our­selves to com­pile time cus­tomiza­tion, in­stead of run time one? High per­for­mance. Blazingly high per­for­mance. You gain an ab­strac­tion, with­out pay­ing for it.

Too good to be true? Read on!

BTW1: none of this is new. It has been float­ing around in var­i­ous forms. But I have never seen ex­plained fully and as­so­ci­ated to the pol­icy pat­tern. BTW2: it is also re­lated to the uber cool Shape pro­posal for C#, where it be­comes an im­ple­men­ta­tion de­tail.

Implementation

First, the usual plethora of name­spaces … BenchmarkDotNet is re­ally heavy into sep­a­rat­ing ab­strac­tions in dif­fer­ent name­spaces …

using System.Runtime.CompilerServices;
using System.Threading;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Diagnostics.Windows.Configs;
using BenchmarkDotNet.Mathematics;
using BenchmarkDotNet.Order;
using BenchmarkDotNet.Running;

Let’s take a look at the strat­egy pat­tern, as more tra­di­tion­ally im­ple­mented. You got an in­ter­face IIncrementer and two im­ple­men­ta­tions, ei­ther thread safe or not. Let’s also add a cou­ple of struct im­ple­men­ta­tions so that we can mea­sure the per­for­mance dif­fer­ence of im­ple­men­ta­tion by classes and by structs.

public interface IIncrementer { void Increment(ref int location); }

public sealed class CStandardIncrementer        : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => location += 1; }
public sealed class CInterlockedIncrementer     : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => Interlocked.Increment(ref location); }

public readonly struct SStandardIncrementer     : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => location += 1; }
public readonly struct SInterlockedIncrementer  : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => Interlocked.Increment(ref location); }

We then need a class that can be cus­tomized by an Incrementer. Think of it as: the pol­icy of in­cre­ment­ing some­thing is in­de­pen­dent from the class in ques­tion.

Let’s take a Counter and call it Dynamic as we want to be able to cus­tomize it at run­time. We need to keep things sim­ple so that look­ing at ASM is doable. Also we in­line every­thing to try to squeeze the most per­for­mance out of our pro­gram.

public class DynamicCounter<T> where T: IIncrementer
{
    int _count;
    T _incrementer;

    public DynamicCounter(T incrementer) => _incrementer = incrementer;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Increment() => _incrementer.Increment(ref _count);
}

Then we look at how to im­ple­ment the strat­egy pat­tern at com­pile time (transforming it mag­i­cally into the pol­icy pat­tern).

(no-one every talks about the neg­a­tive as­pects of giv­ing names to things, one of these days I’ll blog about it …)

There are many ways to go about it. They all rely on the fact that the JIT Compiler in­stan­ti­ates a dif­fer­ent type for each struct used to cus­tomize the type (aka each T).

For each one of these types, the JIT knows at com­pile type what T is, which brings about cer­tain op­ti­miza­tions.

The op­ti­miza­tion ex­ploited by StaticCounterInterface is that the call to Increment be­comes a non-vir­tual call that can be in­lined.

public class StaticCounterInterface<T> where T : struct, IIncrementer
{
    int _count;
    T _incrementer = new T();

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Increment() => _incrementer.Increment(ref _count);
}

For StaticCounterMatch the sit­u­a­tion is even more ob­vi­ous.

The Jitter does­n’t gen­er­ate code for any of the if state­ments. It just puts the right code for the type T di­rectly in the body of the Increment method.

It is as if the if state­ment were ex­e­cuted at com­pile time, as with C++ tem­plates. Also no­tice the IncrementRaw method used for perf com­par­i­son.

public class StaticCounterMatch<T>
{
    int _count;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Increment() {
        if (typeof(T) == typeof(SStandardIncrementer))      _count += 1;
        if (typeof(T) == typeof(SInterlockedIncrementer))   Interlocked.Increment(ref _count);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void IncrementRaw() => _count += 1;
}

We are now ready for the per­for­mance test­ing. We cre­ate one class for each com­bi­na­tion and bench­mark their Increment method.

A few com­ments on the at­trib­utes:

  1. DryCoreJob doesn’t do the benchmark, but just runs the diagnosers, in this case produces assembly code.
  2. InProcessAttribute makes everything go faster, but cannot be used to generate assembly.
  3. DisassemblyDiagnoser creates files with the assembly code.
  4. RankColumn generates a nicer table.
//[DryCoreJob]
//[InProcessAttribute]
[DisassemblyDiagnoser(printAsm: true, printPrologAndEpilog: true, printIL: false, printSource: false, recursiveDepth: 3)]
[Orderer(SummaryOrderPolicy.FastestToSlowest)]
[RankColumn(NumeralSystem.Stars)]
public  class MainClass
{
    DynamicCounter<IIncrementer>         dynInterface           = new DynamicCounter<IIncrementer>(new CStandardIncrementer());
    DynamicCounter<CStandardIncrementer> dynConcrete            = new DynamicCounter<CStandardIncrementer>(new CStandardIncrementer());
    DynamicCounter<SStandardIncrementer> dynStruct              = new DynamicCounter<SStandardIncrementer>(new SStandardIncrementer());
    StaticCounterMatch<SStandardIncrementer> staticCounterM     = new StaticCounterMatch<SStandardIncrementer>();
    StaticCounterInterface<SStandardIncrementer> staticCounterI = new StaticCounterInterface<SStandardIncrementer>();

    [Benchmark] public void DynamicInterface()                      => dynInterface.Increment();
    [Benchmark] public void DynamicConcrete()                       => dynConcrete.Increment();
    [Benchmark] public void DynamicStruct()                         => dynStruct.Increment();
    [Benchmark] public void StaticCounterInterface()                => staticCounterI.Increment();
    [Benchmark] public void StaticCounterMatch()                    => staticCounterM.Increment();
    [Benchmark(Baseline = true)] public void IncrementRaw()         => staticCounterM.IncrementRaw();

    public static void Main() => BenchmarkRunner.Run<MainClass>();
}

Results

Please note that the re­sults are valid just for the tested con­fig­u­ra­tion.

I have no rea­son to think that they would be dif­fer­ent on other mod­ern run­times/​OSs as the op­ti­miza­tions are quite well known.


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009812
  [Host]     : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT


Method Mean Error StdDev Median Ratio RatioSD Rank
StaticCounterInterface 0.0000 ns 0.0000 ns 0.0000 ns 0.0000 ns 0.000 0.00 *
IncrementRaw 0.1036 ns 0.0515 ns 0.0573 ns 0.1071 ns 1.000 0.00 **
StaticCounterMatch 0.1122 ns 0.0422 ns 0.0943 ns 0.1020 ns 2.092 1.95 **
DynamicStruct 0.2707 ns 0.0910 ns 0.2683 ns 0.1407 ns 4.135 6.28 ***
DynamicConcrete 1.9216 ns 0.1506 ns 0.4440 ns 1.7883 ns 23.417 21.44 ****
DynamicInterface 2.2441 ns 0.1170 ns 0.3449 ns 2.1470 ns 32.783 30.52 *****

As ex­pected, you gain an or­der of mag­ni­tude in per­for­mance by fore­go­ing run time cus­tomiza­tion, ex­cept when us­ing a struct as the op­ti­mizer man­ages to in­line that one (as we’ll see).

Notice that these num­bers are re­ally low. In fact the or­der of the first 4 lines might change when you run it. But they are al­ways much faster than the rest.

But why? Let’s look at the gen­er­ated code.

IL and ASM

First let’s look at IL for a few of the meth­ods

MainClass.IncrementRaw()
       IL_0000: ldarg.0
       IL_0001: ldfld StaticCounterMatch`1<SStandardIncrementer> MainClass::staticCounterM
       IL_0006: callvirt System.Void StaticCounterMatch`1<SStandardIncrementer>::IncrementRaw()
       IL_000b: ret

; MainClass.StaticCounterInterface()
       IL_0000: ldarg.0
       IL_0001: ldfld StaticCounterInterface`1<SStandardIncrementer> MainClass::staticCounterI
       IL_0006: callvirt System.Void StaticCounterInterface`1<SStandardIncrementer>::Increment()
       IL_000b: ret

; MainClass.StaticCounterMatch()
       IL_0000: ldarg.0
       IL_0001: ldfld StaticCounterMatch`1<SStandardIncrementer> MainClass::staticCounterM
       IL_0006: callvirt System.Void StaticCounterMatch`1<SStandardIncrementer>::Increment()
       IL_000b: ret

Ok, noth­ing in­ter­est­ing there apart from the use of callvirt when you would think a stan­dard call would do (i.e. IncrementRaw).

I vaguely re­mem­ber from my C# com­piler days that we do that as a way to short-cir­cuit the test for null, as callvirt does it au­to­mat­i­cally.

The as­sem­bly code is more in­ter­est­ing. Let’s start from the three fast-re­solved-at-com­pile-time meth­ods.

BTW: re­mem­ber that look­ing at op­ti­mized ASM code is like peer­ing into a muddy lake with foggy glasses. Let’s do it.

; MainClass.IncrementRaw()
       mov     rax,qword ptr [rcx+20h]
       inc     dword ptr [rax+8]
       ret

; MainClass.StaticCounterInterface()
       mov     rax,qword ptr [rcx+28h]
       (*)mov     edx,dword ptr [rax]
       add     rax,8
       inc     dword ptr [rax]
       ret

; MainClass.StaticCounterMatch()
       mov     rax,qword ptr [rcx+20h]
       (*)mov     edx,dword ptr [rax]
       inc     dword ptr [rax+8]
       ret

; MainClass.DynamicStruct()
       mov     rax,qword ptr [rcx+18h]
       (*) mov     edx,dword ptr [rax]
       add     rax,8
       inc     dword ptr [rax]
       ret

Yep, they are all the same (apart from the mys­te­ri­ous * in­struc­tion)! Get the mem­ory lo­ca­tion of the field, in­cre­ment it.

The Jitter has com­pletely in­lined the code. It is as if you had writ­ten the in­cre­ment­ing code di­rectly into the func­tion, de­spite you com­pos­ing the type us­ing two in­de­pen­dent ab­strac­tions.

I think that is pretty cool! You can ab­stract your code prop­erly and not pay the price for it (well, apart from one as­sem­bly in­struc­tion).

For the sake of com­plete­ness, let’s look at the as­sem­bly code for the dy­namic dis­patch­ing cases.

; MainClass.DynamicConcrete()
       mov     rcx,qword ptr [rcx+10h]
       cmp     dword ptr [rcx],ecx
       lea     rdx,[rcx+10h]
       mov     rcx,qword ptr [rcx+8]
       mov     r11,7FF8C2B30470h
       mov     rax,qword ptr [r11]
       cmp     dword ptr [rcx],ecx
       jmp     rax

; MainClass.DynamicInterface()
       mov     rcx,qword ptr [rcx+8]
       cmp     dword ptr [rcx],ecx
       lea     rdx,[rcx+10h]
       mov     rcx,qword ptr [rcx+8]
       mov     r11,7FF8C2B10470h
       mov     rax,qword ptr [r11]
       cmp     dword ptr [rcx],ecx
       jmp     rax

The first thing to no­tice is that the code is iden­ti­cal, de­spite their seem­ingly dif­fer­ent de­c­la­ra­tions. The Jitter does­n’t care.

Notice the machi­na­tions the Jitter per­forms, very likely re­lated to dy­namic dis­patch­ing, to cal­cu­late the ad­dress to fi­nally jump to. That’s where our Increment method is lo­cated.

No won­der it is slower.

Summary

If you can af­ford to use the pol­icy pat­tern in­stead of the more generic strat­egy pat­tern (i.e. com­pile time vs run time dis­patch) and/​or you need bare to the metal per­for­mance, con­sider the code above.

As for me, I plan to use it in the near fu­ture for a few low level ab­strac­tions (i.e. mem­ory al­lo­ca­tors).

Tags