Stack on the stack

Published on Sunday, September 24, 2023 In Code Tags #dotnet #perf

ref struct have been introduced with C# 7.2 to give us some optimizations opportunities. So let's try to implement a ref struct Stack and see how it goes.

The simplest implementation I can think of is the following.

ref struct Stack<T>
{
    private readonly Span<T> storage;
    private int count;

    public Stack(Span<T> storage)
    {
        Debug.Assert(!storage.IsEmpty, "The storage cannot be empty");
        this.storage = storage;
    }

    public readonly int Count => count;

    public void Push(T value)
    {
        storage[count++] = value;
    }

    public readonly ref T Peek()
    {
        return ref storage[count - 1];
    }

    public ref T Pop()
    {
        return ref storage[--count];
    }
}

does not allocate
avoir if statements by letling the framework perform bound checks

does not grow
IndexOutOfRangeException thrown for overflow or underflow
does not clean the memory cell once the item has been popped, if T is a reference type

One way to fix the exception type issue is to catch and rethrow, as in the CompensatingStack implementation:

    public void Push(T value)
    {
        try
        {
            storage[count++] = value;
        }
        catch (IndexOutOfRangeException)
        {
            count = storage.Length;
            throw new InvalidOperationException("Capacity overflow");
        }
    }

or by using an if statement and throw, as in the DefensiveStack implementation:

    public void Push(T value)
    {
        if (count == storage.Length)
            throw new InvalidOperationException("Capacity overflow");

        storage[count++] = value;
    }

Of course, if we have an if statement, we could grows the storage instead, as in the GrowingStack implementation:

    public void Push(T value)
    {
        if ((uint)count < (uint)storage.Length)
        {
            storage[count++] = value;
        }
        else
        {
            // give a chance to inline the common Push method by moving the uncommon case in another method
            PushWithResize(value);
        }
    }

None of these implementation fixes the clean issue, which is, by the way, to let the garbage collector know that the popped item is no longer required.
To fix this efficiently, we should change the signature of Pop and return void, like c++' STL does.

    public void Pop()
    {
        storage[--count] = default!;
    }

So now, let's benchmark all these implementations by counting the number of items in a tree using a stack. There are different scenario to isolate the impact of memory allocation.


BenchmarkDotNet v0.13.7, Windows 11 (10.0.22000.2295/21H2/SunValley)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 7.0.306
  [Host]     : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX2

Method	Mean	Error	StdDev	Ratio	RatioSD
StandardStack	573.1 ns	3.05 ns	2.70 ns	1.00	0.00
StandardStackWithEnoughCapacity	540.6 ns	10.64 ns	13.84 ns	0.94	0.02
StandardStackWithSharedStack	564.8 ns	1.44 ns	1.28 ns	0.99	0.00
UsingCompensatingStack	770.7 ns	5.87 ns	5.49 ns	1.34	0.01
UsingDefensiveStack	416.0 ns	8.26 ns	10.15 ns	0.72	0.02
UsingUnguardedStack	386.7 ns	2.14 ns	2.00 ns	0.67	0.00
UsingUnguardedStackWithSharedBuffer	399.3 ns	1.08 ns	0.90 ns	0.70	0.00
UsingGrowingStack	460.0 ns	9.07 ns	8.48 ns	0.80	0.01

So, what do we learn?

Compensation has a negative impact on the performance.
The other optimization strategies are more or less the same with 20% improvement, more or less, the unguarded version being the fastest.

Because we were using reference types, we were not able to use stackalloc. Let check if it has an impact on the performance. So, let basically push and pop a fixed number of items N times, in order to have measures in µs instead of ns.

Method	N	Mean	Error	StdDev	Ratio	RatioSD
StandardStack	10000	1.992 ms	0.0359 ms	0.0336 ms	1.00	0.00
StandardStackWithEnoughCapacity	10000	2.043 ms	0.0400 ms	0.0374 ms	1.03	0.03
StandardStackWithSharedStack	10000	1.940 ms	0.0058 ms	0.0051 ms	0.98	0.02
UsingCompensatingStack	10000	2.668 ms	0.0102 ms	0.0095 ms	1.34	0.02
UsingDefensiveStack	10000	2.093 ms	0.0418 ms	0.0391 ms	1.05	0.03
UsingUnguardedStack	10000	1.564 ms	0.0052 ms	0.0046 ms	0.79	0.01
UsingGrowingStack	10000	1.885 ms	0.0097 ms	0.0086 ms	0.95	0.02

We end up having equivalent performance for almost all scenario, but we can definitively say that:

The unguarded scenario is the fastest.
The compensating scenario is the slowest.

After looking at the implementation of the standard Stack, it appears than we may be able to improve the performance for the growing stack implementation, but I would probably not invest time in this. The stack on the stack has such a specific use case, namely to optimize some tree or graph algorithm, that we may use the unguarded version but keep it an implementation detail.

References & Useful Links

Low Level Struct Improvements

References & Useful Links

Leave a comment