Stack on the stack
ref struct
have been introduced with C# 7.2 to give us some optimizations opportunities. So let's try to implement a ref struct Stack and see how it goes.
The simplest implementation I can think of is the following.
ref struct Stack<T>
{
private readonly Span<T> storage;
private int count;
public Stack(Span<T> storage)
{
Debug.Assert(!storage.IsEmpty, "The storage cannot be empty");
this.storage = storage;
}
public readonly int Count => count;
public void Push(T value)
{
storage[count++] = value;
}
public readonly ref T Peek()
{
return ref storage[count - 1];
}
public ref T Pop()
{
return ref storage[--count];
}
}
One way to fix the exception type issue is to catch and rethrow, as in the CompensatingStack
implementation:
public void Push(T value)
{
try
{
storage[count++] = value;
}
catch (IndexOutOfRangeException)
{
count = storage.Length;
throw new InvalidOperationException("Capacity overflow");
}
}
or by using an if
statement and throw, as in the DefensiveStack
implementation:
public void Push(T value)
{
if (count == storage.Length)
throw new InvalidOperationException("Capacity overflow");
storage[count++] = value;
}
Of course, if we have an if
statement, we could grows the storage instead, as in the GrowingStack
implementation:
public void Push(T value)
{
if ((uint)count < (uint)storage.Length)
{
storage[count++] = value;
}
else
{
// give a chance to inline the common Push method by moving the uncommon case in another method
PushWithResize(value);
}
}
None of these implementation fixes the clean issue, which is, by the way, to let the garbage collector know that the popped item is no longer required.
To fix this efficiently, we should change the signature of Pop
and return void, like c++' STL does.
public void Pop()
{
storage[--count] = default!;
}
So now, let's benchmark all these implementations by counting the number of items in a tree using a stack. There are different scenario to isolate the impact of memory allocation.
BenchmarkDotNet v0.13.7, Windows 11 (10.0.22000.2295/21H2/SunValley)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 7.0.306
[Host] : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX2
DefaultJob : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX2
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
StandardStack | 573.1 ns | 3.05 ns | 2.70 ns | 1.00 | 0.00 |
StandardStackWithEnoughCapacity | 540.6 ns | 10.64 ns | 13.84 ns | 0.94 | 0.02 |
StandardStackWithSharedStack | 564.8 ns | 1.44 ns | 1.28 ns | 0.99 | 0.00 |
UsingCompensatingStack | 770.7 ns | 5.87 ns | 5.49 ns | 1.34 | 0.01 |
UsingDefensiveStack | 416.0 ns | 8.26 ns | 10.15 ns | 0.72 | 0.02 |
UsingUnguardedStack | 386.7 ns | 2.14 ns | 2.00 ns | 0.67 | 0.00 |
UsingUnguardedStackWithSharedBuffer | 399.3 ns | 1.08 ns | 0.90 ns | 0.70 | 0.00 |
UsingGrowingStack | 460.0 ns | 9.07 ns | 8.48 ns | 0.80 | 0.01 |
So, what do we learn?
- Compensation has a negative impact on the performance.
- The other optimization strategies are more or less the same with 20% improvement, more or less, the unguarded version being the fastest.
Because we were using reference types, we were not able to use stackalloc. Let check if it has an impact on the performance. So, let basically push and pop a fixed number of items N times, in order to have measures in µs instead of ns.
Method | N | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
StandardStack | 10000 | 1.992 ms | 0.0359 ms | 0.0336 ms | 1.00 | 0.00 |
StandardStackWithEnoughCapacity | 10000 | 2.043 ms | 0.0400 ms | 0.0374 ms | 1.03 | 0.03 |
StandardStackWithSharedStack | 10000 | 1.940 ms | 0.0058 ms | 0.0051 ms | 0.98 | 0.02 |
UsingCompensatingStack | 10000 | 2.668 ms | 0.0102 ms | 0.0095 ms | 1.34 | 0.02 |
UsingDefensiveStack | 10000 | 2.093 ms | 0.0418 ms | 0.0391 ms | 1.05 | 0.03 |
UsingUnguardedStack | 10000 | 1.564 ms | 0.0052 ms | 0.0046 ms | 0.79 | 0.01 |
UsingGrowingStack | 10000 | 1.885 ms | 0.0097 ms | 0.0086 ms | 0.95 | 0.02 |
We end up having equivalent performance for almost all scenario, but we can definitively say that:
- The unguarded scenario is the fastest.
- The compensating scenario is the slowest.
After looking at the implementation of the standard Stack, it appears than we may be able to improve the performance for the growing stack implementation, but I would probably not invest time in this. The stack on the stack has such a specific use case, namely to optimize some tree or graph algorithm, that we may use the unguarded version but keep it an implementation detail.