Monday, February 08, 2021

Readers-writer lock - Part 3: Some numbers

In order to convince you that a readers-writer lock is not a stupid idea, I should finally show some numbers. In this article I'll present a minimalistic (but still real-life) example which allows us to compare different locking solutions.

All code from this article is available in project rwLock at  https://github.com/gabr42/examples/tree/master/Reader-writer%20lock

To demonstrate the difference between various locking mechanisms, the demonstration program implements a very basic publish-subscribe pattern. The TPubSub class in the PublishSubscribe unit functions as a message hub with methods for handling subscribers (Subscribe, Unsubscribe) and publishers (Notify).

In principles, a publish-subscribe pattern is trivial to make - unless you want to support multithreaded programs. In the simples form you only need a list of subscribers (FSubscribers) and few simple functions.

To register a subscriber, the code just needs to add a callback to the subscriber list:

procedure TPubSub.Subscribe(callback: TCallback);
begin
  FSubscribers.Add(callback);
end;

De-registration is equally simple:

procedure TPubSub.Unsubscribe(callback: TCallback);
begin
  FSubscribers.Remove(callback);
end;

To send notifications to all interested parties the code just walks over the subscriber list and calls all callbacks:

procedure TPubSub.Notify(value: integer);
begin
  for var subscriber in FSubscribers do
    subscriber(value);
end;

This, as expected, gets complicated when one wants to use the message hub TPubSub in a multithreaded program. For example, while one thread is walking the list in TPubSub.Notify another may call TPubSub.Unsubscribe which would remove one element from the list and break the iteration. 

[Don't believe me? The demonstration program supports locking mechanism "none" which makes it run without any locking. Run it and you'll observe a nice access violation crash.]

To fix the code in a  multithreaded environment, we must introduce some form of locking. The simplest way is to add a critical section, for example by using Delphi's TMonitor to lock the entire TPubSub object:

procedure TPubSub.Subscribe(callback: TCallback);
begin
  MonitorEnter(Self);
  FSubscribers.Add(callback);
  MonitorExit(Self);
end;

procedure TPubSub.Unsubscribe(callback: TCallback);
begin
  MonitorEnter(Self);
  FSubscribers.Remove(callback);
  MonitorExit(Self);
end;

procedure TPubSub.Notify(value: integer);
begin
  MonitorEnter(Self);
  for var subscriber in FSubscribers do
    subscriber(value);
  MonitorExit(Self);
end;

In some situations this may be enough. A TMonitor implementation is quite fast, after all. There is, however, one scenario where this implementation can be drastically sped up.

When two threads in parallel call the Notify method above, one of them will have to wait until the other finishes although there is actually no reason for that. Both Notify calls could run in parallel without any problem. We just have to protect them against Subscribe and Unsubscribe.

Enter readers-writer lock

So why can we run two Notify methods in parallel but not a Notify and a Subscribe? The question we have to ask ourselves is - how does the code access the shared entity FSubscribers

The Notify method reads from the shared data. It does not change the data and can therefore be executed multiple times in parallel.

Both Subscribe and Unsubscribe, however, write to the shared data. They change the data and two such operations cannot run in parallel.

And, as we have seen in the introduction, one cannot run one reader and one writer in parallel as the writer may break the reader.

This is a typical situation where a readers-writer lock can be used. When the data is modified, such lock would be acquired with an exclusive (write) access. When we are merely reading from the data, the lock would be acquired with a shared (read) access.

With the new TLightweightMREW the modifications are trivial. As the lock is implemented with a record, we don't need to manage its lifecyle; just declare it inside the TPubSub class:

TPubSub = class
  FLockLight: TLightweightMREW;
  ...
end;

procedure TPubSub.Subscribe(callback: TCallback);
begin
  FLockLight.BeginWrite;
  FSubscribers.Add(callback);
  FLockLight.EndWrite;
end;

procedure TPubSub.Unsubscribe(callback: TCallback);
begin
  FLockLight.BeginWrite;
  FSubscribers.Remove(callback);
  FLockLight.EndWrite;
end;

procedure TPubSub.Notify(value: integer);
begin
  FLockLight.BeginRead;
  for var subscriber in FSubscribers do
    subscriber(value);
  FLockLight.EndRead;
end;

Test app

The rwLock test app allows you to set some parameters (number of subscriber threads, number of publisher threads, test duration, locking mechanism) and run the benchmark.



The publisher threads all run the same code:

procedure TfrmRWLockMain.Publisher;
var
  id: integer;
begin
  id := 0;
  while not FStopBenchmark do begin
    Sleep(0);
    Inc(id);
    FPubSub.Notify(id);
  end;
end;

A publisher sleeps a little, then notifies its subscribers that the id has changed and repeats the process until the benchmark is stopped.

All subscribers also run the same code:

procedure TfrmRWLockMain.Subscriber;
var
  callback: TPubSub.TCallback;
begin
  callback :=
    procedure (value: integer)
    begin
      Sleep(50);
end; while not FStopBenchmark do begin FPubSub.Subscribe(callback);
    Sleep(10);
FPubSub.Unsubscribe(callback); end; end;

A subscriber subscribes to notifications, sleeps for 10 ms (in reality it would do some work at this point), then unsubscribes itself. The callback code sleeps for 50 ms to simulate a workload, then exits.

[The actual code is a bit more complicated as it also measures various parameters. If you want the full truth, check the source code.]

The testing code also measures a "raw speed", that is a speed of allocating/releasing a synchronization primitive in a tight loop in one thread only. For example, this is how this speed is measured for a TCriticalSection:

var cs := TCriticalSection.Create;
try
  sw := TStopwatch.StartNew;
  while sw.ElapsedMilliseconds < 1000 do begin
    cs.Acquire;
    cs.Release;
    Inc(cntRead);
  end;
finally FreeAndNil(cs); end;

A separate read/write values are measured for both readers-writer locks.

Finally, some numbers!

I started testing on a Windows machine with Win32 compiler. Test application was running for 60 seconds with 6 subscribers and 6 publishers. CPU was Intel i7 8750 with 6 physical cores (with hyperthreading enabled - so 12 logical cores) running at 2.2 GHz.

Locking mechanism Raw speed # Subscribe calls # Notify calls Max concurrency
TCriticalSection 15,419,547/sec 10 438 1
15,451,119/sec 8 669 1
15,444,868/sec 9 521 1
TMonitor 14,268,881/sec 164 361 1
14,341,582/sec 166 399 1
14,390,437/sec 165 335 1
TMREWSync R: 606,553/sec,
W: 322,056/sec
203 1380 6
R: 603,326/sec,
W: 320,928/sec
205 1376 6
R: 612,249/sec,
W: 321,134/sec
189 1242 6
TLightweightMREW R: 15,814,537/sec,
W: 16,049,193/sec
9 893 6
R: 15,630,150/sec,
W: 16,037,024/sec
7 785 6
R: 15,719,617/sec,
W: 16,166,125/sec
13 856 6

So, what can we glean from that? Firstly, the raw speed of most synchronization mechanisms is blindingly fast.You can acquire a critical section, a monitor, or a lightweight MREW about 15 million times per second. The old Delphi TMREWSync, however, is about 25 times slower.

Secondly, TMonitor and TMREWSync somehow allow the code to function better than  the OS-provided TCriticalSection and TLightweightMREW. I think this is because they internally use spin-locking for synchronization. Instead for allowing the OS to put the thread to sleep when a synchronization primitive cannot be acquired, they spin in a tight loop and keep retrying. When a locked area is short (as happens in the Subscribe call) this helps acquire the lock faster (for a price of more CPU use).

Thirdly, TMREWSync also manages to execute most Notify calls. This is also a consequence of the spinlocking, combined with the introduced parallelism. While TCriticalSection and TMonitor execute at most one Notify at any time, both TMREWSync and TLightweightMREW execute six Notify calls in parallel (at most).

Fourthly, TLightweightMREW-based locking executes more Notify calls then the TCriticalSection-based locking, but not by much. Evidently, in this test case, the introduced parallelism doesn't help as much as a spinlock-based synchronization mechanism.

In recap we can say that in this situation, with this test code, the TMREWSync behaves the best. That was a complete surprise to me, as I knew before how slow this locking mechanism is. This, however, doesn't mean that it will be the beast approach in a different program or, maybe, in this same application but on another platform. That's why I retested the code on a Macbook Pro running an Intel Core i9 with 8 physical cores at 2.30 GHz. I ran 8 subscriber and 8 publisher threads:

Locking mechanism Raw speed # Subscribe calls # Notify calls Max concurrency
TCriticalSection 9,081,416/sec 153 320 1
9,055,023/sec 153 370 1
8,886,505/sec 152 353 1
TMonitor 8,696,415/sec 150 323 1
9,161,138/sec 151 323 1
9,136,758/sec 152 311 1
TMREWSync R: 8,717,434/sec,
W: 8,867,712/sec
149 337 1
R: 8,827,878/sec,
W: 8,546,304/sec
146 339 1
R: 8,987,532/sec,
W: 8,720,651/sec
152 305 <1 td="">
TLightweightMREW R: 10,597,121/sec,
W: 10,302,886/sec
577 1152 6
R: 10,490,843/sec,
W: 10,376,276/sec
584 1303 7
R: 10,323,020/sec,
W: 10,310,630/sec
577 1312 7

In this scenario TLightweightMREW is a clear winner. It has the highest raw speed, it allowed most Subscribe calls and generated most Notify calls. Also, it is the only synchronization primitive that allowed multiple publishers to run in parallel. 

All other three locking mechanisms functioned more or less the same. This is not suprising if you know that on non-Windows platforms both TMonitor and TMREWSync are actually equivalen to a critical section.

At the end I wanted to check how fairly the threads are scheduled when different locking mechanisms are used. In theory, a high number of Notify calls doesn't mean that all publisher threads managed to run in parallel. It could be that only one was running at full speed while others merely kicked in from time to time. So I rerun tests on the Windows machine with 20 subscriber and 20 publisher threads and measured how many Notify calls were executed in each publisher thread:

As we can see in this graph, this is exatcly what happens with the TCriticalSection and TMonitor-based locking. Few of the publishing threads managed to grab most of the CPU and others lagged behing. The scheduling was fairer with the TMREWSync and TLightweightMREW-based locking.

Summary

As we can see, it is not easy to select the "best" locking mechanism. The result will depend on many factors and usually we cannot say in advance which one is the best. My approach is to start with the one I feel that will be the best (TCriticalSection/TMonitor for simple locking, TLightweightMREW for situations with readers and writers) and only start thinking about changing it if the testing shows that the locking mechanism doesn't perform as good as it should. 

1 comment: