Event Sourcing Blind Spot: Race Condition in Rebuild Projections

TLDR: Oskar Dudycz found a race condition in his previously published approach to rebuilding event-driven read models. When using advisory locks and inline projections in PostgreSQL, there's a window where events can be committed but their projections skipped, leading to data inconsistency. He's challenging readers to find a solution.

This is a fascinating design challenge from the Architecture Weekly newsletter. Oskar recently explained how to rebuild event-driven read models safely and resiliently, but then discovered a problematic edge case in his own design. Instead of just publishing the fix, he's turning it into a learning exercise for readers.

Here's the scenario: You're using PostgreSQL for both event storage and read models. You have inline projections that update the read model in the same transaction as the event append. When you need to rebuild these projections (say, because you've changed the projection logic), you truncate the read model and replay all events through the new projection.

The tricky part is handling concurrent writes during the rebuild. You don't want to stop incoming traffic, but you also don't want both the rebuild process and normal operations updating the read model simultaneously. Oskar's original solution used advisory locks and status checks to coordinate this.

But here's the race condition he discovered:

The rebuild process reads the last events in the log
It acquires an advisory lock for the inline projection
A new append operation starts but can't acquire the lock, so it skips the projection (as designed)
The rebuild process finishes processing - there are no new events because step 3 hasn't committed yet
The rebuild marks the projection as active and releases the lock
The append from step 3 finally commits, but its projection was already skipped
Result: the event is stored but never projected into the read model

This is a classic case where two concurrent operations interleave in an unexpected way. The append checked the lock status (locked), made a decision (skip projection), but by the time it committed, the world had changed (rebuild complete, projection active again).

For architects and teams dealing with event sourcing systems, this is a critical reminder: concurrent system design is full of subtle edge cases. Even experienced practitioners like Oskar can miss these scenarios on the first pass. The lesson isn't that advisory locks are bad, but that you need to carefully analyze all possible interleavings of concurrent operations.

Some approaches to consider: What if the rebuild process did a final check after releasing the lock? What if the append operation rechecked state before committing? What if there was a grace period or overlap in processing? Each approach has its own tradeoffs around consistency, performance, and complexity.

Link: I found a blind spot in the rebuilding projections design

This article summarizes content from the Architecture Weekly newsletter by Oskar Dudycz. The information presented here is based on newsletter content and may not reflect the complete original articles.

But here's the race condition he discovered:

The rebuild process reads the last events in the log
It acquires an advisory lock for the inline projection
A new append operation starts but can't acquire the lock, so it skips the projection (as designed)
The rebuild process finishes processing - there are no new events because step 3 hasn't committed yet
The rebuild marks the projection as active and releases the lock
The append from step 3 finally commits, but its projection was already skipped
Result: the event is stored but never projected into the read model

Link: I found a blind spot in the rebuilding projections design