What

Currently, apart from WAL safekeeper persistently stores only two logical clock counter (aka term) values, sourced from the same sequence. The first is bumped whenever safekeeper gives vote to proposer (or acknowledges already elected one) and e.g. prevents electing two proposers with the same term -- it is actually called term in the code. The second, called epoch, reflects progress of log receival and this might lag behind term; safekeeper switches to epoch n when it has received all committed log records from all < n terms. This roughly corresponds to proposed in

https://github.com/zenithdb/rfcs/pull/3/files

This makes our biggest our difference from Raft. In Raft, every log record is stamped with term in which it was generated; while we essentially store in epoch only the term of the highest record on this safekeeper -- when we know it -- because during recovery generally we don't, and epoch is bumped directly to the term of the proposer who performs the recovery when it is finished. It is not immediately obvious that this simplification is safe. I thought and I still think it is; model checking confirmed that. However, some details now make me believe it is better to keep full term switching history (which is equivalent to knowing term of each record).

Why

Without knowing full history (list of <term, LSN> pairs) of terms it is hard to determine the exact divergence point, and if we don't perform truncation at that point safety becomes questionable. Consider the following history, with safekeepers A, B, C, D, E. n_m means record created by proposer in term n with LSN m; (t=x, e=y) means safekeeper currently has term x and epoch y.

P1 in term 1 writes 1.1 everywhere, which is committed, and some more only on A.

A(t=1, e=1) 1.1 1.2 1.3 1.4
B(t=1, e=1) 1.1
C(t=1, e=1) 1.1
D(t=1, e=1) 1.1
E(t=1, e=1) 1.1

P2 is elected by CDE in term 2, epochStartLsn is 2, and writes 2.2, 2.3 on CD:

A(t=1, e=1) 1.1 1.2 1.3 1.4
B(t=1, e=1) 1.1
C(t=2, e=2) 1.1 2.2 2.3
D(t=2, e=2) 1.1 2.2 2.3
E(t=2, e=1) 1.1

P3 is elected by CDE in term 3, epochStartLsn is 4, and writes 3.4 on D:

A(t=1, e=1) 1.1 1.2 1.3 1.4
B(t=1, e=1) 1.1
C(t=3, e=2) 1.1 2.2 2.3
D(t=3, e=3) 1.1 2.2 2.3 3.4
E(t=3, e=1) 1.1

Now, A gets back and P3 starts recovering it. How it should proceed? There are two options.

Don't try to find divergence point at all

...start sending WAL conservatively since the horizon (1.1), and truncate obsolete part of WAL only when recovery is finished, i.e. epochStartLsn (4) is reached, i.e. 2.3 transferred -- that's what neondatabase#505 proposes.

Then the following is possible:

P3 moves one record 2.2 to A.

A(t=1, e=1) 1.1 2.2 1.3 1.4
B(t=1, e=1) 1.1 1.2
C(t=3, e=2) 1.1 2.2 2.3
D(t=3, e=3) 1.1 2.2 2.3 3.4
E(t=3, e=1) 1.1

Now log of A is basically corrupted. Moreover, since ABE are all in epoch 1 and A's log is the longest one, they can elect P4 who will commit such log.

Note that this particular history couldn't happen if we forbid to create new records in term n until majority of safekeepers switch to it. It would force CDE to switch to 2 before 2.2 is created, and A could never become donor while his log is corrupted. Generally with this additional barrier I believe the algorithm becomes safe, but

I don't like this kind of artificial barrier;
I also feel somewhat discomfortable about even temporary having intentionally corrupted WAL;
I'd still model check the idea.

Find divergence point and truncate at it

Then step 4 would delete 1.3 1.4 on A, and we are ok. The question is, how do we do that? Without term switching history we have to resort to sending again since the horizon and memcmp'ing records, which is inefficient and ugly. Or we can maintain full history and determine truncation point by comparing 'wrong' and 'right' histories -- much like pg_rewind does -- and perform truncation + start streaming right there.

Proposal

Add term history as array of <term, LSN> pairs to safekeeper controlfile.
Return it to proposer with VoteResponse so 1) proposer can tell it to other nodes and 2) determine personal streaming starting point. However, since we don't append WAL and update controlfile atomically, let's first always update controlfile but send only the history of what we really have (up to highest term in history where begin_lsn >= end of wal; this highest term replaces current epoch). We also send end of wal as we do now to determine the donor.
Create ProposerAnnouncement message which proposer sends before starting streaming. It announces proposer as elected and
1. Truncates wrong part of WAL on safekeeper (divergence point is already calculated at proposer, but can be cross-verified here).
2. Communicates the 'right' history of its term (taken from donor). Seems better to immediately put the history in the controlfile, though safekeeper might not have full WAL for previous terms in it -- this way is simpler, and we can't update WAL and controlfile atomically anyway.
  
  This also constitutes analogue of current epoch bump for those safekeepers which don't need recovery, which is important for sync-safekeepers (bump epoch without waiting records from new term).
After ProposerAnnouncement proposer streams WAL since calculated starting point -- only what is missing.

pros/cons:

(more) clear safety of WAL truncation -- we get very close to Raft
no unnecessary data sending (faster recovery for not-oldest-safekeepers, matters only for 5+ nodes)
adds some observability at safekeepers

complexity, but not that much

Misc

During model checking I did truncation on first locally non existent or different record -- analogue of 'memcmp' variant described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

013-term-history.md

013-term-history.md

What

Why

Don't try to find divergence point at all

Find divergence point and truncate at it

Proposal

Misc

Files

013-term-history.md

Latest commit

History

013-term-history.md

File metadata and controls

What

Why

Don't try to find divergence point at all

Find divergence point and truncate at it

Proposal

Misc