Incomplete persistence #84

JustAnotherArchivist · 2018-07-25T15:52:29Z

I discovered two issues with persistence in PySyncObj: the election vote is not written to persistent storage, and the Raft log is not synced to disk and therefore not guaranteed to be preserved.

SyncObj.__votedFor, the variable storing which node the current node has voted for in a running election, is not written to persistent storage before granting the vote. This means that if a node crashes after granting its vote and is restarted within the same election, it is possible for a leader to be elected by a minority of the cluster.

The second issue is that the Raft log is not written to disk before replying to AppendEntries. The FileJournal.add method used in SyncObj calls ResizableFile.write, which writes the data to an mmap, but this data is not synced to disk (using mmap.flush, the Python equivalent of msync(2)). This may cause a loss of committed log entries in certain cases (slow cluster or many nodes failing around the same time, for example).

References regarding persistence in Raft: figure 2 in the paper and chapter 3.8 in Diego's thesis

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist · 2018-07-25T16:01:25Z

I'm not sure what the best solution for the votedFor issue is. This would also be affected by the network rewrite I'm currently working on (#4), so it might make sense to delay this fix until my network code is done.

The Raft log issue should be trivial to solve by adding FileJournal.flush calls to the relevant sections in SyncObj.

bakwc · 2018-07-25T20:41:21Z

Thanks for report!

May be we can just wait for at least raftMaxTimeout + some delay after process started before granting any votes? This would give us time to ensure that previous election finished.
We should add a config option to enable or disable flush, cause it in some cases performance issues caused by flush could be critical. For a big cluster chances that half of the cluster will die at the same time without actual writing data on disk are very low. So this flush may be a little overhead.

JustAnotherArchivist · 2018-07-25T20:47:55Z

Thanks.

That might work, though I'm not entirely sure if there are any corner cases to it. I think it's better to follow the Raft spec and persist the votedFor value instead. Maybe it can just be added as a record to the journal? I haven't looked at how the journal works in detail yet.
Yeah, the flush would definitely be a performance hit. I agree; an option to disable it would be a good idea. I think that the default should be the safe option, i.e. flushing enabled.

bakwc · 2018-07-25T20:55:26Z

We can do both - add it to journal + wait for some time for case where journal flushing disabled.
Agree.

JustAnotherArchivist · 2018-07-25T22:07:07Z

Yes, that sounds good.
I looked a bit at how the journal works and have an idea on how to implement this. The votedFor variable would move to Journal and FileJournal (as a property, for example, which would be cleanest). Setting this variable stores the value internally and writes it as a special record to the journal file (which would probably require defining a new command type). When the (File)Journal is (re-)initialised, e.g. after a restart, it would filter these records out and simply set the internal votedFor variable accordingly instead of exposing that record in FileJournal.__journal and thereby the indexing of SyncObj.__raftLog. Clearing the journal or deleting entries would always keep the last assigned value of votedFor (and create a new entry if necessary; this would need to happen atomically). This should require very few changes in SyncObj: only the lines where votedFor is set or used, but no rewriting of e.g. SyncObj.__getCurrentLogIndex or all the other uses of SyncObj.__raftLog. Some changes in the log compaction might also be necessary. In SyncObj, self.__votedFor would no longer exist and be replaced by something like self.__journal.votedFor.
What do you think? I can implement it, though I would – as mentioned previously – probably wait until the network rewrite is done because the value of votedFor will change from a network address to a generic "node ID".

bakwc · 2018-07-28T11:00:29Z

Sounds good. I'll wait for your fix :)

JustAnotherArchivist · 2018-11-15T11:24:14Z

Since #92 is merged, I've been looking into this, and I realised that I missed something above: the current term also needs to be persisted, not only the voted-for node ID.

Regarding the implementation, now that I know better how the journal works, I think it's worth changing the journal structure instead of using hidden log entries as mentioned above. Thie hidden log entries are quite ugly in my opinion and would also remain in the journal file until the entire file gets rewritten (if that ever happens at all).

Specifically, that means adding two header fields for the current term and the voted-for candidate (and incrementing the journal version). The term can simply be stored as a 64-bit integer. The node ID is a bit trickier since it can now quite literally be anything (as long as it's immutable and hashable). The main issue here is that the header field length has to be fixed since we'd have to rewrite the entire file otherwise when its length changes. One option would be to store a hash of the pickled node ID. We don't actually need to know the value of the node ID a particular node voted for: we only need to be able to check whether the current node voted for a particular candidate, and this is possible with a hash. Since the hash has no cryptographic purposes, I think MD5 is fine for this.

Of course, this would have to be backward-compatible, i.e. still able to read version 1 journal files. Upgrading these files to version 2 would require rewriting the entire file, and that would have to happen atomically, but that can probably be done like it happens in the log compaction already.

I'll implement this and send a PR. Please let me know if you disagree with any of this.

JustAnotherArchivist · 2018-11-15T11:24:48Z

By the way, I found a discussion on this topic on the raft-dev mailing list, which also mentions the idea of waiting for the maximum election timeout on boot to avoid double-voting when persistent storage isn't used. Archie Cobbs commented that it "sounds reasonable intuitive, but nailing down the details and proving it all correct in the face of arbitrary message drops, duplications, and delays is another thing altogether (worthy of a PhD dissertation in fact)".

JustAnotherArchivist linked a pull request Nov 16, 2018 that will close this issue

Fix persistence and add automatic journal flushing #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete persistence #84

Incomplete persistence #84

JustAnotherArchivist commented Jul 25, 2018

JustAnotherArchivist commented Jul 25, 2018

bakwc commented Jul 25, 2018

JustAnotherArchivist commented Jul 25, 2018

bakwc commented Jul 25, 2018

JustAnotherArchivist commented Jul 25, 2018

bakwc commented Jul 28, 2018

JustAnotherArchivist commented Nov 15, 2018

JustAnotherArchivist commented Nov 15, 2018

Incomplete persistence #84

Incomplete persistence #84

Comments

JustAnotherArchivist commented Jul 25, 2018

JustAnotherArchivist commented Jul 25, 2018

bakwc commented Jul 25, 2018

JustAnotherArchivist commented Jul 25, 2018

bakwc commented Jul 25, 2018

JustAnotherArchivist commented Jul 25, 2018

bakwc commented Jul 28, 2018

JustAnotherArchivist commented Nov 15, 2018

JustAnotherArchivist commented Nov 15, 2018