-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete persistence #84
Comments
I'm not sure what the best solution for the The Raft log issue should be trivial to solve by adding |
Thanks for report!
|
Thanks.
|
|
|
Sounds good. I'll wait for your fix :) |
Since #92 is merged, I've been looking into this, and I realised that I missed something above: the current term also needs to be persisted, not only the voted-for node ID. Regarding the implementation, now that I know better how the journal works, I think it's worth changing the journal structure instead of using hidden log entries as mentioned above. Thie hidden log entries are quite ugly in my opinion and would also remain in the journal file until the entire file gets rewritten (if that ever happens at all). Specifically, that means adding two header fields for the current term and the voted-for candidate (and incrementing the journal version). The term can simply be stored as a 64-bit integer. The node ID is a bit trickier since it can now quite literally be anything (as long as it's immutable and hashable). The main issue here is that the header field length has to be fixed since we'd have to rewrite the entire file otherwise when its length changes. One option would be to store a hash of the pickled node ID. We don't actually need to know the value of the node ID a particular node voted for: we only need to be able to check whether the current node voted for a particular candidate, and this is possible with a hash. Since the hash has no cryptographic purposes, I think MD5 is fine for this. Of course, this would have to be backward-compatible, i.e. still able to read version 1 journal files. Upgrading these files to version 2 would require rewriting the entire file, and that would have to happen atomically, but that can probably be done like it happens in the log compaction already. I'll implement this and send a PR. Please let me know if you disagree with any of this. |
By the way, I found a discussion on this topic on the raft-dev mailing list, which also mentions the idea of waiting for the maximum election timeout on boot to avoid double-voting when persistent storage isn't used. Archie Cobbs commented that it "sounds reasonable intuitive, but nailing down the details and proving it all correct in the face of arbitrary message drops, duplications, and delays is another thing altogether (worthy of a PhD dissertation in fact)". |
I discovered two issues with persistence in PySyncObj: the election vote is not written to persistent storage, and the Raft log is not synced to disk and therefore not guaranteed to be preserved.
SyncObj.__votedFor
, the variable storing which node the current node has voted for in a running election, is not written to persistent storage before granting the vote. This means that if a node crashes after granting its vote and is restarted within the same election, it is possible for a leader to be elected by a minority of the cluster.The second issue is that the Raft log is not written to disk before replying to AppendEntries. The
FileJournal.add
method used inSyncObj
callsResizableFile.write
, which writes the data to an mmap, but this data is not synced to disk (usingmmap.flush
, the Python equivalent ofmsync(2)
). This may cause a loss of committed log entries in certain cases (slow cluster or many nodes failing around the same time, for example).References regarding persistence in Raft: figure 2 in the paper and chapter 3.8 in Diego's thesis
The text was updated successfully, but these errors were encountered: