Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete persistence #84

Open
JustAnotherArchivist opened this issue Jul 25, 2018 · 8 comments · May be fixed by #98
Open

Incomplete persistence #84

JustAnotherArchivist opened this issue Jul 25, 2018 · 8 comments · May be fixed by #98

Comments

@JustAnotherArchivist
Copy link
Contributor

I discovered two issues with persistence in PySyncObj: the election vote is not written to persistent storage, and the Raft log is not synced to disk and therefore not guaranteed to be preserved.

SyncObj.__votedFor, the variable storing which node the current node has voted for in a running election, is not written to persistent storage before granting the vote. This means that if a node crashes after granting its vote and is restarted within the same election, it is possible for a leader to be elected by a minority of the cluster.

The second issue is that the Raft log is not written to disk before replying to AppendEntries. The FileJournal.add method used in SyncObj calls ResizableFile.write, which writes the data to an mmap, but this data is not synced to disk (using mmap.flush, the Python equivalent of msync(2)). This may cause a loss of committed log entries in certain cases (slow cluster or many nodes failing around the same time, for example).

References regarding persistence in Raft: figure 2 in the paper and chapter 3.8 in Diego's thesis

@JustAnotherArchivist
Copy link
Contributor Author

I'm not sure what the best solution for the votedFor issue is. This would also be affected by the network rewrite I'm currently working on (#4), so it might make sense to delay this fix until my network code is done.

The Raft log issue should be trivial to solve by adding FileJournal.flush calls to the relevant sections in SyncObj.

@bakwc
Copy link
Owner

bakwc commented Jul 25, 2018

Thanks for report!

  1. May be we can just wait for at least raftMaxTimeout + some delay after process started before granting any votes? This would give us time to ensure that previous election finished.
  2. We should add a config option to enable or disable flush, cause it in some cases performance issues caused by flush could be critical. For a big cluster chances that half of the cluster will die at the same time without actual writing data on disk are very low. So this flush may be a little overhead.

@JustAnotherArchivist
Copy link
Contributor Author

Thanks.

  1. That might work, though I'm not entirely sure if there are any corner cases to it. I think it's better to follow the Raft spec and persist the votedFor value instead. Maybe it can just be added as a record to the journal? I haven't looked at how the journal works in detail yet.
  2. Yeah, the flush would definitely be a performance hit. I agree; an option to disable it would be a good idea. I think that the default should be the safe option, i.e. flushing enabled.

@bakwc
Copy link
Owner

bakwc commented Jul 25, 2018

  1. We can do both - add it to journal + wait for some time for case where journal flushing disabled.
  2. Agree.

@JustAnotherArchivist
Copy link
Contributor Author

  1. Yes, that sounds good.
    I looked a bit at how the journal works and have an idea on how to implement this. The votedFor variable would move to Journal and FileJournal (as a property, for example, which would be cleanest). Setting this variable stores the value internally and writes it as a special record to the journal file (which would probably require defining a new command type). When the (File)Journal is (re-)initialised, e.g. after a restart, it would filter these records out and simply set the internal votedFor variable accordingly instead of exposing that record in FileJournal.__journal and thereby the indexing of SyncObj.__raftLog. Clearing the journal or deleting entries would always keep the last assigned value of votedFor (and create a new entry if necessary; this would need to happen atomically). This should require very few changes in SyncObj: only the lines where votedFor is set or used, but no rewriting of e.g. SyncObj.__getCurrentLogIndex or all the other uses of SyncObj.__raftLog. Some changes in the log compaction might also be necessary. In SyncObj, self.__votedFor would no longer exist and be replaced by something like self.__journal.votedFor.
    What do you think? I can implement it, though I would – as mentioned previously – probably wait until the network rewrite is done because the value of votedFor will change from a network address to a generic "node ID".

@bakwc
Copy link
Owner

bakwc commented Jul 28, 2018

Sounds good. I'll wait for your fix :)

@JustAnotherArchivist
Copy link
Contributor Author

Since #92 is merged, I've been looking into this, and I realised that I missed something above: the current term also needs to be persisted, not only the voted-for node ID.

Regarding the implementation, now that I know better how the journal works, I think it's worth changing the journal structure instead of using hidden log entries as mentioned above. Thie hidden log entries are quite ugly in my opinion and would also remain in the journal file until the entire file gets rewritten (if that ever happens at all).

Specifically, that means adding two header fields for the current term and the voted-for candidate (and incrementing the journal version). The term can simply be stored as a 64-bit integer. The node ID is a bit trickier since it can now quite literally be anything (as long as it's immutable and hashable). The main issue here is that the header field length has to be fixed since we'd have to rewrite the entire file otherwise when its length changes. One option would be to store a hash of the pickled node ID. We don't actually need to know the value of the node ID a particular node voted for: we only need to be able to check whether the current node voted for a particular candidate, and this is possible with a hash. Since the hash has no cryptographic purposes, I think MD5 is fine for this.

Of course, this would have to be backward-compatible, i.e. still able to read version 1 journal files. Upgrading these files to version 2 would require rewriting the entire file, and that would have to happen atomically, but that can probably be done like it happens in the log compaction already.

I'll implement this and send a PR. Please let me know if you disagree with any of this.

@JustAnotherArchivist
Copy link
Contributor Author

By the way, I found a discussion on this topic on the raft-dev mailing list, which also mentions the idea of waiting for the maximum election timeout on boot to avoid double-voting when persistent storage isn't used. Archie Cobbs commented that it "sounds reasonable intuitive, but nailing down the details and proving it all correct in the face of arbitrary message drops, duplications, and delays is another thing altogether (worthy of a PhD dissertation in fact)".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants