-
Notifications
You must be signed in to change notification settings - Fork 40
/
Copy pathl03-fault-tolerance.html
572 lines (471 loc) · 17.3 KB
/
l03-fault-tolerance.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
<h1>6.824 2015 Lecture 3: Primary/Backup Replication</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.824
<a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from Spring 2015.</p>
<h2>Today</h2>
<ul>
<li>Replication</li>
<li><a href="papers/remus.pdf">Remus</a> case study</li>
<li>Lab 2 introduction</li>
</ul>
<h2>Fault tolerance</h2>
<ul>
<li>We'd like a service that continues despite failures!</li>
<li><strong>Definitions:</strong>
<ul>
<li><em>Available</em> -- still usable despite [some class of] failures</li>
<li><em>Correct</em> -- act just like a single server to clients
<ul>
<li>A lot of issues that come up have to do with <em>correctness</em></li>
</ul></li>
</ul></li>
<li>Very hard!</li>
<li>Very useful!</li>
</ul>
<h3>Need a failure model: what will we try to cope with?</h3>
<ul>
<li><em>Most common:</em> Independent fail-stop computer failures
<ul>
<li><em>fail-stop failures:</em> computed correctly for a while and then stopped
<ul>
<li>as opposed to computing incorrectly (different situation)</li>
</ul></li>
<li>have to assume independence of failures
<ul>
<li>(o.w. we could have primary fail <code>=></code> backup fail <code>=></code> fffffuu....)</li>
</ul></li>
<li>Remus further assumes only one failure at a time</li>
</ul></li>
<li><em>Another model:</em> Site-wide power failure (and eventual reboot)</li>
<li>(Network partition)</li>
<li>No bugs, no malice</li>
</ul>
<h3>Core idea: replication</h3>
<ul>
<li><em>Two</em> servers (or more)</li>
<li>Each replica keeps state needed for the service</li>
<li>If one replica fails, others can continue</li>
</ul>
<h3>Example: fault-tolerant MapReduce master</h3>
<ul>
<li>Lab 1 workers are already fault-tolerant, but not master</li>
<li><code>[Diagram: M1, M2, workers]</code></li>
<li>State:
<ul>
<li>worker list</li>
<li>which jobs done</li>
<li>which workers idle</li>
<li>TCP connection state</li>
<li>program counter</li>
</ul></li>
</ul>
<h3>Big questions</h3>
<ul>
<li>What <em>state</em> to replicate?
<ul>
<li><em>Example:</em> Remus replicates all of RAM and the CPU state</li>
</ul></li>
<li>How does replica get state?</li>
<li>When to <em>cut over</em> to backup?
<ul>
<li>Is primary really down or is just the network down?</li>
</ul></li>
<li>Are anomalies visible at cut-over?
<ul>
<li>What will clients see?</li>
</ul></li>
<li>How to repair / re-integrate?
<ul>
<li>How to get a new backup?</li>
</ul></li>
</ul>
<h3>Two main approaches:</h3>
<ol>
<li><strong>State transfer</strong>
<ul>
<li>"Primary" replica executes the service</li>
<li>Primary sends [new] state to backups</li>
<li><em>Example:</em> Remus</li>
</ul></li>
<li><strong>Replicated state machine</strong>
<ul>
<li>All replicas (primary and backup) execute all operations</li>
<li>If same start state &
same operations &
same order &
deterministic &
<em>then</em> <code>=></code> same end state</li>
<li><em>Ops</em> are transferred and not the state</li>
</ul></li>
</ol>
<h3><em>State transfer</em> is simpler</h3>
<ul>
<li>But state may be large, slow to transfer</li>
<li><em>Remus</em> uses state transfer</li>
</ul>
<h3><em>Replicated state machine</em> can be more efficient</h3>
<ul>
<li>If operations are small compared to data</li>
<li>But complex, e.g. order on multi-core, determinism
<ul>
<li>Hard to make sure everyone got to the same state</li>
<li>Determinism can be problematic (time, threads, etc.)</li>
</ul></li>
<li>Labs use replicated state machines</li>
</ul>
<h2>Remus: High Availability via Asynchronous Virtual Machine Replication, NSDI 2008</h2>
<h3>Very ambitious system</h3>
<ul>
<li>Whole-system replication</li>
<li>Completely <em>transparent</em> to applications and clients</li>
<li>High availability for any existing software</li>
<li>Would be magic if it worked well!</li>
<li><em>Failure model:</em>
<ol>
<li>Independent hardware faults</li>
<li>Site-wide power failure</li>
</ol></li>
</ul>
<h3>Plan 1 (slow, broken):</h3>
<ul>
<li><code>[Diagram: app, O/S, Remus underneath]</code></li>
<li>two machines, <em>primary</em> and <em>backup</em>; plus net and other machines</li>
<li>primary runs o/s and application s/w, talks to clients, etc.</li>
<li>backup does <em>not</em> initially execute o/s, applications, etc.
<ul>
<li>it only executes some Remus code</li>
</ul></li>
<li>a few times per second:
<ul>
<li>pause primary</li>
<li>copy <strong>entire RAM</strong>, registers, disk to backup
<ul>
<li>10Gbps = 1GB/s network bandwidth</li>
<li>100MB/s disk bandwidth</li>
<li>network bandwidth limits RAM transfer rate</li>
<li>disk bandwidth limits disk transfer rate</li>
</ul></li>
<li>resume primary</li>
</ul></li>
<li>if primary fails:
<ul>
<li>start backup executing!</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Is Plan 1 correct (as described above)?</p>
<ul>
<li>i.e. does it look just like a single reliable server?</li>
<li>No:
<ul>
<li>client sends write req. to primary, primary replies before backup had a chance to copy the new state</li>
<li>primary fails, backup takes over, but it does not reflect the last write req.</li>
<li>client will be screwed because his write was lost</li>
</ul></li>
</ul>
<p><strong>Q:</strong> What will outside world see if primary fails and replica takes over?</p>
<ul>
<li>Will backup have same state as last visible on primary?</li>
<li>Might a client request be lost? Executed twice?</li>
<li>Yes: see above question</li>
</ul>
<p><strong>Q:</strong> How to decide if primary has failed?</p>
<p><strong>Q:</strong> How will clients know to talk to backup rather than primary?</p>
<p><strong>Q:</strong> What if site-wide power failure?</p>
<ul>
<li>Primary is running some o/s, has a plan for reboot from disk
"crash-consistent"</li>
</ul>
<p><strong>Q:</strong> What if primary fails while sending state to backup?</p>
<ul>
<li>i.e. backup is mid-way through absorbing new state?</li>
</ul>
<p><strong>Q:</strong> What if primary gets request, sends checkpoint to backup, and just before replying primary fails?</p>
<ul>
<li>TCP layer will take care of this? If client retransmits request, that could be problematic (side effects). So hopefully
TCP kicks in and notices that no reply came back. How? Primary was just about to reply, but Remus held the
reply in the buffer. Backup will have same state so it'll think it has replied and wait for an ACK from
the client, which will never come because the client got nothing. Thus, backup will retransmit the packets
that the primary never had a chance to and finally get the ACK from the client.</li>
</ul>
<p><strong>Q:</strong> Is Plan 1 efficient?</p>
<ul>
<li>Can we eliminate the fact that backup <em>state</em> trails the primary?
<ul>
<li>Seems very hard!</li>
<li>Primary would have to tell backup (and wait) on every instruction.</li>
</ul></li>
<li>Can we <em>conceal</em> the fact that backup's state lags primary?
<ul>
<li>Prevent outside world from <em>seeing</em> that backup is behind last primary state
<ul>
<li>e.g. prevent primary sent RPC reply but backup state doesn't reflect that RPC</li>
<li>e.g. MapReduce <code>Register()</code> RPC, which it would be bad for backup to forget</li>
</ul></li>
<li><em>Idea:</em> primary "holds" output until backup state catches up to output point
<ul>
<li>e.g. primary receives RPC request, processes it, creates reply packet,
but Remus holds reply packet until backup has received corresponding state update</li>
</ul></li>
</ul></li>
</ul>
<h2>Remus epochs, checkpoints</h2>
<ol>
<li>Primary runs for a while in Epoch 1 (E1), holding E1's output</li>
<li>Primary pauses</li>
<li>Primary copies RAM+disk changes from E1 to local buffer</li>
<li>Primary resumes execution in E2, holding E2's output</li>
<li>Primary sends checkpoint of RAM+disk to backup</li>
<li>Backup copies all to separate RAM, then applies, then ACKs</li>
<li>Primary releases E1's output</li>
<li>Backup applies E1's changes to RAM and disk</li>
</ol>
<p>If primary fails, backup finishes applying last epoch's disk+RAM,
then starts executing</p>
<p><strong>Q:</strong> Any externally visible anomalies?</p>
<p><strong>Q:</strong> What if primary receives + executes a request, crashes before checkpoint?
backup won't have seen request!</p>
<ul>
<li>That's fine as long as primary did not reply to that request: client will just send request again</li>
</ul>
<p><strong>Q:</strong> If primary sends a packet, then crashes, is backup guaranteed to have
state changes implied by that packet?</p>
<ul>
<li>Yes. That's the whole point of keeping the sent network packets buffered until the backup is up to date.</li>
</ul>
<p><strong>Q:</strong> What if primary crashes partway through release of output?
must backup re-send? How does it know what to re-send?</p>
<p><strong>Q:</strong> How does Remus decide it should switch to backup?</p>
<ul>
<li>Naive mechanism: If the primary stops talking to the backup, then something went wrong.</li>
</ul>
<p><strong>Q:</strong> Are there situations in which Remus will incorrectly activate the backup? i.e. primary is actually alive</p>
<ul>
<li>Network partition...</li>
</ul>
<p><strong>Q:</strong> When primary recovers, how does Remus restore replication? Needed, since eventually active ex-backup will itself fail</p>
<p><strong>Q:</strong> What if <em>both</em> fail, e.g. site-wide power failure?</p>
<ul>
<li>RAM content will be lost, but disks will probably survive</li>
<li>After power is restored, reboot guest from one of the disks
<ul>
<li>O/S and application recovery code will execute</li>
</ul></li>
<li>disk must be "crash-consistent"
<ul>
<li>So probably not the backup disk if was in middle of installing checkpoint</li>
</ul></li>
<li>disk shouldn't reflect any held outputs (... why not?)
<ul>
<li>So probably not the primary's disk if was executing</li>
</ul></li>
<li>I do not understand this part of the paper (Section 2.5)
<ul>
<li>Seems to be a window during which neither disk could be used if power failed
<ul>
<li>primary writes its disk during epoch</li>
<li>meanwhile backup applies last epoch's writes to its disk</li>
</ul></li>
</ul></li>
</ul>
<p><strong>Q:</strong> In what situations will Remus likely have good performance?</p>
<p><strong>Q:</strong> In what situations will Remus likely have low performance?</p>
<p><strong>Q:</strong> Should epochs be short or long?</p>
<h2>Remus evaluation</h2>
<ul>
<li><em>Summary:</em> 1/2 to 1/4 native speed</li>
<li>Checkpoints are big and take time to send</li>
<li>Output hold limits speed at which clients can interact</li>
</ul>
<h3>Why so slow?</h3>
<ul>
<li>Checkpoints are big and take time to generate and send
<ul>
<li>100ms for SPECweb2005 -- because many pages written</li>
</ul></li>
<li>So inter-checkpoint intervals must be long</li>
<li>So output must be held for quite a while</li>
<li>So client interactions are slow
<ul>
<li>Only 10 RPCs per second per client</li>
</ul></li>
</ul>
<h3>How could one get better performance for replication?</h3>
<ul>
<li>Big savings possible with application-specific schemes:
<ul>
<li>just send state really needed by application, not all state</li>
<li>send state in optimized format, not whole pages</li>
<li>send operations if they are smaller than state</li>
</ul></li>
<li>likely <em>not</em> transparent to applications
<ul>
<li>and probably not to clients either</li>
</ul></li>
</ul>
<h2>Primary-backup replication in Lab 2</h2>
<h3>Outline</h3>
<ul>
<li>simple key/value database</li>
<li>primary and backup</li>
<li><em>replicated state machine:</em> replicate by primary sending each operation to backups</li>
<li>tolerate network problems, including partition
<ul>
<li>either keep going, correctly</li>
<li>or suspend operations until network is repaired</li>
</ul></li>
<li>allow replacement of failed servers</li>
<li>you implement essentially all of this (unlike lab 1)</li>
</ul>
<h3><em>"View server"</em> decides who primary <code>p</code> and backup <code>b</code> are</h3>
<ul>
<li><em>Main goal:</em> avoid "split brain" -- disagreement about who primary is</li>
<li>Clients and servers ask view server</li>
<li>They don't make independent decisions</li>
</ul>
<h3>Repair</h3>
<ul>
<li>view server can co-opt "idle" server as <code>b</code> after old <code>b</code> becomes <code>p</code></li>
<li>primary initializes new backup's state</li>
</ul>
<h3>Key points:</h3>
<ol>
<li>Only one primary at a time!</li>
<li>The primary must have the latest state!</li>
</ol>
<p>We will work out some rules to ensure these</p>
<h3>View server</h3>
<ul>
<li>Maintains a sequence of "views"</li>
</ul>
<p>Example:</p>
<pre><code> view #, primary, backup
0: -- --
1: S1 --
2: S1 S2
4: S2 --
3: S2 S3
</code></pre>
<ul>
<li>Monitors server liveness
<ul>
<li>each server periodically sends a ping RPC (more like a heartbeat)</li>
<li><em>"dead"</em> if missed <code>N</code> pings in a row</li>
<li><em>"live"</em> after single ping</li>
</ul></li>
<li>Can be more than two servers pinging view server
<ul>
<li>if more than two, <em>"idle"</em> servers</li>
</ul></li>
<li>If primary is dead:
<ul>
<li>new view with previous backup as primary</li>
</ul></li>
<li>If backup is dead, or no backup
<ul>
<li>new view with previously idle server as backup</li>
</ul></li>
<li>OK to have a view with just a primary, and no backup
<ul>
<li>But -- if an idle server is available, make it the backup</li>
</ul></li>
</ul>
<h3>How to ensure new primary has up-to-date replica of state?</h3>
<ul>
<li>Only promote previous backup
<ul>
<li>i.e. don't make an idle server the primary</li>
</ul></li>
<li>Backup must remember if it has been initialized by primary
<ul>
<li>If not, don't function as primary even if promoted!</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Can more than one server think it is primary?</p>
<pre><code> 1: S1, S2
net broken, so viewserver thinks S1 dead but it's alive
2: S2, --
now S1 alive and not aware of view #2, so S1 still thinks it is primary
AND S2 alive and thinks it is primary
=> split brain, no good
</code></pre>
<h3>How to ensure only one server acts as primary?</h3>
<p>...even though more than one may <em>think</em> it is primary.</p>
<p><em>"Acts as"</em> <code>==</code> executes and responds to client requests</p>
<p><em>The basic idea:</em></p>
<pre><code> 1: S1 S2
2: S2 --
S1 still thinks it is primary
S1 must forward ops to S2
S2 thinks S2 is primary
so S2 must reject S1's forwarded ops
</code></pre>
<p>The rules:</p>
<ol>
<li>Primary in view <code>i</code> must have been primary or backup in view <code>i-1</code></li>
<li>Primary must wait for backup to accept each request
<ul>
<li><strong>Q:</strong> What if there's no backup or the backup doesn't know it's a backup?</li>
<li><strong>A:</strong> Primary can't make progress without a backup if it's part of the view, so it just waits</li>
<li><strong>A:</strong> If the view is updated and the backup is taken out of the
view then primary can operate in "dangerous mode" without a backup</li>
</ul></li>
<li>Non-backup must reject forwarded requests</li>
<li>Non-primary must reject direct client requests</li>
<li>Every operation must be before or after state transfer</li>
</ol>
<p>Example:</p>
<pre><code> 1: S1, S2
viewserver stops hearing Pings from S1
2: S2, --
it may be a while before S2 hears about view #2
before S2 hears about view #2
S1 can process ops from clients, S2 will accept forwarded requests
S2 will reject ops from clients who have heard about view #2
after S2 hears about view #2
if S1 receives client request, it will forward, S2 will reject
so S1 can no longer act as primary
S1 will send error to client, client will ask viewserver for new view,
client will re-send to S2
the true moment of switch-over occurs when S2 hears about view #2
</code></pre>
<h3>How can new backup get state?</h3>
<ul>
<li>e.g. all the keys and values</li>
<li>if S2 is backup in view <code>i</code>, but was not in view <code>i-1</code>,
<ul>
<li>S2 should ask primary to transfer the complete state</li>
</ul></li>
</ul>
<h3>Rule for state transfer:</h3>
<ul>
<li>every operation (<code>Put/Get/Append</code>) must be either before or after state xfer
<ul>
<li><code>==</code> state xfer must be atomic w.r.t. operations</li>
</ul></li>
<li>either
<ul>
<li>op is before, and xferred state reflects op</li>
<li>op is after, xferred state doesn't reflect op, primary forwards op after state</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Does primary need to forward <code>Get()</code>'s to backup?</p>
<ul>
<li>After all, <code>Get()</code> doesn't change anything, so why does backup need to know?</li>
<li>and the extra RPC costs time</li>
<li>has to do with ensuring there's just one primary:
<ul>
<li>suppose there's two primaries by accident (P and P' both think they are primaries)
<ul>
<li>how can this occur? network partition?</li>
</ul></li>
<li>suppose client sends a Get request to the wrong primary P'</li>
<li>then P' will try to fwd the request to P (which P' thinks it's the backup)</li>
<li>then P will tell P': <em>"Hey, bug off, I'm the primary"</em></li>
</ul></li>
</ul>
<p><strong>Q:</strong> How could we make primary-only <code>Get()</code>'s work?</p>
<p><strong>Q:</strong> Are there cases when the Lab 2 protocol cannot make forward progress?</p>
<ul>
<li>View service fails</li>
<li>Primary fails before backup gets state</li>
<li>We will start fixing those in Lab 3</li>
</ul>