-
Notifications
You must be signed in to change notification settings - Fork 40
/
Copy pathl20-argus.html
564 lines (472 loc) · 19.7 KB
/
l20-argus.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
<h1>6.824 2015 Lecture 20: Argus</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the
6.824 <a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from
Spring 2015.</p>
<h2>Atomic commit: two-phase commit</h2>
<ul>
<li>how to use two-phase commit for distributed transactions</li>
<li>Argus</li>
</ul>
<p>You have a bunch of computers that do different things (not replicas). Like
two computers, one stores events for people in A-L, another for people in M-Z.
If you want to create an event for Alice and Mike you need to interact with
both servers and make sure that the event is either created on both or on
neither.</p>
<p>The challenges are <em>crashes</em> and <em>network failures</em> which inject ambiguities (
not responding cause of crash or network failure?)</p>
<p>In Ivy and TreadMarks if one of the machines crashed it had no way to recover.
We also saw MapReduce and Spark which had a story for crash recovery.</p>
<p>Code:</p>
<pre><code>schedule(u1 user, u2 user, t time):
ok1 = reserve(u1, t) # reserve for the 1st user
ok2 = reserve(u2, t) # reserve for the 2nd user
# Tricky: if the 1st reserve succeeded and the 2nd didn't => trouble
# We'd like to deal with this in the following way:
if ok1 and ok2
commit
else
abort
# One bad way to make this work is to let the servers chit-chat and
# make sure they both committed.
# At no stage in a transaction like this can the servers finish
# - S1: I'll do it if you do it
# - S2: I'll do it if you do it
# - S1: I'll do it if you do it
# - S2: I'll do it if you do it
# (sounds like the two generals problem?)
</code></pre>
<h3>Idea 1: tentative changes</h3>
<pre><code>reserve(u user, t time):
if u[t] = free # if user's calendar is free at time t
tent[t] = taken # ...then tenatively schedule
commit:
copy tent[t] to u[t]
abort:
discard tent[t]
</code></pre>
<h3>Idea 2: single machine/entity (transaction coordinator) decides</h3>
<pre><code>client TC A B
\-------------------->
\---------------------------->
| |
<--------------------/ |
<----------------------------/
------------
----- GO ----> | |
| | |
<------------/ | |
------------
</code></pre>
<p>Properties:</p>
<ul>
<li>state: unknown, committed, aborted</li>
<li>if any thinks "committed", then none think "aborted"</li>
<li>if any think "aborted", then none think "committed"</li>
</ul>
<h2>Two-phase commit (2PC)</h2>
<p>Used frequently in real distributed databases.</p>
<pre><code>client TC A B
.
.
.
---- GO ----> --\
prepare |
------------> |
------------------------> | Phase 1
yes/no | | |
<-----------/ | |
<-----------------------/ --/
commit/abort
------------> --\
------------------------> | Phase 2
commit/abort |
<------------- --/
</code></pre>
<p><code>Prepare</code> asks "are you still alive and willing to commit this transaction?"</p>
<ul>
<li>servers may say no</li>
<li>servers may be unreachable</li>
</ul>
<h3>Termination protocol</h3>
<ul>
<li>maybe the TC has a timeout while it's waiting for the yes/no response to
one or more prepare messages
<ul>
<li>at this point, it can abort the transaction, because no one has started
a commit (since the TC did not send it, since it was waiting on yes/no)</li>
</ul></li>
<li>B times out while waiting for prepare message
<ul>
<li><code>=></code> B hasn't replied to prepare <code>=></code> TC hasn't sent commit to the
participants <code>=></code> TC can send abort</li>
</ul></li>
<li>B times out waiting for commit/abort after saying <em>no</em> to prepare
<ul>
<li><code>=></code> B can abort because it knows the TC will abort everyone </li>
</ul></li>
<li>B times out waiting for commit/abort after saying <em>yes</em> to prepare
<ul>
<li><code>=></code> B said yes to TC and TC could have received <code>yes</code> from everyone else (or not)
<code>=></code> outcome can be either commit or abort <code>=></code> B has to wait</li>
<li>there are some lucky cases in which <code>B</code> could decide to abort/commit if
<code>A</code> tells it via another channel</li>
</ul></li>
</ul>
<p>Does this waiting make 2PC impractical? People are split up?</p>
<p>What about reboots? If one of the participants said yes to a prepare, it has to
remember that across reboots or crashes, so that it can be able to finish the
transaction (commit or abort).</p>
<ul>
<li>in the calendar example, it would also need to remember the tentative schedule
in <code>tent[]</code></li>
<li>extra note: since in the diagram the TC did not wait for ACKs on commit/abort
the participants need to persist their locks around the transaction so that
they don't do a subsequent transaction before this one is finished</li>
</ul>
<p>What happens if TC crashes in the middle of sending commits?</p>
<ul>
<li>it has to remember all committed/non-committed transactions</li>
</ul>
<p>Resemblance to Paxos?</p>
<ul>
<li>Paxos is a way to build highly available systems by replication (all servers
have all data and are doing the same thing)
<ul>
<li>Paxos system can proceed even if some of the servers are down</li>
</ul></li>
<li>2PC you cannot make progress even if just one server is down
<ul>
<li>each server is doing a <em>different thing</em> (want every server to do its own
part in a transaction)</li>
</ul></li>
<li>While 2PC helps a set of servers reach agreement, it it not fault tolerant
or available (it cannot proceed when servers are down)</li>
<li>You might think you can do the calendar scheduling with Paxos by having both
servers agree on the schedule op. However, while agreeing on the op will work,
committing the op will not. For instance, what if one server's user is busy during the
scheduled time? Then it cannot commit the op while the other one might be able
to. Paxos doesn't help solve that conflict.</li>
</ul>
<p>Atomic distributed transactions: write your transaction code without thinking
about what other transactions could be going on</p>
<p>Bank example:</p>
<pre><code>T1:
addToBal(x, 1)
addToBal(y, -1)
# Need this to be a transaction to implement a transfer correctly
T2:
tmp1 = getBal(x)
tmp2 = getBal(y)
print(tmp1, tmp2)
# We cannot have the execution of T1 interleave with the execution of
# T2. T2 had better see both addToBal calls or no addToBal calls from T1
</code></pre>
<p>This is called <em>serializability</em>: The effect of running a bunch of transactions
is the same as if they were run in some sequential order (no interleaving allowed:
exec first half of T1, exec first half of T2, finish second half of T1, finish T2).</p>
<p>One way to implement transactions is to use locks for each data record that are
acquired before a transaction begins operating on those records and holds them
until it commits or aborts. This is called <strong>two-phase locking</strong>.</p>
<p>Deadlock can occur if T1 acquires x and then y while T2 acquires y and then x.
Database systems for instance have ways to deal with this::</p>
<ul>
<li>timeout on acquiring locks and retry</li>
<li>only allow transactions to acquire locks in a certain order</li>
<li>perform deadlock detection if single-machine setup</li>
</ul>
<p>Nobody ever likes to use 2PC.</p>
<ul>
<li>because of the waiting/blocking issue when a server times out waiting for
a commit/abort after having said "no" to a prepare</li>
</ul>
<p>When participants acquire locks they are holding them across multiple RTTs
in the network because you have to wait for the commit message.</p>
<h2>Argus</h2>
<ul>
<li>the cool thing is that it attempts to absorb as much of the nitty-gritty junk
of distributed systems programming inside the language</li>
<li>the desire was to have a clean story for handling RPC failures </li>
<li>Argus sets up a framework where RPC failures can be handled cleanly
<ul>
<li>does all the bookkeeping required to rollback the transactions</li>
</ul></li>
<li>Argus has to know about the data in order to be able to rollback
<ul>
<li>it needs to create tentative updates and so on </li>
</ul></li>
</ul>
<h1>6.824 notes</h1>
<pre><code>6.824 2015 Lecture 20: Two-Phase Commmit
Topics:
distributed commit, two-phase commit
distributed transactions
Argus -- language for distributed programming
Distributed commit:
A bunch of computers are cooperating on some task, e.g. bank transfer
Each computer has a different role, e.g. src and dst bank account
Want to ensure atomicity: all execute, or none execute
"distributed transaction"
Challenges: crashes and network failures
Example:
calendar system, each user has a calendar
want to schedule meetings with multiple participants
one server holds calendars of users A-M, another server holds N-Z
[diagram: client, two servers]
sched(u1, u2, t):
begin_transaction
ok1 = reserve(u1, t)
ok2 = reserve(u2, t)
if ok1 and ok2:
commit
else
abort
end_transaction
the reserve() calls are RPCs to the two calendar servers
We want both to reserve, or both not to reserve.
What if 1st reserve() returns true, and then:
2nd reserve() returns false (time not available)
2nd reserve() doesn't return (lost RPC msg, u2's server crashes)
2nd reserve() returns but then crashes
client fails before 2nd reserve()
We need a "distributed commit protocol"
Idea: tentative changes, later commit or undo (abort)
reserve_handler(u, t):
if u[t] is free:
temp_u[t] = taken -- A TEMPORARY VERSION
return true
else:
return false
commit_handler():
copy temp_u[t] to real u[t]
abort_handler():
discard temp_u[t]
Idea: single entity decides whether to commit
to prevent any chance of disagreement
let's call it the Transaction Coordinator (TC)
[time diagram: client, TC, A, B]
client sends RPCs to A, B
on end_transaction, client sends "go" to TC
TC/A/B execute distributed commit protocol...
TC reports "commit" or "abort" to client
We want two properties for distributed commit protocol:
TC, A, and B start in state "unknown"
each can move to state "abort" or "commit"
but then each never changes mind
Correctness:
if any commit, none abort
if any abort, none commit
Performance:
(since doing nothing is correct...)
if no failures, and A and B can commit, then commit.
if failures, come to some conclusion ASAP.
We're going to develop a protocol called "two-phase commit"
Used by distributed databases for multi-server transactions
And by Spanner and Argus
Two-phase commit without failures:
[time diagram: client, TC, A, B]
client sends reserve() RPCs to A, B
client sends "go" to TC
TC sends "prepare" messages to A and B.
A and B respond, saying whether they're willing to commit.
Respond "yes" if haven't crashed, timed out, &c.
If both say "yes", TC sends "commit" messages.
If either says "no", TC sends "abort" messages.
A/B "decide to commit" if they get a commit message.
I.e. they actually modify the user's calendar.
Why is this correct so far?
Neither can commit unless they both agreed.
Crucial that neither changes mind after responding to prepare
Not even if failure
What about failures?
Network broken/lossy
Server crashes
Both visible as timeout when expecting a message.
Where do hosts wait for messages?
1) TC waits for yes/no.
2) A and B wait for prepare and commit/abort.
Termination protocol summary:
TC t/o for yes/no -> abort
B t/o for prepare, -> abort
B t/o for commit/abort, B voted no -> abort
B t/o for commit/abort, B voted yes -> block
TC timeout while waiting for yes/no from A/B.
TC has not sent any "commit" messages.
So TC can safely abort, and send "abort" messages.
A/B timeout while waiting for prepare from TC
have not yet responded to prepare
so can abort
respond "no" to future prepare
A/B timeout while waiting for commit/abort from TC.
Let's talk about just B (A is symmetric).
If B voted "no", it can unilaterally abort.
So what if B voted "yes"?
Can B unilaterally decide to abort?
No! TC might have gotten "yes" from both,
and sent out "commit" to A, but crashed before sending to B.
So then A would commit and B would abort: incorrect.
B can't unilaterally commit, either:
A might have voted "no".
If B voted "yes", it must "block": wait for TC decision.
What if B crashes and restarts?
If B sent "yes" before crash, B must remember!
--- this is today's question
Can't change to "no" (and thus abort) after restart
Since TC may have seen previous yes and told A to commit
Thus:
B must remember on disk before saying "yes", including modified data.
B reboots, disk says "yes" but no "commit", must ask TC.
If TC says "commit", copy modified data to real data.
What if TC crashes and restarts?
If TC might have sent "commit" or "abort" before crash, TC must remember!
And repeat that if anyone asks (i.e. if A/B/client didn't get msg).
Thus TC must write "commit" to disk before sending commit msgs.
Can't change mind since A/B/client have already acted.
This protocol is called "two-phase commit".
What properties does it have?
* All hosts that decide reach the same decision.
* No commit unless everyone says "yes".
* TC failure can make servers block until repair.
What about concurrent transactions?
We realy want atomic distributed transactions,
not just single atomic commit.
x and y are bank balances
x and y start out as $10
T1 is doing a transfer of $1 from x to y
T1:
add(x, 1) -- server A
add(y, -1) -- server B
T2:
tmp1 = get(x)
tmp2 = get(y)
print tmp1, tmp2
Problem:
what if T2 runs between the two add() RPCs?
then T2 will print 11, 10
money will have been created!
T2 should print 10,10 or 9,11
The traditional approach is to provide "serializability"
results should be as if transactions ran one at a time in some order
either T1, then T2; or T2, then T1
Why serializability?
it allows transaction code to ignore the possibility of concurrency
just write the transaction to take system from one legal state to another
internally, the transaction can temporarily violate invariants
but serializability guarantess no-one will notice
One way to implement serializabilty is with "two-phase locking"
this is what Argus does
each database record has a lock
the lock is stored at the server that stores the record
no need for a central lock server
each use of a record automatically acquires the record's lock
thus add() handler implicitly acquires lock when it uses record x or y
locks are held until *after* commit or abort
Why hold locks until after commit/abort?
why not release as soon as done with the record?
e.g. why not have T2 release x's lock after first get()?
T1 could then execute between T2's get()s
T2 would print 10,9
but that is not a serializable execution: neither T1;T2 nor T2;T1
2PC perspective
Used in sharded DBs when a transaction uses data on multiple shards
But it has a bad reputation:
slow because of multiple phases / message exchanges
locks are held over the prepare/commit exchanges
TC crash can cause indefinite blocking, with locks held
Thus usually used only in a single small domain
E.g. not between banks, not between airlines, not over wide area
Paxos and two-phase commit solve different problems!
Use Paxos to high availability by replicating
i.e. to be able to operate when some servers are crashed
the servers must have identical state
Use 2PC when each participant does something different
And *all* of them must do their part
2PC does not help availability
since all servers must be up to get anything done
Paxos does not ensure that all servers do something
since only a majority have to be alive
What if you want high availability *and* distributed commit?
[diagram]
Each "server" should be a Paxos-replicated service
And the TC should be Paxos-replicated
Run two-phase commit where each participant is a replicated service
Then you can tolerate failures and still make progress
This is what Spanner does (for update transactions)
Case study: Argus
Argus's big ideas:
Language support for distributed programs
Very cool: language abstracts away ugly parts of distrib systems
Aimed at different servers doing different jobs, cooperating
Easy fault tolerance:
Transactional updates
So crash results in entire transaction un-done, not partial update
Easy persistence ("stable"):
Ordinary variables automatically persisted to disk
Automatic crash recovery
Easy concurrency:
Implicit locking of language objects
Easy RPC model:
Method calls transparently turned into RPCs
RPC failure largely hidden via transactions, two-phase commit
We've seen the fundamental problem before
What to do if *part* of a distributed computation crashes?
IVY/Treadmarks had no answer
MR/Spark could re-execute *part* of computation, for big data
Picture
"guardian" is like an RPC server
has state (variables) and handlers
"handler" is an RPC handler
reads and writes local variables
"action" is a distributed atomic transaction
action on A
A RPC to B
B RPC to C
A RPC to D
A finishes action
prepare msgs to B, C, D
commit msgs to B, C, D
The style is to send RPC to where the data is
Not to fetch the data
Argus is not a storage system
Look at bank example
page 309 (and 306): bank transfer
Points to notice
stable keyword (programmer never writes to disk &c)
atomic keyword (programmer almost never locks/unlocks)
enter topaction (in transfer)
coenter (in transfer)
RPCs are hidden (e.g. f.withdraw())
RPC error handling hidden (just aborts)
what if deposit account doesn't exist?
but f.withdraw(from) has already been called?
how to un-do?
what's the guardian state when withdraw() handler returns?
lock, temporary version, just in memory
what if an audit runs during a transfer?
how does the audit not see the tentative new balances?
if a guardian crashes and reboots, what happens to its locks?
can it just forget about pre-crash locks?
subactions
each RPC is actually a sub-action
the RPC can fail or abort w/o aborting surrounding action
this lets actions e.g. try one server, then another
if RPC reply lost, subaction will abort, undo
much cleaner than e.g. Go RPC
is Argus's implicit locking the right thing?
very convenient!
don't have to worry about forgetting to lock!
(though deadlocks are easy)
databases work (and worked) this way; it's a sucessful idea
is transactions + RPC + 2PC a good design point?
programmability pro:
very easy to get nice fault tolerance semantics
performance con:
lots of msgs and disk writes
2PC and 2PL hold locks for a while, block if failure
is Argus's language integration the right thing?
i.e. persisting and locking language objects
it looks very convenient (and it is)
but it turns out to be even more valuable have relational tables
people like queries/joins/&c over tables, rows, columns
that is, people like a storage abstraction!
maybe there is a better language-based scheme waiting to be found
</code></pre>