-
Notifications
You must be signed in to change notification settings - Fork 40
/
Copy pathl10-treadmarks.html
554 lines (446 loc) · 16.1 KB
/
l10-treadmarks.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
<h1>6.824 2015 Lecture 10: Consistency</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.824
<a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from Spring 2015.</p>
<p><strong>Today:</strong> consistency</p>
<ul>
<li>Lazy release consistency</li>
<li>Using lazy consistency to get performance</li>
<li>Consistency = meaning of concurrent reads and writes</li>
<li>Less obvious than it may seem!</li>
<li>Choice trades off between performance and programmer-friendliness
<ul>
<li>Huge factor in many designs</li>
</ul></li>
<li><a href="papers/keleher-treadmarks.pdf">Today's paper</a>: a case study</li>
</ul>
<p>Many systems have storage/memory w/ concurrent readers and writers</p>
<ul>
<li>Multiprocessors, databases, AFS, labs</li>
<li>You often want to improve in ways that risk changing behavior:
<ul>
<li>add caching</li>
<li>split over multiple servers</li>
<li>replicate for fault tolerance</li>
</ul></li>
<li>How do we know if an optimization is correct?</li>
<li>We need a way to think about correct execution of distributed programs</li>
<li>Most of these ideas from multiprocessors and databases 20/30 years ago</li>
</ul>
<p>How can we write correct distributed programs w/ shared storage?</p>
<ul>
<li>Memory system promises to behave according to certain rules</li>
<li>We write programs assuming those rules</li>
<li>Rules are a "consistency model"</li>
<li>Contract between memory system and programmer</li>
</ul>
<p>What makes a good consistency model?</p>
<ul>
<li>There are no "right" or "wrong" models</li>
<li>A model may make it harder or easier to program
<ul>
<li>i.e. lead to more or less intuitive results</li>
</ul></li>
<li>A model may be harder or easier to implement efficiently</li>
<li>Also application dependent
<ul>
<li>e.g. Web pages vs memory</li>
</ul></li>
</ul>
<p>Some consistency models:</p>
<ul>
<li>Spanner: external consistency (behaves like a single machine)</li>
<li>Database world: strict serializability, serializability, snap-shot isolation, read-committed</li>
<li>Distributed file systems: open-to-close consistency</li>
<li>Computer architects: TSO (total store ordering), release consistency, etc.</li>
<li>Concurrency theory: sequential consistency, linearizability</li>
<li>Similar ideas, but sometimes slightly different meaning</li>
</ul>
<p>DSM is a good place to start to study consistency</p>
<ul>
<li>Simple interface: read and write of memory locations</li>
<li>Consistency well developed in architecture community</li>
</ul>
<p>Example:</p>
<pre><code> x and y start out = 0
M0:
x = 1
if y == 0:
print yes
M1:
y = 1
if x == 0:
print yes
Can they both print "yes"?
</code></pre>
<p>Performance of DSM is limited by memory consistency</p>
<ul>
<li>With sequential consistency, M0's write must be visible to M1 before M0 can execute read
<ul>
<li>Otherwise both M0 and M1 can read 0 and print "yes"</li>
<li>(Second "forbidden" example)</li>
</ul></li>
<li>Thus operations will take a while in a distributed system
<ul>
<li>And they have to be done one by one</li>
</ul></li>
</ul>
<p>Treadmarks high level goals?</p>
<ul>
<li>Better DSM performance</li>
<li>Run existing parallel code (multithreaded)
<ul>
<li>this code already has locks</li>
<li>TreadMarks will run each thread/process on a separate machine and
let it access the DSM.</li>
<li>TreadMarks takes advantage that the code already uses locking</li>
</ul></li>
</ul>
<p>What specific problems with previous DSM are they trying to fix?</p>
<ul>
<li><strong>false sharing:</strong> two machines r/w different vars on same page
<ul>
<li>m1 writes x, m2 writes y</li>
<li>m1 writes x, m2 just reads y</li>
<li><strong>Q:</strong> what does IVY do in this situation?</li>
<li><strong>A:</strong> Ivy will bounce the page between x and y back and forth</li>
</ul></li>
<li><strong>Write amplification:</strong> a 1-byte write turns into a whole-page transfer</li>
</ul>
<p><strong>First Goal:</strong> eliminate write amplification</p>
<ul>
<li>don't send whole page, just written bytes</li>
</ul>
<h3>Big idea: write diffs (to fix write amplification)</h3>
<p>Example: </p>
<pre><code>m1 and m2 both have x's page as readable
m1 writes x
m2 just reads x
</code></pre>
<ul>
<li>on M1 write fault:
<ul>
<li>tell other hosts to invalidate but <em>keep hidden copy</em></li>
<li>M1 makes hidden copy as well</li>
<li>M1 marks the page as R/W</li>
</ul></li>
<li>on M2 read fault:
<ul>
<li>M2 asks M1 for recent modifications</li>
<li>M1 "diffs" current page against hidden copy</li>
<li>M1 send diffs to M2</li>
<li>M2 applies diffs to its hidden copy</li>
<li>M2 marks the page as read-only</li>
<li>M1 marks the page as read-only</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Do write diffs provide sequential consistency?</p>
<ul>
<li>At most one writeable copy, so writes are ordered</li>
<li>No writing while any copy is readable, so no stale reads</li>
<li>Readable copies are up to date, so no stale reads</li>
<li>Still sequentially consistent</li>
</ul>
<p><strong>Q:</strong> Do write diffs help with false sharing? <br />
<strong>A:</strong> No, they help with write amplification</p>
<p>Next goal: allow multiple readers+writers to cope with false sharing</p>
<ul>
<li>our solution needs to allow two machines to write the same page</li>
<li><code>=></code> don't invalidate others when a machine writes</li>
<li><code>=></code> don't demote writers to r/o when another machine reads</li>
<li><code>=></code> multiple <em>different</em> copies of a page!
<ul>
<li>which should a reader look at?</li>
</ul></li>
<li>diffs help: can merge writes to same page</li>
<li>but when to send the diffs?
<ul>
<li>no invalidations -> no page faults -> what triggers sending diffs</li>
</ul></li>
</ul>
<p>...so we come to <em>release consistency</em></p>
<h3>Big idea: (eager) release consistency (RC)</h3>
<ul>
<li><em>Again:</em> what should trigger sending diffs?</li>
<li>Seems like we should be sending the diffs when someone reads the data
that was changed. How can we tell someone's reading the data if we
won't get a read fault because we did not invalidate other people's
pages when it was written by one person?</li>
<li>no-one should read data w/o holding a lock!
<ul>
<li>so let's assume a lock server</li>
</ul></li>
<li>send out write diffs on release (unlock)
<ul>
<li>to <em>all</em> machines with a copy of the written page(s)</li>
</ul></li>
</ul>
<p>Example:</p>
<pre><code>lock()
x = 1
unlock() --> diffs all pages, to detect all the writes since
the last unlock
--> sends diffs to *all* machines
</code></pre>
<p><strong>Q:</strong> Why detect all writes since the last <code>unlock()</code> and not the last <code>lock()</code>?</p>
<p><strong>A:</strong> See causal consistency discussion below.</p>
<p>Example 1 (RC and false sharing)</p>
<pre><code>x and y are on the same page
ax -- acquire lock x
rx -- release lock x
M0: a1 for(...) x++ r1
M1: a2 for(...) y++ r2 a1 print x, y r1
</code></pre>
<p>What does RC do?</p>
<ul>
<li>M0 and M1 both get cached writeable copy of the page</li>
<li>when they release, each computes diffs against original page</li>
<li><code>M1</code>'s <code>a1</code> causes it to wait until <code>M0</code>'s <code>r1</code> release
<ul>
<li>so M1 will see M0's writes</li>
</ul></li>
</ul>
<p><strong>Q:</strong> What is the performance benefit of RC?</p>
<ul>
<li>What does IVY do with Example 1?
<ul>
<li>if <code>x</code> and <code>y</code> are on the same page, page is bounced back between <code>M0</code> and <code>M1</code></li>
</ul></li>
<li>multiple machines can have copies of a page, even when 1 or more writes
<ul>
<li><code>=></code> no bouncing of pages due to false sharing</li>
<li><code>=></code> read copies can co-exist with writers</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Is RC sequentially consistent? No!</p>
<ul>
<li>in SC, a read sees the latest write</li>
<li>M1 won't see M0's writes until M0 releases a lock
<ul>
<li>i.e. M1 can see a stale copy of x, which wasn't allowed under seq const</li>
</ul></li>
<li>so machines can temporarily disagree on memory contents</li>
<li>if you always lock:
<ul>
<li>locks force order <code>-></code> no stale reads <code>-></code> like sequential consistency</li>
</ul></li>
</ul>
<p><strong>Q:</strong> What if you don't lock?</p>
<ul>
<li>Reads can return stale data</li>
<li>Concurrent writes to same var -> trouble</li>
</ul>
<p><strong>Q:</strong> Does RC make sense without write diffs?</p>
<ul>
<li>Probably not: diffs needed to reconcile concurrent writes to same page</li>
</ul>
<h3>Big idea: lazy release consistency (LRC)</h3>
<ul>
<li>one problem is that when we <code>unlock()</code> we update everybody,
but not everyone might need the changed data</li>
<li>only send write diffs to next acquirer of released lock
<ul>
<li>(i.e. when someone calls <code>lock()</code> and they need updates to the
data)</li>
</ul></li>
<li>lazier than RC in two ways:
<ul>
<li>release does nothing, so maybe defer work to future release</li>
<li>sends write diffs just to acquirer, not everyone</li>
</ul></li>
</ul>
<p>Example 2 (lazyness)</p>
<pre><code>x and y on same page (otherwise IVY avoids copy too)
M0: a1 x=1 r1
M1: a2 y=1 r2
M2: a1 print x,y r1
</code></pre>
<p>What does LRC do?</p>
<ul>
<li>M2 asks the lock manager who the previous holder of lock 1 was
<ul>
<li>it was M1</li>
</ul></li>
<li>M2 only asks previous holder of lock 1 for write diffs</li>
<li>M2 does not see M1's modification to <code>y</code>, even though on same page
<ul>
<li>because it did not acquire lock 2 using <code>a2</code></li>
</ul></li>
</ul>
<p>What does RC do?</p>
<ul>
<li>RC would have broadcast all changes on <code>x</code> and <code>y</code> to everyone</li>
</ul>
<p>What does IVY do?</p>
<ul>
<li>IVY would invalidate pages and ensure only the writer has a write-only
copy</li>
<li>on reads, the written page is turned to read only and the data is
fetched by the readers</li>
</ul>
<p><strong>Q:</strong> What's the performance win from LRC?</p>
<ul>
<li>if you don't acquire lock on object, you don't see updates to it</li>
<li><code>=></code> if you use just some vars on a page, you don't see writes to others</li>
<li><code>=></code> less network traffic</li>
</ul>
<p><strong>Q:</strong> Does LRC provide the same consistency model as RC?</p>
<ul>
<li><strong>No!</strong> LRC hides some writes that RC reveals</li>
<li>Note: if you use locks correctly, then you should not notice the differences
between (E)RC and LRC</li>
<li>in above example, RC reveals <code>y=1</code> to M2, LRC does not reveal
<ul>
<li>because RC broadcasts changes on a lock release</li>
</ul></li>
<li>so <code>"M2: print x, y"</code> might print fresh data for RC, stale for LRC
<ul>
<li>depends on whether print is before/after M1's release</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Is LRC a win over IVY if each variable on a separate page?</p>
<ul>
<li>IVY doesn't move data until the program tries to read it
<ul>
<li>So Ivy is pretty lazy already</li>
</ul></li>
<li>Robert: TreadMarks is only worth it pages are big</li>
<li>Or a win over IVY plus write diffs?</li>
</ul>
<p>Do we think all threaded/locking code will work with LRC?</p>
<ul>
<li>Do all programs lock every shared variable they read?</li>
<li>Paper doesn't quite say, but strongly implies "no"!</li>
</ul>
<p>Example 3 (causal anomaly)</p>
<pre><code>M0: a1 x=1 r1
M1: a1 a2 y=x r2 r1
M2: a2 print x, y r2
</code></pre>
<p>What's the potential problem here?</p>
<ul>
<li>Counter-intuitive that M2 might see y=1 but x=0
<ul>
<li>because M2 didn't acquire lock 1, it could not get
the changes to <code>x</code></li>
</ul></li>
</ul>
<p>A violation of "causal consistency":</p>
<ul>
<li>If write W1 contributed to write W2, everyone sees W1 before W2</li>
</ul>
<p>Example 4 (there's an argument that this is <em>natural cod</em>):</p>
<pre><code>M0: x=7 a1 y=&x r1
M1: a1 a2 z=y r2 r1
M2: a2 print *z r2
</code></pre>
<p>In example 4, it's not clear if M2 will learn from M1 the writes that M0 also did
and contributed to <code>y=&x</code>.</p>
<ul>
<li>for instance, if <code>x</code> was 1 before it was changed by M0, will M2 see this when it prints <code>*z</code></li>
</ul>
<p>TreadMarks provides <strong>causal consistency</strong>:</p>
<ul>
<li>when you acquire a lock,</li>
<li>you see all writes by previous holder</li>
<li>and all writes previous holder saw </li>
</ul>
<p>How to track what writes contributed to a write?</p>
<ul>
<li>Number each machine's releases -- "interval" numbers</li>
<li>Each machine tracks highest write it has seen from each other machine
<ul>
<li>highest write = the interval # of the last write that machine is aware of</li>
<li>a "Vector Timestamp"</li>
</ul></li>
<li>Tag each release with current VT</li>
<li>On acquire, tell previous holder your VT
<ul>
<li>difference indicates which writes need to be sent</li>
</ul></li>
<li>(annotate previous example)</li>
<li>when can you throw diffs away?
<ul>
<li>seems like you need to globally know what everyone knows about</li>
<li>see "Garbage Collection" section from paper</li>
</ul></li>
</ul>
<p>VTs order writes to same variable by different machines:</p>
<pre><code>M0: a1 x=1 r1 a2 y=9 r2
M1: a1 x=2 r1
M2: a1 a2 z = x + y r2 r1
M2 is going to hear "x=1" from M0, and "x=2" from M1.
TODO: what about y?
</code></pre>
<p>How does M2 know what to do?</p>
<p>Could the VTs for two values of the same variable not be ordered?</p>
<pre><code>M0: a1 x=1 r1
M1: a2 x=2 r2
M2: a1 a2 print x r2 r1
</code></pre>
<h3>Summary of programmer rules / system guarantees</h3>
<ol>
<li>Each shared variable protected by some lock</li>
<li>Lock before writing a shared variable to order writes to same var.,
otherwise "latest value" not well defined</li>
<li>Lock before reading a shared variable to get the latest version</li>
<li>If no lock for read, guaranteed to see values that
contributed to the variables you did lock</li>
</ol>
<p>Example of when LRC might work too hard.</p>
<pre><code>M0: a2 z=99 r2 a1 x=1 r1
M1: a1 y=x r1
</code></pre>
<p>TreadMarks will send <code>z</code> to M1, because it comes before <code>x=1</code> in VT order.</p>
<ul>
<li>Assuming x and z are on the same page.</li>
<li>Even if on different pages, M1 must invalidate z's page.</li>
<li>But M1 doesn't use z.</li>
<li>How could a system understand that z isn't needed?
<ul>
<li>Require locking of all data you read</li>
<li><code>=></code> Relax the causal part of the LRC model</li>
</ul></li>
</ul>
<p><strong>Q:</strong> Could TreadMarks work without using VM page protection?</p>
<ul>
<li>it uses VM to
<ul>
<li>detect writes to avoid making hidden copies (for diffs) if not needed</li>
<li>detect reads to pages => know whether to fetch a diff</li>
</ul></li>
<li>neither is really crucial</li>
<li>so TM doesn't depend on VM as much as IVY does
<ul>
<li>IVY used VM faults to decide what data has to be moved, and when</li>
<li>TM uses acquire()/release() and diffs for that purpose</li>
</ul></li>
</ul>
<h3>Performance?</h3>
<p>Figure 3 shows mostly good scaling</p>
<ul>
<li>is that the same as "good"?</li>
<li>though apparently Water does lots of locking / sharing</li>
</ul>
<p>How close are they to best possible performance?</p>
<ul>
<li>maybe Figure 5 implies there is only about 20% fat to be cut</li>
</ul>
<p>Does LRC beat previous DSM schemes?</p>
<ul>
<li>they only compare against their own straw-man eager realease consistency (ERC)
<ul>
<li>not against best known prior work</li>
</ul></li>
<li>Figure 9 suggests not much win, even for Water</li>
</ul>
<h3>Has DSM been successful?</h3>
<ul>
<li>clusters of cooperating machines are hugely successful</li>
<li>DSM not so much
<ul>
<li>main justification is transparency for existing threaded code</li>
<li>that's not interesting for new apps</li>
<li>and transparency makes it hard to get high performance</li>
</ul></li>
<li>MapReduce or message-passing or shared storage more common than DSM</li>
</ul>