Inspired by the original Utreexo proposal by Tadge Dryja.
This is a specification of a dynamic accumulator for unspent transaction outputs
that supports inserts and deletions, costs O(log(n))
in storage and proof bandwidth,
with efficient update of the accumulator and the membership proofs.
The accumulator consists of a number of perfect merkle binary trees, up to log(n)
,
that can be fully or partially pruned, leaving only their merkle roots.
Accumulator supports the following operations:
- Insert: add a new item to the accumulator and create a proof for it.
- Verify: check that an item exists using its merkle proof.
- Delete: check that an item exists using its merkle proof, and also mark it as deleted.
- Normalize: prune deleted items, and normalize the shape of the trees, reducing them to their merkle roots.
- Update proof: update the merkle proof created against the previous state of the accumulator (before most recent normalization).
Normalization is required to reduce the space occupied by the accumulator. Users must update their proofs upon every normalization.
All proofs for items inserted before normalization are O(1)
in length.
Increasing the time interval between normalizations allows trading off storage for lower bandwidth.
Normalization does the following:
- Extracts all nodes that were not marked as modified (that is, subtrees that do not contain items marked as deleted).
- Reorganizes the remaining nodes to form new perfect binary trees.
- A catch up structure is created mapping all preserved nodes to their new positions.
- The collection of trees is pruned, leaving only the merkle roots in the accumulator.
After normalization, every proof against the previous state of the accumulator becomes invalid and needs to be updated via the catch up structure.
Catch up structure is stored till the next normalization, in order to auto-update proofs that were created against the previous state of the forest. It is also used to update the stored proofs for the outputs watched by the node.
Entity that can be added and removed from the forest.
Forest is a structure with the following fields:
generation
: a generation of the forest.roots
: an ordered list of nodes each representing a k-tree, from highestk
to the lowest.insertions
: an ordered list of item hashes.
The list of roots unambiguously encodes the total number of items as a sum of 2^k
for each k
-tree present in the forest.
The forest with {3-tree, 2-tree, 0-tree}
roots and no insertions contains 13 items.
The merkle root of the entire forest. It can be computed using the n
ordered roots of k-trees as follows:
H(
roots[0],
H(
roots[1],
H(...
H(
roots[n-2],
roots[n-1]
)
)
)
)
where H
is the merkle hash function over the left and right nodes.
If the forest contains a single k-tree, its root is also the Utreexo root.
The Utreexo root is computed only when the forest is normalized.
Sum of all insertions and deletions to the forest, ignoring
deletions of items from the forest.insertions
list.
This count indicates how much storage the accumulator requires and helps determine when to normalize it. Transient items that were inserted and then deleted before normalization do not affect the count.
The sequence number of the forest incremented each time it is normalized. Represented as a 64-bit unsigned integer.
The power of two describing the size of the binary tree. K-tree has level k
.
A binary tree of level k
, containing 2^k
items. 0-tree contains a single item.
A Merkle root of the k-tree.
Leafs and nodes in the tree use the same instance of a Merlin transcript initialized as:
T = Transcript("ZkVM.utreexo")
The hash of an empty list is a 32-byte challenge string with the label merkle.empty
:
MerkleHash(T, {}) = T.challenge_bytes("merkle.empty")
The hash of a list with one entry (also known as a leaf hash) is
computed by committing the entry to the transcript (defined by the item type),
and then generating 32-byte challenge string the label merkle.leaf
:
MerkleHash(T, {item}) = {
T.append(<field1 name>, item.field1)
T.append(<field2 name>, item.field2)
...
T.challenge_bytes("merkle.leaf")
}
For n > 1
, let k
be the largest power of two smaller than n
(i.e., k < n ≤ 2k
). The merkle hash of an n
-element list is then defined recursively as:
MerkleHash(T, list) = {
T.append("L", MerkleHash(list[0..k]))
T.append("R", MerkleHash(list[k..n]))
T.challenge_bytes("merkle.node")
}
Item proof is a structure with three fields:
generation
: a 64-bit unsigned integer indicating the generation of the forest.position
: a 64-bit unsigned integer indicating absolute index in the set of all items in the forest,neighbors
: a list of neighboring hashes at each level up to (but not including) the k-tree root of the k-tree that contains this item.
The position of the neighbor root is determined by a correposnding bit in a binary little-endian representation of position
:
i
-th bit set to zero means the i
-th neighbor is to the right of the item.
Illustration:
c
| \
a b
|\ |\
0 1 2 3
The proof for the item 2 contains neighbors (3, a)
, with the positions indicated by the little-endian binary encoding of the index 2: 01
(right, left).
A structure with the following fields:
root
: a merkle hash.level
: an order of the tree.modified
: a boolean indicating whether the node is modified (the modified leaf node stands for deleted item).children
: either a pair of nodes, or nil, if the children are pruned or this node has level 0.
A structure with the following fields:
forest
: the non-pruned forest of the next generation to which the proofs are updated.map
linking a hash of a node to its absolute position in the new forest, across all binary trees.
Returns item proof.
- Compute item’s leaf hash.
- Append it to the
forest.insertions
list. - Return a proof with:
generation
: the current generation of the forest,position
set to the index in the insertions list, offset by the total number of items in the forest,neighbors
list: empty.
Inputs: forest, item, item proof.
Returns true
or false
.
- Verify that the item proof’s generation is equal to the generation of the forest.
- If the position is past the number of items under the forest roots, check the presence of the item in the insertions list.
If the item’s leaf hash is present in the insertions, return
true
; otherwise returnfalse
. - Find the root node
r
in the forest that contains the item’s position. - Verify that the number of neighbors in the proof is equal to the
r.level
. - Find the lowest-level available node
b
in ther
’s subtree that contains the item’s position (b
is equal tor
ifr
has pruned children or has level 0). Verify that the corresponding neighbors in the proof are equal to the actual higher-level neighbors (fromr.level-1
tob.level
) along the way toward theb
. - If
b.level
is 0 and it's marked as modified, returnfalse
. - Compute the intermediate hashes using the item and its lower-level neighbors from
0
tob.level-1
. - Verify that the resulting hash is equal to the hash of node
b
. - Return
true
if all checks succeeded.
Inputs: forest, item, item proof.
Returns true
or false
.
- Verify that the item proof’s generation is equal to the generation of the forest.
- If the position is past the number of items under the forest roots, check the presence of the item in the insertions list.
If the item’s leaf hash is present in the insertions, return
true
; otherwise returnfalse
. - Find the root node
r
in the forest that contains the item’s position. - Verify that the number of neighbors in the proof is equal to the
r.level
. - Find the lowest-level available node
b
in ther
’s subtree that contains the item’s position (b
is equal tor
ifr
has pruned children or has level 0). Verify that the corresponding neighbors in the proof are equal to the actual higher-level neighbors (fromr.level-1
tob.level
) along the way toward theb
. - If
b.level
is 0 and it's marked as modified, returnfalse
. - Compute the intermediate hashes using the item and its lower-level neighbors from
0
tob.level-1
. - Verify that the resulting hash is equal to the hash of node
b
. - If all checks succeeded:
- Add newly created nodes on the way from the item to
b
at step 6, including the neighbors and the node for the item itself to the treer
. - Mark all the nodes containing the item as
modified
, from the item’s node to the rootr
. - Return
true
.
- Add newly created nodes on the way from the item to
- If any check failed, return
false
.
Input: forest.
Returns utreexo root, new forest, catchup structure.
- Traverse forest trees, from left to right, collecting the unmodified nodes in a list.
- Note: the unmodified nodes must not have children.
- Append to the list all the insertions, in order, as level-0 nodes.
- Process the collected nodes, from left to right, keeping track of the latest node at each level
k
:- As long as there is an already remembered node
l
at the same levelk
as current noder
:- Forget the
l
, create a new parent node with levelk+1
with childrenl
andr
. - Set
r
to the new parent node and repeat, creating higher-level parents until there’s no longer a remembered node to join with.
- Forget the
- Remember the resulting node
r
at its level.
- As long as there is an already remembered node
- Create a new forest with new roots formed by all remembered nodes in the previous step, incremented generation, and an empty insertions list.
- Create a catchup structure with the new forest and the map created as follows:
- Traverse the new forest, picking only the nodes without children.
- For each such node, add a pair
(hash, offset)
where offset is the offset of all the items’ positions in this node’s subtree.
- Create a copy of the new forest with all the roots having their subtrees completely pruned: this will be the new forest to which updates.
- Compute the complete merkle root of the entire new forest by hashing pairs of roots from end to the beginning. This is equivalent to computing the merkle root over the entire set of items, if they were all stored.
- Return the root of the new forest, the pruned forest and the catchup structure.
Illustrations:
delete 0:
d e
| \ | \
a b c -> b c -> b c
|\ |\ |\ |\ |\ |\ |\
0 1 2 3 4 5 x 1 2 3 4 5 2 3 4 5 1
delete 2:
d e
| \ | \
a b c -> a c -> a c
|\ |\ |\ |\ |\ |\ |\
0 1 2 3 4 5 0 1 x 3 4 5 0 1 4 5 3
delete 0, insert 6, 7:
d f
| \ | \
a b c -> b c -> b c h
|\ |\ |\ |\ |\ |\ |\ |\
0 1 2 3 4 5 x 1 2 3 4 5 6 7 2 3 4 5 1 6 7
Input: catchup structure, item and item proof.
Returns new item proof or failure.
- If the generation of the proof is equal to the generation of the
catchup.forest
, return the proof unchanged. - If the generation of the proof is not equal to the preceding generation of the
catchup.forest
, fail. - Compute intermediate hashes using the proof’s neighbors until meeting a hash present in the catchup map.
- If reached the end of the proof, but no matching hash is found in the map, fail.
- Let
k
be the level of the remembered node (k
can be zero, if the item’s leaf node was remembered). - Erase all but first
k
bits of the item’s position in the proof. - Add to the updated position an offset stored in the catchup map with the matching hash.
- Erase all the neighbors in the proof past the
k
hashes. - Walk the new forest from the catch up node to the root, appending all the neighbors to the proof.
- Return the updated proof.
The Utreexo forest can be normalized at arbitrary intervals, limited only by amount of memory to store intermediate nodes in the trees before they are pruned by normalization.
Larger normalization interval permits saving bandwidth for recently added nodes, as their merkle proofs are effectively empty, but requires more storage.
The implementation represents the sparse binary tree in memory, allocating nodes on the fly. Internally, the allocations happen inside a small memory arena maintained in-between tree normalizations, so the individual insertions/deletions can be performed without any system calls.
(In development)
Caching the top 16 levels of nodes requires only ≈4Mb of data to store, but removes 512 bytes of data from each proof.
Bandwidth savings at different utxo set sizes and cache sizes.
Levels cached | Required RAM | 1M utxos | 10M | 100M | 1B |
---|---|---|---|---|---|
16 | 4 Mb | 80% | 66% | 59% | 53% |
19 | 32 Mb | 95% | 79% | 70% | 63% |
22 | 256 Mb | 100% | 91% | 81% | 73% |
25 | 2048 Mb | 100% | 100% | 92% | 83% |
Proof sizes per input in bytes:
Levels cached | Required RAM | 1M utxos | 10M | 100M | 1B |
---|---|---|---|---|---|
16 | 4 Mb | 128 | 256 | 352 | 448 |
19 | 32 Mb | 32 | 160 | 256 | 352 |
22 | 256 Mb | 0 | 64 | 160 | 256 |
25 | 2048 Mb | 0 | 0 | 64 | 160 |
Notice that as network grows in size, user can have a trade-off between optimizing bandwidth or optimizing storage. In any case, bandwidth costs grow only linearly with the exponential growth of the network.
Cache policy can be decided per node and advertised to its peers upon connection.
For example, let's have nodes A, B, C where A sends a proof to B, who relays it to C.
A and B first establish a connection and B tells A what is B's cache configuration. Same happens between B and C.
A then sends an appropriately trimmed proof to B. B reconstructs a full proof and verifies it. If the proof is valid, it is trimmed according to C's cache configuration, and sent to C.