Skip to content

Commit

Permalink
Deletion Vectors
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Jun 11, 2024
1 parent 71e2b38 commit ad32f5f
Show file tree
Hide file tree
Showing 9 changed files with 256 additions and 9 deletions.
5 changes: 4 additions & 1 deletion docs/commands/reorg/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@

```sql
REORG TABLE (delta.`/path/to/table` | delta_table_name)
[WHERE partition_predicate] APPLY (PURGE)
(
[WHERE partition_predicate] APPLY (PURGE) |
APPLY (UPGRADE UNIFORM (ICEBERG_COMPAT_VERSION = version))
)
```

`REORG TABLE` command requires the file path or the name of a delta table.
Expand Down
9 changes: 4 additions & 5 deletions docs/deletion-vectors/DeletionVectorBitmapGenerator.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,11 @@ buildRowIndexSetsForFilesMatchingCondition(

Column Name | Column
------------|-------
`filePath` | The given `fileNameColumnOpt` if specified or `_metadata.file_path`
`rowIndexCol` | The given `rowIndexColumnOpt` if specified or one of the following based on [spark.databricks.delta.deletionVectors.useMetadataRowIndex](../configuration-properties/index.md#deletionVectors.useMetadataRowIndex):<ul><li>`_metadata.row_index` (enabled)</li><li>[__delta_internal_row_index](../DeltaParquetFileFormat.md#ROW_INDEX_COLUMN_NAME)</li></ul>
[filePath](#FILE_NAME_COL) | The given `fileNameColumnOpt` if specified or `_metadata.file_path`
[rowIndexCol](#ROW_INDEX_COL) | The given `rowIndexColumnOpt` if specified or one of the following based on [spark.databricks.delta.deletionVectors.useMetadataRowIndex](../configuration-properties/index.md#deletionVectors.useMetadataRowIndex):<ul><li>`_metadata.row_index` (enabled)</li><li>[__delta_internal_row_index](../DeltaParquetFileFormat.md#ROW_INDEX_COLUMN_NAME)</li></ul>
[deletionVectorId](#FILE_DV_ID_COL) | <ul><li>With the table with deletion vectors (based on the given `tableHasDVs` flag), `buildRowIndexSetsForFilesMatchingCondition`...FIXME...the [DeletionVectorDescriptor](../AddFile.md#deletionVector)s of the given `candidateFiles`</li><li>Otherwise, `null` (undefined)</li></ul>

With the table uses deletion vectors already (based on the given `tableHasDVs`), `buildRowIndexSetsForFilesMatchingCondition`...FIXME...to add [deletionVectorId](#FILE_DV_ID_COL) column with the value of [deletionVector](../AddFile.md#deletionVector) of the given `candidateFiles`. Otherwise, the [deletionVectorId](#FILE_DV_ID_COL) is `null`.

In the end, `buildRowIndexSetsForFilesMatchingCondition` [buildDeletionVectors](#buildDeletionVectors) (with the DataFrame).
In the end, `buildRowIndexSetsForFilesMatchingCondition` [builds the deletion vectors](#buildDeletionVectors) (for the modified `targetDf` DataFrame).

---

Expand Down
32 changes: 32 additions & 0 deletions docs/deletion-vectors/DeletionVectorDescriptor.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,35 @@ assembleDeletionVectorPath(

* `DeletionVectorDescriptor` is requested to [absolutePath](DeletionVectorDescriptor.md#absolutePath) (for the [uuid marker](#UUID_DV_MARKER))
* `DeletionVectorStoreUtils` is requested to [assembleDeletionVectorPathWithFileSystem](DeletionVectorStoreUtils.md#assembleDeletionVectorPathWithFileSystem)

## isOnDisk { #isOnDisk }

```scala
isOnDisk: Boolean
```

`isOnDisk` is the negation (_opposite_) of the [isInline](#isInline) flag.

---

`isOnDisk` is used when:

* `VacuumCommandImpl` is requested for the [path of an on-disk deletion vector](../commands/vacuum/VacuumCommandImpl.md#getDeletionVectorRelativePath)
* `DeletionVectorStoredBitmap` is requested to [isOnDisk](DeletionVectorStoredBitmap.md#isOnDisk)
* `StoredBitmap` utility is requested to [create a StoredBitmap](StoredBitmap.md#create)

## isInline { #isInline }

```scala
isInline: Boolean
```

`isInline` holds true for the [storageType](#storageType) being [i](#INLINE_DV_MARKER).

---

`isInline` is used when:

* `DeletionVectorDescriptor` is requested to [inlineData](#inlineData), [isOnDisk](#isOnDisk)
* `DeletionVectorStoredBitmap` is requested to [isInline](DeletionVectorStoredBitmap.md#isInline)
* `StoredBitmap` is requested to [inline](StoredBitmap.md#inline)
46 changes: 45 additions & 1 deletion docs/deletion-vectors/DeletionVectorStore.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,47 @@
# DeletionVectorStore

`DeletionVectorStore`...FIXME
`DeletionVectorStore` is an [abstraction](#contract) of [stores](#implementations) of [deletion vectors](index.md) to be [loaded as RoaringBitmapArrays](#read).

`DeletionVectorStore` is created using [createInstance](#createInstance) utility.

## Contract (Subset)

### Reading Deletion Vector { #read }

```scala
read(
path: Path,
offset: Int,
size: Int): RoaringBitmapArray
```

Reads the Deletion Vector as `RoaringBitmapArray`

See:

* [HadoopFileSystemDVStore](HadoopFileSystemDVStore.md#read)

Used when:

* `DeletionVectorStoredBitmap` is requested to [load deletion vectors](DeletionVectorStoredBitmap.md#load)

## Implementations

* [HadoopFileSystemDVStore](HadoopFileSystemDVStore.md)

## Creating DeletionVectorStore { #createInstance }

```scala
createInstance(
hadoopConf: Configuration): DeletionVectorStore
```

`createInstance` creates a [HadoopFileSystemDVStore](HadoopFileSystemDVStore.md).

---

`createInstance` is used when:

* `DeletionVectorWriter` is requested to [create a deletion vector partition mapper function](DeletionVectorWriter.md#createDeletionVectorMapper)
* `CDCReaderImpl` is requested to [processDeletionVectorActions](../change-data-feed/CDCReaderImpl.md#processDeletionVectorActions)
* `RowIndexMarkingFiltersBuilder` is requested to [create a RowIndexFilter](RowIndexMarkingFiltersBuilder.md#createInstance) (for non-empty deletion vectors)
24 changes: 24 additions & 0 deletions docs/deletion-vectors/DeletionVectorStoredBitmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# DeletionVectorStoredBitmap

`DeletionVectorStoredBitmap` is a [StoredBitmap](StoredBitmap.md).

## Creating Instance

`DeletionVectorStoredBitmap` takes the following to be created:

* <span id="dvDescriptor"> [DeletionVectorDescriptor](DeletionVectorDescriptor.md)
* [Table Data Path](#tableDataPath)

`DeletionVectorStoredBitmap` is created when:

* `StoredBitmap` is requested to [create a StoredBitmap](StoredBitmap.md#create), [EMPTY](StoredBitmap.md#EMPTY), [inline](StoredBitmap.md#inline)

### Table Data Path { #tableDataPath }

```scala
tableDataPath: Option[Path]
```

`DeletionVectorStoredBitmap` can be given the path to the data directory of a delta table. The path is undefined (`None`) by default.

The path is specified only when `StoredBitmap` utility is requested to [create a StoredBitmap](StoredBitmap.md#create) for [on-disk deletion vectors](DeletionVectorDescriptor.md#isOnDisk).
2 changes: 1 addition & 1 deletion docs/deletion-vectors/DeletionVectorWriter.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ createMapperToStoreDeletionVectors(

* `DeletionVectorSet` is requested to [build deletion vectors](DeletionVectorSet.md#computeResult) (and [bitmapStorageMapper](DeletionVectorSet.md#bitmapStorageMapper))

### createDeletionVectorMapper { #createDeletionVectorMapper }
### Creating Deletion Vector Partition Mapper Function { #createDeletionVectorMapper }

```scala
createDeletionVectorMapper[InputT <: Sizing, OutputT](
Expand Down
28 changes: 28 additions & 0 deletions docs/deletion-vectors/HadoopFileSystemDVStore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# HadoopFileSystemDVStore

`HadoopFileSystemDVStore` is a [DeletionVectorStore](DeletionVectorStore.md).

## Creating Instance

`HadoopFileSystemDVStore` takes the following to be created:

* <span id="hadoopConf"> `Configuration` ([Apache Hadoop]({{ hadoop.api }}/org/apache/hadoop/conf/Configuration.html))

`HadoopFileSystemDVStore` is created when:

* `DeletionVectorStore` is requested to [create a DeletionVectorStore](DeletionVectorStore.md#createInstance)

## Reading Deletion Vector { #read }

??? note "DeletionVectorStore"

```scala
read(
path: Path,
offset: Int,
size: Int): RoaringBitmapArray
```

`read` is part of the [DeletionVectorStore](DeletionVectorStore.md#read) abstraction.

`read`...FIXME
18 changes: 17 additions & 1 deletion docs/deletion-vectors/StoredBitmap.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
# StoredBitmap

`StoredBitmap` is...FIXME
## Create StoredBitmap { #create }

```scala
create(
dvDescriptor: DeletionVectorDescriptor,
tablePath: Path): StoredBitmap
```

`create` creates a new [DeletionVectorStoredBitmap](DeletionVectorStoredBitmap.md) (possibly with the given `tablePath` for an [on-disk deletion vector](DeletionVectorDescriptor.md#isOnDisk)).

---

`create` is used when:

* `DeletionVectorWriter` is requested to [storeBitmapAndGenerateResult](DeletionVectorWriter.md#storeBitmapAndGenerateResult)
* `RowIndexMarkingFiltersBuilder` is requested to [create a RowIndexFilter](RowIndexMarkingFiltersBuilder.md#createInstance)
* `DeletionVectorStore` is requested to [read a deletion vector](DeletionVectorStore.md#read)
101 changes: 101 additions & 0 deletions docs/deletion-vectors/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ Deletion Vectors is used on a delta table when all of the following hold:
1. [delta.enableDeletionVectors](../table-properties/DeltaConfigs.md#enableDeletionVectors) table property is enabled
1. [DeletionVectorsTableFeature](DeletionVectorsTableFeature.md) is [supported](../table-features/TableFeatureSupport.md#isFeatureSupported) by the [Protocol](../Protocol.md)

There are two types of deletion vectors:

* inline
* on-disk (persistent)

(Persistent) Deletion Vectors are only supported on [parquet-based delta tables](../Protocol.md#assertTablePropertyConstraintsSatisfied).

## REORG TABLE Command
Expand Down Expand Up @@ -62,6 +67,18 @@ Create a delta table with [delta.enableDeletionVectors](../table-properties/Delt
)
```

=== "Scala"

```scala
sql("""
CREATE OR REPLACE TABLE tbl(id int)
USING delta
TBLPROPERTIES (
'delta.enableDeletionVectors' = 'true'
)
""")
```

Describe the detail of the delta table using [DESCRIBE DETAIL](../commands/describe-detail/index.md) command.

=== "Scala"
Expand All @@ -80,6 +97,90 @@ Describe the detail of the delta table using [DESCRIBE DETAIL](../commands/descr
+-------------------------+-------------------------------------+----------------+----------------+-----------------+
```

```scala
sql("INSERT INTO tbl VALUES 1, 2, 3")
```

Deletion Vectors is supported by [DELETE](../commands/delete/index.md) command.

=== "Scala"

```text
scala> sql("DELETE FROM tbl WHERE id=1").show()
+-----------------+
|num_affected_rows|
+-----------------+
| 1|
+-----------------+
```

```scala
val location = sql("DESC DETAIL tbl").select("location").as[String].head()
```

```console hl_lines="4"
$ ls -l spark-warehouse/tbl
total 32
drwxr-xr-x@ 9 jacek staff 288 Jun 11 20:41 _delta_log
-rw-r--r--@ 1 jacek staff 43 Jun 11 20:41 deletion_vector_5366f7d2-59db-4b86-b160-af5b8f5944d6.bin
-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00000-be36d6d7-fd71-4b4a-a6b3-fbb41d568abc-c000.snappy.parquet
-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00001-728e8290-6af7-465d-9372-df7d0f981b62-c000.snappy.parquet
-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00002-85c45a7f-8903-4a7c-bdd1-2f4998fcc8b4-c000.snappy.parquet
```

Physically delete dropped rows using [VACUUM](../commands/vacuum/index.md) command.

```text
scala> sql("DESC HISTORY tbl").select("version", "operation", "operationParameters").show(truncate=false)
+-------+------------+----------------------------------------------------------------------------------------------------------------------------------+
|version|operation |operationParameters |
+-------+------------+----------------------------------------------------------------------------------------------------------------------------------+
|2 |DELETE |{predicate -> ["(a#2003 = 1)"]} |
|1 |WRITE |{mode -> Append, partitionBy -> []} |
|0 |CREATE TABLE|{partitionBy -> [], clusterBy -> [], description -> NULL, isManaged -> true, properties -> {"delta.enableDeletionVectors":"true"}}|
+-------+------------+----------------------------------------------------------------------------------------------------------------------------------+
```

```text
scala> sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false").show(truncate=false)
+-----------------------------------------------------+-----+
|key |value|
+-----------------------------------------------------+-----+
|spark.databricks.delta.retentionDurationCheck.enabled|false|
+-----------------------------------------------------+-----+
```

```text
scala> sql("VACUUM tbl RETAIN 0 HOURS").show(truncate=false)
Deleted 2 files and directories in a total of 1 directories.
+---------------------------------------------------+
|path |
+---------------------------------------------------+
|file:/Users/jacek/dev/oss/spark/spark-warehouse/tbl|
+---------------------------------------------------+
```

```text
scala> sql("DESC HISTORY tbl").select("version", "operation", "operationParameters").show(truncate=false)
+-------+------------+----------------------------------------------------------------------------------------------------------------------------------+
|version|operation |operationParameters |
+-------+------------+----------------------------------------------------------------------------------------------------------------------------------+
|4 |VACUUM END |{status -> COMPLETED} |
|3 |VACUUM START|{retentionCheckEnabled -> false, defaultRetentionMillis -> 604800000, specifiedRetentionMillis -> 0} |
|2 |DELETE |{predicate -> ["(a#2003 = 1)"]} |
|1 |WRITE |{mode -> Append, partitionBy -> []} |
|0 |CREATE TABLE|{partitionBy -> [], clusterBy -> [], description -> NULL, isManaged -> true, properties -> {"delta.enableDeletionVectors":"true"}}|
+-------+------------+----------------------------------------------------------------------------------------------------------------------------------+
```

```console
$ ls -l spark-warehouse/tbl
total 16
drwxr-xr-x@ 13 jacek staff 416 Jun 11 21:08 _delta_log
-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00001-728e8290-6af7-465d-9372-df7d0f981b62-c000.snappy.parquet
-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00002-85c45a7f-8903-4a7c-bdd1-2f4998fcc8b4-c000.snappy.parquet
```

## Learn More

1. [Delta Lake Deletion Vectors]({{ delta.blog }}/2023-07-05-deletion-vectors/)

0 comments on commit ad32f5f

Please sign in to comment.