diff --git a/docs/commands/reorg/index.md b/docs/commands/reorg/index.md index 721fd056f..1713db043 100644 --- a/docs/commands/reorg/index.md +++ b/docs/commands/reorg/index.md @@ -4,7 +4,10 @@ ```sql REORG TABLE (delta.`/path/to/table` | delta_table_name) -[WHERE partition_predicate] APPLY (PURGE) +( + [WHERE partition_predicate] APPLY (PURGE) | + APPLY (UPGRADE UNIFORM (ICEBERG_COMPAT_VERSION = version)) +) ``` `REORG TABLE` command requires the file path or the name of a delta table. diff --git a/docs/deletion-vectors/DeletionVectorBitmapGenerator.md b/docs/deletion-vectors/DeletionVectorBitmapGenerator.md index 2aa095840..6ca04ebfb 100644 --- a/docs/deletion-vectors/DeletionVectorBitmapGenerator.md +++ b/docs/deletion-vectors/DeletionVectorBitmapGenerator.md @@ -18,12 +18,11 @@ buildRowIndexSetsForFilesMatchingCondition( Column Name | Column ------------|------- - `filePath` | The given `fileNameColumnOpt` if specified or `_metadata.file_path` - `rowIndexCol` | The given `rowIndexColumnOpt` if specified or one of the following based on [spark.databricks.delta.deletionVectors.useMetadataRowIndex](../configuration-properties/index.md#deletionVectors.useMetadataRowIndex): + [filePath](#FILE_NAME_COL) | The given `fileNameColumnOpt` if specified or `_metadata.file_path` + [rowIndexCol](#ROW_INDEX_COL) | The given `rowIndexColumnOpt` if specified or one of the following based on [spark.databricks.delta.deletionVectors.useMetadataRowIndex](../configuration-properties/index.md#deletionVectors.useMetadataRowIndex): + [deletionVectorId](#FILE_DV_ID_COL) | -With the table uses deletion vectors already (based on the given `tableHasDVs`), `buildRowIndexSetsForFilesMatchingCondition`...FIXME...to add [deletionVectorId](#FILE_DV_ID_COL) column with the value of [deletionVector](../AddFile.md#deletionVector) of the given `candidateFiles`. Otherwise, the [deletionVectorId](#FILE_DV_ID_COL) is `null`. - -In the end, `buildRowIndexSetsForFilesMatchingCondition` [buildDeletionVectors](#buildDeletionVectors) (with the DataFrame). +In the end, `buildRowIndexSetsForFilesMatchingCondition` [builds the deletion vectors](#buildDeletionVectors) (for the modified `targetDf` DataFrame). --- diff --git a/docs/deletion-vectors/DeletionVectorDescriptor.md b/docs/deletion-vectors/DeletionVectorDescriptor.md index f1d2e25e7..f2869a3d1 100644 --- a/docs/deletion-vectors/DeletionVectorDescriptor.md +++ b/docs/deletion-vectors/DeletionVectorDescriptor.md @@ -125,3 +125,35 @@ assembleDeletionVectorPath( * `DeletionVectorDescriptor` is requested to [absolutePath](DeletionVectorDescriptor.md#absolutePath) (for the [uuid marker](#UUID_DV_MARKER)) * `DeletionVectorStoreUtils` is requested to [assembleDeletionVectorPathWithFileSystem](DeletionVectorStoreUtils.md#assembleDeletionVectorPathWithFileSystem) + +## isOnDisk { #isOnDisk } + +```scala +isOnDisk: Boolean +``` + +`isOnDisk` is the negation (_opposite_) of the [isInline](#isInline) flag. + +--- + +`isOnDisk` is used when: + +* `VacuumCommandImpl` is requested for the [path of an on-disk deletion vector](../commands/vacuum/VacuumCommandImpl.md#getDeletionVectorRelativePath) +* `DeletionVectorStoredBitmap` is requested to [isOnDisk](DeletionVectorStoredBitmap.md#isOnDisk) +* `StoredBitmap` utility is requested to [create a StoredBitmap](StoredBitmap.md#create) + +## isInline { #isInline } + +```scala +isInline: Boolean +``` + +`isInline` holds true for the [storageType](#storageType) being [i](#INLINE_DV_MARKER). + +--- + +`isInline` is used when: + +* `DeletionVectorDescriptor` is requested to [inlineData](#inlineData), [isOnDisk](#isOnDisk) +* `DeletionVectorStoredBitmap` is requested to [isInline](DeletionVectorStoredBitmap.md#isInline) +* `StoredBitmap` is requested to [inline](StoredBitmap.md#inline) diff --git a/docs/deletion-vectors/DeletionVectorStore.md b/docs/deletion-vectors/DeletionVectorStore.md index 59dd5314c..d9712f96f 100644 --- a/docs/deletion-vectors/DeletionVectorStore.md +++ b/docs/deletion-vectors/DeletionVectorStore.md @@ -1,3 +1,47 @@ # DeletionVectorStore -`DeletionVectorStore`...FIXME +`DeletionVectorStore` is an [abstraction](#contract) of [stores](#implementations) of [deletion vectors](index.md) to be [loaded as RoaringBitmapArrays](#read). + +`DeletionVectorStore` is created using [createInstance](#createInstance) utility. + +## Contract (Subset) + +### Reading Deletion Vector { #read } + +```scala +read( + path: Path, + offset: Int, + size: Int): RoaringBitmapArray +``` + +Reads the Deletion Vector as `RoaringBitmapArray` + +See: + +* [HadoopFileSystemDVStore](HadoopFileSystemDVStore.md#read) + +Used when: + +* `DeletionVectorStoredBitmap` is requested to [load deletion vectors](DeletionVectorStoredBitmap.md#load) + +## Implementations + +* [HadoopFileSystemDVStore](HadoopFileSystemDVStore.md) + +## Creating DeletionVectorStore { #createInstance } + +```scala +createInstance( + hadoopConf: Configuration): DeletionVectorStore +``` + +`createInstance` creates a [HadoopFileSystemDVStore](HadoopFileSystemDVStore.md). + +--- + +`createInstance` is used when: + +* `DeletionVectorWriter` is requested to [create a deletion vector partition mapper function](DeletionVectorWriter.md#createDeletionVectorMapper) +* `CDCReaderImpl` is requested to [processDeletionVectorActions](../change-data-feed/CDCReaderImpl.md#processDeletionVectorActions) +* `RowIndexMarkingFiltersBuilder` is requested to [create a RowIndexFilter](RowIndexMarkingFiltersBuilder.md#createInstance) (for non-empty deletion vectors) diff --git a/docs/deletion-vectors/DeletionVectorStoredBitmap.md b/docs/deletion-vectors/DeletionVectorStoredBitmap.md new file mode 100644 index 000000000..354cd3c84 --- /dev/null +++ b/docs/deletion-vectors/DeletionVectorStoredBitmap.md @@ -0,0 +1,24 @@ +# DeletionVectorStoredBitmap + +`DeletionVectorStoredBitmap` is a [StoredBitmap](StoredBitmap.md). + +## Creating Instance + +`DeletionVectorStoredBitmap` takes the following to be created: + +* [DeletionVectorDescriptor](DeletionVectorDescriptor.md) +* [Table Data Path](#tableDataPath) + +`DeletionVectorStoredBitmap` is created when: + +* `StoredBitmap` is requested to [create a StoredBitmap](StoredBitmap.md#create), [EMPTY](StoredBitmap.md#EMPTY), [inline](StoredBitmap.md#inline) + +### Table Data Path { #tableDataPath } + +```scala +tableDataPath: Option[Path] +``` + +`DeletionVectorStoredBitmap` can be given the path to the data directory of a delta table. The path is undefined (`None`) by default. + +The path is specified only when `StoredBitmap` utility is requested to [create a StoredBitmap](StoredBitmap.md#create) for [on-disk deletion vectors](DeletionVectorDescriptor.md#isOnDisk). diff --git a/docs/deletion-vectors/DeletionVectorWriter.md b/docs/deletion-vectors/DeletionVectorWriter.md index 376a0b485..06c2b89b8 100644 --- a/docs/deletion-vectors/DeletionVectorWriter.md +++ b/docs/deletion-vectors/DeletionVectorWriter.md @@ -18,7 +18,7 @@ createMapperToStoreDeletionVectors( * `DeletionVectorSet` is requested to [build deletion vectors](DeletionVectorSet.md#computeResult) (and [bitmapStorageMapper](DeletionVectorSet.md#bitmapStorageMapper)) -### createDeletionVectorMapper { #createDeletionVectorMapper } +### Creating Deletion Vector Partition Mapper Function { #createDeletionVectorMapper } ```scala createDeletionVectorMapper[InputT <: Sizing, OutputT]( diff --git a/docs/deletion-vectors/HadoopFileSystemDVStore.md b/docs/deletion-vectors/HadoopFileSystemDVStore.md new file mode 100644 index 000000000..d2bef72e6 --- /dev/null +++ b/docs/deletion-vectors/HadoopFileSystemDVStore.md @@ -0,0 +1,28 @@ +# HadoopFileSystemDVStore + +`HadoopFileSystemDVStore` is a [DeletionVectorStore](DeletionVectorStore.md). + +## Creating Instance + +`HadoopFileSystemDVStore` takes the following to be created: + +* `Configuration` ([Apache Hadoop]({{ hadoop.api }}/org/apache/hadoop/conf/Configuration.html)) + +`HadoopFileSystemDVStore` is created when: + +* `DeletionVectorStore` is requested to [create a DeletionVectorStore](DeletionVectorStore.md#createInstance) + +## Reading Deletion Vector { #read } + +??? note "DeletionVectorStore" + + ```scala + read( + path: Path, + offset: Int, + size: Int): RoaringBitmapArray + ``` + + `read` is part of the [DeletionVectorStore](DeletionVectorStore.md#read) abstraction. + +`read`...FIXME diff --git a/docs/deletion-vectors/StoredBitmap.md b/docs/deletion-vectors/StoredBitmap.md index 8f0df4c3b..a8409d914 100644 --- a/docs/deletion-vectors/StoredBitmap.md +++ b/docs/deletion-vectors/StoredBitmap.md @@ -1,3 +1,19 @@ # StoredBitmap -`StoredBitmap` is...FIXME +## Create StoredBitmap { #create } + +```scala +create( + dvDescriptor: DeletionVectorDescriptor, + tablePath: Path): StoredBitmap +``` + +`create` creates a new [DeletionVectorStoredBitmap](DeletionVectorStoredBitmap.md) (possibly with the given `tablePath` for an [on-disk deletion vector](DeletionVectorDescriptor.md#isOnDisk)). + +--- + +`create` is used when: + +* `DeletionVectorWriter` is requested to [storeBitmapAndGenerateResult](DeletionVectorWriter.md#storeBitmapAndGenerateResult) +* `RowIndexMarkingFiltersBuilder` is requested to [create a RowIndexFilter](RowIndexMarkingFiltersBuilder.md#createInstance) +* `DeletionVectorStore` is requested to [read a deletion vector](DeletionVectorStore.md#read) diff --git a/docs/deletion-vectors/index.md b/docs/deletion-vectors/index.md index 25951d8b7..34d43fc45 100644 --- a/docs/deletion-vectors/index.md +++ b/docs/deletion-vectors/index.md @@ -23,6 +23,11 @@ Deletion Vectors is used on a delta table when all of the following hold: 1. [delta.enableDeletionVectors](../table-properties/DeltaConfigs.md#enableDeletionVectors) table property is enabled 1. [DeletionVectorsTableFeature](DeletionVectorsTableFeature.md) is [supported](../table-features/TableFeatureSupport.md#isFeatureSupported) by the [Protocol](../Protocol.md) +There are two types of deletion vectors: + +* inline +* on-disk (persistent) + (Persistent) Deletion Vectors are only supported on [parquet-based delta tables](../Protocol.md#assertTablePropertyConstraintsSatisfied). ## REORG TABLE Command @@ -62,6 +67,18 @@ Create a delta table with [delta.enableDeletionVectors](../table-properties/Delt ) ``` +=== "Scala" + + ```scala + sql(""" + CREATE OR REPLACE TABLE tbl(id int) + USING delta + TBLPROPERTIES ( + 'delta.enableDeletionVectors' = 'true' + ) + """) + ``` + Describe the detail of the delta table using [DESCRIBE DETAIL](../commands/describe-detail/index.md) command. === "Scala" @@ -80,6 +97,90 @@ Describe the detail of the delta table using [DESCRIBE DETAIL](../commands/descr +-------------------------+-------------------------------------+----------------+----------------+-----------------+ ``` +```scala +sql("INSERT INTO tbl VALUES 1, 2, 3") +``` + +Deletion Vectors is supported by [DELETE](../commands/delete/index.md) command. + +=== "Scala" + + ```text + scala> sql("DELETE FROM tbl WHERE id=1").show() + +-----------------+ + |num_affected_rows| + +-----------------+ + | 1| + +-----------------+ + ``` + +```scala +val location = sql("DESC DETAIL tbl").select("location").as[String].head() +``` + +```console hl_lines="4" +$ ls -l spark-warehouse/tbl +total 32 +drwxr-xr-x@ 9 jacek staff 288 Jun 11 20:41 _delta_log +-rw-r--r--@ 1 jacek staff 43 Jun 11 20:41 deletion_vector_5366f7d2-59db-4b86-b160-af5b8f5944d6.bin +-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00000-be36d6d7-fd71-4b4a-a6b3-fbb41d568abc-c000.snappy.parquet +-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00001-728e8290-6af7-465d-9372-df7d0f981b62-c000.snappy.parquet +-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00002-85c45a7f-8903-4a7c-bdd1-2f4998fcc8b4-c000.snappy.parquet +``` + +Physically delete dropped rows using [VACUUM](../commands/vacuum/index.md) command. + +```text +scala> sql("DESC HISTORY tbl").select("version", "operation", "operationParameters").show(truncate=false) ++-------+------------+----------------------------------------------------------------------------------------------------------------------------------+ +|version|operation |operationParameters | ++-------+------------+----------------------------------------------------------------------------------------------------------------------------------+ +|2 |DELETE |{predicate -> ["(a#2003 = 1)"]} | +|1 |WRITE |{mode -> Append, partitionBy -> []} | +|0 |CREATE TABLE|{partitionBy -> [], clusterBy -> [], description -> NULL, isManaged -> true, properties -> {"delta.enableDeletionVectors":"true"}}| ++-------+------------+----------------------------------------------------------------------------------------------------------------------------------+ +``` + +```text +scala> sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false").show(truncate=false) ++-----------------------------------------------------+-----+ +|key |value| ++-----------------------------------------------------+-----+ +|spark.databricks.delta.retentionDurationCheck.enabled|false| ++-----------------------------------------------------+-----+ +``` + +```text +scala> sql("VACUUM tbl RETAIN 0 HOURS").show(truncate=false) +Deleted 2 files and directories in a total of 1 directories. ++---------------------------------------------------+ +|path | ++---------------------------------------------------+ +|file:/Users/jacek/dev/oss/spark/spark-warehouse/tbl| ++---------------------------------------------------+ +``` + +```text +scala> sql("DESC HISTORY tbl").select("version", "operation", "operationParameters").show(truncate=false) ++-------+------------+----------------------------------------------------------------------------------------------------------------------------------+ +|version|operation |operationParameters | ++-------+------------+----------------------------------------------------------------------------------------------------------------------------------+ +|4 |VACUUM END |{status -> COMPLETED} | +|3 |VACUUM START|{retentionCheckEnabled -> false, defaultRetentionMillis -> 604800000, specifiedRetentionMillis -> 0} | +|2 |DELETE |{predicate -> ["(a#2003 = 1)"]} | +|1 |WRITE |{mode -> Append, partitionBy -> []} | +|0 |CREATE TABLE|{partitionBy -> [], clusterBy -> [], description -> NULL, isManaged -> true, properties -> {"delta.enableDeletionVectors":"true"}}| ++-------+------------+----------------------------------------------------------------------------------------------------------------------------------+ +``` + +```console +$ ls -l spark-warehouse/tbl +total 16 +drwxr-xr-x@ 13 jacek staff 416 Jun 11 21:08 _delta_log +-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00001-728e8290-6af7-465d-9372-df7d0f981b62-c000.snappy.parquet +-rw-r--r--@ 1 jacek staff 449 Jun 11 20:39 part-00002-85c45a7f-8903-4a7c-bdd1-2f4998fcc8b4-c000.snappy.parquet +``` + ## Learn More 1. [Delta Lake Deletion Vectors]({{ delta.blog }}/2023-07-05-deletion-vectors/)