This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Extents (data shards)

1 - Extent tags

Learn how to create and use extent tags.

An extent tag is a string that describes properties common to all data in an extent. For example, during data ingestion, you can append an extent tag to signify the source of the ingested data. Then, you can use this tag for analysis.

Extents can hold multiple tags as part of their metadata. When extents merge, their tags also merge, ensuring consistent metadata representation.

To see the tags associated with an extent, use the .show extents command. For a granular view of tags associated with records within an extent, use the extent-tags() function.

drop-by extent tags

Tags that start with a drop-by: prefix can be used to control which other extents to merge with. Extents that have the same set of drop-by: tags can be merged together, but they won’t be merged with other extents if they have a different set of drop-by: tags.

Examples

Determine which extents can be merged together

If:

  • Extent 1 has the following tags: drop-by:blue, drop-by:red, green.
  • Extent 2 has the following tags: drop-by:red, yellow.
  • Extent 3 has the following tags: purple, drop-by:red, drop-by:blue.

Then:

  • Extents 1 and 2 won’t be merged together, as they have a different set of drop-by tags.
  • Extents 2 and 3 won’t be merged together, as they have a different set of drop-by tags.
  • Extents 1 and 3 can be merged together, as they have the same set of drop-by tags.

Use drop-by tags as part of extent-level operations

The following query issues a command to drop extents according to their drop-by: tag.

.ingest ... with @'{"tags":"[\"drop-by:2016-02-17\"]"}'

.drop extents <| .show table MyTable extents where tags has "drop-by:2016-02-17" 

ingest-by extent tags

Tags with the prefix ingest-by: can be used together with the ingestIfNotExists property to ensure that data is ingested only once.

The ingestIfNotExists property prevents duplicate ingestion by checking if an extent with the specified ingest-by: tag already exists. Typically, an ingest command contains an ingest-by: tag and the ingestIfNotExists property with the same value.

Examples

Add a tag on ingestion

The following command ingests the data and adds the tag ingest-by:2016-02-17.

.ingest ... with (tags = '["ingest-by:2016-02-17"]')

Prevent duplicate ingestion

The following command ingests the data so long as no extent in the table has the ingest-by:2016-02-17 tag.

.ingest ... with (ingestIfNotExists = '["2016-02-17"]')

Prevent duplicate ingestion and add a tag to any new data

The following command ingests the data so long as no extent in the table has the ingest-by:2016-02-17 tag. Any newly ingested data gets the ingest-by:2016-02-17 tag.

.ingest ... with (ingestIfNotExists = '["2016-02-17"]', tags = '["ingest-by:2016-02-17"]')

Limitations

  • Extent tags can only be applied to records within an extent. Consequently, tags can’t be set on streaming ingestion data before it is stored in extents.
  • Extent tags can’t be stored on data in external tables or materialized views.

2 - Extents (data shards)

This article describes Extents (data shards).

Tables are partitioned into extents, or data shards. Each extent is a horizontal segment of the table that contains data and metadata such as its creation time and optional tags. The union of all these extents contains the entire dataset of the table. Extents are evenly distributed across nodes in the cluster, and they’re cached in both local SSD and memory for optimized performance.

Extents are immutable, meaning they can be queried, reassigned to a different node, or dropped out of the table but never modified. Data modification happens by creating new extents and transactionally swapping old extents with the new ones. The immutability of extents provides benefits such as increased robustness and easy reversion to previous snapshots.

Extents hold a collection of records that are physically arranged in columns, enabling efficient encoding and compression of the data. To maintain query efficiency, smaller extents are merged into larger extents according to the configured merge policy and sharding policy. Merging extents reduces management overhead and leads to index optimization and improved compression.

The common extent lifecycle is as follows:

  1. The extent is created by an ingestion operation.
  2. The extent is merged with other extents.
  3. The merged extent (possibly one that tracks its lineage to other extents) is eventually dropped because of a retention policy.

Extent creation time

Two datetime values are tracked per extent: MinCreatedOn and MaxCreatedOn. These values are initially the same but may change when the extent is merged with other extents. When the extent is merged with other extents, the new values are according to the original minimum and maximum values of the merged extents.

The creation time of an extent is used for the following purposes:

  • Retention: Extents created earlier are dropped earlier.
  • Caching: Extents created recently are kept in hot cache.
  • Sampling: Recent extents are preferred when using query operations such as take.

To overwrite the creation time of an extent, provide an alternate creationTime in the data ingestion properties. This can be useful for retention purposes, such as if you want to reingest data but don’t want it to appear as if it arrived late.

3 - Extents commands