This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Ingestion batching

1 - IngestionBatching policy

Learn how to use the IngestionBatching policy to optimize batching for ingestion.

Overview

During the queued ingestion process, the service optimizes for throughput by batching small ingress data chunks together before ingestion. Batching reduces the resources consumed by the queued ingestion process and doesn’t require post-ingestion resources to optimize the small data shards produced by non-batched ingestion.

The downside to doing batching before ingestion is the forced delay. Therefore, the end-to-end time from requesting the data ingestion until the data ready for query is larger.

When you define the IngestionBatching policy, you’ll need to find a balance between optimizing for throughput and time delay. This policy applies to queued ingestion. It defines the maximum forced delay allowed when batching small blobs together. To learn more about using batching policy commands, and optimizing for throughput, see:

Sealing a batch

There’s an optimal size of about 1 GB of uncompressed data for bulk ingestion. Ingestion of blobs with much less data is suboptimal, so in queued ingestion the service will batch small blobs together.

The following list shows the basic batching policy triggers to seal a batch. A batch is sealed and ingested when the first condition is met:

  • Size: Batch size limit reached or exceeded
  • Count: Batch file number limit reached
  • Time: Batching time has expired

The IngestionBatching policy can be set on databases or tables. Default values are as follows: 5 minutes maximum delay time, 500 items, total size of 1 GB.

The following list shows conditions to seal batches related to single blob ingestion. A batch is sealed and ingested when the conditions are met:

  • SingleBlob_FlushImmediately: Ingest a single blob because ‘FlushImmediately’ was set
  • SingleBlob_IngestIfNotExists: Ingest a single blob because ‘IngestIfNotExists’ was set
  • SingleBlob_IngestByTag: Ingest a single blob because ‘ingest-by’ was set
  • SingleBlob_SizeUnknown: Ingest a single blob because blob size is unknown

If the SystemFlush condition is set, a batch will be sealed when a system flush is triggered. With the SystemFlush parameter set, the system flushes the data, for example due to database scaling or internal reset of system components.

Defaults and limits

TypePropertyDefaultLow latency settingMinimum valueMaximum value
Number of itemsMaximumNumberOfItems500500125,000
Data size (MB)MaximumRawDataSizeMB102410241004096
Time (TimeSpan)MaximumBatchingTimeSpan00:05:0000:00:20 - 00:00:3000:00:1000:30:00

The most effective way of controlling the end-to-end latency using ingestion batching policy is to alter its time boundary at table or database level, according to the higher bound of latency requirements. A database level policy affects all tables in that database that don’t have the table-level policy defined, and any newly created table.

Batch data size

The batching policy data size is set for uncompressed data. For Parquet, AVRO, and ORC files, an estimation is calculated based on file size. For compressed data, the uncompressed data size is evaluated as follows in descending order of accuracy:

  1. If the uncompressed size is provided in the ingestion source options, that value is used.
  2. When ingesting local files using SDKs, zip archives and gzip streams are inspected to assess their raw size.
  3. If previous options don’t provide a data size, a factor is applied to the compressed data size to estimate the uncompressed data size.

Batching latencies

Latencies can result from many causes that can be addressed using batching policy settings.

CauseSolution
Data latency matches the time setting, with too little data to reach the size or count limitReduce the time limit
Inefficient batching due to a large number of very small filesIncrease the size of the source files. If using Kafka Sink, configure it to send data in ~100 KB chunks or higher. If you have many small files, increase the count (up to 2000) in the database or table ingestion policy.
Batching a large amount of uncompressed dataThis is common when ingesting Parquet files. Incrementally decrease size for the table or database batching policy towards 250 MB and check for improvement.
Backlog because the database is under scaledAccept any Azure advisor suggestions to scale aside or scale up your database. Alternatively, manually scale your database to see if the backlog is closed. If these options don’t work, contact support for assistance.