This is the multi-page printable view of this section. Click here to print.
Management
- 1: Advanced data management
- 1.1: .clear cluster cache external-artifacts command
- 1.2: Data purge
- 1.3: Delete data
- 1.4: Follower commands
- 1.5: Data soft delete
- 1.5.1: Data soft delete
- 1.5.2: Data soft delete command
- 1.6: Extents (data shards)
- 1.6.1: Extent tags
- 1.6.2: Extents (data shards)
- 1.6.3: Extents commands
- 2: Cross-cluster schema
- 3: Continuous data export
- 3.1: .export to SQL
- 3.2: .export to storage
- 3.3: .export to table
- 3.4: Data export
- 3.5: Continuous data export
- 3.5.1: .create or alter continuous-export
- 3.5.2: .drop continuous-export
- 3.5.3: .show continuous data-export failures
- 3.5.4: .show continuous-export
- 3.5.5: .show continuous-export exported-artifacts
- 3.5.6: Continuous data export
- 3.5.7: Enable or disable continuous data export
- 3.5.8: Use a managed identity to run a continuous export job
- 4: Data ingestion
- 4.1: .ingest inline command (push)
- 4.2: .show data operations
- 4.3: Data formats supported for ingestion
- 4.4: Data ingestion properties
- 4.5: Ingest from query
- 4.6: Kusto.ingest into command (pull data from storage)
- 4.7: Streaming ingestion
- 5: Database cursors
- 5.1: Database cursors
- 6: Plugin commands
- 7: Policies
- 7.1: Policies overview
- 7.2: Auto delete
- 7.2.1: Auto delete policy
- 7.3: Caching
- 7.4: Callout
- 7.4.1: Callout policy
- 7.5: Capacity
- 7.5.1: Capacity policy
- 7.6: Encoding policy
- 7.6.1: Encoding policy
- 7.7: Extent tags policy
- 7.7.1: Extent tags retention policy
- 7.8: Ingestion batching
- 7.8.1: IngestionBatching policy
- 7.9: Ingestion time
- 7.9.1: IngestionTime policy
- 7.10: Managed identity
- 7.10.1: Kusto ManagedIdentity policy
- 7.11: Merge policy
- 7.11.1: Extents merge policy
- 7.12: Mirroring policy
- 7.12.1: Mirroring policy
- 7.13: Partitioning policy
- 7.13.1: Partitioning policy
- 7.14: Query acceleration policy
- 7.15: Query week consistency policy
- 7.15.1: Query weak consistency policy
- 7.16: Restricted view access
- 7.16.1: Restricted view access policy
- 7.17: Retention policy
- 7.17.1: Retention policy
- 7.18: Row level security policy
- 7.18.1: Row level security policy
- 7.19: Row order policy
- 7.19.1: Row order policy
- 7.20: Sandbox policy
- 7.20.1: Sandbox policy
- 7.20.2: Sandboxes
- 7.21: Sharding policy
- 7.21.1: Data sharding policy
- 7.22: Streaming ingestion policy
- 7.22.1: Streaming ingestion policy
- 7.23: Update policy
- 8: Query results cache
- 9: Schema
- 9.1: Avrotize k2a tool
- 9.2: Best practices for schema management
- 9.3: Columns
- 9.3.1: Change column type without data loss
- 9.3.2: Columns management
- 9.4: Databases
- 9.5: External tables
- 9.5.1: Azure SQL external tables
- 9.5.1.1: Create and alter Azure SQL external tables
- 9.5.1.2: Query SQL external tables
- 9.5.1.3: Use row-level security with Azure SQL external tables
- 9.5.2: Azure Storage external tables
- 9.6: Functions
- 9.7: Ingestion mappings
- 9.7.1: AVRO Mapping
- 9.7.2: CSV Mapping
- 9.7.3: Ingestion mappings
- 9.7.4: JSON Mapping
- 9.7.5: ORC Mapping
- 9.7.6: Parquet Mapping
- 9.7.7: W3CLOGFILE Mapping
- 9.8: Manage external table mappings
- 9.9: Materialized views
- 9.9.1: Materialized views
- 9.9.2: Materialized views data purge
- 9.9.3: Materialized views limitations
- 9.9.4: Materialized views policies
- 9.9.5: Materialized views use cases
- 9.9.6: Monitor materialized views
- 9.10: Stored query results
- 9.10.1: Stored query results
- 9.11: Tables
- 9.11.1: Tables management
- 10: Security roles
- 10.1: Manage database security roles
- 10.2: Manage external table roles
- 10.3: Manage function roles
- 10.4: Manage materialized view roles
- 10.5: Referencing security principals
- 10.6: Security roles
- 10.7: Access control
- 10.7.1: Access Control Overview
- 10.7.2: Microsoft Entra application registration
- 10.7.3: Role-based access control
- 10.8: Manage table roles
- 10.8.1: Manage table security roles
- 10.8.2: Manage view access to tables
- 11: Operations
- 11.1: Estimate table size
- 11.2: Journal management
- 11.3: System information
- 11.4: Operations
- 11.5: Queries and commands
- 11.6: Statistics
- 12: Workload groups
- 12.1: Query consistency policy
- 12.2: Request limits policy
- 12.3: Request queuing policy
- 12.4: Request rate limit policy
- 12.5: Request rate limits enforcement policy
- 12.6: Workload groups
- 12.7: Request classification policy
- 12.7.1: Request classification policy
- 12.8: Workload group commands
- 13: Management commands overview
1 - Advanced data management
1.1 - .clear cluster cache external-artifacts command
.clear cluster cache external-artifacts
command to clear cached external-artifacts of language plugins.Clears cached external-artifacts of language plugins.
This command is useful when you update external-artifact files stored in external storage, as the cache may retain the previous versions. In such scenarios, executing this command will clear the cache entries and ensure that subsequent queries run with the latest version of the artifacts.
Permissions
You must have at least Database Admin permissions to run this command.
Syntax
.clear
cluster
cache
external-artifacts
(
ArtifactURI [,
… ] )
Parameters
Name | Type | Required | Description |
---|---|---|---|
ArtifactURI | string | ✔️ | The URI for the external-artifact to clear from the cache. |
Returns
This command returns a table with the following columns:
Column | Type | Description |
---|---|---|
ExternalArtifactUri | string | The external artifact URI. |
State | string | The result of the clear operation on the external artifact. |
Example
.clear cluster cache external-artifacts ("https://kustoscriptsamples.blob.core.windows.net/samples/R/sample_script.r", "https://kustoscriptsamples.blob.core.windows.net/samples/python/sample_script.py")
ExternalArtifactUri | State |
---|---|
https://kustoscriptsamples.blob.core.windows.net/samples/R/sample_script.r | Cleared successfully on all nodes |
https://kustoscriptsamples.blob.core.windows.net/samples/python/sample_script.py | Cleared successfully on all nodes |
Related content
1.2 - Data purge
The data platform supports the ability to delete individual records, by using Kusto .purge
and related commands. You can also purge an entire table or purge records in a materialized view.
Purge guidelines
Carefully design your data schema and investigate relevant policies before storing personal data.
- In a best-case scenario, the retention period on this data is sufficiently short and data is automatically deleted.
- If retention period usage isn’t possible, isolate all data that is subject to privacy rules in a few tables. Optimally, use just one table and link to it from all other tables. This isolation allows you to run the data purge process on a few tables holding sensitive data, and avoid all other tables.
- The caller should make every attempt to batch the execution of
.purge
commands to 1-2 commands per table per day. Don’t issue multiple commands with unique user identity predicates. Instead, send a single command whose predicate includes all user identities that require purging.
Purge process
The process of selectively purging data happens in the following steps:
Phase 1: Give an input with a table name and a per-record predicate, indicating which records to delete. Kusto scans the table looking to identify data extents that would participate in the data purge. The extents identified are those having one or more records for which the predicate returns true.
Phase 2: (Soft Delete) Replace each data extent in the table (identified in step (1)) with a reingested version. The reingested version shouldn’t have the records for which the predicate returns true. If new data isn’t being ingested into the table, then by the end of this phase, queries will no longer return data for which the predicate returns true. The duration of the purge soft delete phase depends on the following parameters:
- The number of records that must be purged
- Record distribution across the data extents in the cluster
- The number of nodes in the cluster
- The spare capacity it has for purge operations
- Several other factors
The duration of phase 2 can vary between a few seconds to many hours.
Phase 3: (Hard Delete) Work back all storage artifacts that may have the “poison” data, and delete them from storage. This phase is done at least five days after the completion of the previous phase, but no longer than 30 days after the initial command. These timelines are set to follow data privacy requirements.
Issuing a .purge
command triggers this process, which takes a few days to complete. If the density of records for which the predicate applies is sufficiently large, the process will effectively reingest all the data in the table. This reingestion has a significant impact on performance and COGS (cost of goods sold).
Purge limitations and considerations
The purge process is final and irreversible. It isn’t possible to undo this process or recover data that has been purged. Commands such as undo table drop can’t recover purged data. Rollback of the data to a previous version can’t go to before the latest purge command.
Before running the purge, verify the predicate by running a query and checking that the results match the expected outcome. You can also use the two-step process that returns the expected number of records that will be purged.
The
.purge
command is executed against the Data Management endpoint:https://ingest-[YourClusterName].[region].kusto.windows.net
. The command requires database admin permissions on the relevant databases.Due to the purge process performance impact, and to guarantee that purge guidelines have been followed, the caller is expected to modify the data schema so that minimal tables include relevant data, and batch commands per table to reduce the significant COGS impact of the purge process.
The
predicate
parameter of the .purge command is used to specify which records to purge.Predicate
size is limited to 1 MB. When constructing thepredicate
:- Use the ‘in’ operator, for example,
where [ColumnName] in ('Id1', 'Id2', .. , 'Id1000')
. - Note the limits of the ‘in’ operator (list can contain up to
1,000,000
values). - If the query size is large, use
externaldata
operator, for examplewhere UserId in (externaldata(UserId:string) ["https://...blob.core.windows.net/path/to/file?..."])
. The file stores the list of IDs to purge. - The total query size, after expanding all
externaldata
blobs (total size of all blobs), can’t exceed 64 MB.
- Use the ‘in’ operator, for example,
Purge performance
Only one purge request can be executed on the cluster, at any given time. All other requests are queued in Scheduled
state.
Monitor the purge request queue size, and keep within adequate limits to match the requirements applicable for your data.
To reduce purge execution time:
Follow the purge guidelines to decrease the amount of purged data.
Adjust the caching policy since purge takes longer on cold data.
Scale out the cluster
Increase cluster purge capacity, after careful consideration, as detailed in Extents purge rebuild capacity.
Trigger the purge process
Purge table TableName records command
Purge command may be invoked in two ways for differing usage scenarios:
Programmatic invocation: A single step that is intended to be invoked by applications. Calling this command directly triggers purge execution sequence.
Syntax
// Connect to the Data Management service #connect "https://ingest-[YourClusterName].[region].kusto.windows.net" // To purge table records .purge table [TableName] records in database [DatabaseName] with (noregrets='true') <| [Predicate] // To purge materialized view records .purge materialized-view [MaterializedViewName] records in database [DatabaseName] with (noregrets='true') <| [Predicate]
Human invocation: A two-step process that requires an explicit confirmation as a separate step. First invocation of the command returns a verification token, which should be provided to run the actual purge. This sequence reduces the risk of inadvertently deleting incorrect data.
[!NOTE] The first step in the two-step invocation requires running a query on the entire dataset, to identify records to be purged. This query may time-out or fail on large tables, especially with significant amount of cold cache data. In case of failures, validate the predicate yourself and after verifying correctness use the single-step purge with the
noregrets
option.
Syntax
// Connect to the Data Management service - this command only works in Kusto.Explorer
#connect "https://ingest-[YourClusterName].[region].kusto.windows.net"
// Step #1 - retrieve a verification token (no records will be purged until step #2 is executed)
.purge table [TableName] records in database [DatabaseName] <| [Predicate]
// Step #2 - input the verification token to execute purge
.purge table [TableName] records in database [DatabaseName] with (verificationtoken=h'<verification token from step #1>') <| [Predicate]
To purge a materialized view, replace the table
keyword with materialized-view
, and replace TableName with the MaterializedViewName.
Parameters | Description |
---|---|
DatabaseName | Name of the database |
TableName / MaterializedViewName | Name of the table / materialized view to purge. |
Predicate | Identifies the records to purge. See purge predicate limitations. |
noregrets | If set, triggers a single-step activation. |
verificationtoken | In the two-step activation scenario (noregrets isn’t set), this token can be used to execute the second step and commit the action. If verificationtoken isn’t specified, it will trigger the command’s first step. Information about the purge will be returned with a token that should be passed back to the command to do step #2. |
Purge predicate limitations
- The predicate must be a simple selection (for example, where [ColumnName] == ‘X’ / where [ColumnName] in (‘X’, ‘Y’, ‘Z’) and [OtherColumn] == ‘A’).
- Multiple filters must be combined with an ‘and’, rather than separate
where
clauses (for example,where [ColumnName] == 'X' and OtherColumn] == 'Y'
and notwhere [ColumnName] == 'X' | where [OtherColumn] == 'Y'
). - The predicate can’t reference tables other than the table being purged (TableName). The predicate can only include the selection statement (
where
). It can’t project specific columns from the table (output schema when running ‘table
| Predicate’ must match table schema). - System functions (such as,
ingestion_time()
,extent_id()
) aren’t supported.
Example: Two-step purge
To start purge in a two-step activation scenario, run step #1 of the command:
// Connect to the Data Management service
#connect "https://ingest-[YourClusterName].[region].kusto.windows.net"
.purge table MyTable records in database MyDatabase <| where CustomerId in ('X', 'Y')
.purge materialized-view MyView records in database MyDatabase <| where CustomerId in ('X', 'Y')
Output
NumRecordsToPurge | EstimatedPurgeExecutionTime | VerificationToken |
---|---|---|
1,596 | 00:00:02 | e43c7184ed22f4f23c7a9d7b124d196be2e570096987e5baadf65057fa65736b |
Then, validate the NumRecordsToPurge before running step #2.
To complete a purge in a two-step activation scenario, use the verification token returned from step #1 to run step #2:
.purge table MyTable records in database MyDatabase
with(verificationtoken=h'e43c7....')
<| where CustomerId in ('X', 'Y')
.purge materialized-view MyView records in database MyDatabase
with(verificationtoken=h'e43c7....')
<| where CustomerId in ('X', 'Y')
Output
OperationId | DatabaseName | TableName | ScheduledTime | Duration | LastUpdatedOn | EngineOperationId | State | StateDetails | EngineStartTime | EngineDuration | Retries | ClientRequestId | Principal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
c9651d74-3b80-4183-90bb-bbe9e42eadc4 | MyDatabase | MyTable | 2019-01-20 11:41:05.4391686 | 00:00:00.1406211 | 2019-01-20 11:41:05.4391686 | Scheduled | 0 | KE.RunCommand;1d0ad28b-f791-4f5a-a60f-0e32318367b7 | AAD app id=… |
Example: Single-step purge
To trigger a purge in a single-step activation scenario, run the following command:
// Connect to the Data Management service
#connect "https://ingest-[YourClusterName].[region].kusto.windows.net"
.purge table MyTable records in database MyDatabase with (noregrets='true') <| where CustomerId in ('X', 'Y')
.purge materialized-view MyView records in database MyDatabase with (noregrets='true') <| where CustomerId in ('X', 'Y')
Output
OperationId | DatabaseName | TableName | ScheduledTime | Duration | LastUpdatedOn | EngineOperationId | State | StateDetails | EngineStartTime | EngineDuration | Retries | ClientRequestId | Principal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
c9651d74-3b80-4183-90bb-bbe9e42eadc4 | MyDatabase | MyTable | 2019-01-20 11:41:05.4391686 | 00:00:00.1406211 | 2019-01-20 11:41:05.4391686 | Scheduled | 0 | KE.RunCommand;1d0ad28b-f791-4f5a-a60f-0e32318367b7 | AAD app id=… |
Cancel purge operation command
If needed, you can cancel pending purge requests.
Syntax
// Cancel of a single purge operation
.cancel purge <OperationId>
// Cancel of all pending purge requests in a database
.cancel all purges in database <DatabaseName>
// Cancel of all pending purge requests, for all databases
.cancel all purges
Example: Cancel a single purge operation
.cancel purge aa894210-1c60-4657-9d21-adb2887993e1
Output
The output of this command is the same as the ‘show purges OperationId’ command output, showing the updated status of the purge operation being canceled.
If the attempt is successful, the operation state is updated to Canceled
. Otherwise, the operation state isn’t changed.
OperationId | DatabaseName | TableName | ScheduledTime | Duration | LastUpdatedOn | EngineOperationId | State | StateDetails | EngineStartTime | EngineDuration | Retries | ClientRequestId | Principal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
c9651d74-3b80-4183-90bb-bbe9e42eadc4 | MyDatabase | MyTable | 2019-01-20 11:41:05.4391686 | 00:00:00.1406211 | 2019-01-20 11:41:05.4391686 | Canceled | 0 | KE.RunCommand;1d0ad28b-f791-4f5a-a60f-0e32318367b7 | AAD app id=… |
Example: Cancel all pending purge operations in a database
.cancel all purges in database MyDatabase
Output
The output of this command is the same as the show purges command output, showing all operations in the database with their updated status.
Operations that were canceled successfully will have their status updated to Canceled
. Otherwise, the operation state isn’t changed.
OperationId | DatabaseName | TableName | ScheduledTime | Duration | LastUpdatedOn | EngineOperationId | State | StateDetails | EngineStartTime | EngineDuration | Retries | ClientRequestId | Principal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5a34169e-8730-49f5-9694-7fde3a7a0139 | MyDatabase | MyTable | 2021-03-03 05:07:29.7050198 | 00:00:00.2971331 | 2021-03-03 05:07:30.0021529 | Canceled | 0 | KE.RunCommand;1d0ad28b-f791-4f5a-a60f-0e32318367b7 | AAD app id=… | ||||
2fa7c04c-6364-4ce1-a5e5-1ab921f518f5 | MyDatabase | MyTable | 2021-03-03 05:05:03.5035478 | 00:00:00.1406211 | 2021-03-03 05:05:03.6441689 | InProgress | 0 | KE.RunCommand;1d0ad28b-f791-4f5a-a60f-0e32318367b7 | AAD app id=… |
Track purge operation status
Status = ‘Completed’ indicates successful completion of the first phase of the purge operation, that is records are soft-deleted and are no longer available for querying. Customers aren’t expected to track and verify the second phase (hard-delete) completion. This phase is monitored internally.
Show purges command
Show purges
command shows purge operation status by specifying the operation ID within the requested time period.
.show purges <OperationId>
.show purges [in database <DatabaseName>]
.show purges from '<StartDate>' [in database <DatabaseName>]
.show purges from '<StartDate>' to '<EndDate>' [in database <DatabaseName>]
Properties | Description | Mandatory/Optional |
---|---|---|
OperationId | The Data Management operation ID outputted after executing single phase or second phase. | Mandatory |
StartDate | Lower time limit for filtering operations. If omitted, defaults to 24 hours before current time. | Optional |
EndDate | Upper time limit for filtering operations. If omitted, defaults to current time. | Optional |
DatabaseName | Database name to filter results. | Optional |
Examples
.show purges
.show purges c9651d74-3b80-4183-90bb-bbe9e42eadc4
.show purges from '2018-01-30 12:00'
.show purges from '2018-01-30 12:00' to '2018-02-25 12:00'
.show purges from '2018-01-30 12:00' to '2018-02-25 12:00' in database MyDatabase
Output
OperationId | DatabaseName | TableName | ScheduledTime | Duration | LastUpdatedOn | EngineOperationId | State | StateDetails | EngineStartTime | EngineDuration | Retries | ClientRequestId | Principal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
c9651d74-3b80-4183-90bb-bbe9e42eadc4 | MyDatabase | MyTable | 2019-01-20 11:41:05.4391686 | 00:00:33.6782130 | 2019-01-20 11:42:34.6169153 | a0825d4d-6b0f-47f3-a499-54ac5681ab78 | Completed | Purge completed successfully (storage artifacts pending deletion) | 2019-01-20 11:41:34.6486506 | 00:00:04.4687310 | 0 | KE.RunCommand;1d0ad28b-f791-4f5a-a60f-0e32318367b7 | AAD app id=… |
OperationId
- the DM operation ID returned when executing purge.DatabaseName
** - database name (case sensitive).TableName
- table name (case sensitive).ScheduledTime
- time of executing purge command to the DM service.Duration
- total duration of the purge operation, including the execution DM queue wait time.EngineOperationId
- the operation ID of the actual purge executing in the engine.State
- purge state, can be one of the following values:Scheduled
- purge operation is scheduled for execution. If job remains Scheduled, there’s probably a backlog of purge operations. See purge performance to clear this backlog. If a purge operation fails on a transient error, it will be retried by the DM and set to Scheduled again (so you may see an operation transition from Scheduled to InProgress and back to Scheduled).InProgress
- the purge operation is in-progress in the engine.Completed
- purge completed successfully.BadInput
- purge failed on bad input and won’t be retried. This failure may be due to various issues such as a syntax error in the predicate, an illegal predicate for purge commands, a query that exceeds limits (for example, over 1M entities in anexternaldata
operator or over 64 MB of total expanded query size), and 404 or 403 errors forexternaldata
blobs.Failed
- purge failed and won’t be retried. This failure may happen if the operation was waiting in the queue for too long (over 14 days), due to a backlog of other purge operations or a number of failures that exceed the retry limit. The latter will raise an internal monitoring alert and will be investigated by the team.
StateDetails
- a description of the State.EngineStartTime
- the time the command was issued to the engine. If there’s a large difference between this time and ScheduledTime, there’s usually a significant backlog of purge operations and the cluster isn’t keeping up with the pace.EngineDuration
- time of actual purge execution in the engine. If purge was retried several times, it’s the sum of all the execution durations.Retries
- number of times the operation was retried by the DM service due to a transient error.ClientRequestId
- client activity ID of the DM purge request.Principal
- identity of the purge command issuer.
Purging an entire table
Purging a table includes dropping the table, and marking it as purged so that the hard delete process described in Purge process runs on it.
Dropping a table without purging it doesn’t delete all its storage artifacts. These artifacts are deleted according to the hard retention policy initially set on the table.
The purge table allrecords
command is quick and efficient and is preferable to the purge records process, if applicable for your scenario.
Purge table TableName allrecords command
Similar to ‘.purge table records ’ command, this command can be invoked in a programmatic (single-step) or in a manual (two-step) mode.
Programmatic invocation (single-step):
Syntax
// Connect to the Data Management service #connect "https://ingest-[YourClusterName].[Region].kusto.windows.net" .purge table [TableName] in database [DatabaseName] allrecords with (noregrets='true')
Human invocation (two-steps):
Syntax
// Connect to the Data Management service #connect "https://ingest-[YourClusterName].[Region].kusto.windows.net" // Step #1 - retrieve a verification token (the table will not be purged until step #2 is executed) .purge table [TableName] in database [DatabaseName] allrecords // Step #2 - input the verification token to execute purge .purge table [TableName] in database [DatabaseName] allrecords with (verificationtoken=h'<verification token from step #1>')
Parameters Description DatabaseName
Name of the database. TableName
Name of the table. noregrets
If set, triggers a single-step activation. verificationtoken
In two-step activation scenario ( noregrets
isn’t set), this token can be used to execute the second step and commit the action. Ifverificationtoken
isn’t specified, it will trigger the command’s first step. In this step, a token is returned to pass back to the command and do step #2.
Example: Two-step purge
To start purge in a two-step activation scenario, run step #1 of the command:
// Connect to the Data Management service #connect "https://ingest-[YourClusterName].[Region].kusto.windows.net" .purge table MyTable in database MyDatabase allrecords
Output
VerificationToken
e43c7184ed22f4f23c7a9d7b124d196be2e570096987e5baadf65057fa65736b To complete a purge in a two-step activation scenario, use the verification token returned from step #1 to run step #2:
.purge table MyTable in database MyDatabase allrecords with (verificationtoken=h'eyJT.....')
The output is the same as the ‘.show tables’ command output (returned without the purged table).
Output
TableName DatabaseName Folder DocString OtherTable MyDatabase — —
Example: Single-step purge
To trigger a purge in a single-step activation scenario, run the following command:
// Connect to the Data Management service
#connect "https://ingest-[YourClusterName].[Region].kusto.windows.net"
.purge table MyTable in database MyDatabase allrecords with (noregrets='true')
The output is the same as the ‘.show tables’ command output (returned without the purged table).
Output
TableName | DatabaseName | Folder | DocString |
---|---|---|---|
OtherTable | MyDatabase | — | — |
Related content
1.3 - Delete data
Delete data from a table is supported in several ways. Use the following information to help you choose which deletion method is best for your use case.
Use case | Considerations | Method |
---|---|---|
Delete all data from a table. | Use the .clear table data command | |
Routinely delete old data. | Use if you need an automated deletion solution. | Use a retention policy |
Bulk delete specific data by extents. | Only use if you’re an expert user. | Use the .drop extents command |
Delete records based on their content. | - Storage artifacts that contain the deleted records aren’t necessarily deleted. - Deleted records can’t be recovered (regardless of any retention or recoverability settings). - Use if you need a quick way to delete records. | Use soft delete |
Delete records based on their content. | - Storage artifacts that contain the deleted records are deleted. - Deleted records can’t be recovered (regardless of any retention or recoverability settings). - Requires significant system resources and time to complete. | Use purge |
Use case | Considerations | Method |
---|---|---|
Delete all data from a table. | Use the .clear table data command | |
Routinely delete old data. | Use if you need an automated deletion solution. | Use a retention policy |
Bulk delete specific data by extents. | Only use if you’re an expert user. | Use the .drop extents command |
Delete records based on their content. | - Storage artifacts that contain the deleted records aren’t necessarily deleted. - Deleted records can’t be recovered (regardless of any retention or recoverability settings). - Use if you need a quick way to delete records. | Use soft delete |
The following sections describe the different deletion methods.
Delete all data in a table
To delete all data in a table, use the .clear table data command. This command is the most efficient way to remove all data from a table.
Syntax:
.clear table <TableName> data
Delete data using a retention policy
Automatically delete data based on a retention policy. You can set the retention policy at the database or table level. There’s no guarantee as to when the deletion occurs, but it will not be deleted before the retention period. This is an efficient and convenient way to remove old data.
Consider a database or table that is set for 90 days of retention. If only 60 days of data are needed, delete the older data as follows:
.alter-merge database <DatabaseName> policy retention softdelete = 60d
.alter-merge table <TableName> policy retention softdelete = 60d
Delete data by dropping extents
Extent (data shard) is the internal structure where data is stored. Each extent can hold up to millions of records. Extents can be deleted individually or as a group using drop extent(s) commands.
Examples
You can delete all rows in a table or just a specific extent.
Delete all rows in a table:
.drop extents from TestTable
Delete a specific extent:
.drop extent e9fac0d2-b6d5-4ce3-bdb4-dea052d13b42
Delete individual rows
Both purge and soft delete can be used for deleting individual rows. Soft delete doesn’t necessarily delete the storage artifacts that contain records to delete, and purge does delete all such storage artifacts.
Both methods prevent deleted records from being recovered, regardless of any retention or recoverability settings. The deletion process is final and irreversible.
Soft delete
With soft delete, data isn’t necessarily deleted from storage artifacts. This method marks all matching records as deleted, so that they’ll be filtered out in queries, and doesn’t require significant system resources.
Purge
With purge, extents that have one or more records to be deleted, are replaced with new extents in which those records don’t exist. This deletion process isn’t immediate, requires significant system resources, and can take a whole day to complete.
Soft delete can be used for deleting individual rows. Data isn’t necessarily deleted from storage artifacts. Soft delete prevent deleted records from being recovered, regardless of any retention or recoverability settings. The deletion process is final and irreversible. This method marks all matching records as deleted, so that they’ll be filtered out in queries, and doesn’t require significant system resources.
1.4 - Follower commands
Management commands for managing your follower configuration. These commands run synchronously but are applied on the next periodic schema refresh, which may result in a short delay until the new configuration is applied.
The follower commands include database level commands and table level commands.
Permissions
You must have at least Database Admin permissions to run this command.
Database policy overrides
A leader database can override the following database-level policies in the follower cluster: Caching policy and Authorized principals.
Caching policy
The default caching policy for the follower cluster uses the leader cluster database and table-level caching policies.
Option | Description |
---|---|
None | The caching policies used are those policies defined in the source database in the leader cluster. |
replace | The source database in the leader cluster database and table-level caching policies are removed (set to null ). These policies are replaced by the database and table-level override policies, if defined. |
union(default) | The source database in the leader cluster database and table-level caching policies are combined with the policies defined in the database and table-level override policies. |
Authorized principals
Option | Description |
---|---|
None | The authorized principals are defined in the source database of the leader cluster. |
replace | The override authorized principals replace the authorized principals from the source database in the leader cluster. |
union(default) | The override authorized principals are combined with the authorized principals from the source database in the leader cluster. |
Table and materialized views policy overrides
By default, tables and materialized views in a database that is being followed by a follower cluster keep the source entity’s caching policy.
However, table and materialized view caching policies can be overridden in the follower cluster.
Use the replace
option to override the source entity’s caching policy.
Database level commands
.show follower database
Shows a database (or databases) followed from other leader cluster, which have one or more database-level overrides configured.
Syntax
.show
follower
database
DatabaseName
.show
follower
databases
(
DatabaseName1,
…,
DatabaseNameN)
Output
Output parameter | Type | Description |
---|---|---|
DatabaseName | string | The name of the database being followed. |
LeaderClusterMetadataPath | string | The path to the leader cluster’s metadata container. |
CachingPolicyOverride | string | An override caching policy for the database, serialized as JSON, or null. |
AuthorizedPrincipalsOverride | string | An override collection of authorized principals for the database, serialized as JSON, or null. |
AuthorizedPrincipalsModificationKind | string | The modification kind to apply using AuthorizedPrincipalsOverride (none , union , or replace ). |
CachingPoliciesModificationKind | string | The modification kind to apply using database or table-level caching policy overrides (none , union , or replace ). |
IsAutoPrefetchEnabled | bool | Whether new data is pre-fetched upon each schema refresh. |
TableMetadataOverrides | string | If defined, A JSON serialization of table-level property overrides. |
.alter follower database policy caching
Alters a follower database caching policy, to override the one set on the source database in the leader cluster.
Notes
- The default
modification kind
for caching policies isunion
. To change themodification kind
, use the.alter follower database caching-policies-modification-kind
command. - Viewing the policy or effective policies after the change can be done using the
.show
commands: - Viewing the override settings on the follower database after the change is made can be done using
.show follower database
Syntax
.alter
follower
database
DatabaseName policy
caching
hot
=
HotDataSpan
Example
.alter follower database MyDb policy caching hot = 7d
.delete follower database policy caching
Deletes a follower database override caching policy. This deletion causes the policy set on the source database in the leader cluster the effective one.
Notes
- Viewing the policy or effective policies after the change can be done using the
.show
commands: - Viewing the override settings on the follower database after the change can be done using
.show follower database
Syntax
.delete
follower
database
DatabaseName policy
caching
Example
.delete follower database MyDB policy caching
.add follower database principals
Adds authorized principal(s) to the follower database collection of override authorized principals. Notes
- The default
modification kind
for such authorized principals isnone
. To change themodification kind
use alter follower database principals-modification-kind. - Viewing the effective collection of principals after the change can be done using the
.show
commands: - Viewing the override settings on the follower database after the change can be done using
.show follower database
Syntax
.add
follower
database
DatabaseName (admins
| users
| viewers
| monitors
) Role (
principal1,
…,
principalN)
['
notes'
]
Example
.add follower database MyDB viewers ('aadgroup=mygroup@microsoft.com') 'My Group'
.drop follower database principals
Drops authorized principal(s) from the follower database collection of override authorized principals.
Syntax
.drop
follower
database
DatabaseName
(admins
| users
| viewers
| monitors
) (
principal1,
…,
principalN)
Example
.drop follower database MyDB viewers ('aadgroup=mygroup@microsoft.com')
.alter follower database principals-modification-kind
Alters the follower database authorized principals modification kind.
Syntax
.alter
follower
database
DatabaseName
principals-modification-kind
= (none
| union
| replace
)
Example
.alter follower database MyDB principals-modification-kind = union
.alter follower database caching-policies-modification-kind
Alters the caching policies modification kind for the follower database, table, and materialized views.
Syntax
.alter
follower
database
DatabaseName caching-policies-modification-kind
= (none
| union
| replace
)
Example
.alter follower database MyDB caching-policies-modification-kind = union
.alter follower database prefetch-extents
The follower cluster can wait for new data to be fetched from the underlying storage to the nodes’ SSD (cache) before making this data queryable.
The following command alters the follower database configuration of pre-fetching new extents upon each schema refresh.
Syntax
.alter
follower
database
DatabaseName prefetch-extents
= (true
| false
)
Example
.alter follower database MyDB prefetch-extents = false
Tables and materialized views commands
Alter follower table or materialized view caching policy
Alters a table’s or a materialized view’s caching policy on the follower database, to override the policy set on the source database in the leader cluster.
Syntax
.alter
follower
database
DatabaseName table TableName policy
caching
hot
=
HotDataSpan
.alter
follower
database
DatabaseName tables (
TableName1,
…,
TableNameN)
policy
caching
hot
=
HotDataSpan
.alter
follower
database
DatabaseName materialized-view ViewName policy
caching
hot
=
HotDataSpan
.alter
follower
database
DatabaseName materialized-views (
ViewName1,
…,
ViewNameN)
policy
caching
hot
=
HotDataSpan
Examples
.alter follower database MyDb tables (Table1, Table2) policy caching hot = 7d
.alter follower database MyDb materialized-views (View1, View2) policy caching hot = 7d
Delete follower table or materialized view caching policy
Deletes an override for a table’s or a materialized-view’s caching policy on the follower database. The policy set on the source database in the leader cluster will now be the effective policy.
Syntax
.delete
follower
database
DatabaseName table
TableName policy
caching
.delete
follower
database
DatabaseName tables
(
TableName1,
…,
TableNameN)
policy
caching
.delete
follower
database
DatabaseName materialized-view
ViewName policy
caching
.delete
follower
database
DatabaseName materialized-views
(
ViewName1,
…,
ViewNameN)
policy
caching
Example
.delete follower database MyDB tables (Table1, Table2) policy caching
.delete follower database MyDB materialized-views (View1, View2) policy caching
Sample configuration
The following are sample steps to configure a follower database.
In this example:
Our follower cluster,
MyFollowerCluster
will be following databaseMyDatabase
from the leader cluster,MyLeaderCluster
.MyDatabase
hasN
tables:MyTable1
,MyTable2
,MyTable3
, …MyTableN
(N
> 3).- On
MyLeaderCluster
:
MyTable1
caching policyMyTable2
caching policyMyTable3
…MyTableN
caching policyMyDatabase
Authorized principalshot data span = 7d
hot data span = 30d
hot data span = 365d
Viewers = aadgroup=scubadivers@contoso.com
; Admins =aaduser=jack@contoso.com
- On
MyFollowerCluster
we want:
MyTable1
caching policyMyTable2
caching policyMyTable3
…MyTableN
caching policyMyDatabase
Authorized principalshot data span = 1d
hot data span = 3d
hot data span = 0d
(nothing is cached)Admins = aaduser=jack@contoso.com
, Viewers =aaduser=jill@contoso.com
Steps to execute
Prerequisite: Set up cluster MyFollowerCluster
to follow database MyDatabase
from cluster MyLeaderCluster
.
Show the current configuration
See the current configuration according to which MyDatabase
is being followed on MyFollowerCluster
:
.show follower database MyDatabase
| evaluate narrow() // just for presentation purposes
Column | Value |
---|---|
DatabaseName | MyDatabase |
LeaderClusterMetadataPath | https://storageaccountname.blob.core.windows.net/cluster |
CachingPolicyOverride | null |
AuthorizedPrincipalsOverride | [] |
AuthorizedPrincipalsModificationKind | None |
IsAutoPrefetchEnabled | False |
TableMetadataOverrides | |
CachingPoliciesModificationKind | Union |
Override authorized principals
Replace the collection of authorized principals for MyDatabase
on MyFollowerCluster
with a collection that includes only one Microsoft Entra user as the database admin, and one Microsoft Entra user as a database viewer:
.add follower database MyDatabase admins ('aaduser=jack@contoso.com')
.add follower database MyDatabase viewers ('aaduser=jill@contoso.com')
.alter follower database MyDatabase principals-modification-kind = replace
Only those two specific principals are authorized to access MyDatabase
on MyFollowerCluster
.show database MyDatabase principals
Role | PrincipalType | PrincipalDisplayName | PrincipalObjectId | PrincipalFQN | Notes |
---|---|---|---|---|---|
Database MyDatabase Admin | Microsoft Entra user | Jack Kusto (upn: jack@contoso.com) | 12345678-abcd-efef-1234-350bf486087b | aaduser=87654321-abcd-efef-1234-350bf486087b;55555555-4444-3333-2222-2d7cd011db47 | |
Database MyDatabase Viewer | Microsoft Entra user | Jill Kusto (upn: jack@contoso.com) | abcdefab-abcd-efef-1234-350bf486087b | aaduser=54321789-abcd-efef-1234-350bf486087b;55555555-4444-3333-2222-2d7cd011db47 |
.show follower database MyDatabase
| mv-expand parse_json(AuthorizedPrincipalsOverride)
| project AuthorizedPrincipalsOverride.Principal.FullyQualifiedName
AuthorizedPrincipalsOverride_Principal_FullyQualifiedName |
---|
aaduser=87654321-abcd-efef-1234-350bf486087b;55555555-4444-3333-2222-2d7cd011db47 |
aaduser=54321789-abcd-efef-1234-350bf486087b;55555555-4444-3333-2222-2d7cd011db47 |
Override Caching policies
Replace the collection of database and table-level caching policies for MyDatabase
on MyFollowerCluster
by setting all tables to not have their data cached, excluding two specific tables - MyTable1
, MyTable2
- that will have their data cached for periods of 1d
and 3d
, respectively:
.alter follower database MyDatabase policy caching hot = 0d
.alter follower database MyDatabase table MyTable1 policy caching hot = 1d
.alter follower database MyDatabase table MyTable2 policy caching hot = 3d
.alter follower database MyDatabase caching-policies-modification-kind = replace
Only those two specific tables have data cached, and the rest of the tables have a hot data period of 0d
:
.show tables details
| summarize TableNames = make_list(TableName) by CachingPolicy
CachingPolicy | TableNames |
---|---|
{“DataHotSpan”:{“Value”:“1.00:00:00”},“IndexHotSpan”:{“Value”:“1.00:00:00”}} | [“MyTable1”] |
{“DataHotSpan”:{“Value”:“3.00:00:00”},“IndexHotSpan”:{“Value”:“3.00:00:00”}} | [“MyTable2”] |
{“DataHotSpan”:{“Value”:“0.00:00:00”},“IndexHotSpan”:{“Value”:“0.00:00:00”}} | [“MyTable3”,…,“MyTableN”] |
.show follower database MyDatabase
| mv-expand parse_json(TableMetadataOverrides)
| project TableMetadataOverrides
TableMetadataOverrides |
---|
{“MyTable1”:{“CachingPolicyOverride”:{“DataHotSpan”:{“Value”:“1.00:00:00”},“IndexHotSpan”:{“Value”:“1.00:00:00”}}}} |
{“MyTable2”:{“CachingPolicyOverride”:{“DataHotSpan”:{“Value”:“3.00:00:00”},“IndexHotSpan”:{“Value”:“3.00:00:00”}}}} |
Summary
See the current configuration where MyDatabase
is being followed on MyFollowerCluster
:
.show follower database MyDatabase
| evaluate narrow() // just for presentation purposes
Column | Value |
---|---|
DatabaseName | MyDatabase |
LeaderClusterMetadataPath | https://storageaccountname.blob.core.windows.net/cluster |
CachingPolicyOverride | {“DataHotSpan”:{“Value”:“00:00:00”},“IndexHotSpan”:{“Value”:“00:00:00”}} |
AuthorizedPrincipalsOverride | [{“Principal”:{“FullyQualifiedName”:“aaduser=87654321-abcd-efef-1234-350bf486087b”,…},{“Principal”:{“FullyQualifiedName”:“aaduser=54321789-abcd-efef-1234-350bf486087b”,…}] |
AuthorizedPrincipalsModificationKind | Replace |
IsAutoPrefetchEnabled | False |
TableMetadataOverrides | {“MyTargetTable”:{“CachingPolicyOverride”:{“DataHotSpan”:{“Value”:“3.00:00:00”}…},“MySourceTable”:{“CachingPolicyOverride”:{“DataHotSpan”:{“Value”:“1.00:00:00”},…}}} |
CachingPoliciesModificationKind | Replace |
1.5 - Data soft delete
1.5.1 - Data soft delete
The ability to delete individual records is supported. Record deletion is commonly achieved using one of the following methods:
- To delete records with a system guarantee that the storage artifacts containing these records are deleted as well, use
.purge
- To delete records without such a guarantee, use
.delete
as described in this article - this command marks records as deleted but doesn’t necessarily delete the data from storage artifacts. This deletion method is faster than purge.
For information on how to use the command, see Syntax
Use cases
This deletion method should only be used for the unplanned deletion of individual records. For example, if you discover that an IoT device is reporting corrupt telemetry for some time, you should consider using this method to delete the corrupt data.
If you need to frequently delete records for deduplication or updates, we recommend using materialized views. See choose between materialized views and soft delete for data deduplication.
Deletion process
The soft delete process is performed using the following steps:
- Run predicate query: The table is scanned to identify data extents that contain records to be deleted. The extents identified are those with one or more records returned by the predicate query.
- Extents replacement: The identified extents are replaced with new extents that point to the original data blobs, and also have a new hidden column of type
bool
that indicates per record whether it was deleted or not. Once completed, if no new data is ingested, the predicate query won’t return any records if run again.
Limitations and considerations
The deletion process is final and irreversible. It isn’t possible to undo this process or recover data that has been deleted, even though the storage artifacts aren’t necessarily deleted following the operation.
Soft delete is supported for native tables and materialized views. It isn’t supported for external tables.
Before running soft delete, verify the predicate by running a query and checking that the results match the expected outcome. You can also run the command in
whatif
mode, which returns the number of records that are expected to be deleted.Don’t run multiple parallel soft delete operations on the same table, as this may result in failures of some or all the commands. However, it’s possible to run multiple parallel soft delete operations on different tables.
Don’t run soft delete and purge commands on the same table in parallel. First wait for one command to complete and only then run the other command.
Soft delete is executed against your cluster URI:
https://[YourClusterName].[region].kusto.windows.net
. The command requires database admin permissions on the relevant database.Deleting records from a table that is a source table of a materialized view, can have an impact on the materialized view. If records being deleted were not yet processed by the materialization cycle, these records will be missing in the view, since they will never be processed. Similarly, the deletion will not have an impact on the materialized view if the records have already been processed.
Limitations on the predicate:
- It must contain at least one
where
operator. - It can only reference the table from which records are to be deleted.
- Only the following operators are allowed:
extend
,order
,project
,take
andwhere
. Withintoscalar()
, thesummarize
operator is also allowed.
- It must contain at least one
Deletion performance
The main considerations that can impact the deletion process performance are:
- Run predicate query: The performance of this step is very similar to the performance of the predicate itself. It might be slightly faster or slower depending on the predicate, but the difference is expected to be insignificant.
- Extents replacement: The performance of this step depends on the following:
- Record distribution across the data extents in the cluster
- The number of nodes in the cluster
Unlike .purge
, the .delete
command doesn’t reingest the data. It just marks records that are returned by the predicate query as deleted and is therefore much faster.
Query performance after deletion
Query performance isn’t expected to noticeably change following the deletion of records.
Performance degradation isn’t expected because the filter that is automatically added on all queries that filter out records that were deleted is efficient.
However, query performance is also not guaranteed to improve. While performance improvement may happen for some types of queries, it may not happen for some others. In order to improve query performance, extents in which most of the records are deleted are periodically compacted by replacing them with new extents that only contain the records that haven’t been deleted.
Impact on COGS (cost of goods sold)
In most cases, the deletion of records won’t result in a change of COGS.
- There will be no decrease, because no records are actually deleted. Records are only marked as deleted using a hidden column of type
bool
, the size of which is negligible. - In most cases, there will be no increase because the
.delete
operation doesn’t require the provisioning of extra resources. - In some cases, extents in which the majority of the records are deleted are periodically compacted by replacing them with new extents that only contain the records that haven’t been deleted. This causes the deletion of the old storage artifacts that contain a large number of deleted records. The new extents are smaller and therefore consume less space in both the Storage account and in the hot cache. However, in most cases, the effect of this on COGS is negligible.
1.5.2 - Data soft delete command
To soft delete individual records without a system guarantee that the storage artifacts containing these records are deleted as well, use the following command. This command marks records as deleted but doesn’t necessarily delete the data from storage artifacts. For more information, see Soft delete.
To delete individual records with a system guarantee that the storage artifacts containing these records are deleted as well, see Data purge.
Syntax
.delete
[async
] table
TableName records
[with
(
propertyName =
propertyValue [,
…])
] <|
Predicate
Parameters
Name | Type | Required | Description |
---|---|---|---|
async | string | If specified, indicates that the command runs in asynchronous mode. | |
TableName | string | ✔️ | The name of the table from which to delete records. |
propertyName, propertyValue | string | A comma-separated list of key-value property pairs. See supported properties. | |
Predicate | string | ✔️ | The predicate that returns records to delete, which is specified as a query. See note. |
Supported properties
Name | Type | Description |
---|---|---|
whatif | bool | If true , returns the number of records that will be deleted in every shard, without actually deleting any records. The default is false . |
Returns
The output of the command contains information about which extents were replaced.
Example: delete records of a given user
To delete all the records that contain data of a given user:
.delete table MyTable records <| MyTable | where UserId == 'X'
Example: check how many records would be deleted from a table
To determine the number of records that would be deleted by the operation without actually deleting them, check the value in the RecordsMatchPredicate column when running the command in whatif
mode:
.delete table MyTable records with (whatif=true) <| MyTable | where UserId == 'X'
.delete materialized-view records - soft delete command
When soft delete is executed on materialized views, the same concepts and limitations apply.
Syntax - materialized views
.delete
[async
] materialized-view
MaterializedViewName records
[with
(
propertyName =
propertyValue [,
…])
] <|
Predicate
Parameters - materialized views
Name | Type | Required | Description |
---|---|---|---|
async | string | If specified, indicates that the command runs in asynchronous mode. | |
MaterializedViewName | string | ✔️ | The name of the materialized view from which to delete records. |
propertyName, propertyValue | string | A comma-separated list of key-value property pairs. See supported properties. | |
Predicate | string | ✔️ | The predicate that returns records to delete. Specified as a query. |
Supported properties - materialized views
Name | Type | Description |
---|---|---|
whatif | bool | If true , returns the number of records that will be deleted in every shard, without actually deleting any records. The default is false . |
Example - materialized views
To delete all the materialized view records that contain data of a given user:
.delete materialized-view MyMaterializedView records <| MyMaterializedView | where UserId == 'X'
Example: check how many records would be deleted from a materialized view
To determine the number of records that would be deleted by the operation without actually deleting them, check the value in the RecordsMatchPredicate column while running the command in whatif
mode:
.delete materialized-view MyMaterializedView records with (whatif=true) <| MyMaterializedView | where UserId == 'X'
Related content
1.6 - Extents (data shards)
1.6.1 - Extent tags
An extent tag is a string that describes properties common to all data in an extent. For example, during data ingestion, you can append an extent tag to signify the source of the ingested data. Then, you can use this tag for analysis.
Extents can hold multiple tags as part of their metadata. When extents merge, their tags also merge, ensuring consistent metadata representation.
To see the tags associated with an extent, use the .show extents command. For a granular view of tags associated with records within an extent, use the extent-tags() function.
drop-by
extent tags
Tags that start with a drop-by:
prefix can be used to control which other extents to merge with. Extents that have the same set of drop-by:
tags can be merged together, but they won’t be merged with other extents if they have a different set of drop-by:
tags.
Examples
Determine which extents can be merged together
If:
- Extent 1 has the following tags:
drop-by:blue
,drop-by:red
,green
. - Extent 2 has the following tags:
drop-by:red
,yellow
. - Extent 3 has the following tags:
purple
,drop-by:red
,drop-by:blue
.
Then:
- Extents 1 and 2 won’t be merged together, as they have a different set of
drop-by
tags. - Extents 2 and 3 won’t be merged together, as they have a different set of
drop-by
tags. - Extents 1 and 3 can be merged together, as they have the same set of
drop-by
tags.
Use drop-by
tags as part of extent-level operations
The following query issues a command to drop extents according to their drop-by:
tag.
.ingest ... with @'{"tags":"[\"drop-by:2016-02-17\"]"}'
.drop extents <| .show table MyTable extents where tags has "drop-by:2016-02-17"
ingest-by
extent tags
Tags with the prefix ingest-by:
can be used together with the ingestIfNotExists
property to ensure that data is ingested only once.
The ingestIfNotExists
property prevents duplicate ingestion by checking if an extent with the specified ingest-by:
tag already exists. Typically, an ingest command contains an ingest-by:
tag and the ingestIfNotExists
property with the same value.
Examples
Add a tag on ingestion
The following command ingests the data and adds the tag ingest-by:2016-02-17
.
.ingest ... with (tags = '["ingest-by:2016-02-17"]')
Prevent duplicate ingestion
The following command ingests the data so long as no extent in the table has the ingest-by:2016-02-17
tag.
.ingest ... with (ingestIfNotExists = '["2016-02-17"]')
Prevent duplicate ingestion and add a tag to any new data
The following command ingests the data so long as no extent in the table has the ingest-by:2016-02-17
tag. Any newly ingested data gets the ingest-by:2016-02-17
tag.
.ingest ... with (ingestIfNotExists = '["2016-02-17"]', tags = '["ingest-by:2016-02-17"]')
Limitations
- Extent tags can only be applied to records within an extent. Consequently, tags can’t be set on streaming ingestion data before it is stored in extents.
- Extent tags can’t be stored on data in external tables or materialized views.
Related content
1.6.2 - Extents (data shards)
Tables are partitioned into extents, or data shards. Each extent is a horizontal segment of the table that contains data and metadata such as its creation time and optional tags. The union of all these extents contains the entire dataset of the table. Extents are evenly distributed across nodes in the cluster, and they’re cached in both local SSD and memory for optimized performance.
Extents are immutable, meaning they can be queried, reassigned to a different node, or dropped out of the table but never modified. Data modification happens by creating new extents and transactionally swapping old extents with the new ones. The immutability of extents provides benefits such as increased robustness and easy reversion to previous snapshots.
Extents hold a collection of records that are physically arranged in columns, enabling efficient encoding and compression of the data. To maintain query efficiency, smaller extents are merged into larger extents according to the configured merge policy and sharding policy. Merging extents reduces management overhead and leads to index optimization and improved compression.
The common extent lifecycle is as follows:
- The extent is created by an ingestion operation.
- The extent is merged with other extents.
- The merged extent (possibly one that tracks its lineage to other extents) is eventually dropped because of a retention policy.
Extent creation time
Two datetime values are tracked per extent: MinCreatedOn
and MaxCreatedOn
. These values are initially the same but may change when the extent is merged with other extents. When the extent is merged with other extents, the new values are according to the original minimum and maximum values of the merged extents.
The creation time of an extent is used for the following purposes:
- Retention: Extents created earlier are dropped earlier.
- Caching: Extents created recently are kept in hot cache.
- Sampling: Recent extents are preferred when using query operations such as take.
To overwrite the creation time of an extent, provide an alternate creationTime
in the data ingestion properties. This can be useful for retention purposes, such as if you want to reingest data but don’t want it to appear as if it arrived late.
Related content
1.6.3 - Extents commands
2 - Cross-cluster schema
3 - Continuous data export
3.1 - .export to SQL
Export data to SQL allows you to run a query and have its results sent to a table in an SQL database, such as an SQL database hosted by the Azure SQL Database service.
Permissions
You must have at least Table Admin permissions to run this command.
Syntax
.export
[async
] to
sql
sqlTableName sqlConnectionString [with
(
propertyName =
propertyValue [,
…])
] <|
query
Parameters
Name | Type | Required | Description |
---|---|---|---|
async | string | If specified, the command runs asynchronously. | |
SqlTableName | string | ✔️ | The name of the SQL database table into which to insert the data. To protect against injection attacks, this name is restricted. |
SqlConnectionString | string | ✔️ | The connection string for the SQL endpoint and database. The string must follow the ADO.NET connection string format. For security reasons, the connection string is restricted. |
PropertyName, PropertyValue | string | A list of optional properties. |
Supported properties
Name | Values | Description |
---|---|---|
firetriggers | true or false | If true , instructs the target system to fire INSERT triggers defined on the SQL table. The default is false . For more information, see BULK INSERT and System.Data.SqlClient.SqlBulkCopy. |
createifnotexists | true or false | If true , the target SQL table is created if it doesn’t already exist; the primarykey property must be provided in this case to indicate the result column that is the primary key. The default is false . |
primarykey | If createifnotexists is true , this property indicates the name of the column in the result that is used as the SQL table’s primary key if it’s created by this command. | |
persistDetails | bool | Indicates that the command should persist its results (see async flag). Defaults to true in async runs, but can be turned off if the caller doesn’t require the results. Defaults to false in synchronous executions, but can be turned on. |
token | string | The Microsoft Entra access token that Kusto forwards to the SQL endpoint for authentication. When set, the SQL connection string shouldn’t include authentication information like Authentication , User ID , or Password . |
Authentication and authorization
The authentication method is based on the connection string provided, and the permissions required to access the SQL database vary depending on the authentication method.
The supported authentication methods for exporting data to SQL are Microsoft Entra integrated (impersonation) authentication and username/password authentication. For impersonation authentication, be sure that the principal has the following permissions on the database:
- Existing table: table UPDATE and INSERT
- New table: CREATE, UPDATE, and INSERT
Limitations and restrictions
There are some limitations and restrictions when exporting data to an SQL database:
Kusto is a cloud service, so the connection string must point to a database that is accessible from the cloud. (In particular, one can’t export to an on-premises database since it’s not accessible from the public cloud.)
Kusto supports Active Directory Integrated authentication when the calling principal is a Microsoft Entra principal (
aaduser=
oraadapp=
). Alternatively, Kusto also supports providing the credentials for the SQL database as part of the connection string. Other methods of authentication aren’t supported. The identity being presented to the SQL database always emanates from the command caller not the Kusto service identity itself.If the target table in the SQL database exists, it must match the query result schema. In some cases, such as Azure SQL Database, this means that the table has one column marked as an identity column.
Exporting large volumes of data might take a long time. It’s recommended that the target SQL table is set for minimal logging during bulk import. See SQL Server Database Engine > … > Database Features > Bulk Import and Export of Data.
Data export is performed using SQL bulk copy and provides no transactional guarantees on the target SQL database. See Transaction and Bulk Copy Operations.
The SQL table name is restricted to a name consisting of letters, digits, spaces, underscores (
_
), dots (.
) and hyphens (-
).The SQL connection string is restricted as follows:
Persist Security Info
is explicitly set tofalse
,Encrypt
is set totrue
, andTrust Server Certificate
is set tofalse
.The primary key property on the column can be specified when creating a new SQL table. If the column is of type
string
, then SQL might refuse to create the table due to other limitations on the primary key column. The workaround is to manually create the table in SQL before exporting the data. This limitation exists because primary key columns in SQL can’t be of unlimited size, but Kusto table columns don’t have declared size limitations.
Azure database Microsoft Entra integrated authentication Documentation
Examples
Asynchronous export to SQL table
In the following example, Kusto runs the query and then exports the first record set produced by the query to the MySqlTable
table in the MyDatabase
database in server myserver
.
.export async to sql MySqlTable
h@"Server=tcp:myserver.database.windows.net,1433;Authentication=Active Directory Integrated;Initial Catalog=MyDatabase;Connection Timeout=30;"
<| print Id="d3b68d12-cbd3-428b-807f-2c740f561989", Name="YSO4", DateOfBirth=datetime(2017-10-15)
Export to SQL table if it doesn’t exist
In the following example, Kusto runs the query and then exports the first record set produced by the query to the MySqlTable
table in the MyDatabase
database in server myserver
.
The target table is created if it doesn’t exist in the target database.
.export async to sql ['dbo.MySqlTable']
h@"Server=tcp:myserver.database.windows.net,1433;Authentication=Active Directory Integrated;Initial Catalog=MyDatabase;Connection Timeout=30;"
with (createifnotexists="true", primarykey="Id")
<| print Message = "Hello World!", Timestamp = now(), Id=12345678
Related content
3.2 - .export to storage
Executes a query and writes the first result set to an external cloud storage, specified by a storage connection string.
Permissions
You must have at least Database Viewer permissions to run this command.
Syntax
.export
[async
] [compressed
] to
OutputDataFormat (
StorageConnectionString [,
…] )
[with
(
PropertyName =
PropertyValue [,
…] )
] <|
Query
Parameters
Name | Type | Required | Description |
---|---|---|---|
async | string | If specified, the command runs in asynchronous mode. See asynchronous mode. | |
compressed | bool | If specified, the output storage artifacts are compressed in the format specified by the compressionType supported property. | |
OutputDataFormat | string | ✔️ | The data format of the storage artifacts written by the command. Supported values are: csv , tsv , json , and parquet . |
StorageConnectionString | string | One or more storage connection strings that specify which storage to write the data to. More than one storage connection string might be specified for scalable writes. Each such connection string must specify the credentials to use when writing to storage. For example, when writing to Azure Blob Storage, the credentials can be the storage account key, or a shared access key (SAS) with the permissions to read, write, and list blobs. When you export data to CSV files using a DFS endpoint, the data goes through a DFS managed private endpoint. When you export data to parquet files, the data goes through a blob managed private endpoint. | |
PropertyName, PropertyValue | string | A comma-separated list of key-value property pairs. See supported properties. |
Supported properties
Property | Type | Description |
---|---|---|
includeHeaders | string | For csv /tsv output, controls the generation of column headers. Can be one of none (default; no header lines emitted), all (emit a header line into every storage artifact), or firstFile (emit a header line into the first storage artifact only). |
fileExtension | string | The “extension” part of the storage artifact (for example, .csv or .tsv ). If compression is used, .gz is appended as well. |
namePrefix | string | The prefix to add to each generated storage artifact name. A random prefix is used if left unspecified. |
encoding | string | The encoding for text. Possible values include: UTF8NoBOM (default) or UTF8BOM . |
compressionType | string | The type of compression to use. For non-Parquet files, only gzip is allowed. For Parquet files, possible values include gzip , snappy , lz4_raw , brotli , and zstd . Default is gzip . |
distribution | string | Distribution hint (single , per_node , per_shard ). If value equals single , a single thread writes to storage. Otherwise, export writes from all nodes executing the query in parallel. See evaluate plugin operator. Defaults to per_shard . |
persistDetails | bool | If true , the command persists its results (see async flag). Defaults to true in async runs, but can be turned off if the caller doesn’t require the results. Defaults to false in synchronous executions, but can be turned on. |
sizeLimit | long | The size limit in bytes of a single storage artifact written before compression. Valid range: 100 MB (default) to 4 GB. |
parquetRowGroupSize | int | Relevant only when data format is Parquet. Controls the row group size in the exported files. Default row group size is 100,000 records. |
distributed | bool | Disable or enable distributed export. Setting to false is equivalent to single distribution hint. Default is true. |
parquetDatetimePrecision | string | The precision to use when exporting datetime values to Parquet. Possible values are millisecond and microsecond. Default is millisecond. |
Authentication and authorization
The authentication method is based on the connection string provided, and the permissions required vary depending on the authentication method.
The following table lists the supported authentication methods and the permissions needed for exporting data to external storage by storage type.
Authentication method | Azure Blob Storage / Data Lake Storage Gen2 | Data Lake Storage Gen1 |
---|---|---|
Impersonation | Storage Blob Data Contributor | Contributor |
Shared Access (SAS) token | Write | Write |
Microsoft Entra access token | No extra permissions required | No extra permissions required |
Storage account access key | No extra permissions required | No extra permissions required |
Returns
The commands return a table that describes the generated storage artifacts. Each record describes a single artifact and includes the storage path to the artifact and how many records it holds.
Asynchronous mode
If the async
flag is specified, the command executes in asynchronous mode.
In this mode, the command returns immediately with an operation ID, and data
export continues in the background until completion. The operation ID returned
by the command can be used to track its progress and ultimately its results
via the following commands:
.show operations
: Track progress..show operation details
: Get completion results.
For example, after a successful completion, you can retrieve the results using:
.show operation f008dc1e-2710-47d8-8d34-0d562f5f8615 details
Examples
In this example, Kusto runs the query and then exports the first recordset produced by the query to one or more compressed CSV blobs, up to 1 GB before compression. Column name labels are added as the first row for each blob.
.export
async compressed
to csv (
h@"https://storage1.blob.core.windows.net/containerName;secretKey",
h@"https://storage1.blob.core.windows.net/containerName2;secretKey"
) with (
sizeLimit=1000000000,
namePrefix="export",
includeHeaders="all",
encoding="UTF8NoBOM"
)
<|
Logs | where id == "1234"
Failures during export commands
Export commands can transiently fail during execution. Continuous export automatically retries the command. Regular export commands (export to storage, export to external table) don’t perform any retries.
- When the export command fails, artifacts already written to storage aren’t deleted. These artifacts remain in storage. If the command fails, assume the export is incomplete, even if some artifacts were written.
- The best way to track both completion of the command and the artifacts exported upon successful completion is by using the
.show operations
and.show operation details
commands.
Storage failures
By default, export commands are distributed such that there might be many concurrent writes to storage. The level of distribution depends on the type of export command:
The default distribution for regular
.export
command isper_shard
, which means all extents that contain data to export write to storage concurrently.The default distribution for export to external table commands is
per_node
, which means the concurrency is the number of nodes.
When the number of extents/nodes is large, this might lead to high load on storage that results in storage throttling, or transient storage errors. The following suggestions might overcome these errors (by order of priority):
Increase the number of storage accounts provided to the export command or to the external table definition. The load is evenly distributed between the accounts.
Reduce the concurrency by setting the distribution hint to
per_node
(see command properties).Reduce concurrency of number of nodes exporting by setting the client request property
query_fanout_nodes_percent
to the desired concurrency (percent of nodes). The property can be set as part of the export query. For example, the following command limits the number of nodes writing to storage concurrently to 50% of the nodes:.export async to csv ( h@"https://storage1.blob.core.windows.net/containerName;secretKey" ) with ( distribution="per_node" ) <| set query_fanout_nodes_percent = 50; ExportQuery
Reduce concurrency of number of threads exporting in each node when using per shard export, by setting the client request property
query_fanout_threads_percent
to the desired concurrency (percent of threads). The property can be set as part of the export query. For example, the following command limits the number of threads writing to storage concurrently to 50% on each of the nodes:.export async to csv ( h@"https://storage1.blob.core.windows.net/containerName;secretKey" ) with ( distribution="per_shard" ) <| set query_fanout_threads_percent = 50; ExportQuery
If exporting to a partitioned external table, setting the
spread
/concurrency
properties can reduce concurrency (see details in the command properties.If neither of the previous recommendations work, you can completely disable distribution by setting the
distributed
property to false. However, we don’t recommend doing so, as it might significantly affect the command performance.
Authorization failures
Authentication or authorization failures during export commands can occur when the credentials provided in the storage connection string aren’t permitted to write to storage. If you’re using impersonate
or a user-delegated SAS token for the export command, the Storage Blob Data Contributor role is required to write to the storage account. For more information, see Storage connection strings.
Data types mapping
Parquet data types mapping
On export, Kusto data types are mapped to Parquet data types using the following rules:
Kusto Data Type | Parquet Data Type | Parquet Annotation | Comments |
---|---|---|---|
bool | BOOLEAN | ||
datetime | INT64 | TIMESTAMP_MICROS | |
dynamic | BYTE_ARRAY | UTF-8 | Serialized as JSON string |
guid | BYTE_ARRAY | UTF-8 | |
int | INT32 | ||
long | INT64 | ||
real | DOUBLE | ||
string | BYTE_ARRAY | UTF-8 | |
timespan | INT64 | Stored as ticks (100-nanosecond units) count | |
decimal | FIXED_LENGTH_BYTE_ARRAY | DECIMAL |
Related content
3.3 - .export to table
You can export data by defining an external table and exporting data to it. The table properties are specified when creating the external table. The export command references the external table by name.
Permissions
You must have at least Table Admin permissions to run this command.
Syntax
.export
[async
] to
table
externalTableName
[with
(
propertyName =
propertyValue [,
…])
] <|
query
Parameters
Name | Type | Required | Description |
---|---|---|---|
externalTableName | string | ✔️ | The name of the external table to which to export. |
propertyName, propertyValue | string | A comma-separated list of optional properties. | |
query | string | ✔️ | The export query. |
Supported properties
The following properties are supported as part of the export to external table command.
Property | Type | Description | Default |
---|---|---|---|
sizeLimit | long | The size limit in bytes of a single storage artifact written before compression. A full row group of size parquetRowGroupSize is written before checking whether this row group reaches the size limit and should start a new artifact. Valid range: 100 MB (default) to 1 GB. | |
distributed | bool | Disable or enable distributed export. Setting to false is equivalent to single distribution hint. | true |
distribution | string | Distribution hint (single , per_node , per_shard ). See more details in Distribution settings | per_node |
distributionKind | string | Optionally switches to uniform distribution when the external table is partitioned by string partition. Valid values are uniform or default . See more details in Distribution settings | |
concurrency | Number | Hints the system how many partitions to run in parallel. See more details in Distribution settings | 16 |
spread | Number | Hints the system how to distribute the partitions among nodes. See more details in Distribution settings | Min(64, number-of-nodes) |
parquetRowGroupSize | int | Relevant only when data format is Parquet. Controls the row group size in the exported files. This value takes precedence over sizeLimit , meaning a full row group will be exported before checking whether this row group reaches the size limit and should start a new artifact. | 100,000 |
Distribution settings
The distribution of an export to external table operation indicates the number of nodes and threads that are writing to storage concurrently. The default distribution depends on the external table partitioning:
External table partitioning | Default distribution |
---|---|
External table isn’t partitioned, or partitioned by datetime column only | Export is distributed per_node - all nodes are exporting concurrently. Each node writes the data assigned to that node. The number of files exported by a node is greater than one, only if the size of the data from that node exceeds sizeLimit . |
External table is partitioned by a string column | The data to export is moved between the nodes, such that each node writes a subset of the partition values. A single partition is always written by a single node. The number of files written per partition should be greater than one only if the data exceeds sizeLimit . If the external table includes several string partitions, then data is partitioned between the node based on the first partition. Therefore, the recommendation is to define the partition with most uniform distribution as the first one. |
Change the default distribution settings
Changing the default distribution settings can be useful in the following cases:
Use case | Description | Recommendation |
---|---|---|
Reduce the number of exported files | Export is creating too many small files, and you would like it to create a smaller number of larger files. | Set distribution =single or distributed =false (both are equivalent) in the command properties. Only a single thread performs the export. The downside of this is that the export operation can be slower, as concurrency is much reduced. |
Reduce the export duration | Increasing the concurrency of the export operation, to reduce its duration. | Set distribution =per_shard in the command properties. Doing so means concurrency of the write operations is per data shard, instead of per node. This is only relevant when exporting to an external table that isn’t partitioned by string partition. This might create too much load on storage, potentially resulting in throttling. See Storage failures. |
Reduce the export duration for external tables that are partitioned by a string partition | If the partitions aren’t uniformly distributed between the nodes, export might take a longer time to run. If one partition is much larger than the others, the node assigned to that partition does most of the export work, while the other nodes remain mostly idle. For more information, see Distribution settings. | There are several settings you can change: * If there’s more than one string partition, define the one with best distribution first. * Set distributionKind =uniform in the command properties. This setting disables the default distribution settings for string-partitioned external tables. Export runs with per-node distribution and each node exports the data assigned to the node. A single partition might be written by several nodes, and the number of files increases accordingly. To increase concurrency even further, set distributionKind =uniform along with distribution =per_shard for highest concurrency (at the cost of potentially many more files written)* If the cause for slow export isn’t outliers in the data, reduce duration by increasing concurrency, without changing partitioning settings. Use the hint.spread and hint.concurrency properties, which determine the concurrency of the partitioning. See partition operator. By default, the number of nodes exporting concurrently (the spread ) is the minimum value between 64 and the number of nodes. Setting spread to a higher number than number of nodes increases the concurrency on each node (max value for spread is 64). |
Authentication and authorization
In order to export to an external table, you must set up write permissions. For more information, see the Write permissions for Azure Storage external table or SQL Server external table.
Output
Output parameter | Type | Description |
---|---|---|
ExternalTableName | string | The name of the external table. |
Path | string | Output path. |
NumRecords | string | Number of records exported to path. |
Notes
The export query output schema must match the schema of the external table, including all columns defined by the partitions. For example, if the table is partitioned by DateTime, the query output schema must have a Timestamp column matching the TimestampColumnName. This column name is defined in the external table partitioning definition.
It isn’t possible to override the external table properties using the export command. For example, you can’t export data in Parquet format to an external table whose data format is CSV.
If the external table is partitioned, exported artifacts are written to their respective directories according to the partition definitions. For an example, see partitioned external table example.
- If a partition value is null/empty or is an invalid directory value, per the definitions of the target storage, the partition value is replaced with a default value of
__DEFAULT_PARTITION__
.
- If a partition value is null/empty or is an invalid directory value, per the definitions of the target storage, the partition value is replaced with a default value of
For suggestions to overcome storage errors during export commands, see failures during export commands.
External table columns are mapped to suitable target format data types, according to data types mapping rules.
Parquet native export is a more performant, resource light export mechanism. An exported
datetime
column is currently unsupported by Synapse SQLCOPY
.
Number of files
The number of files written per partition depends on the distribution settings of the export operation:
If the external table includes
datetime
partitions only, or no partitions at all, the number of files written for each partition that exists, should be similar to the number of nodes (or more, ifsizeLimit
is reached). When the export operation is distributed, all nodes export concurrently. To disable distribution, so that only a single node does the writes, setdistributed
to false. This process creates fewer files, but reduces the export performance.If the external table includes a partition by a string column, the number of exported files should be a single file per partition (or more, if
sizeLimit
is reached). All nodes still participate in the export (operation is distributed), but each partition is assigned to a specific node. Settingdistributed
to false, causes only a single node to do the export, but behavior remains the same (a single file written per partition).
Examples
Non-partitioned external table example
The following example exports data from table T
to the ExternalBlob
table. ExternalBlob
is a non-partitioned external table.
.export to table ExternalBlob <| T
Output
ExternalTableName | Path | NumRecords |
---|---|---|
ExternalBlob | http://storage1.blob.core.windows.net/externaltable1cont1/1_58017c550b384c0db0fea61a8661333e.csv | 10 |
Partitioned external table example
The following example first creates a partitioned external table, PartitionedExternalBlob
with a specified blob storage location. The data is stored in CSV format with a path format which organizes the data by customer name and date.
.create external table PartitionedExternalBlob (Timestamp:datetime, CustomerName:string)
kind=blob
partition by (CustomerName:string=CustomerName, Date:datetime=startofday(Timestamp))
pathformat = ("CustomerName=" CustomerName "/" datetime_pattern("yyyy/MM/dd", Date))
dataformat=csv
(
h@'http://storageaccount.blob.core.windows.net/container1;secretKey'
)
It then exports data from table T
to the PartitionedExternalBlob
external table.
.export to table PartitionedExternalBlob <| T
Output
If the command is executed asynchronously by using the async
keyword, the output is available using the show operation details command.
Related content
3.4 - Data export
Data export involves executing a Kusto query and saving its results. This process can be carried out either on the client side or the service side.
For examples on data export, see Related content.
Client-side export
Client-side export gives you control over saving query results either to the local file system or pushing them to a preferred storage location. This flexibility is facilitated by using Kusto client libraries. You can create an app to run queries, read the desired data, and implement an export process tailored to your requirements.
Alternatively, you can use a client tool like the Azure Data Explorer web UI to export data from your Kusto cluster. For more information, see Share queries.
Service-side export (pull)
Use the ingest from query commands to pull query results into a table in the same or different database. See the performance tips before using these commands.
Service-side export (push)
For scalable data export, the service offers various .export
management commands to push query results to cloud storage, an external table, or an SQL table. This approach enhances scalability by avoiding the bottleneck of streaming through a single network connection.
Continuous data export is supported for export to external tables.
Related content
3.5 - Continuous data export
3.5.1 - .create or alter continuous-export
Creates or alters a continuous export job.
Permissions
You must have at least Database Admin permissions to run this command.
Syntax
.create-or-alter
continuous-export
continuousExportName [over
(
T1, T2 )
] to
table
externalTableName [with
(
propertyName =
propertyValue [,
…])
] <|
query
Parameters
Name | Type | Required | Description |
---|---|---|---|
continuousExportName | string | ✔️ | The name of the continuous export. Must be unique within the database. |
externalTableName | string | ✔️ | The name of the external table export target. |
query | string | ✔️ | The query to export. |
T1, T2 | string | A comma-separated list of fact tables in the query. If not specified, all tables referenced in the query are assumed to be fact tables. If specified, tables not in this list are treated as dimension tables and aren’t scoped, so all records participate in all exports. See continuous data export overview for details. | |
propertyName, propertyValue | string | A comma-separated list of optional properties. |
Supported properties
Property | Type | Description |
---|---|---|
intervalBetweenRuns | Timespan | The time span between continuous export executions. Must be greater than 1 minute. |
forcedLatency | Timespan | An optional period of time to limit the query to records ingested before a specified period relative to the current time. This property is useful if, for example, the query performs some aggregations or joins, and you want to make sure all relevant records have been ingested before running the export. |
sizeLimit | long | The size limit in bytes of a single storage artifact written before compression. Valid range: 100 MB (default) to 1 GB. |
distributed | bool | Disable or enable distributed export. Setting to false is equivalent to single distribution hint. Default is true. |
parquetRowGroupSize | int | Relevant only when data format is Parquet. Controls the row group size in the exported files. Default row group size is 100,000 records. |
managedIdentity | string | The managed identity for which the continuous export job runs. The managed identity can be an object ID, or the system reserved word. For more information, see Use a managed identity to run a continuous export job. |
isDisabled | bool | Disable or enable the continuous export. Default is false. |
Example
The following example creates or alters a continuous export MyExport
that exports data from the T
table to ExternalBlob
. The data exports occur every hour, and have a defined forced latency and size limit per storage artifact.
.create-or-alter continuous-export MyExport
over (T)
to table ExternalBlob
with
(intervalBetweenRuns=1h,
forcedLatency=10m,
sizeLimit=104857600)
<| T
Name | ExternalTableName | Query | ForcedLatency | IntervalBetweenRuns | CursorScopedTables | ExportProperties |
---|---|---|---|---|---|---|
MyExport | ExternalBlob | S | 00:10:00 | 01:00:00 | [ “[‘DB’].[‘S’]" ] | { “SizeLimit”: 104857600 } |
Related content
3.5.2 - .drop continuous-export
Drops a continuous-export job.
Permissions
You must have at least Database Admin permissions to run this command.
Syntax
.drop
continuous-export
ContinuousExportName
Parameters
Name | Type | Required | Description |
---|---|---|---|
ContinuousExportName | string | ✔️ | The name of the continuous export. |
Returns
The remaining continuous exports in the database (post deletion). Output schema as in the show continuous export command.
Related content
3.5.3 - .show continuous data-export failures
Returns all failures logged as part of the continuous export within the past 14 days. To view only a specific time range, filter the results by the Timestamp column.
The command doesn’t return any results if executed on a follower database, it must be executed against the leader database.
The command doesn’t return any results if executed on a database shortcut, it must be executed against the leader database.
Permissions
You must have at least Database Monitor or Database Admin permissions to run this command. For more information, see role-based access control.
Syntax
.show
continuous-export
ContinuousExportName failures
Parameters
Name | Type | Required | Description |
---|---|---|---|
ContinuousExportName | string | ✔️ | The name of the continuous export. |
Returns
Output parameter | Type | Description |
---|---|---|
Timestamp | datetime | Timestamp of the failure. |
OperationId | string | Operation ID of the failure. |
Name | string | Continuous export name. |
LastSuccessRun | Timestamp | The last successful run of the continuous export. |
FailureKind | string | Failure/PartialFailure. PartialFailure indicates some artifacts were exported successfully before the failure occurred. |
Details | string | Failure error details. |
Example
The following example shows failures from the continuous export MyExport
.
.show continuous-export MyExport failures
Output
Timestamp | OperationId | Name | LastSuccessRun | FailureKind | Details |
---|---|---|---|---|---|
2019-01-01 11:07:41.1887304 | ec641435-2505-4532-ba19-d6ab88c96a9d | MyExport | 2019-01-01 11:06:35.6308140 | Failure | Details… |
Related content
3.5.4 - .show continuous-export
Returns the properties of a specified continuous export or all continuous exports in the database.
Permissions
You must have at least Database User, Database Viewer, or Database Monitor permissions to run this command. For more information, see role-based access control.
Syntax
.show
continuous-export
ContinuousExportName
.show
continuous-exports
Parameters
Name | Type | Required | Description |
---|---|---|---|
ContinuousExportName | string | ✔️ | The name of the continuous export. |
Returns
Output parameter | Type | Description |
---|---|---|
CursorScopedTables | string | The list of explicitly scoped (fact) tables (JSON serialized). |
ExportProperties | string | The export properties (JSON serialized). |
ExportedTo | datetime | The last datetime (ingestion time) that was exported successfully. |
ExternalTableName | string | The external table name. |
ForcedLatency | timeSpan | The forced latency timespan, if defined. Returns Null if no timespan is defined. |
IntervalBetweenRuns | timeSpan | The interval between runs. |
IsDisabled | bool | A boolean value indicating whether the continuous export is disabled. |
IsRunning | bool | A boolean value indicating whether the continuous export is currently running. |
LastRunResult | string | The results of the last continuous-export run (Completed or Failed ). |
LastRunTime | datetime | The last time the continuous export was executed (start time) |
Name | string | The name of the continuous export. |
Query | string | The export query. |
StartCursor | string | The starting point of the first execution of this continuous export. |
Related content
3.5.5 - .show continuous-export exported-artifacts
Returns all artifacts exported by the continuous-export in all runs. Filter the results by the Timestamp column in the command to view only records of interest. The history of exported artifacts is retained for 14 days.
The command doesn’t return any results if executed on a follower database, it must be executed against the leader database.
The command doesn’t return any results if executed on a database shortcut, it must be executed against the leader database.
Permissions
You must have at least Database Monitor or Database Admin permissions to run this command. For more information, see role-based access control.
Syntax
.show
continuous-export
ContinuousExportName exported-artifacts
Parameters
Name | Type | Required | Description |
---|---|---|---|
ContinuousExportName | string | ✔️ | The name of the continuous export. |
Returns
Output parameter | Type | Description |
---|---|---|
Timestamp | datetime | THe tTimestamp of the continuous export run |
ExternalTableName | string | Name of the external table |
Path | string | Output path |
NumRecords | long | Number of records exported to path |
Example
The following example shows retrieved artifacts from the continuous export MyExport
that were exported within the last hour.
.show continuous-export MyExport exported-artifacts | where Timestamp > ago(1h)
Output
Timestamp | ExternalTableName | Path | NumRecords | SizeInBytes |
---|---|---|---|---|
2018-12-20 07:31:30.2634216 | ExternalBlob | http://storageaccount.blob.core.windows.net/container1/1_6ca073fd4c8740ec9a2f574eaa98f579.csv | 10 | 1024 |
Related content
3.5.6 - Continuous data export
This article describes continuous export of data from Kusto to an external table with a periodically run query. The results are stored in the external table, which defines the destination, such as Azure Blob Storage, and the schema of the exported data. This process guarantees that all records are exported “exactly once”, with some exceptions.
By default, continuous export runs in a distributed mode, where all nodes export concurrently, so the number of artifacts depends on the number of nodes. Continuous export isn’t designed for low-latency streaming data.
To enable continuous data export, create an external table and then create a continuous export definition pointing to the external table.
In some cases, you must use a managed identity to successfully configure a continuous export job. For more information, see Use a managed identity to run a continuous export job.
Permissions
All continuous export commands require at least Database Admin permissions.
Continuous export guidelines
Output schema:
- The output schema of the export query must match the schema of the external table to which you export.
Frequency:
Continuous export runs according to the time period configured for it in the
intervalBetweenRuns
property. The recommended value for this interval is at least several minutes, depending on the latencies you’re willing to accept. The time interval can be as low as one minute, if the ingestion rate is high.[!NOTE] The
intervalBetweenRuns
serves as a recommendation only, and isn’t guaranteed to be precise. Continuous export isn’t suitable for exporting periodic aggregations. For example, a configuration ofintervalBetweenRuns
=1h
with an hourly aggregation (T | summarize by bin(Timestamp, 1h)
) won’t work as expected, since the continuous export won’t run exactly on-the-hour. Therefore, each hourly bin will receive multiple entries in the exported data.
Number of files:
- The number of files exported in each continuous export iteration depends on how the external table is partitioned. For more information, see export to external table command. Each continuous export iteration always writes to new files, and never appends to existing ones. As a result, the number of exported files also depends on the frequency in which the continuous export runs. The frequency parameter is
intervalBetweenRuns
.
- The number of files exported in each continuous export iteration depends on how the external table is partitioned. For more information, see export to external table command. Each continuous export iteration always writes to new files, and never appends to existing ones. As a result, the number of exported files also depends on the frequency in which the continuous export runs. The frequency parameter is
External table storage accounts:
- For best performance, the database and the storage accounts should be colocated in the same Azure region.
- Continuous export works in a distributed manner, such that all nodes are exporting concurrently. On large databases, and if the exported data volume is large, this might lead to storage throttling. The recommendation is to configure multiple storage accounts for the external table. For more information, see storage failures during export commands.
Exactly once export
To guarantee “exactly once” export, continuous export uses database cursors. The continuous export query shouldn’t include a timestamp filter - the database cursors mechanism ensures that records aren’t processed more than once. Adding a timestamp filter in the query can lead to missing data in exported data.
IngestionTime policy must be enabled on all tables referenced in the query that should be processed “exactly once” in the export. The policy is enabled by default on all newly created tables.
The guarantee for “exactly once” export is only for files reported in the show exported artifacts command. Continuous export doesn’t guarantee that each record is written only once to the external table. If a failure occurs after export begins and some of the artifacts were already written to the external table, the external table might contain duplicates. If a write operation was aborted before completion, the external table might contain corrupted files. In such cases, artifacts aren’t deleted from the external table, but they aren’t reported in the show exported artifacts command. Consuming the exported files using the show exported artifacts command
guarantees no duplications and no corruptions.
Export from fact and dimension tables
By default, all tables referenced in the export query are assumed to be fact tables. As such, they’re scoped to the database cursor. The syntax explicitly declares which tables are scoped (fact) and which aren’t scoped (dimension). See the over
parameter in the create command for details.
The export query includes only the records that joined since the previous export execution. The export query might contain dimension tables in which all records of the dimension table are included in all export queries. When using joins between fact and dimension tables in continuous-export, keep in mind that records in the fact table are only processed once. If the export runs while records in the dimension tables are missing for some keys, records for the respective keys are either missed or include null values for the dimension columns in the exported files. Returning missed or null records depends on whether the query uses inner or outer join. The forcedLatency
property in the continuous-export definition can be useful in such cases, where the fact and dimensions tables are ingested during the same time for matching records.
Monitor continuous export
Monitor the health of your continuous export jobs using the following export metrics:
Continuous export max lateness
- Max lateness (in minutes) of continuous exports in the database. This is the time between now and the minExportedTo
time of all continuous export jobs in database. For more information, see.show continuous export
command.Continuous export result
- Success/failure result of each continuous export execution. This metric can be split by the continuous export name.
Use the .show continuous export failures
command to see the specific failures of a continuous export job.
Resource consumption
- The impact of the continuous export on the database depends on the query the continuous export is running. Most resources, such as CPU and memory, are consumed by the query execution.
- The number of export operations that can run concurrently is limited by the database’s data export capacity. For more information, see Management commands throttling. If the database doesn’t have sufficient capacity to handle all continuous exports, some start lagging behind.
- The show commands-and-queries command can be used to estimate the resources consumption.
- Filter on
| where ClientActivityId startswith "RunContinuousExports"
to view the commands and queries associated with continuous export.
- Filter on
Export historical data
Continuous export starts exporting data only from the point of its creation. Records ingested before that time should be exported separately using the non-continuous export command. Historical data might be too large to be exported in a single export command. If needed, partition the query into several smaller batches.
To avoid duplicates with data exported by continuous export, use StartCursor
returned by the show continuous export command and export only records where cursor_before_or_at
the cursor value. For example:
.show continuous-export MyExport | project StartCursor
StartCursor |
---|
636751928823156645 |
Followed by:
.export async to table ExternalBlob
<| T | where cursor_before_or_at("636751928823156645")
Continuous export from a table with Row Level Security
To create a continuous export job with a query that references a table with Row Level Security policy, you must:
- Provide a managed identity as part of the continuous export configuration. For more information, see Use a managed identity to run a continuous export job.
- Use impersonation authentication for the external table to which the data is exported.
Continuous export to delta table - Preview
Continuous export to a delta table is currently in preview.
To define continuous export to a delta table, do the following steps:
Create an external delta table, as described in Create and alter delta external tables on Azure Storage.
[!NOTE] If the schema isn’t provided, Kusto will try infer it automatically if there is already a delta table defined in the target storage container.
Delta table partitioning isn’t supported.Define continuous export to this table using the commands described in Create or alter continuous export.
[!IMPORTANT] The schema of the delta table must be in sync with the continuous export query. If the underlying delta table changes, the export might start failing with unexpected behavior.
Limitations
General:
- The following formats are allowed on target tables:
CSV
,TSV
,JSON
, andParquet
. - Continuous export isn’t designed to work over materialized views, since a materialized view might be updated, while data exported to storage is always appended and never updated.
- Continuous export can’t be created on follower databases since follower databases are read-only and continuous export requires write operations.
- Records in source table must be ingested to the table directly, using an update policy, or ingest from query commands. If records are moved into the table using .move extents or using .rename table, continuous export might not process these records. See the limitations described in the Database Cursors page.
- If the artifacts used by continuous export are intended to trigger Event Grid notifications, see the known issues section in the Event Grid documentation.
Cross-database and cross-cluster:
- Continuous export doesn’t support cross-cluster calls.
- Continuous export supports cross-database calls only for dimension tables. All fact tables must reside in the local database. See more details in Export from fact and dimension tables.
- If the continuous export includes cross-database calls, it must be configured with a managed identity.
Cross-database and cross-Eventhouse:
- Continuous export doesn’t support cross-Eventhouse calls.
- Continuous export supports cross-database calls only for dimension tables. All fact tables must reside in the local database. See more details in Export from fact and dimension tables.
Policies:
- Continuous export can’t be enabled on a table with Row Level Security policy unless specific conditions are met. For more information, see Continuous export from a table with Row Level Security.
- Continuous export can’t be configured on a table with restricted view access policy.
Related content
3.5.7 - Enable or disable continuous data export
Disables or enables the continuous-export job. A disabled continuous export isn’t executed, but its current state is persisted and can be resumed when the continuous export is enabled.
When enabling a continuous export that was disabled for a long time, exporting continues from where it last stopped when the exporting was disabled. This continuation might result in a long running export, blocking other exports from running, if there isn’t sufficient database capacity to serve all processes. Continuous exports are executed by last run time in ascending order so the oldest export runs first, until catch up is complete.
Permissions
You must have at least Database Admin permissions to run these commands.
Syntax
.enable
continuous-export
ContinuousExportName
.disable
continuous-export
ContinuousExportName
Parameters
Name | Type | Required | Description |
---|---|---|---|
ContinuousExportName | string | ✔️ | The name of the continuous export. |
Returns
The result of the show continuous export command of the altered continuous export.
Related content
3.5.8 - Use a managed identity to run a continuous export job
A continuous export job exports data to an external table with a periodically run query.
The continuous export job should be configured with a managed identity in the following scenarios:
- When the external table uses impersonation authentication
- When the query references tables in other databases
- When the query references tables with an enabled row level security policy
A continuous export job configured with a managed identity is performed on behalf of the managed identity.
In this article, you learn how to configure a system-assigned or user-assigned managed identity and create a continuous export job using that identity.
Prerequisites
- A cluster and database. Create a cluster and database.
- All Databases Admin permissions on the database.
Configure a managed identity
There are two types of managed identities:
System-assigned: A system-assigned identity is connected to your cluster and is removed when the cluster is removed. Only one system-assigned identity is allowed per cluster.
User-assigned: A user-assigned managed identity is a standalone Azure resource. Multiple user-assigned identities can be assigned to your cluster.
Select one of the following tabs to set up your preferred managed identity type.
User-assigned
Follow the steps to Add a user-assigned identity.
In the Azure portal, in the left menu of your managed identity resource, select Properties. Copy and save the Tenant Id and Principal Id for use in the following steps.
Run the following .alter-merge policy managed_identity command, replacing
<objectId>
with the managed identity object ID from the previous step. This command sets a managed identity policy on the cluster that allows the managed identity to be used with continuous export..alter-merge cluster policy managed_identity ```[ { "ObjectId": "<objectId>", "AllowedUsages": "AutomatedFlows" } ]```
[!NOTE] To set the policy on a specific database, use
database <DatabaseName>
instead ofcluster
.Run the following command to grant the managed identity Database Viewer permissions over all databases used for the continuous export, such as the database that contains the external table.
.add database <DatabaseName> viewers ('aadapp=<objectId>;<tenantId>')
Replace
<DatabaseName>
with the relevant database,<objectId>
with the managed identity Principal Id from step 2, and<tenantId>
with the Microsoft Entra ID Tenant Id from step 2.
System-assigned
Follow the steps to Add a system-assigned identity.
Copy and save the Object (principal) ID for use in a later step.
Run the following .alter-merge policy managed_identity command. This command sets a managed identity policy on the cluster that allows the managed identity to be used with continuous export.
.alter-merge cluster policy managed_identity ```[ { "ObjectId": "system", "AllowedUsages": "AutomatedFlows" } ]```
[!NOTE] To set the policy on a specific database, use
database <DatabaseName>
instead ofcluster
.Run the following command to grant the managed identity Database Viewer permissions over all databases used for the continuous export, such as the database that contains the external table.
.add database <DatabaseName> viewers ('aadapp=<objectId>')
Replace
<DatabaseName>
with the relevant database and<objectId>
with the managed identity Object (principal) ID from step 2.
Set up an external table
External tables refer to data located in Azure Storage, such as Azure Blob Storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, or SQL Server.
Select one of the following tabs to set up an Azure Storage or SQL Server external table.
Azure Storage
Create a connection string based on the storage connection string templates. This string indicates the resource to access and its authentication information. For continuous export flows, we recommend impersonation authentication.
Run the .create or .alter external table command to create the table. Use the connection string from the previous step as the storageConnectionString argument.
For example, the following command creates
MyExternalTable
that refers to CSV-formatted data inmycontainer
ofmystorageaccount
in Azure Blob Storage. The table has two columns, one for an integerx
and one for a strings
. The connection string ends with;impersonate
, which indicates to use impersonation authentication to access the data store..create external table MyExternalTable (x:int, s:string) kind=storage dataformat=csv ( h@'https://mystorageaccount.blob.core.windows.net/mycontainer;impersonate' )
Grant the managed identity write permissions over the relevant external data store. The managed identity needs write permissions because the continuous export job exports data to the data store on behalf of the managed identity.
External data store Required permissions Grant the permissions Azure Blob Storage Storage Blob Data Contributor Assign an Azure role Data Lake Storage Gen2 Storage Blob Data Contributor Assign an Azure role Data Lake Storage Gen1 Contributor Assign an Azure role
SQL Server
Create a SQL Server connection string. This string indicates the resource to access and its authentication information. For continuous export flows, we recommend Microsoft Entra integrated authentication, which is impersonation authentication.
Run the .create or .alter external table command to create the table. Use the connection string from the previous step as the sqlServerConnectionString argument.
For example, the following command creates
MySqlExternalTable
that refers toMySqlTable
table inMyDatabase
of SQL Server. The table has two columns, one for an integerx
and one for a strings
. The connection string contains;Authentication=Active Directory Integrated
, which indicates to use impersonation authentication to access the table..create external table MySqlExternalTable (x:int, s:string) kind=sql table=MySqlTable ( h@'Server=tcp:myserver.database.windows.net,1433;Authentication=Active Directory Integrated;Initial Catalog=MyDatabase;' )
Grant the managed identity CREATE, UPDATE, and INSERT permissions over the SQL Server database. The managed identity needs write permissions because the continuous export job exports data to the database on behalf of the managed identity. To learn more, see Permissions.
Create a continuous export job
Select one of the following tabs to create a continuous export job that runs on behalf of a user-assigned or system-assigned managed identity.
User-assigned
Run the .create-or-alter continuous-export command with the managedIdentity
property set to the managed identity object ID.
For example, the following command creates a continuous export job named MyExport
to export the data in MyTable
to MyExternalTable
on behalf of a user-assigned managed identity. <objectId>
should be a managed identity object ID.
.create-or-alter continuous-export MyExport over (MyTable) to table MyExternalTable with (managedIdentity=<objectId>, intervalBetweenRuns=5m) <| MyTable
System-assigned
Run the .create-or-alter continuous-export command with the managedIdentity
property set to system
.
For example, the following command creates a continuous export job named MyExport
to export the data in MyTable
to MyExternalTable
on behalf of your system-assigned managed identity.
.create-or-alter continuous-export MyExport over (MyTable) to table MyExternalTable with (managedIdentity="system", intervalBetweenRuns=5m) <| MyTable
Related content
4 - Data ingestion
4.1 - .ingest inline command (push)
.ingest
inline
command (push).This command inserts data into a table by pushing the data included within the command to the table.
Permissions
You must have at least Table Ingestor permissions to run this command.
Syntax
.ingest
inline
into
table
TableName
[with
(
IngestionPropertyName =
IngestionPropertyValue [,
…] )
]
<|
Data
.ingest
inline
into
table
TableName
[with
(
IngestionPropertyName =
IngestionPropertyValue [,
…] )
]
[
Data ]
Parameters
Name | Type | Required | Description |
---|---|---|---|
TableName | string | ✔️ | The name of the table into which to ingest data. The table name is always relative to the database in context. Its schema is the default schema assumed for the data if no schema mapping object is provided. |
Data | string | ✔️ | The data content to ingest. Unless otherwise modified by the ingestion properties, this content is parsed as CSV. |
IngestionPropertyName, IngestionPropertyValue | string | Any number of ingestion properties that affect the ingestion process. |
Returns
The result is a table with as many records as the number of generated data shards (“extents”). If no data shards are generated, a single record is returned with an empty (zero-valued) extent ID.
Name | Type | Description |
---|---|---|
ExtentId | guid | The unique identifier for the data shard that’s generated by the command. |
Examples
Ingest with <|
syntax
The following command ingests data into a table Purchases
with two columns: SKU
(of type string
) and Quantity
(of type long
).
.ingest inline into table Purchases <|
Shoes,1000
Wide Shoes,50
"Coats black",20
"Coats with ""quotes""",5
Ingest with bracket syntax
The following command ingests data into a table Logs
with two columns: Date
(of type datetime
) and EventDetails
(of type dynamic
).
.ingest inline into table Logs
[2015-01-01,"{""EventType"":""Read"", ""Count"":""12""}"]
[2015-01-01,"{""EventType"":""Write"", ""EventValue"":""84""}"]
Related content
4.2 - .show data operations
.show data operations
command to return data operations that reached a final state.Returns a table with data operations that reached a final state. Data operations are available for 30 days from when they ran.
Any operation that results in new extents (data shards) added to the system is considered a data operation.
Permissions
You must have Database Admin or Database Monitor permissions to see any data operations invoked on your database.
Any user can see their own data operations.
For more information, see Kusto role-based access control.
Syntax
.show
data
operations
Returns
This command returns a table with the following columns:
Output parameter | Type | Description |
---|---|---|
Timestamp | datetime | The time when the operation reached its final state. |
Database | string | The database name. |
Table | string | The table name. |
ClientActivityId | string | The operation client activity ID. |
OperationKind | string | One of BatchIngest , SetOrAppend , RowStoreSeal , MaterializedView , QueryAcceleration , and UpdatePolicy . |
OriginalSize | long | The original size of the ingested data. |
ExtentSize | long | The extent size. |
RowCount | long | The number of rows in the extent. |
ExtentCount | int | The number of extents. |
TotalCpu | timespan | The total CPU time used by the data operation. |
Duration | timespan | The duration of the operation. |
Principal | string | The identity that initiated the data operation. |
Properties | dynamic | Additional information about the data operation. |
Example
The following example returns information about UpdatePolicy
, BatchIngest
, and SetOrAppend
operations.
.show data operations
Output
Timestamp | Database | Table | ClientActivityId | OperationKind | OriginalSize | ExtentSize | RowCount | ExtentCount | TotalCpu | Duration | Principal | Properties |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2024-07-18 15:21:10.5432134 | TestLogs | UTResults | DM.IngestionExecutor;abcd1234-1234-1234-abcd-1234abcdce;1 | UpdatePolicy | 100,829 | 75,578 | 279 | 1 | 00:00:00.2656250 | 00:00:28.9101535 | aadapp=xxx | {“SourceTable”: “UTLogs”} |
2024-07-18 15:21:12.9481819 | TestLogs | UTLogs | DM.IngestionExecutor;abcd1234-1234-1234-abcd-1234abcdce;1 | BatchIngest | 1,045,027,298 | 123,067,947 | 1,688,705 | 2 | 00:00:22.9843750 | 00:00:29.9745733 | aadapp=xxx | {“Format”: “Csv”,“NumberOfInputStreams”:2} |
2024-07-18 15:21:16.1095441 | KustoAuto | IncidentKustoGPTSummary | cdef12345-6789-ghij-0123-klmn45678 | SetOrAppend | 1,420 | 3,190 | 1 | 1 | 00:00:00.0156250 | 00:00:00.0638211 | aaduser=xxx |
4.3 - Data formats supported for ingestion
Data ingestion is the process by which data is added to a table and is made available for query. For all ingestion methods, other than ingest-from-query, the data must be in one of the supported formats. The following table lists and describes the formats that is supported for data ingestion.
For more information about why ingestion might fail, see Ingestion failures and Ingestion error codes in Azure Data Explorer.
Format | Extension | Description |
---|---|---|
ApacheAvro | .avro | An AVRO format with support for logical types. The following compression codecs are supported: null , deflate , and snappy . Reader implementation of the apacheavro format is based on the official Apache Avro library. For information about ingesting Event Hub Capture Avro files, see Ingesting Event Hub Capture Avro files. |
Avro | .avro | A legacy implementation for AVRO format based on .NET library. The following compression codecs are supported: null , deflate (for snappy - use ApacheAvro data format). |
CSV | .csv | A text file with comma-separated values (, ). See RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. |
JSON | .json | A text file with JSON objects delimited by \n or \r\n . See JSON Lines (JSONL). |
MultiJSON | .multijson | A text file with a JSON array of property bags (each representing a record), or any number of property bags delimited by whitespace, \n or \r\n . Each property bag can be spread on multiple lines. |
ORC | .orc | An ORC file. |
Parquet | .parquet | A Parquet file. |
PSV | .psv | A text file with pipe-separated values (| ). |
RAW | .raw | A text file whose entire contents is a single string value. |
SCsv | .scsv | A text file with semicolon-separated values (; ). |
SOHsv | .sohsv | A text file with SOH-separated values. (SOH is ASCII codepoint 1; this format is used by Hive on HDInsight.) |
TSV | .tsv | A text file with tab-separated values (\t ). |
TSVE | .tsv | A text file with tab-separated values (\t ). A backslash character (\ ) is used for escaping. |
TXT | .txt | A text file with lines delimited by \n . Empty lines are skipped. |
W3CLOGFILE | .log | Web log file format standardized by the W3C. |
For more info on ingesting data using json
or multijson
formats, see ingest json formats.
Supported data compression formats
Blobs and files can be compressed through any of the following compression algorithms:
Compression | Extension |
---|---|
gzip | .gz |
zip | .zip |
Indicate compression by appending the extension to the name of the blob or file.
For example:
MyData.csv.zip
indicates a blob or a file formatted as CSV, compressed with zip (archive or a single file)MyData.json.gz
indicates a blob or a file formatted as JSON, compressed with gzip.
Blob or file names that don’t include the format extensions but just compression (for example, MyData.zip
) is also supported. In this case, the file format
must be specified as an ingestion property because it cannot be inferred.
Related content
- Learn more about supported data formats
- Learn more about Data ingestion properties
- Learn more about data ingestion
4.4 - Data ingestion properties
Data ingestion is the process by which data is added to a table and is made available for query. You add properties to the ingestion command after the with
keyword.
Related content
- Learn more about supported data formats
- Learn more about data ingestion
4.5 - Ingest from query
4.5.1 - .cancel operation command
.cancel operation
command to cancel a long-running operation.This command cancels a long-running ingest from query operation. This command is useful when the operation is taking too long and you would like to abort it while running.
The cancel operation command isn’t guaranteed to succeed. The output of the .cancel operation
command indicates whether or not cancellation was successful.
Syntax
.cancel
operation
OperationId [with
(
reason
=
ReasonPhrase )
]
Parameters
Name | Type | Required | Description |
---|---|---|---|
OperationId | guid | ✔️ | A guid of the operation ID returned from the running command. |
ReasonPhrase | string | The reason for canceling the running command. |
Returns
Output parameter | Type | Description |
---|---|---|
OperationId | guid | The operation ID of the operation that was canceled. |
Operation | string | The operation kind that was canceled. |
StartedOn | datetime | The start time of the operation that was canceled. |
CancellationState | string | Returns one of the following options:Cancelled successfully : the operation was canceledCancel failed : the operation can’t be canceled at this point. The operation may still be running or may have completed. |
ReasonPhrase | string | Reason why cancellation wasn’t successful. |
Example
.cancel operation 078b2641-f10d-4694-96f8-1ee2b75dda48 with(Reason="Command canceled by me")
OperationId | Operation | StartedOn | CancellationState | ReasonPhrase |
---|---|---|---|---|
c078b2641-f10d-4694-96f8-1ee2b75dda48 | TableSetOrAppend | 2022-07-18 09:03:55.1387320 | Canceled successfully | Command canceled by me |
4.5.2 - Kusto query ingestion (set, append, replace)
These commands execute a query or a management command and ingest the results of the query into a table. The difference between these commands is how they treat existing or nonexistent tables and data.
Command | If table exists | If table doesn’t exist |
---|---|---|
.set | The command fails. | The table is created and data is ingested. |
.append | Data is appended to the table. | The command fails. |
.set-or-append | Data is appended to the table. | The table is created and data is ingested. |
.set-or-replace | Data replaces the data in the table. | The table is created and data is ingested. |
To cancel an ingest from query command, see cancel operation
.
Permissions
To perform different actions on a table, you need specific permissions:
- To add rows to an existing table using the
.append
command, you need a minimum of Table Ingestor permissions. - To create a new table using the various
.set
commands, you need a minimum of Database User permissions. - To replace rows in an existing table using the
.set-or-replace
command, you need a minimum of Table Admin permissions.
For more information on permissions, see Kusto role-based access control.
Syntax
(.set
| .append
| .set-or-append
| .set-or-replace
) [async
] tableName [with
(
propertyName =
propertyValue [,
…])
] <|
queryOrCommand
Parameters
Name | Type | Required | Description |
---|---|---|---|
async | string | If specified, the command returns immediately and continues ingestion in the background. Use the returned OperationId with the .show operations command to retrieve the ingestion completion status and results. | |
tableName | string | ✔️ | The name of the table to ingest data into. The tableName is always related to the database in context. |
propertyName, propertyValue | string | One or more supported ingestion properties used to control the ingestion process. | |
queryOrCommand | string | ✔️ | The text of a query or a management command whose results are used as data to ingest. Only .show management commands are supported. |
Performance tips
- Set the
distributed
property totrue
if the amount of data produced by the query is large, exceeds one gigabyte (GB), and doesn’t require serialization. Then, multiple nodes can produce output in parallel. Don’t use this flag when query results are small, since it might needlessly generate many small data shards. - Data ingestion is a resource-intensive operation that might affect concurrent activities on the database, including running queries. Avoid running too many ingestion commands at the same time.
- Limit the data for ingestion to less than one GB per ingestion operation. If necessary, use multiple ingestion commands.
Supported ingestion properties
Property | Type | Description |
---|---|---|
distributed | bool | If true , the command ingests from all nodes executing the query in parallel. Default is false . See performance tips. |
creationTime | string | The datetime value, formatted as an ISO8601 string , to use at the creation time of the ingested data extents. If unspecified, now() is used. When specified, make sure the Lookback property in the target table’s effective Extents merge policy is aligned with the specified value. |
extend_schema | bool | If true , the command might extend the schema of the table. Default is false . This option applies only to .append , .set-or-append , and set-or-replace commands. This option requires at least Table Admin permissions. |
recreate_schema | bool | If true , the command might recreate the schema of the table. Default is false . This option applies only to the .set-or-replace command. This option takes precedence over the extend_schema property if both are set. This option requires at least Table Admin permissions. |
folder | string | The folder to assign to the table. If the table already exists, this property overwrites the table’s folder. |
ingestIfNotExists | string | If specified, ingestion fails if the table already has data tagged with an ingest-by: tag with the same value. For more information, see ingest-by: tags. |
policy_ingestiontime | bool | If true , the Ingestion Time Policy is enabled on the table. The default is true . |
tags | string | A JSON string that represents a list of tags to associate with the created extent. |
docstring | string | A description used to document the table. |
persistDetails | A Boolean value that, if specified, indicates that the command should persist the detailed results for retrieval by the .show operation details command. Defaults to false . | with (persistDetails=true) |
Schema considerations
.set-or-replace
preserves the schema unless one ofextend_schema
orrecreate_schema
ingestion properties is set totrue
..set-or-append
and.append
commands preserve the schema unless theextend_schema
ingestion property is set totrue
.- Matching the result set schema to that of the target table is based on the column types. There’s no matching of column names. Make sure that the query result schema columns are in the same order as the table, otherwise data is ingested into the wrong columns.
Character limitation
The command fails if the query generates an entity name with the $
character. The entity names must comply with the naming rules, so the $
character must be removed for the ingest command to succeed.
For example, in the following query, the search
operator generates a column $table
. To store the query results, use project-rename to rename the column.
.set Texas <| search State has 'Texas' | project-rename tableName=$table
Returns
Returns information on the extents created because of the .set
or .append
command.
Examples
Create and update table from query source
The following query creates the :::no-loc text=“RecentErrors”::: table with the same schema as :::no-loc text=“LogsTable”:::. It updates :::no-loc text=“RecentErrors”::: with all error logs from :::no-loc text=“LogsTable”::: over the last hour.
.set RecentErrors <|
LogsTable
| where Level == "Error" and Timestamp > now() - time(1h)
Create and update table from query source using the distributed flag
The following example creates a new table called OldExtents
in the database, asynchronously. The dataset is expected to be bigger than one GB (more than ~one million rows) so the distributed flag is used. It updates OldExtents
with ExtentId
entries from the MyExtents
table that were created more than 30 days ago.
.set async OldExtents with(distributed=true) <|
MyExtents
| where CreatedOn < now() - time(30d)
| project ExtentId
Append data to table
The following example filters ExtentId
entries in the MyExtents
table that were created more than 30 days ago and appends the entries to the OldExtents
table with associated tags.
.append OldExtents with(tags='["TagA","TagB"]') <|
MyExtents
| where CreatedOn < now() - time(30d)
| project ExtentId
Create or append a table with possibly existing tagged data
The following example either appends to or creates the OldExtents
table asynchronously. It filters ExtentId
entries in the MyExtents
table that were created more than 30 days ago and specifies the tags to append to the new extents with ingest-by:myTag
. The ingestIfNotExists
parameter ensures that the ingestion only occurs if the data doesn’t already exist in the table with the specified tag.
.set-or-append async OldExtents with(tags='["ingest-by:myTag"]', ingestIfNotExists='["myTag"]') <|
MyExtents
| where CreatedOn < now() - time(30d)
| project ExtentId
Create table or replace data with associated data
The following query replaces the data in the OldExtents
table, or creates the table if it doesn’t already exist, with ExtentId
entries in the MyExtents
table that were created more than 30 days ago. Tag the new extent with ingest-by:myTag
if the data doesn’t already exist in the table with the specified tag.
.set-or-replace async OldExtents with(tags='["ingest-by:myTag"]', ingestIfNotExists='["myTag"]') <|
MyExtents
| where CreatedOn < now() - time(30d)
| project ExtentId
Append data with associated data
The following example appends data to the OldExtents
table asynchronously, using ExtentId
entries from the MyExtents
table that were created more than 30 days ago. It sets a specific creation time for the new extents.
.append async OldExtents with(creationTime='2017-02-13T11:09:36.7992775Z') <|
MyExtents
| where CreatedOn < now() - time(30d)
| project ExtentId
Sample output
The following is a sample of the type of output you may see from your queries.
ExtentId | OriginalSize | ExtentSize | CompressedSize | IndexSize | RowCount |
---|---|---|---|---|---|
23a05ed6-376d-4119-b1fc-6493bcb05563 | 1291 | 5882 | 1568 | 4314 | 10 |
Related content
4.6 - Kusto.ingest into command (pull data from storage)
The .ingest into
command ingests data into a table by “pulling” the data
from one or more cloud storage files.
For example, the command
can retrieve 1,000 CSV-formatted blobs from Azure Blob Storage, parse
them, and ingest them together into a single target table.
Data is appended to the table
without affecting existing records, and without modifying the table’s schema.
Permissions
You must have at least Table Ingestor permissions to run this command.
Syntax
.ingest
[async
] into
table
TableName SourceDataLocator [with
(
IngestionPropertyName =
IngestionPropertyValue [,
…] )
]
Parameters
Name | Type | Required | Description |
---|---|---|---|
async | string | If specified, the command returns immediately and continues ingestion in the background. The results of the command include an OperationId value that can then be used with the .show operation command to retrieve the ingestion completion status and results. | |
TableName | string | ✔️ | The name of the table into which to ingest data. The table name is always relative to the database in context. If no schema mapping object is provided, the schema of the database in context is used. |
SourceDataLocator | string | ✔️ | A single or comma-separated list of storage connection strings. A single connection string must refer to a single file hosted by a storage account. Ingestion of multiple files can be done by specifying multiple connection strings, or by ingesting from a query of an external table. |
Authentication and authorization
Each storage connection string indicates the authorization method to use for access to the storage. Depending on the authorization method, the principal might need to be granted permissions on the external storage to perform the ingestion.
The following table lists the supported authentication methods and the permissions needed for ingesting data from external storage.
Authentication method | Azure Blob Storage / Data Lake Storage Gen2 | Data Lake Storage Gen1 |
---|---|---|
Impersonation | Storage Blob Data Reader | Reader |
Shared Access (SAS) token | List + Read | This authentication method isn’t supported in Gen1. |
Microsoft Entra access token | ||
Storage account access key | This authentication method isn’t supported in Gen1. | |
Managed identity | Storage Blob Data Reader | Reader |
Returns
The result of the command is a table with as many records as there are data shards (“extents”) generated by the command. If no data shards were generated, a single record is returned with an empty (zero-valued) extent ID.
Name | Type | Description |
---|---|---|
ExtentId | guid | The unique identifier for the data shard that was generated by the command. |
ItemLoaded | string | One or more storage files that are related to this record. |
Duration | timespan | How long it took to perform ingestion. |
HasErrors | bool | Whether or not this record represents an ingestion failure. |
OperationId | guid | A unique ID representing the operation. Can be used with the .show operation command. |
Examples
Azure Blob Storage with shared access signature
The following example instructs your database to read two blobs from Azure Blob Storage as CSV files, and ingest their contents into table T
. The ...
represents an Azure Storage shared access signature (SAS) which gives read access to each blob. Obfuscated strings (the h
in front of the string values) are used to ensure that the SAS is never recorded.
.ingest into table T (
h'https://contoso.blob.core.windows.net/container/file1.csv?...',
h'https://contoso.blob.core.windows.net/container/file2.csv?...'
)
Azure Blob Storage with managed identity
The following example shows how to read a CSV file from Azure Blob Storage and ingest its contents into table T
using managed identity authentication. Authentication uses the managed identity ID (object ID) assigned to the Azure Blob Storage in Azure. For more information, see Create a managed identity for storage containers.
.ingest into table T ('https://StorageAccount.blob.core.windows.net/Container/file.csv;managed_identity=802bada6-4d21-44b2-9d15-e66b29e4d63e')
Azure Data Lake Storage Gen 2
The following example is for ingesting data from Azure Data Lake Storage Gen 2
(ADLSv2). The credentials used here (...
) are the storage account credentials
(shared key), and we use string obfuscation only for the secret part of the
connection string.
.ingest into table T (
'abfss://myfilesystem@contoso.dfs.core.windows.net/path/to/file1.csv;...'
)
Azure Data Lake Storage
The following example ingests a single file from Azure Data Lake Storage (ADLS). It uses the user’s credentials to access ADLS (so there’s no need to treat the storage URI as containing a secret). It also shows how to specify ingestion properties.
.ingest into table T ('adl://contoso.azuredatalakestore.net/Path/To/File/file1.ext;impersonate')
with (format='csv')
Amazon S3 with an access key
The following example ingests a single file from Amazon S3 using an access key ID and a secret access key.
.ingest into table T ('https://bucketname.s3.us-east-1.amazonaws.com/path/to/file.csv;AwsCredentials=AKIAIOSFODNN7EXAMPLE,wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')
with (format='csv')
Amazon S3 with a presigned URL
The following example ingests a single file from Amazon S3 using a preSigned URL.
with (format='csv')
Related content
4.7 - Streaming ingestion
4.7.1 - Clearing cached schema for streaming ingestion
Nodes cache schema of the databases that receive data via streaming ingestion. This process optimizes performance and utilization of resources, but can cause propagation delays when the schema change.
Clear the cache to guarantee that subsequent streaming ingestion requests incorporate database or table schema changes. For more information, see Streaming ingestion and schema changes.
Permissions
You must have at least Database Ingestor permissions to run this command.
Syntax
.clear
table
TableName cache
streamingingestion
schema
.clear
database
cache
streamingingestion
schema
Parameters
Name | Type | Required | Description |
---|---|---|---|
TableName | string | ✔️ | The name of the table for which to clear the cache. |
Returns
This command returns a table with the following columns:
Column | Type | Description |
---|---|---|
NodeId | string | Identifier of the node |
Status | string | Succeeded/Failed |
Example
.clear database cache streamingingestion schema
.clear table T1 cache streamingingestion schema
NodeId | Status |
---|---|
Node1 | Succeeded |
Node2 | Failed |
4.7.2 - Streaming ingestion and schema changes
Cluster nodes cache the schema of databases that get data through streaming ingestion, boosting performance and resource use. However, when there are schema changes, it can lead to delays in updates.
Eventhouse nodes cache the schema of databases that get data through streaming ingestion, boosting performance and resource use. However, when there are schema changes, it can lead to delays in updates.
If schema changes and streaming ingestion aren’t synchronized, you can encounter failures like schema-related errors or incomplete and distorted data in the table.
This article outlines typical schema changes and provides guidance on avoiding problems with streaming ingestion during these changes.
Schema changes
The following list covers key examples of schema changes:
- Creation of tables
- Deletion of tables
- Adding a column to a table
- Removing a column from a table
- Retyping the columns of a table
- Renaming the columns of a table
- Adding precreated ingestion mappings
- Removing precreated ingestion mappings
- Adding, removing, or altering policies
Coordinate schema changes with streaming ingestion
The schema cache is kept while the database is online. If there are schema changes, the system automatically refreshes the cache, but this refresh can take several minutes. If you rely on the automatic refresh, you can experience uncoordinated ingestion failures.
You can reduce the effects of propagation delay by explicitly clearing the schema cache on the nodes. If the streaming ingestion flow and schema changes are coordinated, you can completely eliminate failures and their associated data distortion.
To coordinate the streaming ingestion flow with schema changes:
- Suspend streaming ingestion.
- Wait until all outstanding streaming ingestion requests are complete.
- Do schema changes.
- Issue one or several .clear cache streaming ingestion schema commands.
- Repeat until successful and all rows in the command output indicate success
- Resume streaming ingestion.
5 - Database cursors
5.1 - Database cursors
A database cursor is a database-level object that lets you query a database multiple times. You get consistent results even if there are data-append
or data-retention
operations happening in parallel with the queries.
Database cursors are designed to address two important scenarios:
The ability to repeat the same query multiple times and get the same results, as long as the query indicates “same data set”.
The ability to make an “exactly once” query. This query only “sees” the data that a previous query didn’t see, because the data wasn’t available then. The query lets you iterate, for example, through all the newly arrived data in a table without fear of processing the same record twice or skipping records by mistake.
The database cursor is represented in the query language as a scalar value of type
string
. The actual value should be considered opaque and there’s no support
for any operation other than to save its value or use the cursor functions
noted below.
Cursor functions
Kusto provides three functions to help implement the two above scenarios:
cursor_current(): Use this function to retrieve the current value of the database cursor. You can use this value as an argument to the two other functions.
cursor_after(rhs:string): This special function can be used on table records that have the IngestionTime policy enabled. It returns a scalar value of type
bool
indicating whether the record’singestion_time()
database cursor value comes after therhs
database cursor value.cursor_before_or_at(rhs:string): This special function can be used on the table records that have the IngestionTime policy enabled. It returns a scalar value of type
bool
indicating whether the record’singestion_time()
database cursor value comes before or at therhs
database cursor value.
The two special functions (cursor_after
and cursor_before_or_at
) also have
a side-effect: When they’re used, Kusto will emit the current value of the database cursor to the @ExtendedProperties
result set of the query. The property name for the cursor is Cursor
, and its value is a single string
.
For example:
{"Cursor" : "636040929866477946"}
Restrictions
Database cursors can only be used with tables for which the IngestionTime policy has been enabled. Each record in such a table is associated with the value of the database cursor that was in effect when the record was ingested. As such, the ingestion_time() function can be used.
The database cursor object holds no meaningful value unless the database has at least one table that has an IngestionTime policy defined. This value is guaranteed to update, as-needed by the ingestion history, into such tables and the queries run, that reference such tables. It might, or might not, be updated in other cases.
The ingestion process first commits the data, so that it’s available for querying, and only then assigns an actual cursor value to each record. If you attempt to query for data immediately following the ingestion completion using a database cursor, the results might not yet incorporate the last records added, because they haven’t yet been assigned the cursor value. Also, retrieving the current database cursor value repeatedly might return the same value, even if ingestion was done in between, because only a cursor commit can update its value.
Querying a table based on database cursors is only guaranteed to “work” (providing exactly-once guarantees) if the records are ingested directly into that table. If you’re using extents commands, such as move extents/.replace extents to move data into the table, or if you’re using .rename table, then querying this table using database cursors isn’t guaranteed to not miss any data. This is because the ingestion time of the records is assigned when initially ingested, and doesn’t change during the move extents operation. Therefore, when the extents are moved into the target table, it’s possible that the cursor value assigned to the records in these extents was already processed (and next query by database cursor will miss the new records).
Example: Processing records exactly once
For a table Employees
with schema [Name, Salary]
, to continuously process new records as they’re ingested into the table, use the following process:
// [Once] Enable the IngestionTime policy on table Employees
.set table Employees policy ingestiontime true
// [Once] Get all the data that the Employees table currently holds
Employees | where cursor_after('')
// The query above will return the database cursor value in
// the @ExtendedProperties result set. Lets assume that it returns
// the value '636040929866477946'
// [Many] Get all the data that was added to the Employees table
// since the previous query was run using the previously-returned
// database cursor
6 - Plugin commands
7 - Policies
7.1 - Policies overview
The following table provides an overview of the policies for managing your environment:
Policy | Description |
---|---|
Auto delete policy | Sets an expiry date for the table. The table is automatically deleted at this expiry time. |
Cache policy | Defines how to prioritize resources. Allows customers to differentiate between hot data cache and cold data cache. |
Callout policy | Manages the authorized domains for external calls. |
Capacity policy | Controls the compute resources of data management operations. |
Encoding policy | Defines how data is encoded, compressed, and indexed. |
Extent tags retention policy | Controls the mechanism that automatically removes extent tags from tables. |
Ingestion batching policy | Groups multiple data ingestion requests into batches for more efficient processing. |
Ingestion time policy | Adds a hidden datetime column to the table that records the time of ingestion. |
ManagedIdentity policy | Controls which managed identities can be used for what purposes. |
Merge policy | Defines rules for merging data from different extents into a single extent. |
Mirroring policy | Allows you to manage your mirroring policy and mirroring policy operations. |
Partitioning policy | Defines rules for partitioning extents for a specific table or a materialized view. |
Retention policy | Controls the mechanism that automatically removes data from tables or materialized views. |
Restricted view access policy | Adds an extra layer of permission requirements for principals to access and view the table. |
Row level security policy | Defines rules for access to rows in a table based on group membership or execution context. |
Row order policy | Maintains a specific order for rows within an extent. |
Sandbox policy | Controls the usage and behavior of sandboxes, which are isolated environments for query execution. |
Sharding policy | Defines rules for how extents are created. |
Streaming ingestion policy | Configurations for streaming data ingestion. |
Update policy | Allows for data to be appended to a target table upon adding data to a source table. |
Query weak consistency policy | Controls the level of consistency for query results. |
7.2 - Auto delete
7.2.1 - Auto delete policy
An auto delete policy on a table sets an expiry date for the table. The table is automatically deleted at this expiry time. Unlike the retention policy, which determines when data (extents) are removed from a table, the auto delete policy drops the entire table.
The auto delete policy can be useful for temporary staging tables. Temporary staging tables are used for data preparation, until the data is moved to its permanent location. We recommend explicitly dropping temporary tables when they’re no longer needed. Only use the auto delete policy as a fallback mechanism in case the explicit deletion doesn’t occur.
Policy object
An auto delete policy includes the following properties:
ExpiryDate:
- Date and time value indicating when the table should be deleted.
- The deletion time is imprecise, and could occur few hours later than the time specified in the ExpiryDate property.
- The value specified can’t be null and it must be greater than current time.
DeleteIfNotEmpty:
- Boolean value indicating whether table should be dropped even if there are still extents in it.
- Defaults to
false
.
Related content
For more information, see auto delete policy commands.
7.3 - Caching
7.3.1 - Caching policy (hot and cold cache)
To ensure fast query performance, a multi-tiered data cache system is used. Data is stored in reliable storage but parts of it are cached on processing nodes, SSD, or even in RAM for faster access.
The caching policy allows you to choose which data should be cached. You can differentiate between hot data cache and cold data cache by setting a caching policy on hot data. Hot data is kept in local SSD storage for faster query performance, while cold data is stored in reliable storage, which is cheaper but slower to access.
The cache uses 95% of the local SSD disk for hot data. If there isn’t enough space, the most recent data is preferentially kept in the cache. The remaining 5% is used for data that isn’t categorized as hot. This design ensures that queries loading lots of cold data won’t evict hot data from the cache.
The best query performance is achieved when all ingested data is cached. However, certain data might not warrant the expense of being kept in the hot cache. For instance, infrequently accessed old log records might be considered less crucial. In such cases, teams often opt for lower querying performance over paying to keep the data warm.
Use management commands to alter the caching policy at the database, table, or materialized view level.
Use management commands to alter the caching policy at the cluster, database, table, or materialized view level.
How caching policy is applied
When data is ingested, the system keeps track of the date and time of the ingestion, and of the extent that was created. The extent’s ingestion date and time value (or maximum value, if an extent was built from multiple preexisting extents), is used to evaluate the caching policy.
By default, the effective policy is null
, which means that all the data is considered hot. A null
policy at the table level means that the policy is inherited from the database. A non-null
table-level policy overrides a database-level policy.
Scoping queries to hot cache
When running queries, you can limit the scope to only query data in hot cache.
There are several query possibilities:
- Add a client request property called
query_datascope
to the query. Possible values:default
,all
, andhotcache
. - Use a
set
statement in the query text:set query_datascope='...'
. Possible values are the same as for the client request property. - Add a
datascope=...
text immediately after a table reference in the query body. Possible values areall
andhotcache
.
The default
value indicates use of the default settings, which determine that the query should cover all data.
If there’s a discrepancy between the different methods, then set
takes precedence over the client request property. Specifying a value for a table reference takes precedence over both.
For example, in the following query, all table references use hot cache data only, except for the second reference to “T” that is scoped to all the data:
set query_datascope="hotcache";
T | union U | join (T datascope=all | where Timestamp < ago(365d)) on X
Caching policy vs retention policy
Caching policy is independent of retention policy:
- Caching policy defines how to prioritize resources. Queries for important data are faster.
- Retention policy defines the extent of the queryable data in a table/database (specifically,
SoftDeletePeriod
).
Configure this policy to achieve the optimal balance between cost and performance, based on the expected query pattern.
Example:
SoftDeletePeriod
= 56dhot cache policy
= 28d
In the example, the last 28 days of data is stored on the SSD and the additional 28 days of data is stored in Azure blob storage. You can run queries on the full 56 days of data.
Related content
7.4 - Callout
7.4.1 - Callout policy
Your cluster can communicate with external services in many different scenarios. Cluster administrators can manage the authorized domains for external calls by updating the cluster’s callout policy.
Supported properties of a callout
A callout policy is composed of the following properties:
Name | Type | Description |
---|---|---|
CalloutType | string | Defines the type of callout, and can be one of types listed in callout types. |
CalloutUriRegex | string | Specifies the regular expression whose matches represent the domain of resources of the callout domain. |
CanCall | bool | Whether the callout is permitted or denied external calls. |
Types of callout
Callout policies are managed at cluster-level and are classified into the following types:
Callout policy type | Description |
---|---|
kusto | Controls cross-cluster queries. |
sql | Controls the SQL plugin. |
mysql | Controls the MySQL plugin. |
postgresql | Controls the PostgreSql plugin. |
azure_digital_twins | Controls the Azure Digital Twins plugin. |
cosmosdb | Controls the Cosmos DB plugin. |
sandbox_artifacts | Controls sandboxed plugins (python and R). |
external_data | Controls access to external data through external tables or externaldata operator. |
webapi | Controls access to http endpoints. |
azure_openai | Controls calls to Azure OpenAI plugins such as the embedding plugin ai_embed_text plugin. |
Predefined callout policies
The following table shows a set of predefined callout policies that are preconfigured on your cluster to enable callouts to selected services:
Service | Designation | Permitted domains |
---|---|---|
Kusto | Cross cluster queries | [a-z0-9]{3,22}\\.(\\w+\\.)?kusto(mfa)?\\.windows\\.net/?$ |
Kusto | Cross cluster queries | `^https://[a-z0-9]{3,22}\.[a-z0-9-]{1,50}\.(kusto\.azuresynapse |
Kusto | Cross cluster queries | `^https://([A-Za-z0-9]+\.)?(ade |
Azure DB | SQL requests | [a-z0-9][a-z0-9\\-]{0,61}[a-z0-9]?\\.database\\.windows\\.net/?$ |
Synapse Analytics | SQL requests | [a-z0-9-]{0,61}?(-ondemand)?\\.sql\\.azuresynapse(-dogfood)?\\.net/?$ |
External Data | External data | .* |
Azure Digital Twins | Azure Digital Twins | [A-Za-z0-9\\-]{3,63}\\.api\\.[A-Za-z0-9]+\\.digitaltwins\\.azure\\.net/?$ |
More predefined policies on your cluster may be observed with next query:
.show cluster policy callout
| where EntityType == 'Cluster immutable policy'
| project Policy
Remarks
If an external resource of a given type matches more than one policy defined for such type, and at least one of the matched policies has their CanCall property set to false, access to the resource is denied.
Related content
7.5 - Capacity
7.5.1 - Capacity policy
A capacity policy is used for controlling the compute resources of data management operations on the cluster.
The capacity policy object
The capacity policy is made of the following components:
- IngestionCapacity
- ExtentsMergeCapacity
- ExtentsPurgeRebuildCapacity
- ExportCapacity
- ExtentsPartitionCapacity
- MaterializedViewsCapacity
- StoredQueryResultsCapacity
- StreamingIngestionPostProcessingCapacity
- PurgeStorageArtifactsCleanupCapacity
- PeriodicStorageArtifactsCleanupCapacity
- QueryAccelerationCapacity
To view the capacity of your cluster, use the .show capacity command.
Ingestion capacity
Property | Type | Description |
---|---|---|
ClusterMaximumConcurrentOperations | long | The maximum number of concurrent ingestion operations allowed in a cluster. This value caps the total ingestion capacity, as shown in the following formula. |
CoreUtilizationCoefficient | real | Determines the percentage of cores to use in the ingestion capacity calculation. |
Formula
The .show capacity command returns the cluster’s ingestion capacity based on the following formula:
Minimum(ClusterMaximumConcurrentOperations
,
Number of nodes in cluster *
Maximum(1,
Core count per node *
CoreUtilizationCoefficient))
Extents merge capacity
Property | Type | Description |
---|---|---|
MinimumConcurrentOperationsPerNode | long | The minimal number of concurrent extents merge/rebuild operations on a single node. Default is 1 . |
MaximumConcurrentOperationsPerNode | long | The maximum number of concurrent extents merge/rebuild operations on a single node. Default is 5 . |
ClusterMaximumConcurrentOperations | long | The maximum number of concurrent extents merge/rebuild operations allowed in a cluster. This value caps the total merge capacity. |
Formula
The .show capacity command returns the cluster’s extents merge capacity based on the following formula:
Minimum(
Number of nodes in cluster *
Concurrent operations per node,
ClusterMaximumConcurrentOperations)
The effective value for Concurrent operations per node is automatically adjusted by the system in the range [MinimumConcurrentOperationsPerNode
,MaximumConcurrentOperationsPerNode
], as long as the success rate of the merge operations is 90% or higher.
Extents purge rebuild capacity
Property | Type | Description |
---|---|---|
MaximumConcurrentOperationsPerNode | long | The maximum number of concurrent rebuild extents for purge operations on a single node. |
Formula
The .show capacity command returns the cluster’s extents purge rebuild capacity based on the following formula:
Number of nodes in cluster x MaximumConcurrentOperationsPerNode
Export capacity
Property | Type | Description |
---|---|---|
ClusterMaximumConcurrentOperations | long | The maximum number of concurrent export operations in a cluster. This value caps the total export capacity, as shown in the following formula. |
CoreUtilizationCoefficient | long | Determines the percentage of cores to use in the export capacity calculation. |
Formula
The .show capacity command returns the cluster’s export capacity based on the following formula:
Minimum(ClusterMaximumConcurrentOperations
,
Number of nodes in cluster *
Maximum(1,
Core count per node *
CoreUtilizationCoefficient))
Extents partition capacity
Property | Type | Description |
---|---|---|
ClusterMinimumConcurrentOperations | long | The minimal number of concurrent extents partition operations in a cluster. Default is 1 . |
ClusterMaximumConcurrentOperations | long | The maximum number of concurrent extents partition operations in a cluster. Default is 32 . |
The effective value for Concurrent operations is automatically adjusted by the system in the range
[ClusterMinimumConcurrentOperations
,ClusterMaximumConcurrentOperations
], as long as the success rate of the
partitioning operations is 90% or higher.
Materialized views capacity policy
The policy can be used to change concurrency settings for materialized views. Changing the materialized views capacity policy can be useful when there’s more than a single materialized view defined on a cluster.
Property | Type | Description |
---|---|---|
ClusterMinimumConcurrentOperations | long | The minimal number of concurrent materialization operations in a cluster. Default is 1 . |
ClusterMaximumConcurrentOperations | long | The maximum number of concurrent materialization operations in a cluster. Default is 10 . |
By default, only a single materialization runs concurrently (see how materialized views work). The system adjusts the current concurrency in the range [ClusterMinimumConcurrentOperations
,ClusterMaximumConcurrentOperations
], based on the number of materialized views in the cluster and the cluster’s CPU. You can increase/decrease concurrency by altering this policy. For example, if the cluster has 10 materialized views, setting the ClusterMinimumConcurrentOperations
to five ensures that at least five of them can materialize concurrently.
You can view the effective value for the current concurrency using the .show capacity command
Stored query results capacity
Property | Type | Description |
---|---|---|
MaximumConcurrentOperationsPerDbAdmin | long | The maximum number of concurrent ingestion operations in a cluster admin node. |
CoreUtilizationCoefficient | real | Determines the percentage of cores to use in the stored query results creation calculation. |
Formula
The .show capacity command returns the cluster’s stored query results creation capacity based on the following formula:
Minimum(MaximumConcurrentOperationsPerDbAdmin
,
Number of nodes in cluster *
Maximum(1,
Core count per node *
CoreUtilizationCoefficient))
Streaming ingestion post processing capacity
Property | Type | Description |
---|---|---|
MaximumConcurrentOperationsPerNode | long | The maximum number of concurrent streaming ingestion post processing operations on each cluster node. |
Formula
The .show capacity command returns the cluster’s streaming ingestion post processing capacity based on the following formula:
Number of nodes in cluster x MaximumConcurrentOperationsPerNode
Purge storage artifacts cleanup capacity
Property | Type | Description |
---|---|---|
MaximumConcurrentOperationsPerCluster | long | The maximum number of concurrent purge storage artifacts cleanup operations on cluster. |
Formula
The .show capacity command returns the cluster’s purge storage artifacts cleanup capacity based on the following formula:
MaximumConcurrentOperationsPerCluster
Periodic storage artifacts cleanup capacity
Property | Type | Description |
---|---|---|
MaximumConcurrentOperationsPerCluster | long | The maximum number of concurrent periodic storage artifacts cleanup operations on cluster. |
Formula
The .show capacity command returns the cluster’s periodic storage artifacts cleanup capacity based on the following formula:
MaximumConcurrentOperationsPerCluster
Query Acceleration capacity
Property | Type | Description |
---|---|---|
ClusterMaximumConcurrentOperations | long | The maximum number of concurrent query acceleration caching operations in a cluster. This value caps the total query acceleration caching capacity, as shown in the following formula. |
CoreUtilizationCoefficient | long | Determines the percentage of cores to use in the query acceleration caching capacity calculation. |
Formula
The .show capacity command returns the cluster’s query acceleration caching capacity based on the following formula:
Minimum(ClusterMaximumConcurrentOperations
,
Number of nodes in cluster *
Maximum(1,
Core count per node *
CoreUtilizationCoefficient))
Defaults
The default capacity policy has the following JSON representation:
{
"IngestionCapacity": {
"ClusterMaximumConcurrentOperations": 512,
"CoreUtilizationCoefficient": 0.75
},
"ExtentsMergeCapacity": {
"MinimumConcurrentOperationsPerNode": 1,
"MaximumConcurrentOperationsPerNode": 3
},
"ExtentsPurgeRebuildCapacity": {
"MaximumConcurrentOperationsPerNode": 1
},
"ExportCapacity": {
"ClusterMaximumConcurrentOperations": 100,
"CoreUtilizationCoefficient": 0.25
},
"ExtentsPartitionCapacity": {
"ClusterMinimumConcurrentOperations": 1,
"ClusterMaximumConcurrentOperations": 32
},
"MaterializedViewsCapacity": {
"ClusterMaximumConcurrentOperations": 1,
"ExtentsRebuildCapacity": {
"ClusterMaximumConcurrentOperations": 50,
"MaximumConcurrentOperationsPerNode": 5
}
},
"StoredQueryResultsCapacity": {
"MaximumConcurrentOperationsPerDbAdmin": 250,
"CoreUtilizationCoefficient": 0.75
},
"StreamingIngestionPostProcessingCapacity": {
"MaximumConcurrentOperationsPerNode": 4
},
"PurgeStorageArtifactsCleanupCapacity": {
"MaximumConcurrentOperationsPerCluster": 2
},
"PeriodicStorageArtifactsCleanupCapacity": {
"MaximumConcurrentOperationsPerCluster": 2
},
"QueryAccelerationCapacity": {
"ClusterMaximumConcurrentOperations": 100,
"CoreUtilizationCoefficient": 0.5
}
}
Management commands
- Use
.show cluster policy capacity
to show the current capacity policy of the cluster. - Use
.alter-merge cluster policy capacity
to alter the capacity policy of the cluster.
Management commands throttling
Kusto limits the number of concurrent requests for the following user-initiated commands:
- Ingestions
- This category includes commands that ingest from storage, ingest from a query, and ingest inline.
- The ingestion capacity defines the limit.
- Purges
- The global limit is currently fixed at one per cluster.
- The purge rebuild capacity is used internally to determine the number of concurrent rebuild operations during purge commands. Purge commands aren’t blocked or throttled because of this process, but completes faster or slower depending on the purge rebuild capacity.
- Exports
- The limit is as defined in the export capacity.
- Query Acceleration
- The limit is as defined in the query acceleration capacity.
When the cluster detects that an operation exceeded the limit on concurrent requests:
- The command’s state, as presented by System information commands, is
Throttled
. - The error message includes the command type, the origin of the throttling and the capacity that exceeded. For example:
- For example:
The management command was aborted due to throttling. Retrying after some backoff might succeed. CommandType: 'TableSetOrAppend', Capacity: 18, Origin: 'CapacityPolicy/Ingestion'
.
- For example:
- The HTTP response code is
429
. The subcode isTooManyRequests
. - The exception type is
ControlCommandThrottledException
.
Related content
7.6 - Encoding policy
7.6.1 - Encoding policy
The encoding policy defines how data is encoded, compressed, and indexed. This policy applies to all columns of stored data. A default encoding policy is applied based on the column’s data type, and a background process adjusts the encoding policy automatically if necessary.
Scenarios
We recommend the default policy be maintained except for specific scenarios. It can be useful to modify the default column’s encoding policy to fine tune control over the performance/COGS trade-off. For example:
- The default indexing applied to
string
columns is built for term searches. If you only query for specific values in the column, COGS might be reduced if the index is simplified using the encoding profileIdentifier
. For more information, see the string data type. - Fields that are never queried on or don’t need fast searches can disable indexing. You can use profile
BigObject
to turn off the indexes and increase maximal value size in dynamic or string columns. For example, use this profile to store HLL values returned by hll() function.
How it works
Encoding policy changes do not affect data that has already been ingested. Only new ingestion operations will be performed according to the new policy. The encoding policy applies to individual columns in a table, but can be set at the column level, table level (affecting all columns of the table), or database level.
Related content
- To view the encoding policy, see .show encoding policy.
- To alter the encoding policy, see .alter encoding policy.
7.7 - Extent tags policy
7.7.1 - Extent tags retention policy
The extent tags retention policy controls the mechanism that automatically removes extent tags from tables, based on the age of the extents.
It’s recommended to remove any tags that are no longer helpful, or were used temporarily as part of an ingestion pipeline, and may limit the system from reaching optimal performance. For example: old drop-by:
tags, which prevent merging extents together.
The policy can be set at the table-level, or at the database-level. A database-level policy applies to all tables in the database that don’t override the policy.
The policy object
The extent tags retention policy is an array of policy objects. Each object includes the following properties:
Property name | Type | Description | Example |
---|---|---|---|
TagPrefix | string | The prefix of the tags to be automatically deleted, once RetentionPeriod is exceeded. The prefix must include a colon (: ) as its final character, and may only include one colon. | drop-by: , ingest-by: , custom_prefix: |
RetentionPeriod | timespan | The duration for which it’s guaranteed that the tags aren’t dropped. This period is measured starting from the extent’s creation time. | 1.00:00:00 |
Example
The following policy will have any drop-by:
tags older than three days and any ingest-by:
tags older than two hours automatically dropped:
[
{
"TagPrefix": "drop-by:",
"RetentionPeriod": "3.00:00:00"
},
{
"TagPrefix": "ingest-by:",
"RetentionPeriod": "02:00:00"
}
]
Defaults
By default, when the policy isn’t defined, extent tags of any kind are retained as long as the extent isn’t dropped.
Management commands
The following management commands can be used to manage the extent tags retention policy:
7.8 - Ingestion batching
7.8.1 - IngestionBatching policy
Overview
During the queued ingestion process, the service optimizes for throughput by batching small ingress data chunks together before ingestion. Batching reduces the resources consumed by the queued ingestion process and doesn’t require post-ingestion resources to optimize the small data shards produced by non-batched ingestion.
The downside to doing batching before ingestion is the forced delay. Therefore, the end-to-end time from requesting the data ingestion until the data ready for query is larger.
When you define the IngestionBatching
policy, you’ll need to find a balance between optimizing for throughput and time delay. This policy applies to queued ingestion. It defines the maximum forced delay allowed when batching small blobs together. To learn more about using batching policy commands, and optimizing for throughput, see:
Sealing a batch
There’s an optimal size of about 1 GB of uncompressed data for bulk ingestion. Ingestion of blobs with much less data is suboptimal, so in queued ingestion the service will batch small blobs together.
The following list shows the basic batching policy triggers to seal a batch. A batch is sealed and ingested when the first condition is met:
Size
: Batch size limit reached or exceededCount
: Batch file number limit reachedTime
: Batching time has expired
The IngestionBatching
policy can be set on databases or tables. Default values are as follows: 5 minutes maximum delay time, 500 items, total size of 1 GB.
The following list shows conditions to seal batches related to single blob ingestion. A batch is sealed and ingested when the conditions are met:
SingleBlob_FlushImmediately
: Ingest a single blob because ‘FlushImmediately’ was setSingleBlob_IngestIfNotExists
: Ingest a single blob because ‘IngestIfNotExists’ was setSingleBlob_IngestByTag
: Ingest a single blob because ‘ingest-by’ was setSingleBlob_SizeUnknown
: Ingest a single blob because blob size is unknown
If the SystemFlush
condition is set, a batch will be sealed when a system flush is triggered. With the SystemFlush
parameter set, the system flushes the data, for example due to database scaling or internal reset of system components.
Defaults and limits
Type | Property | Default | Low latency setting | Minimum value | Maximum value |
---|---|---|---|---|---|
Number of items | MaximumNumberOfItems | 500 | 500 | 1 | 25,000 |
Data size (MB) | MaximumRawDataSizeMB | 1024 | 1024 | 100 | 4096 |
Time (TimeSpan) | MaximumBatchingTimeSpan | 00:05:00 | 00:00:20 - 00:00:30 | 00:00:10 | 00:30:00 |
The most effective way of controlling the end-to-end latency using ingestion batching policy is to alter its time boundary at table or database level, according to the higher bound of latency requirements. A database level policy affects all tables in that database that don’t have the table-level policy defined, and any newly created table.
Batch data size
The batching policy data size is set for uncompressed data. For Parquet, AVRO, and ORC files, an estimation is calculated based on file size. For compressed data, the uncompressed data size is evaluated as follows in descending order of accuracy:
- If the uncompressed size is provided in the ingestion source options, that value is used.
- When ingesting local files using SDKs, zip archives and gzip streams are inspected to assess their raw size.
- If previous options don’t provide a data size, a factor is applied to the compressed data size to estimate the uncompressed data size.
Batching latencies
Latencies can result from many causes that can be addressed using batching policy settings.
Cause | Solution |
---|---|
Data latency matches the time setting, with too little data to reach the size or count limit | Reduce the time limit |
Inefficient batching due to a large number of very small files | Increase the size of the source files. If using Kafka Sink, configure it to send data in ~100 KB chunks or higher. If you have many small files, increase the count (up to 2000) in the database or table ingestion policy. |
Batching a large amount of uncompressed data | This is common when ingesting Parquet files. Incrementally decrease size for the table or database batching policy towards 250 MB and check for improvement. |
Backlog because the database is under scaled | Accept any Azure advisor suggestions to scale aside or scale up your database. Alternatively, manually scale your database to see if the backlog is closed. If these options don’t work, contact support for assistance. |
7.9 - Ingestion time
7.9.1 - IngestionTime policy
The IngestionTime policy is an optional policy that can be set (enabled) on tables.
When enabled, Kusto adds a hidden datetime
column to the table, called $IngestionTime
.
Now, whenever new data is ingested, the time of ingestion is recorded in the hidden column.
That time is measured just before the data is committed.
Since the ingestion time column is hidden, you can’t directly query for its value.
Instead, a special function called
ingestion_time()
retrieves that value. If there’s no datetime
column in the table,
or the IngestionTime policy wasn’t enabled when a record was ingested, a null
value is returned.
The IngestionTime policy is designed for two main scenarios:
To allow users to estimate the latency in ingesting data. Many tables with log data have a timestamp column. The timestamp value gets filled by the source and indicates the time when the record was produced. By comparing that column’s value with the ingestion time column, you can estimate the latency for getting the data in.
[!NOTE] The calculated value is only an estimate, because the source and Kusto don’t necessarily have their clocks synchronized.
To support Database Cursors that let users issue consecutive queries, the query is limited to the data that was ingested since the previous query.
For more information. see the management commands for managing the IngestionTime policy.
For more information. see the management commands for managing the IngestionTime policy.
7.10 - Managed identity
7.10.1 - Kusto ManagedIdentity policy
ManagedIdentity is a policy that controls which managed identities can be used for what purposes. For example, you can configure a policy that allows a specific managed identity to be used for accessing a storage account for ingestion purposes.
This policy can be enabled at the cluster and database levels. The policy is additive, meaning that for every operation that involves a managed identity, the operation will be permitted if the usage is allowed at either the cluster or database level.
Permissions
Creating or altering a managed identity policy requires AllDatabasesAdmin permissions.
The ManagedIdentity policy object
A cluster or database may have zero or more ManagedIdentity policy objects associated with it. Each ManagedIdentity policy object has the following user-definable properties: DisplayName and AllowedUsages. Other properties are automatically populated from the managed identity associated with the specified ObjectId and displayed for convenience.
The following table describes the properties of the ManagedIdentity policy object:
Property | Type | Required | Description |
---|---|---|---|
ObjectId | string | ✔️ | Either the actual object ID of the managed identity or the reserved keyword system to reference the System Managed Identity of the cluster on which the command is run. |
ClientId | string | Not applicable | The client ID of the managed identity. |
TenantId | string | Not applicable | The tenant ID of the managed identity. |
DisplayName | string | Not applicable | The display name of the managed identity. |
IsSystem | bool | Not applicable | A Boolean value indicating true if the identity is a System Managed Identity; false if otherwise. |
AllowedUsages | string | ✔️ | A list of comma-separated allowed usage values for the managed identity. See managed identity usages. |
The following is an example of a ManagedIdentity policy object:
{
"ObjectId": "<objectID>",
"ClientId": "<clientID>",
"TenantId": "<tenantID",
"DisplayName": "myManagedIdentity",
"IsSystem": false,
"AllowedUsages": "NativeIngestion, ExternalTable"
}
Managed identity usages
The following values specify authentication to a usage
using the configured managed identity:
Value | Description |
---|---|
All | All current and future usages are allowed. |
AutomatedFlows | Run a Continuous Export or Update Policy automated flow on behalf of a managed identity. |
AzureAI | Authenticate to an Azure OpenAI service using the ai_embed_text plugin with a managed identity. |
DataConnection | Authenticate to data connections to an Event Hub or an Event Grid. |
ExternalTable | Authenticate to external tables using connection strings configured with a managed identity. |
NativeIngestion | Authenticate to an SDK for native ingestion from an external source. |
SandboxArtifacts | Authenticate to external artifacts referenced in sandboxed plugins (e.g., Python) with a managed identity. This usage needs to be defined on the cluster level managed identity policy. |
SqlRequest | Authenticate to an external database using the sql_request or cosmosdb_request plugin with a managed identity. |
7.11 - Merge policy
7.11.1 - Extents merge policy
The merge policy defines if and how Extents (data shards) should get merged.
There are two types of merge operations: Merge
, which rebuilds indexes, and Rebuild
, which completely reingests the data.
Both operation types result in a single extent that replaces the source extents.
By default, Rebuild
operations are preferred. If there are extents that don’t fit the criteria for being rebuilt, then an attempt will be made to merge them.
Merge policy properties
The merge policy contains the following properties:
- RowCountUpperBoundForMerge:
- Defaults to 16,000,000.
- Maximum allowed row count of the merged extent.
- Applies to Merge operations, not Rebuild.
- OriginalSizeMBUpperBoundForMerge:
- Defaults to 30,000.
- Maximum allowed original size (in MBs) of the merged extent.
- Applies to Merge operations, not Rebuild.
- MaxExtentsToMerge:
- Defaults to 100.
- Maximum allowed number of extents to be merged in a single operation.
- Applies to Merge operations.
- This value shouldn’t be changed.
- AllowRebuild:
- Defaults to ’true'.
- Defines whether
Rebuild
operations are enabled (in which case, they’re preferred overMerge
operations).
- AllowMerge:
- Defaults to ’true'.
- Defines whether
Merge
operations are enabled, in which case, they’re less preferred thanRebuild
operations.
- MaxRangeInHours:
- Defaults to 24.
- The maximum allowed difference, in hours, between any two different extents’ creation times, so that they can still be merged.
- Timestamps are of extent creation, and don’t relate to the actual data contained in the extents.
- Applies to both Merge and Rebuild operations.
- In materialized views: defaults to 336 (14 days), unless recoverability is disabled in the materialized view’s effective retention policy.
- This value should be set according to the effective retention policy SoftDeletePeriod, or cache policy DataHotSpan values. Take the lower value of SoftDeletePeriod and DataHotSpan. Set the MaxRangeInHours value to between 2-3% of it. See the examples .
- Lookback:
- Defines the timespan during which extents are considered for rebuild/merge.
- Supported values:
Default
- The system-managed default. This is the recommended and default value, whose period is currently set to 14 days.All
- All extents, hot and cold, are included.HotCache
- Only hot extents are included.Custom
- Only extents whose age is under the providedCustomPeriod
are included.CustomPeriod
is a timespan value in the formatdd.hh:mm
.
Default policy example
The following example shows the default policy:
{
"RowCountUpperBoundForMerge": 16000000,
"OriginalSizeMBUpperBoundForMerge": 30000,
"MaxExtentsToMerge": 100,,
"MaxRangeInHours": 24,
"AllowRebuild": true,
"AllowMerge": true,
"Lookback": {
"Kind": "Default",
"CustomPeriod": null
}
}
MaxRangeInHours examples
min(SoftDeletePeriod (Retention Policy), DataHotSpan (Cache Policy)) | Max Range in hours (Merge Policy) |
---|---|
7 days (168 hours) | 4 |
14 days (336 hours) | 8 |
30 days (720 hours) | 18 |
60 days (1,440 hours) | 36 |
90 days (2,160 hours) | 60 |
180 days (4,320 hours) | 120 |
365 days (8,760 hours) | 250 |
When a database is created, it’s set with the default merge policy values mentioned above. The policy is by default inherited by all tables created in the database, unless their policies are explicitly overridden at table-level.
For more information, see management commands that allow you to manage merge policies for databases or tables.
7.12 - Mirroring policy
7.12.1 - Mirroring policy
The mirroring policy commands allow you to view, change, partition, and delete your table mirroring policy. They also provide a way to check the mirroring latency by reviewing the operations mirroring status.
Management commands
- Use .show table policy mirroring command to show the current mirroring policy of the table.
- Use .alter-merge table policy mirroring command to change the current mirroring policy.
- Use .delete table policy mirroring command to soft-delete the current mirroring policy.
- Use .show table mirroring operations command to check operations mirroring status.
- Use .show table mirroring operations exported artifacts command to check operations exported artifacts status.
- Use .show table mirroring operations failures to check operations mirroring failure status.
The policy object
The mirroring policy includes the following properties:
Property | Description | Values | Default |
---|---|---|---|
Format | The format of your mirrored files. | Valid value is parquet . | parquet |
ConnectionStrings | An array of connection strings that help configure and establish connections. This value is autopopulated. | ||
IsEnabled | Determines whether the mirroring policy is enabled. When the mirroring policy is disabled and set to false , the underlying mirroring data is retained in the database. | true , false , null . | null |
Partitions | A comma-separated list of columns used to divide the data into smaller partitions. | See Partitions formatting. |
Data types mapping
To ensure compatibility and optimize queries, ensure that your data types are properly mapped to the parquet data types.
Event house to Delta parquet data types mapping
Event house data types are mapped to Delta Parquet data types using the following rules:
Event house data type | Delta data type |
---|---|
bool | boolean |
datetime | timestamp OR date (for date-bound partition definitions) |
dynamic | string |
guid | string |
int | integer |
long | long |
real | double |
string | string |
timespan | long |
decimal | decimal(38,18) |
For more information on Event house data types, see Scalar data types.
Example policy
{
"Format": "parquet",
"IsEnabled": true,
"Partitions": null,
}
7.13 - Partitioning policy
7.13.1 - Partitioning policy
The partitioning policy defines if and how extents (data shards) should be partitioned for a specific table or a materialized view.
The policy triggers an additional background process that takes place after the creation of extents, following data ingestion. This process includes reingesting data from the source extents and producing homogeneous extents, in which all values of the column designated as the partition key reside within a single partition.
The primary objective of the partitioning policy is to enhance query performance in specific supported scenarios.
Supported scenarios
The following are the only scenarios in which setting a data partitioning policy is recommended. In all other scenarios, setting the policy isn’t advised.
- Frequent filters on a medium or high cardinality
string
orguid
column:- For example: multitenant solutions, or a metrics table where most or all queries filter on a column of type
string
orguid
, such as theTenantId
or theMetricId
. - Medium cardinality is at least 10,000 distinct values.
- Set the hash partition key to be the
string
orguid
column, and set thePartitionAssignmentMode
property touniform
.
- For example: multitenant solutions, or a metrics table where most or all queries filter on a column of type
- Frequent aggregations or joins on a high cardinality
string
orguid
column:- For example, IoT information from many different sensors, or academic records of many different students.
- High cardinality is at least 1,000,000 distinct values, where the distribution of values in the column is approximately even.
- In this case, set the hash partition key to be the column frequently grouped-by or joined-on, and set the
PartitionAssignmentMode
property toByPartition
.
- Out-of-order data ingestion:
- Data ingested into a table might not be ordered and partitioned into extents (shards) according to a specific
datetime
column that represents the data creation time and is commonly used to filter data. This could be due to a backfill from heterogeneous source files that include datetime values over a large time span. - In this case, set the uniform range datetime partition key to be the
datetime
column. - If you need retention and caching policies to align with the datetime values in the column, instead of aligning with the time of ingestion, set the
OverrideCreationTime
property totrue
.
- Data ingested into a table might not be ordered and partitioned into extents (shards) according to a specific
Partition keys
The following kinds of partition keys are supported.
Kind | Column Type | Partition properties | Partition value |
---|---|---|---|
Hash | string or guid | Function , MaxPartitionCount , Seed , PartitionAssignmentMode | Function (ColumnName , MaxPartitionCount , Seed ) |
Uniform range | datetime | RangeSize , Reference , OverrideCreationTime | bin_at (ColumnName , RangeSize , Reference ) |
Hash partition key
If the policy includes a hash partition key, all homogeneous extents that belong to the same partition will be assigned to the same data node.
- A hash-modulo function is used to partition the data.
- Data in homogeneous (partitioned) extents is ordered by the hash partition key.
- You don’t need to include the hash partition key in the row order policy, if one is defined on the table.
- Queries that use the shuffle strategy, and in which the
shuffle key
used injoin
,summarize
ormake-series
is the table’s hash partition key, are expected to perform better because the amount of data required to move across nodes is reduced.
Partition properties
Property | Description | Supported value(s) | Recommended value |
---|---|---|---|
Function | The name of a hash-modulo function to use. | XxHash64 | |
MaxPartitionCount | The maximum number of partitions to create (the modulo argument to the hash-modulo function) per time period. | In the range (1,2048] . | Higher values lead to greater overhead of the data partitioning process, and a higher number of extents for each time period. The recommended value is 128 . Higher values will significantly increase the overhead of partitioning the data post-ingestion, and the size of metadata - and are therefore not recommended. |
Seed | Use for randomizing the hash value. | A positive integer. | 1 , which is also the default value. |
PartitionAssignmentMode | The mode used for assigning partitions to nodes. | ByPartition : All homogeneous (partitioned) extents that belong to the same partition are assigned to the same node.Uniform : An extents’ partition values are disregarded. Extents are assigned uniformly to the nodes. | If queries don’t join or aggregate on the hash partition key, use Uniform . Otherwise, use ByPartition . |
Hash partition key example
A hash partition key over a string
-typed column named tenant_id
.
It uses the XxHash64
hash function, with MaxPartitionCount
set to the recommended value 128
, and the default Seed
of 1
.
{
"ColumnName": "tenant_id",
"Kind": "Hash",
"Properties": {
"Function": "XxHash64",
"MaxPartitionCount": 128,
"Seed": 1,
"PartitionAssignmentMode": "Uniform"
}
}
Uniform range datetime partition key
In these cases, you can reshuffle the data between extents so that each extent includes records from a limited time range. This process results in filters on the datetime
column being more effective at query time.
The partition function used is bin_at() and isn’t customizable.
Partition properties
Property | Description | Recommended value |
---|---|---|
RangeSize | A timespan scalar constant that indicates the size of each datetime partition. | Start with the value 1.00:00:00 (one day). Don’t set a shorter value, because it may result in the table having a large number of small extents that can’t be merged. |
Reference | A datetime scalar constant that indicates a fixed point in time, according to which datetime partitions are aligned. | Start with 1970-01-01 00:00:00 . If there are records in which the datetime partition key has null values, their partition value is set to the value of Reference . |
OverrideCreationTime | A bool indicating whether or not the result extent’s minimum and maximum creation times should be overridden by the range of the values in the partition key. | Defaults to false . Set to true if data isn’t ingested in-order of time of arrival. For example, a single source file may include datetime values that are distant, and/or you may want to enforce retention or caching based on the datetime values rather than the time of ingestion.When OverrideCreationTime is set to true , extents may be missed in the merge process. Extents are missed if their creation time is older than the Lookback period of the table’s Extents merge policy. To make sure that the extents are discoverable, set the Lookback property to HotCache . |
Uniform range datetime partition example
The snippet shows a uniform datetime range partition key over a datetime
typed column named timestamp
.
It uses datetime(2021-01-01)
as its reference point, with a size of 7d
for each partition, and doesn’t
override the extents’ creation times.
{
"ColumnName": "timestamp",
"Kind": "UniformRange",
"Properties": {
"Reference": "2021-01-01T00:00:00",
"RangeSize": "7.00:00:00",
"OverrideCreationTime": false
}
}
The policy object
By default, a table’s data partitioning policy is null
, in which case data in the table won’t be repartitioned after it’s ingested.
The data partitioning policy has the following main properties:
PartitionKeys:
- A collection of partition keys that define how to partition the data in the table.
- A table can have up to
2
partition keys, with one of the following options: - Each partition key has the following properties:
ColumnName
:string
- The name of the column according to which the data will be partitioned.Kind
:string
- The data partitioning kind to apply (Hash
orUniformRange
).Properties
:property bag
- Defines parameters according to which partitioning is done.
EffectiveDateTime:
- The UTC datetime from which the policy is effective.
- This property is optional. If it isn’t specified, the policy will take effect for data ingested after the policy was applied.
Data partitioning example
Data partitioning policy object with two partition keys.
- A hash partition key over a
string
-typed column namedtenant_id
.- It uses the
XxHash64
hash function, withMaxPartitionCount
set to the recommended value128
, and the defaultSeed
of1
.
- It uses the
- A uniform datetime range partition key over a
datetime
type column namedtimestamp
.- It uses
datetime(2021-01-01)
as its reference point, with a size of7d
for each partition.
- It uses
{
"PartitionKeys": [
{
"ColumnName": "tenant_id",
"Kind": "Hash",
"Properties": {
"Function": "XxHash64",
"MaxPartitionCount": 128,
"Seed": 1,
"PartitionAssignmentMode": "Uniform"
}
},
{
"ColumnName": "timestamp",
"Kind": "UniformRange",
"Properties": {
"Reference": "2021-01-01T00:00:00",
"RangeSize": "7.00:00:00",
"OverrideCreationTime": false
}
}
]
}
Additional properties
The following properties can be defined as part of the policy. These properties are optional and we recommend not changing them.
Property | Description | Recommended value | Default value |
---|---|---|---|
MinRowCountPerOperation | Minimum target for the sum of row count of the source extents of a single data partitioning operation. | 0 | |
MaxRowCountPerOperation | Maximum target for the sum of the row count of the source extents of a single data partitioning operation. | Set a value lower than 5M if you see that the partitioning operations consume a large amount of memory or CPU per operation. | 0 , with a default target of 5,000,000 records. |
MaxOriginalSizePerOperation | Maximum target for the sum of the original size (in bytes) of the source extents of a single data partitioning operation. | If the partitioning operations consume a large amount of memory or CPU per operation, set a value lower than 5 GB. | 0 , with a default target of 5,368,709,120 bytes (5 GB). |
The data partitioning process
- Data partitioning runs as a post-ingestion background process.
- A table that is continuously ingested into is expected to always have a “tail” of data that is yet to be partitioned (nonhomogeneous extents).
- Data partitioning runs only on hot extents, regardless of the value of the
EffectiveDateTime
property in the policy.- If partitioning cold extents is required, you need to temporarily adjust the caching policy.
You can monitor the partitioning status of tables with defined policies in a database by using the .show database extents partitioning statistics command and partitioning metrics.
Partitioning capacity
The data partitioning process results in the creation of more extents. The extents merge capacity may gradually increase, so that the process of merging extents can keep up.
If there’s a high ingestion throughput, or a large enough number of tables that have a partitioning policy defined, then the Extents partition capacity may gradually increase, so that the process of partitioning extents can keep up.
To avoid consuming too many resources, these dynamic increases are capped. You may be required to gradually and linearly increase them beyond the cap, if they’re used up entirely.
Limitations
- Attempts to partition data in a database that already has more than 5,000,000 extents will be throttled.
- In such cases, the
EffectiveDateTime
property of partitioning policies of tables in the database will be automatically delayed by several hours, so that you can reevaluate your configuration and policies.
- In such cases, the
Outliers in partitioned columns
- The following situations can contribute to imbalanced distribution of data across nodes, and degrade query performance:
- If a hash partition key includes values that are much more prevalent than others, for example, an empty string, or a generic value (such as
null
orN/A
). - The values represent an entity (such as
tenant_id
) that is more prevalent in the dataset.
- If a hash partition key includes values that are much more prevalent than others, for example, an empty string, or a generic value (such as
- If a uniform range datetime partition key has a large enough percentage of values that are “far” from the majority of the values in the column, the overhead of the data partitioning process is increased and may lead to many small extents to keep track of. An example of such a situation is datetime values from the distant past or future.
In both of these cases, either “fix” the data, or filter out any irrelevant records in the data before or at ingestion time, to reduce the overhead of the data partitioning. For example, use an update policy.
Related content
7.14 - Query acceleration policy
7.14.1 - Query acceleration policy (preview)
An external table is a schema entity that references data stored external to a Kusto database. Queries run over external tables can be less performant than on data that is ingested due to various factors such as network calls to fetch data from storage, the absence of indexes, and more. Query acceleration allows specifying a policy on top of external delta tables. This policy defines a number of days to accelerate data for high-performance queries.
Query acceleration is supported in Azure Data Explorer over Azure Data Lake Store Gen2 or Azure blob storage external tables.
Query acceleration is supported in Eventhouse over OneLake, Azure Data Lake Store Gen2, or Azure blob storage external tables.
To enable query acceleration in the Fabric UI, see Query acceleration over OneLake shortcuts.
Limitations
- The number of columns in the external table can’t exceed 900.
- Delta tables with checkpoint V2 are not supported.
- Query performance over accelerated external delta tables which have partitions may not be optimal during preview.
- The feature assumes delta tables with static advanced features, for example column mapping doesn’t change, partitions don’t change, and so on. To change advanced features, first disable the policy, and once the change is made, re-enable the policy.
- Schema changes on the delta table must also be followed with the respective
.alter
external delta table schema, which might result in acceleration starting from scratch if there was breaking schema change. - Index-based pruning isn’t supported for partitions.
- Parquet files larger than 1 GB won’t be cached.
- Query acceleration isn’t supported for external tables with impersonation authentication.
Known issues
- Data in the external delta table that is optimized with the OPTIMIZE function will need to be reaccelearted.
- If you run frequent MERGE/UPDATE/DELETE operations in delta, the underlying parquet files may be rewritten with changes and Kusto will skip accelerating such files, causing retrieval during query time.
- The system assumes that all artifacts under the delta table directory have the same access level to the selected users. Different files having different access permissions under the delta table directory might result with unexpected behavior.
Commands for query acceleration
7.15 - Query week consistency policy
7.15.1 - Query weak consistency policy
The query weak consistency policy is a cluster-level policy object that configures the weak consistency service.
Management commands
- Use
.show cluster policy query_weak_consistency
to show the current query weak consistency policy of the cluster. - Use
.alter cluster policy query_weak_consistency
to change the current query weak consistency policy of the cluster.
The policy object
The query weak consistency policy includes the following properties:
Property | Description | Values | Default |
---|---|---|---|
PercentageOfNodes | The percentage of nodes in the cluster that execute the query weak consistency service (the selected nodes will execute the weakly consistent queries). | An integer between 1 to 100 , or -1 for default value (which is currently 20% ). | -1 |
MinimumNumberOfNodes | Minimum number of nodes that execute the query weak consistency service (will determine the number of nodes in case PercentageOfNodes *#NodesInCluster is smaller). | A positive integer, or -1 for default value (which is currently 2 ). Smaller or equal to MaximumNumberOfNodes . | -1 |
MaximumNumberOfNodes | Maximum number of nodes that execute the query weak consistency service (will determine the number of nodes in case PercentageOfNodes *#NodesInCluster is greater). | A positive integer, or -1 for default value (which is currently 30 ). Greater or equal to MinimumNumberOfNodes . | -1 |
SuperSlackerNumberOfNodesThreshold | If the total number of nodes in the cluster exceeds this number, nodes that execute the weak consistency service will become ‘super slacker’, meaning they won’t have data on them (in order to reduce load). See Warning below. | A positive integer that is greater than or equal to 4 , or -1 for default value (currently no threshold - weak consistency nodes won’t become ‘super slacker’). | -1 |
EnableMetadataPrefetch | When set to true , database metadata will be pre-loaded when the cluster comes up, and reloaded every few minutes, on all weak consistency nodes. When set to false , database metadata load will be triggered by queries (on demand), so some queries might be delayed (until the database metadata is pulled from storage). Database metadata must be reloaded from storage to query the database, when its age is greater than MaximumLagAllowedInMinutes . See Warning and Important below. | true or false | false |
MaximumLagAllowedInMinutes | The maximum duration (in minutes) that weakly consistent metadata is allowed to lag behind. If metadata is older than this value, the most up-to-date metadata will be pulled from storage (when the database is queried, or periodically if EnableMetadataPrefech is enabled). See Warning below. | An integer between 1 to 60 , or -1 for default value (currently 5 minutes). | -1 |
RefreshPeriodInSeconds | The refresh period (in seconds) to update a database metadata on each weak consistency node. See Warning below. | An integer between 30 to 1800 , or -1 for default value (currently 120 seconds). | -1 |
Default policy
The default policy is:
{
"PercentageOfNodes": -1,
"MinimumNumberOfNodes": -1,
"MaximumNumberOfNodes": -1,
"SuperSlackerNumberOfNodesThreshold": -1,
"EnableMetadataPrefetch": false,
"MaximumLagAllowedInMinutes": -1,
"RefreshPeriodInSeconds": -1
}
7.16 - Restricted view access
7.16.1 - Restricted view access policy
The restricted view access policy is an optional security feature that governs view permissions on a table. By default, the policy is disabled. When enabled, the policy adds an extra layer of permission requirements for principals to access and view the table.
For a table with an enabled restricted view access policy, only principals assigned the UnrestrictedViewer role have the necessary permissions to view the table. Even principals with roles like Table Admin or Database Admin are restricted unless granted the UnrestrictedViewer role.
While the restricted view access policy is specific to individual tables, the UnrestrictedViewer role operates at the database level. Thereby, a principal with the UnrestrictedViewer role has view permissions for all tables within the database. For more detailed information on managing table view access, see Manage view access to tables.
Limitations
- The restricted view access policy can’t be configured on a table on which a Row Level Security policy is enabled.
- A table with the restricted view access policy enabled can’t be used as the source of a materialized view. For more information, see materialized views limitations and known issues.
Related content
7.17 - Retention policy
7.17.1 - Retention policy
The retention policy controls the mechanism that automatically removes data from tables or materialized views. It’s useful to remove data that continuously flows into a table, and whose relevance is age-based. For example, the policy can be used for a table that holds diagnostics events that may become uninteresting after two weeks.
The retention policy can be configured for a specific table or materialized view, or for an entire database. The policy then applies to all tables in the database that don’t override it. When the policy is configured both at the database and table level, the retention policy in the table takes precedence over the database policy.
Setting up a retention policy is important when continuously ingesting data, which will limit costs.
Data that is “outside” the retention policy is eligible for removal. There’s no specific guarantee when removal occurs. Data may “linger” even if the retention policy is triggered.
The retention policy is most commonly set to limit the age of the data since ingestion. For more information, see SoftDeletePeriod.
deleted before the limit is exceeded, but deletion isn’t immediate following that point.
The policy object
A retention policy includes the following properties:
- SoftDeletePeriod:
- Time span for which it’s guaranteed that the data is kept available to query. The period is measured starting from the time the data was ingested.
- Defaults to
1,000 years
. - When altering the soft-delete period of a table or database, the new value applies to both existing and new data.
- Recoverability:
- Data recoverability (Enabled/Disabled) after the data was deleted.
- Defaults to
Enabled
. - If set to
Enabled
, the data will be recoverable for 14 days after it’s been soft-deleted. - It is not possible to configure the recoverability period.
Management commands
- Use
.show policy retention
to show the current retention policy for a database, table, or materialized view. - Use
.alter policy retention
to change current retention policy of a database, table, or materialized view.
Defaults
By default, when a database or a table is created, it doesn’t have a retention policy defined. Normally, the database is created and then immediately has its retention policy set by its creator according to known requirements.
When you run a .show
command for the retention policy of a database or table that hasn’t had its policy set, Policy
appears as null
.
The default retention policy, with the default values mentioned above, can be applied using the following command.
.alter database DatabaseName policy retention "{}"
.alter table TableName policy retention "{}"
.alter materialized-view ViewName policy retention "{}"
The command results in the following policy object applied to the database or table.
{
"SoftDeletePeriod": "365000.00:00:00", "Recoverability":"Enabled"
}
Clearing the retention policy of a database or table can be done using the following command.
.delete database DatabaseName policy retention
.delete table TableName policy retention
Examples
For an environment that has a database named MyDatabase
, with tables MyTable1
, MyTable2
, and MySpecialTable
.
Soft-delete period of seven days and recoverability disabled
Set all tables in the database to have a soft-delete period of seven days and disabled recoverability.
Option 1 (Recommended): Set a database-level retention policy, and verify there are no table-level policies set.
.delete table MyTable1 policy retention // optional, only if the table previously had its policy set .delete table MyTable2 policy retention // optional, only if the table previously had its policy set .delete table MySpecialTable policy retention // optional, only if the table previously had its policy set .alter-merge database MyDatabase policy retention softdelete = 7d recoverability = disabled .alter-merge materialized-view ViewName policy retention softdelete = 7d
Option 2: For each table, set a table-level retention policy, with a soft-delete period of seven days and recoverability disabled.
.alter-merge table MyTable1 policy retention softdelete = 7d recoverability = disabled .alter-merge table MyTable2 policy retention softdelete = 7d recoverability = disabled .alter-merge table MySpecialTable policy retention softdelete = 7d recoverability = disabled
Soft-delete period of seven days and recoverability enabled
Set tables
MyTable1
andMyTable2
to have a soft-delete period of seven days and recoverability disabled.Set
MySpecialTable
to have a soft-delete period of 14 days and recoverability enabled.Option 1 (Recommended): Set a database-level retention policy, and set a table-level retention policy.
.delete table MyTable1 policy retention // optional, only if the table previously had its policy set .delete table MyTable2 policy retention // optional, only if the table previously had its policy set .alter-merge database MyDatabase policy retention softdelete = 7d recoverability = disabled .alter-merge table MySpecialTable policy retention softdelete = 14d recoverability = enabled
Option 2: For each table, set a table-level retention policy, with the relevant soft-delete period and recoverability.
.alter-merge table MyTable1 policy retention softdelete = 7d recoverability = disabled .alter-merge table MyTable2 policy retention softdelete = 7d recoverability = disabled .alter-merge table MySpecialTable policy retention softdelete = 14d recoverability = enabled
Soft-delete period of seven days, and MySpecialTable
keeps its data indefinitely
Set tables MyTable1
and MyTable2
to have a soft-delete period of seven days, and have MySpecialTable
keep its data indefinitely.
Option 1: Set a database-level retention policy, and set a table-level retention policy, with a soft-delete period of 1,000 years, the default retention policy, for
MySpecialTable
..delete table MyTable1 policy retention // optional, only if the table previously had its policy set .delete table MyTable2 policy retention // optional, only if the table previously had its policy set .alter-merge database MyDatabase policy retention softdelete = 7d .alter table MySpecialTable policy retention "{}" // this sets the default retention policy
Option 2: For tables
MyTable1
andMyTable2
, set a table-level retention policy, and verify that the database-level and table-level policy forMySpecialTable
aren’t set..delete database MyDatabase policy retention // optional, only if the database previously had its policy set .delete table MySpecialTable policy retention // optional, only if the table previously had its policy set .alter-merge table MyTable1 policy retention softdelete = 7d .alter-merge table MyTable2 policy retention softdelete = 7d
Option 3: For tables
MyTable1
andMyTable2
, set a table-level retention policy. For tableMySpecialTable
, set a table-level retention policy with a soft-delete period of 1,000 years, the default retention policy..alter-merge table MyTable1 policy retention softdelete = 7d .alter-merge table MyTable2 policy retention softdelete = 7d .alter table MySpecialTable policy retention "{}"
7.18 - Row level security policy
7.18.1 - Row level security policy
Use group membership or execution context to control access to rows in a database table.
Row Level Security (RLS) simplifies the design and coding of security. It lets you apply restrictions on data row access in your application. For example, limit user access to rows relevant to their department, or restrict customer access to only the data relevant to their company.
The access restriction logic is located in the database tier, rather than away from the data in another application tier. The database system applies the access restrictions every time data access is attempted from any tier. This logic makes your security system more reliable and robust by reducing the surface area of your security system.
RLS lets you provide access to other applications and users, only to a certain portion of a table. For example, you might want to:
- Grant access only to rows that meet some criteria
- Anonymize data in some of the columns
- All of the above
For more information, see management commands for managing the Row Level Security policy.
Limitations
- There’s no limit on the number of tables on which Row Level Security policy can be configured.
- Row Level Security policy cannot be configured on External Tables.
- The RLS policy can’t be enabled on a table under the following circumstances:
- When it’s referenced by an update policy query, while the update policy is not configured with a managed identity.
- When it’s referenced by a continuous export that uses an authentication method other than impersonation.
- When a restricted view access policy is configured for the table.
- The RLS query can’t reference other tables that have Row Level Security policy enabled.
- The RLS query can’t reference tables located in other databases.
Examples
Limit access to Sales table
In a table named Sales
, each row contains details about a sale. One of the columns contains the name of the salesperson. Instead of giving your salespeople access to all records in Sales
, enable a Row Level Security policy on this table to only return records where the salesperson is the current user:
Sales | where SalesPersonAadUser == current_principal()
You can also mask the email address:
Sales | where SalesPersonAadUser == current_principal() | extend EmailAddress = "****"
If you want every sales person to see all the sales of a specific country/region, you can define a query similar to:
let UserToCountryMapping = datatable(User:string, Country:string)
[
"john@domain.com", "USA",
"anna@domain.com", "France"
];
Sales
| where Country in ((UserToCountryMapping | where User == current_principal_details()["UserPrincipalName"] | project Country))
If you have a group that contains the managers, you might want to give them access to all rows. Here’s the query for the Row Level Security policy.
let IsManager = current_principal_is_member_of('aadgroup=sales_managers@domain.com');
let AllData = Sales | where IsManager;
let PartialData = Sales | where not(IsManager) and (SalesPersonAadUser == current_principal()) | extend EmailAddress = "****";
union AllData, PartialData
Expose different data to members of different Microsoft Entra groups
If you have multiple Microsoft Entra groups, and you want the members of each group to see a different subset of data, use this structure for an RLS query.
Customers
| where (current_principal_is_member_of('aadgroup=group1@domain.com') and <filtering specific for group1>) or
(current_principal_is_member_of('aadgroup=group2@domain.com') and <filtering specific for group2>) or
(current_principal_is_member_of('aadgroup=group3@domain.com') and <filtering specific for group3>)
Apply the same RLS function on multiple tables
First, define a function that receives the table name as a string parameter, and references the table using the table()
operator.
For example:
.create-or-alter function RLSForCustomersTables(TableName: string) {
table(TableName)
| ...
}
Then configure RLS on multiple tables this way:
.alter table Customers1 policy row_level_security enable "RLSForCustomersTables('Customers1')"
.alter table Customers2 policy row_level_security enable "RLSForCustomersTables('Customers2')"
.alter table Customers3 policy row_level_security enable "RLSForCustomersTables('Customers3')"
Produce an error upon unauthorized access
If you want nonauthorized table users to receive an error instead of returning an empty table, use the assert()
function. The following example shows you how to produce this error in an RLS function:
.create-or-alter function RLSForCustomersTables() {
MyTable
| where assert(current_principal_is_member_of('aadgroup=mygroup@mycompany.com') == true, "You don't have access")
}
You can combine this approach with other examples. For example, you can display different results to users in different Microsoft Entra groups, and produce an error for everyone else.
Control permissions on follower databases
The RLS policy that you configure on the production database will also take effect in the follower databases. You can’t configure different RLS policies on the production and follower databases. However, you can use the current_cluster_endpoint()
function in your RLS query to achieve the same effect, as having different RLS queries in follower tables.
For example:
.create-or-alter function RLSForCustomersTables() {
let IsProductionCluster = current_cluster_endpoint() == "mycluster.eastus.kusto.windows.net";
let DataForProductionCluster = TempTable | where IsProductionCluster;
let DataForFollowerClusters = TempTable | where not(IsProductionCluster) | extend EmailAddress = "****";
union DataForProductionCluster, DataForFollowerClusters
}
Control permissions on shortcut databases
The RLS policy that you configure on the production database will also take effect in the shortcut databases. You can’t configure different RLS policies on the production and shortcut databases. However, you can use the current_cluster_endpoint()
function in your RLS query to achieve the same effect, as having different RLS queries in shortcut tables.
For example:
.create-or-alter function RLSForCustomersTables() {
let IsProductionCluster = current_cluster_endpoint() == "mycluster.eastus.kusto.windows.net";
let DataForProductionCluster = TempTable | where IsProductionCluster;
let DataForFollowerClusters = TempTable | where not(IsProductionCluster) | extend EmailAddress = "****";
union DataForProductionCluster, DataForFollowerClusters
}
More use cases
- A call center support person may identify callers by several digits of their social security number. This number shouldn’t be fully exposed to the support person. An RLS policy can be applied on the table to mask all but the last four digits of the social security number in the result set of any query.
- Set an RLS policy that masks personally identifiable information (PII), and enables developers to query production environments for troubleshooting purposes without violating compliance regulations.
- A hospital can set an RLS policy that allows nurses to view data rows for their patients only.
- A bank can set an RLS policy to restrict access to financial data rows based on an employee’s business division or role.
- A multi-tenant application can store data from many tenants in a single tableset (which is efficient). They would use an RLS policy to enforce a logical separation of each tenant’s data rows from every other tenant’s rows, so each tenant can see only its data rows.
Performance impact on queries
When an RLS policy is enabled on a table, there will be some performance impact on queries that access that table. Access to the table will be replaced by the RLS query that’s defined on that table. The performance impact of an RLS query will normally consist of two parts:
- Membership checks in Microsoft Entra ID: Checks are efficient. You can check membership in tens, or even hundreds of groups without major impact on the query performance.
- Filters, joins, and other operations that are applied on the data: Impact depends on the complexity of the query
For example:
let IsRestrictedUser = current_principal_is_member_of('aadgroup=some_group@domain.com');
let AllData = MyTable | where not(IsRestrictedUser);
let PartialData = MyTable | where IsRestrictedUser and (...);
union AllData, PartialData
If the user isn’t part of some_group@domain.com, then IsRestrictedUser
is evaluated to false
. The query that is evaluated is similar to this one:
let AllData = MyTable; // the condition evaluates to `true`, so the filter is dropped
let PartialData = <empty table>; // the condition evaluates to `false`, so the whole expression is replaced with an empty table
union AllData, PartialData // this will just return AllData, as PartialData is empty
Similarly, if IsRestrictedUser
evaluates to true
, then only the query for PartialData
will be evaluated.
Improve query performance when RLS is used
- If a filter is applied on a high-cardinality column, for example, DeviceID, consider using Partitioning policy or Row Order policy
- If a filter is applied on a low-medium-cardinality column, consider using Row Order policy
Performance impact on ingestion
There’s no performance impact on ingestion.
7.19 - Row order policy
7.19.1 - Row order policy
The row order policy sets the preferred arrangement of rows within an extent. The policy is optional and set at the table level.
The main purpose of the policy is to improve the performance of queries that are narrowed to a small subset of values in ordered columns. Additionally, it may contribute to improvements in compression.
Use management commands to alter, alter-merge delete, or show the row order policy for a table.
When to set the policy
It’s appropriate to set the policy under the following conditions:
- Most queries filter on specific values of a certain large-dimension column, such as an “application ID” or a “tenant ID”
- The data ingested into the table is unlikely to be preordered according to this column
Performance considerations
There are no hardcoded limits set on the amount of columns, or sort keys, that can be defined as part of the policy. However, every additional column adds some overhead to the ingestion process, and as more columns are added, the effective return diminishes.
7.20 - Sandbox policy
7.20.1 - Sandbox policy
Certain plugins run within sandboxes whose available resources are limited and controlled for security and for resource governance.
Sandboxes run on the nodes of your cluster. Some of their limitations are defined in sandbox policies, where each sandbox kind can have its own policy.
Sandbox policies are managed at cluster-level and affect all the nodes in the cluster.
Permissions
You must have AllDatabasesAdmin permissions to run this command.
The policy object
A sandbox policy has the following properties.
- SandboxKind: Defines the type of the sandbox (such as,
PythonExecution
,RExecution
). - IsEnabled: Defines if sandboxes of this type may run on the cluster’s nodes.
- The default value is false.
- InitializeOnStartup: Defines whether sandboxes of this type are initialized on startup, or lazily, upon first use.
- The default value is false. To ensure consistent performance and avoid any delays for running queries following service restart, set this property to true.
- TargetCountPerNode: Defines how many sandboxes of this type are allowed to run on the cluster’s nodes.
- Values can be between one and twice the number of processors per node.
- The default value is 16.
- MaxCpuRatePerSandbox: Defines the maximum CPU rate as a percentage of all available cores that a single sandbox can use.
- Values can be between 1 and 100.
- The default value is 50.
- MaxMemoryMbPerSandbox: Defines the maximum amount of memory (in megabytes) that a single sandbox can use.
- For Hyper-V technology sandboxes, values can be between 200 and 32768 (32 GB). The default value is 1024 (1 GB). The maximum memory of all sandboxes on a node (TargetCountPerNode * MaxMemoryMbPerSandbox) is 32768 (32 GB).
- For legacy sandboxes, values can be between 200 and 65536 (64 GB). The default value is 20480 (20 GB).
If a policy isn’t explicitly defined for a sandbox kind, an implicit policy with the default values and IsEnabled
set to true
applies.
Example
The following policy sets different limits for PythonExecution
and RExecution
sandboxes:
[
{
"SandboxKind": "PythonExecution",
"IsEnabled": true,
"InitializeOnStartup": false,
"TargetCountPerNode": 4,
"MaxCpuRatePerSandbox": 55,
"MaxMemoryMbPerSandbox": 8192
},
{
"SandboxKind": "RExecution",
"IsEnabled": true,
"InitializeOnStartup": false,
"TargetCountPerNode": 2,
"MaxCpuRatePerSandbox": 50,
"MaxMemoryMbPerSandbox": 10240
}
]
Related content
7.20.2 - Sandboxes
Kusto can run sandboxes for specific flows that must be run in a secure and isolated environment. Examples of these flows are user-defined scripts that run using the Python plugin or the R plugin.
Sandboxes are run locally (meaning, processing is done close to the data), with no extra latency for remote calls.
Prerequisites and limitations
- Sandboxes must run on VM sizes supporting nested virtualization, which implemented using Hyper-V technology and have no limitations.
- The image for running the sandboxes is deployed to every cluster node and requires dedicated SSD space to run.
- The estimated size is between 10-20 GB.
- This affects the cluster’s data capacity, and may affect the cost of the cluster.
Runtime
- A sandboxed query operator may use one or more sandboxes for its execution.
- A sandbox is only used for a single query and is disposed of once that query completes.
- When a node is restarted, for example, as part of a service upgrade, all running sandboxes on it are disposed of.
- Each node maintains a predefined number of sandboxes that are ready for running incoming requests.
- Once a sandbox is used, a new one is automatically made available to replace it.
- If there are no pre-allocated sandboxes available to serve a query operator, it will be throttled until new sandboxes are available. For more information, see Errors. New sandbox allocation could take up to 10-15 seconds per sandbox, depending on the SKU and available resources on the data node.
Sandbox parameters
Some of the parameters can be controlled using a cluster-level sandbox policy, for each kind of sandbox.
- Number of sandboxes per node: The number of sandboxes per node is limited.
- Requests that are made when there’s no available sandbox will be throttled.
- Initialize on startup: if set to
false
(default), sandboxes are lazily initialized on a node, the first time a query requires a sandbox for its execution. Otherwise, if set totrue
, sandboxes are initialized as part of service startup.- This means that the first execution of a plugin that uses sandboxes on a node will include a short warm-up period.
- CPU: The maximum rate of CPU a sandbox can consume of its host’s processors is limited (default is
50%
).- When the limit is reached, the sandbox’s CPU use is throttled, but execution continues.
- Memory: The maximum amount of RAM a sandbox can consume of its host’s RAM is limited.
- Default memory for Hyper-V technology is 1 GB, and for legacy sandboxes 20 GB.
- Reaching the limit results in termination of the sandbox, and a query execution error.
Sandbox limitations
- Network: A sandbox can’t interact with any resource on the virtual machine (VM) or outside of it.
- A sandbox can’t interact with another sandbox.
Errors
ErrorCode | Status | Message | Potential reason |
---|---|---|---|
E_SB_QUERY_THROTTLED_ERROR | TooManyRequests (429) | The sandboxed query was aborted because of throttling. Retrying after some backoff might succeed | There are no available sandboxes on the target node. New sandboxes should become available in a few seconds |
E_SB_QUERY_THROTTLED_ERROR | TooManyRequests (429) | Sandboxes of kind ‘{kind}’ haven’t yet been initialized | The sandbox policy has recently changed. New sandboxes obeying the new policy will become available in a few seconds |
InternalServiceError (520) | The sandboxed query was aborted due to a failure in initializing sandboxes | An unexpected infrastructure failure. |
VM Sizes supporting nested virtualization
The following table lists all modern VM sizes that support Hyper-V sandbox technology.
Name | Category |
---|---|
Standard_L8s_v3 | storage-optimized |
Standard_L16s_v3 | storage-optimized |
Standard_L8as_v3 | storage-optimized |
Standard_L16as_v3 | storage-optimized |
Standard_E8as_v5 | storage-optimized |
Standard_E16as_v5 | storage-optimized |
Standard_E8s_v4 | storage-optimized |
Standard_E16s_v4 | storage-optimized |
Standard_E8s_v5 | storage-optimized |
Standard_E16s_v5 | storage-optimized |
Standard_E2ads_v5 | compute-optimized |
Standard_E4ads_v5 | compute-optimized |
Standard_E8ads_v5 | compute-optimized |
Standard_E16ads_v5 | compute-optimized |
Standard_E2d_v4 | compute-optimized |
Standard_E4d_v4 | compute-optimized |
Standard_E8d_v4 | compute-optimized |
Standard_E16d_v4 | compute-optimized |
Standard_E2d_v5 | compute-optimized |
Standard_E4d_v5 | compute-optimized |
Standard_E8d_v5 | compute-optimized |
Standard_E16d_v5 | compute-optimized |
Standard_D32d_v4 | compute-optimized |
7.21 - Sharding policy
7.21.1 - Data sharding policy
The sharding policy defines if and how extents (data shards) in your cluster are created. You can only query data in an extent once it’s created.
The data sharding policy contains the following properties:
ShardEngineMaxRowCount:
- Maximum row count for an extent created by an ingestion or rebuild operation.
- Defaults to 1,048,576.
- Not in effect for merge operations.
- If you must limit the number of rows in extents created by merge operations, adjust the
RowCountUpperBoundForMerge
property in the entity’s extents merge policy.
- If you must limit the number of rows in extents created by merge operations, adjust the
ShardEngineMaxExtentSizeInMb:
- Maximum allowed compressed data size (in megabytes) for an extent created by a merge or rebuild operation.
- Defaults to 8,192 (8 GB).
ShardEngineMaxOriginalSizeInMb:
- Maximum allowed original data size (in megabytes) for an extent created by a rebuild operation.
- In effect only for rebuild operations.
- Defaults to 3,072 (3 GB).
When a database is created, it contains the default data sharding policy. This policy is inherited by all tables created in the database (unless the policy is explicitly overridden at the table level).
Use the sharding policy management commands to manage data sharding policies for databases and tables.
Related content
7.22 - Streaming ingestion policy
7.22.1 - Streaming ingestion policy
Streaming ingestion target scenarios
Streaming ingestion should be used for the following scenarios:
- Latency of less than a few seconds is required.
- To optimize operational processing of many tables where the stream of data into each table is relatively small (a few records per second), but the overall data ingestion volume is high (thousands of records per second).
If the stream of data into each table is high (over 4 GB per hour), consider using queued ingestion.
- To learn how to implement this feature and about its limitations, see streaming ingestion.
- For information about streaming ingestion management commands, see Management commands used for managing the streaming ingestion policy.
Streaming ingestion policy definition
The streaming ingestion policy contains the following properties:
- IsEnabled:
- defines the status of streaming ingestion functionality for the table/database
- mandatory, no default value, must explicitly be set to true or false
- HintAllocatedRate:
- if set provides a hint on the hourly volume of data in gigabytes expected for the table. This hint helps the system adjust the amount of resources that are allocated for a table in support of streaming ingestion.
- default value null (unset)
To enable streaming ingestion on a table, define the streaming ingestion policy with IsEnabled set to true. This definition can be set on a table itself or on the database. Defining this policy at the database level applies the same settings to all existing and future tables in the database. If the streaming ingestion policy is set at both the table and database levels, the table level setting takes precedence. This setting means that streaming ingestion can be generally enabled for the database but specifically disabled for certain tables, or the other way around.
Set the data rate hint
The streaming ingestion policy can provide a hint about the hourly volume of data expected for the table. This hint helps the system adjust the amount of resources allocated for this table in support of streaming ingestion. Set the hint if the rate of streaming data ingresses into the table exceeds 1 Gb/hour. If setting HintAllocatedRate in the streaming ingestion policy for the database, set it by the table with the highest expected data rate. It isn’t recommended to set the effective hint for a table to a value much higher than the expected peak hourly data rate. This setting might have an adverse effect on the query performance.
Related content
- .show database policy streamingingestion command
- .show table policy streamingingestion command
- .alter database policy streamingingestion command
- .alter-merge database policy streamingingestion command
- .alter table policy streamingingestion command
- .alter-merge table policy streamingingestion command
- .delete database policy streamingingestion command
- .delete table policy streamingingestion command
- Streaming ingestion and schema changes
7.23 - Update policy
7.23.1 - Common scenarios for using table update policies
This section describes some well-known scenarios that use update policies. Consider adopting these scenarios when your circumstances are similar.
In this article, you learn about the following common scenarios:
Medallion architecture data enrichment
Update policies on tables provide an efficient way to apply rapid transformations and are compatible with the medallion lakehouse architecture in Fabric.
In the medallion architecture, when raw data lands in a landing table (bronze layer), an update policy can be used to apply initial transformations and save the enriched output to a silver layer table. This process can cascade, where the data from the silver layer table can trigger another update policy to further refine the data and hydrate a gold layer table.
The following diagram illustrates an example of a data enrichment update policy named Get_Values. The enriched data is output to a silver layer table, which includes a calculated timestamp value and lookup values based on the raw data.
Data routing
A special case of data enrichment occurs when a raw data element contains data that must be routed to a different table based on one or more attributes of the data itself.
Consider an example that uses the same base data as the previous scenario, but this time there are three messages. The first message is a device telemetry message, the second message is a device alarm message, and the third message is an error.
To handle this scenario, three update policies are used. The Get_Telemetry update policy filters the device telemetry message, enriches the data, and saves it to the Device_Telemetry table. Similarly, the Get_Alarms update policy saves the data to the Device_Alarms table. Lastly, the Log_Error update policy sends unknown messages to the Error_Log table, allowing operators to detect malformed messages or unexpected schema evolution.
The following diagram depicts the example with the three update policies.
Optimize data models
Update policies on tables are built for speed. Tables typically conform to star schema design, which supports the development of data models that are optimized for performance and usability.
Querying tables in a star schema often requires joining tables. However, table joins can lead to performance issues, especially when querying high volumes of data. To improve query performance, you can flatten the model by storing denormalized data at ingestion time.
Joining tables at ingestion time has the added benefit of operating on a small batch of data, resulting in a reduced computational cost of the join. This approach can massively improve the performance of downstream queries.
For example, you can enrich raw telemetry data from a device by looking up values from a dimension table. An update policy can perform the lookup at ingestion time and save the output to a denormalized table. Furthermore, you can extend the output with data sourced from a reference data table.
The following diagram depicts the example, which comprises an update policy named Enrich_Device_Data. It extends the output data with data sourced from the Site reference data table.
Related content
7.23.2 - Run an update policy with a managed identity
The update policy must be configured with a managed identity in the following scenarios:
- When the update policy query references tables in other databases
- When the update policy query references tables with an enabled row level security policy
An update policy configured with a managed identity is performed on behalf of the managed identity.
In this article, you learn how to configure a system-assigned or user-assigned managed identity and create an update policy using that identity.
Prerequisites
- A cluster and database Create a cluster and database.
- AllDatabasesAdmin permissions on the database.
Configure a managed identity
There are two types of managed identities:
System-assigned: A system-assigned identity is connected to your cluster and is removed when the cluster is removed. Only one system-assigned identity is allowed per cluster.
User-assigned: A user-assigned managed identity is a standalone Azure resource. Multiple user-assigned identities can be assigned to your cluster.
Select one of the following tabs to set up your preferred managed identity type.
User-assigned
Follow the steps to Add a user-assigned identity.
In the Azure portal, in the left menu of your managed identity resource, select Properties. Copy and save the Tenant Id and Principal ID for use in the following steps.
Run the following .alter-merge policy managed_identity command, replacing
<objectId>
with the managed identity Principal ID from the previous step. This command sets a managed identity policy on the cluster that allows the managed identity to be used with the update policy..alter-merge cluster policy managed_identity ```[ { "ObjectId": "<objectId>", "AllowedUsages": "AutomatedFlows" } ]```
[!NOTE] To set the policy on a specific database, use
database <DatabaseName>
instead ofcluster
.Run the following command to grant the managed identity Database Viewer permissions over all databases referenced by the update policy query.
.add database <DatabaseName> viewers ('aadapp=<objectId>;<tenantId>')
Replace
<DatabaseName>
with the relevant database,<objectId>
with the managed identity Principal ID from step 2, and<tenantId>
with the Microsoft Entra ID Tenant Id from step 2.
System-assigned
Follow the steps to Add a system-assigned identity.
Copy and save the Object ID for use in a later step.
Run the following .alter-merge policy managed_identity command. This command sets a managed identity policy on the cluster that allows the managed identity to be used with the update policy.
.alter-merge cluster policy managed_identity ```[ { "ObjectId": "system", "AllowedUsages": "AutomatedFlows" } ]```
[!NOTE] To set the policy on a specific database, use
database <DatabaseName>
instead ofcluster
.Run the following command to grant the managed identity Database Viewer permissions over all databases referenced by the update policy query.
.add database <DatabaseName> viewers ('aadapp=<objectId>')
Replace
<DatabaseName>
with the relevant database and<objectId>
with the managed identity Object ID you saved earlier.
Create an update policy
Select one of the following tabs to create an update policy that runs on behalf of a user-assigned or system-assigned managed identity.
User-assigned
Run the .alter table policy update command with the ManagedIdentity
property set to the managed identity object ID.
For example, the following command alters the update policy of the table MyTable
in the database MyDatabase
. It’s important to note that both the Source
and Query
parameters should only reference objects within the same database where the update policy is defined. However, the code contained within the function specified in the Query
parameter can interact with tables located in other databases. For example, the function MyUpdatePolicyFunction()
can access OtherTable
in OtherDatabase
on behalf of a user-assigned managed identity. <objectId>
should be a managed identity object ID.
.alter table MyDatabase.MyTable policy update
```
[
{
"IsEnabled": true,
"Source": "MyTable",
"Query": "MyUpdatePolicyFunction()",
"IsTransactional": false,
"PropagateIngestionProperties": false,
"ManagedIdentity": "<objectId>"
}
]
```
System-assigned
Run the .alter table policy update command with the ManagedIdentity
property set to the managed identity object ID.
For example, the following command alters the update policy of the table MyTable
in the database MyDatabase
. It’s important to note that both the Source
and Query
parameters should only reference objects within the same database where the update policy is defined. However, the code contained within the function specified in the Query
parameter can interact with tables located in other databases. For example, the function MyUpdatePolicyFunction()
can access OtherTable
in OtherDatabase
on behalf of your system-assigned managed identity.
.alter table MyDatabase.MyTable policy update
```
[
{
"IsEnabled": true,
"Source": "MyTable",
"Query": "MyUpdatePolicyFunction()",
"IsTransactional": false,
"PropagateIngestionProperties": false,
"ManagedIdentity": "system"
}
]
```
7.23.3 - Update policy overview
Update policies are automation mechanisms triggered when new data is written to a table. They eliminate the need for special orchestration by running a query to transform the ingested data and save the result to a destination table. Multiple update policies can be defined on a single table, allowing for different transformations and saving data to multiple tables simultaneously. The target tables can have a different schema, retention policy, and other policies from the source table.
For example, a high-rate trace source table can contain data formatted as a free-text column. The target table can include specific trace lines, with a well-structured schema generated from a transformation of the source table’s free-text data using the parse operator. For more information, common scenarios.
The following diagram depicts a high-level view of an update policy. It shows two update policies that are triggered when data is added to the second source table. Once they’re triggered, transformed data is added to the two target tables.
An update policy is subject to the same restrictions and best practices as regular ingestion. The policy scales-out according to the cluster size, and is more efficient when handling bulk ingestion. An update policy is subject to the same restrictions and best practices as regular ingestion. The policy scales-out according to the Eventhouse size, and is more efficient when handling bulk ingestion.
Ingesting formatted data improves performance, and CSV is preferred because of it’s a well-defined format. Sometimes, however, you have no control over the format of the data, or you want to enrich ingested data, for example, by joining records with a static dimension table in your database.
Update policy query
If the update policy is defined on the target table, multiple queries can run on data ingested into a source table. If there are multiple update policies, the order of execution isn’t necessarily known.
Query limitations
- The policy-related query can invoke stored functions, but:
- It can’t perform cross-cluster queries.
- It can’t access external data or external tables.
- It can’t make callouts (by using a plugin).
- The query doesn’t have read access to tables that have the RestrictedViewAccess policy enabled.
- For update policy limitations in streaming ingestion, see streaming ingestion limitations.
- The policy-related query can invoke stored functions, but:
- It can’t perform cross-eventhouse queries.
- It can’t access external data or external tables.
- It can’t make callouts (by using a plugin).
- The query doesn’t have read access to tables that have the RestrictedViewAccess policy enabled.
- By default, the Streaming ingestion policy is enabled for all tables in the Eventhouse. To use functions with the
join
operator in an update policy, the streaming ingestion policy must be disabled. Use the.alter
table
TableNamepolicy
streamingingestion
PolicyObject command to disable it.
When referencing the Source
table in the Query
part of the policy, or in functions referenced by the Query
part:
- Don’t use the qualified name of the table. Instead, use
TableName
. - Don’t use
database("<DatabaseName>").TableName
orcluster("<ClusterName>").database("<DatabaseName>").TableName
. - Don’t use the qualified name of the table. Instead, use
TableName
. - Don’t use
database("<DatabaseName>").TableName
orcluster("<EventhouseName>").database("<DatabaseName>").TableName
.
The update policy object
A table can have zero or more update policy objects associated with it. Each such object is represented as a JSON property bag, with the following properties defined.
Property | Type | Description |
---|---|---|
IsEnabled | bool | States if update policy is true - enabled, or false - disabled |
Source | string | Name of the table that triggers invocation of the update policy |
Query | string | A query used to produce data for the update |
IsTransactional | bool | States if the update policy is transactional or not, default is false. If the policy is transactional and the update policy fails, the source table isn’t updated. |
PropagateIngestionProperties | bool | States if properties specified during ingestion to the source table, such as extent tags and creation time, apply to the target table. |
ManagedIdentity | string | The managed identity on behalf of which the update policy runs. The managed identity can be an object ID, or the system reserved word. The update policy must be configured with a managed identity when the query references tables in other databases or tables with an enabled row level security policy. For more information, see Use a managed identity to run a update policy. |
Management commands
Update policy management commands include:
.show table *TableName* policy update
shows the current update policy of a table..alter table *TableName* policy update
defines the current update policy of a table..alter-merge table *TableName* policy update
appends definitions to the current update policy of a table..delete table *TableName* policy update
deletes the current update policy of a table.
Update policy is initiated following ingestion
Update policies take effect when data is ingested or moved to a source table, or extents are created in a source table. These actions can be done using any of the following commands:
- .ingest (pull)
- .ingest (inline)
- .set | .append | .set-or-append | .set-or-replace
- .move extents
- .replace extents
- The
PropagateIngestionProperties
command only takes effect in ingestion operations. When the update policy is triggered as part of a.move extents
or.replace extents
command, this option has no effect.
- The
Remove data from source table
After ingesting data to the target table, you can optionally remove it from the source table. Set a soft-delete period of 0sec
(or 00:00:00
) in the source table’s retention policy, and the update policy as transactional. The following conditions apply:
- The source data isn’t queryable from the source table
- The source data doesn’t persist in durable storage as part of the ingestion operation
- Operational performance improves. Post-ingestion resources are reduced for background grooming operations on extents in the source table.
Performance impact
Update policies can affect performance, and ingestion for data extents is multiplied by the number of target tables. It’s important to optimize the policy-related query. You can test an update policy’s performance impact by invoking the policy on already-existing extents, before creating or altering the policy, or on the function used with the query.
Evaluate resource usage
Use .show queries
, to evaluate resource usage (CPU, memory, and so on) with the following parameters:
- Set the
Source
property, the source table name, asMySourceTable
- Set the
Query
property to call a function namedMyFunction()
// '_extentId' is the ID of a recently created extent, that likely hasn't been merged yet.
let _extentId = toscalar(
MySourceTable
| project ExtentId = extent_id(), IngestionTime = ingestion_time()
| where IngestionTime > ago(10m)
| top 1 by IngestionTime desc
| project ExtentId
);
// This scopes the source table to the single recent extent.
let MySourceTable =
MySourceTable
| where ingestion_time() > ago(10m) and extent_id() == _extentId;
// This invokes the function in the update policy (that internally references `MySourceTable`).
MyFunction
Transactional settings
The update policy IsTransactional
setting defines whether the update policy is transactional and can affect the behavior of the policy update, as follows:
IsTransactional:false
: If the value is set to the default value, false, the update policy doesn’t guarantee consistency between data in the source and target table. If an update policy fails, data is ingested only to the source table and not to the target table. In this scenario, ingestion operation is successful.IsTransactional:true
: If the value is set to true, the setting does guarantee consistency between data in the source and target tables. If an update policy fails, data isn’t ingested to the source or target table. In this scenario, the ingestion operation is unsuccessful.
Handling failures
When policy updates fail, they’re handled differently based on whether the IsTransactional
setting is true
or false
. Common reasons for update policy failures are:
- A mismatch between the query output schema and the target table.
- Any query error.
You can view policy update failures using the .show ingestion failures
command with the following command:
In any other case, you can manually retry ingestion.
.show ingestion failures
| where FailedOn > ago(1hr) and OriginatesFromUpdatePolicy == true
Example of extract, transform, load
You can use update policy settings to perform extract, transform, load (ETL).
In this example, use an update policy with a simple function to perform ETL. First, we create two tables:
- The source table - Contains a single string-typed column into which data is ingested.
- The target table - Contains the desired schema. The update policy is defined on this table.
Let’s create the source table:
.create table MySourceTable (OriginalRecord:string)
Next, create the target table:
.create table MyTargetTable (Timestamp:datetime, ThreadId:int, ProcessId:int, TimeSinceStartup:timespan, Message:string)
Then create a function to extract data:
.create function with (docstring = 'Parses raw records into strongly-typed columns', folder = 'UpdatePolicyFunctions') ExtractMyLogs() { MySourceTable | parse OriginalRecord with "[" Timestamp:datetime "] [ThreadId:" ThreadId:int "] [ProcessId:" ProcessId:int "] TimeSinceStartup: " TimeSinceStartup:timespan " Message: " Message:string | project-away OriginalRecord }
Now, set the update policy to invoke the function that we created:
.alter table MyTargetTable policy update @'[{ "IsEnabled": true, "Source": "MySourceTable", "Query": "ExtractMyLogs()", "IsTransactional": true, "PropagateIngestionProperties": false}]'
To empty the source table after data is ingested into the target table, define the retention policy on the source table to have 0s as its
SoftDeletePeriod
..alter-merge table MySourceTable policy retention softdelete = 0s
Related content
8 - Query results cache
8.1 - Query results cache commands
The query results cache is a cache dedicated for storing query results. For more information, see Query results cache.
Query results cache commands
Kusto provides two commands for cache management and observability:
Show cache
: Use this command to show statistics exposed by the results cache.Clear cache(rhs:string)
: Use this command to clear cached results.
9 - Schema
9.1 - Avrotize k2a tool
Avrotize is a versatile tool for converting data and database schema formats, and generating code in various programming languages. The tool supports the conversion of Kusto table schemas to Apache Avro format and vice versa with the Convert Kusto table definition to Avrotize Schema command. The tool handles dynamic columns in Kusto tables by:
- Inferring the schema through sampling
- Resolving arrays and records at any level of nesting
- Detecting conflicting schemas
- Creating type unions for each different schema branch
Convert table definition to AVRO format
You can use the avrotize k2a
command to connect to a Kusto database and create an Avro schema with a record type for each of the tables in the database.
The following are examples of how to use the command:
Create an Avro schema with a top-level union with a record for each table:
avrotize k2a --kusto-uri <Uri> --kusto-database <DatabaseName> --avsc <AvroFilename.avsc>
Create a XRegistry Catalog file with CloudEvent wrappers and per-event schemas:
In the following example, you create xRegistry catalog files with schemas for each table. If the input table contains CloudEvents identified by columns like id, source, and type, the tool creates separate schemas for each event type.
avrotize k2a --kusto-uri <URI> --kusto-database <DatabaseName> --avsc <AvroFilename.xreg.json> --emit-cloudevents-xregistry --avro-namespace <AvroNamespace>
Convert AVRO schema to Kusto table declaration
You can use the avrotize a2k
command to create KQL table declarations from Avro schema and JSON mappings. It can also include docstrings in the table declarations extracted from the “doc” annotations in the Avro record types.
If the Avro schema is a single record type, the output script includes a .create table
command for the record. The record fields are converted into columns in the table. If the Avro schema is a type union (a top-level array), the output script emits a separate .create table
command for each record type in the union.
avrotize a2k .\<AvroFilename.avsc> --out <KustoFilename.kql>
The Avrotize tool is capable of converting JSON Schema, XML Schema, ASN.1 Schema, and Protobuf 2 and Protobuf 3 schemas into Avro schema. You can first convert the source schema into an Avro schema to normalize it and then convert it into Kusto schema.
For example, to convert “address.json” into Avro schema, the following command first converts an input JSON Schema document “address.json” to normalize it:
avrotize j2a address.json --out address.avsc
Then convert the Avro schema file into Kusto schema:
avrotize a2k address.avsc --out address.kql
You can also chain the commands together to convert from JSON Schema via Avro into Kusto schema:
avrotize j2a address.json | avrotize a2k --out address.kql
Related content
9.2 - Best practices for schema management
Here are several best practices to follow. They’ll help make your management commands work better, and have a lighter impact on the service resources.
Action | Use | Don’t use | Notes |
---|---|---|---|
Create multiple tables | Use a single .create tables command | Don’t issue many .create table commands | |
Rename multiple tables | Make a single call to .rename tables | Don’t issue a separate call for each pair of tables | |
Show commands | Use the lowest-scoped .show command | Don’t apply filters after a pipe (| ) | Limit use as much as possible. When possible, cache the information they return. |
Show extents | Use .show table T extents | Don’t use `.show cluster extents | where TableName == ‘T’` |
Show database schema. | Use .show database DB schema | Don’t use `.show schema | where DatabaseName == ‘DB’` |
Show large schema | Use .show databases schema | Don’t use .show schema | For example, use on an environment with more than 100 databases. |
Check a table’s existence or get the table’s schema | Use .show table T schema as json | Don’t use .show table T | Only use this command to get actual statistics on a single table. |
Define the schema for a table that will include datetime values | Set the relevant columns to the datetime type | Don’t convert string or numeric columns to datetime at query time for filtering, if that can be done before or during ingestion time | |
Add extent tag to metadata | Use sparingly | Avoid drop-by: tags, which limit the system’s ability to do performance-oriented grooming processes in the background. | See performance notes. |
9.3 - Columns
9.3.1 - Change column type without data loss
The .alter column command changes the column type, making the original data unrecoverable. To preserve preexisting data while changing the column type, create a new, properly typed table.
For each table OriginalTable
you’d like to change a column type in, execute the following steps:
Create a table
NewTable
with the correct schema (the right column types and the same column order).Ingest the data into
NewTable
fromOriginalTable
, applying the required data transformations. In the following example, Col1 is being converted to the string data type..set-or-append NewTable <| OriginalTable | extend Col1=tostring(Col1)
Use the .rename tables command to swap table names.
.rename tables NewTable=OriginalTable, OriginalTable=NewTable
When the command completes, the new data from existing ingestion pipelines flows to
OriginalTable
that is now typed correctly.Drop the table
NewTable
.NewTable
includes only a copy of the historical data from before the schema change. It can be safely dropped after confirming the schema and data inOriginalTable
were correctly updated..drop table NewTable
Example
The following example updates the schema of OriginalTable
while preserving its data.
Create the table, OriginalTable
, with a column, “Col1,” of type guid.
.create table OriginalTable (Col1:guid, Id:int)
Then ingest data into OriginalTable
.
.ingest inline into table OriginalTable <|
b642dec0-1040-4eac-84df-a75cfeba7aa4,1
c224488c-ad42-4e6c-bc55-ae10858af58d,2
99784a64-91ad-4897-ae0e-9d44bed8eda0,3
d8857a93-2728-4bcb-be1d-1a2cd35386a7,4
b1ddcfcc-388c-46a2-91d4-5e70aead098c,5
Create the table, NewTable
of type string.
.create table NewTable (Col1:string, Id:int)
Append data from OriginalTable
to NewTable
and use the tostring()
function to convert the “Col1” column from type guid to type string.
.set-or-append NewTable <| OriginalTable | extend Col1=tostring(Col1)
Swap the table names.
.rename tables NewTable = OriginalTable, OriginalTable = NewTable
Drop table, NewTable
with the old schema and data.
.drop table NewTable
Related content
9.3.2 - Columns management
This section describes the following management commands used for managing table columns:
Command | Description |
---|---|
.alter column | Alters the data type of an existing table column |
.alter-merge column docstrings and .alter column docstrings | Sets the docstring property of one or more columns of the specified table |
.alter table , .alter-merge table | Modify the schema of a table (add/remove columns) |
drop column and drop table columns | Removes one or multiple columns from a table |
rename column or columns | Changes the name of an existing or multiple table columns |
9.4 - Databases
9.5 - External tables
9.5.1 - Azure SQL external tables
9.5.1.1 - Create and alter Azure SQL external tables
Creates or alters an Azure SQL external table in the database in which the command is executed.
Supported Azure SQL external table types
- SQL Server
- MySQL
- PostgreSQL
- Cosmos DB
Permissions
To .create
requires at least Database User permissions and to .alter
requires at least Table Admin permissions.
To .create
, .alter
, or .create-or-alter
an external table using managed identity authentication requires Database Admin permissions. This method is supported for SQL Server and Cosmos DB external tables.
Syntax
(.create
| .alter
| .create-or-alter
) external
table
TableName (
Schema)
kind
=
sql
[ table
=
SqlTableName ] (
SqlConnectionString)
[with
(
[ sqlDialect
=
SqlDialect ] ,
[ Property ,
… ])
]
Parameters
Name | Type | Required | Description |
---|---|---|---|
TableName | string | ✔️ | The name of the external table. The name must follow the rules for entity names, and an external table can’t have the same name as a regular table in the same database. |
Schema | string | ✔️ | The external data schema is a comma-separated list of one or more column names and data types, where each item follows the format: ColumnName : ColumnType. |
SqlTableName | string | The name of the SQL table not including the database name. For example, “MySqlTable” and not “db1.MySqlTable”. If the name of the table contains a period ("."), use [‘Name.of.the.table’] notation. | |
This specification is required for all types of tables except for Cosmos DB, as for Cosmos DB the collection name is part of the connection string. | |||
SqlConnectionString | string | ✔️ | The connection string to the SQL server. |
SqlDialect | string | Indicates the type of Azure SQL external table. SQL Server is the default. For MySQL, specify MySQL . For PostgreSQL, specify PostgreSQL . For Cosmos DB, specify CosmosDbSql . | |
Property | string | A key-value property pair in the format PropertyName = PropertyValue. See optional properties. |
Optional properties
Property | Type | Description |
---|---|---|
folder | string | The table’s folder. |
docString | string | A string documenting the table. |
firetriggers | true /false | If true , instructs the target system to fire INSERT triggers defined on the SQL table. The default is false . (For more information, see BULK INSERT and System.Data.SqlClient.SqlBulkCopy) |
createifnotexists | true / false | If true , the target SQL table is created if it doesn’t already exist; the primarykey property must be provided in this case to indicate the result column that is the primary key. The default is false . |
primarykey | string | If createifnotexists is true , the resulting column name is used as the SQL table’s primary key if it’s created by this command. |
Authentication and authorization
To interact with an external Azure SQL table, you must specify authentication means as part of the SqlConnectionString. The SqlConnectionString defines the resource to access and its authentication information.
For more information, see Azure SQL external table authentication methods.
Examples
The following examples show how to create each type of Azure SQL external table.
SQL Server
.create external table MySqlExternalTable (x:long, s:string)
kind=sql
table=MySqlTable
(
h@'Server=tcp:myserver.database.windows.net,1433;Authentication=Active Directory Integrated;Initial Catalog=mydatabase;'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
createifnotexists = true,
primarykey = x,
firetriggers=true
)
Output
TableName | TableType | Folder | DocString | Properties |
---|---|---|---|---|
MySqlExternalTable | Sql | ExternalTables | Docs | { “TargetEntityKind”: “sqltable`”, “TargetEntityName”: “MySqlTable”, “TargetEntityConnectionString”: “Server=tcp:myserver.database.windows.net,1433;Authentication=Active Directory Integrated;Initial Catalog=mydatabase;”, “FireTriggers”: true, “CreateIfNotExists”: true, “PrimaryKey”: “x” } |
MySQL
.create external table MySqlExternalTable (x:long, s:string)
kind=sql
table=MySqlTable
(
h@'Server=myserver.mysql.database.windows.net;Port = 3306;UID = USERNAME;Pwd = PASSWORD;Database = mydatabase;'
)
with
(
sqlDialect = "MySql",
docstring = "Docs",
folder = "ExternalTables",
)
PostgreSQL
.create external table PostgreSqlExternalTable (x:long, s:string)
kind=sql
table=PostgreSqlTable
(
h@'Host = hostname.postgres.database.azure.com; Port = 5432; Database= db; User Id=user; Password=pass; Timeout = 30;'
)
with
(
sqlDialect = "PostgreSQL",
docstring = "Docs",
folder = "ExternalTables",
)
Cosmos DB
.create external table CosmosDBSQLExternalTable (x:long, s:string)
kind=sql
(
h@'AccountEndpoint=https://cosmosdbacc.documents.azure.com/;Database=MyDatabase;Collection=MyCollection;AccountKey=' h'R8PM...;'
)
with
(
sqlDialect = "CosmosDbSQL",
docstring = "Docs",
folder = "ExternalTables",
)
Related content
9.5.1.2 - Query SQL external tables
You can query a SQL external table just as you would query an Azure Data Explorer or a table in a KQL Database.
How it works
Azure SQL external table queries are translated from Kusto Query Language (KQL) to SQL. The operators after the external_table function call, such as where, project, count, and so on, are pushed down and translated into a single SQL query to be executed against the target SQL table.
Example
For example, consider an external table named MySqlExternalTable
with two columns x
and s
. In this case, the following KQL query is translated into the following SQL query.
KQL query
external_table(MySqlExternalTable)
| where x > 5
| count
SQL query
SELECT COUNT(*) FROM (SELECT x, s FROM MySqlTable WHERE x > 5) AS Subquery1
9.5.1.3 - Use row-level security with Azure SQL external tables
Apply row-level security on Azure SQL external tables
This document describes how to apply a row-level security (RLS) solution with SQL external tables. row-level security implements data isolation at the user level, restricting the access to data based on the current user credential. However, Kusto external tables don’t support RLS policy definitions, so data isolation on external SQL tables require a different approach. The following solution employs using row-level security in SQL Server, and Microsoft Entra ID Impersonation in the SQL Server connection string. This combination provides the same behavior as applying user access control with RLS on standard Kusto tables, such that the users querying the SQL External Table are able to only see the records addressed to them, based on the row-level security policy defined in the source database.
Prerequisites
ALTER ANY SECURITY POLICY
permission on the SQL Server- Table admin level permissions on the Kusto-side SQL external table
Sample table
The example source is a SQL Server table called SourceTable
, with the following schema. The systemuser
column contains the user email to whom the data record belongs. This is the same user who should have access to this data.
CREATE TABLE SourceTable (
id INT,
region VARCHAR(5),
central VARCHAR(5),
systemuser VARCHAR(200)
)
Configure row-level security in the source SQL Server - SQL Server side
For general information on SQL Server row-level security, see row-level security in SQL Server.
Create a SQL Function with the logic for the data access policy. In this example, the row-level security is based on the current user’s email matching the
systemuser
column. This logic could be modified to meet any other business requirement.CREATE SCHEMA Security; GO CREATE FUNCTION Security.mySecurityPredicate(@CheckColumn AS nvarchar(100)) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT 1 AS mySecurityPredicate_result WHERE @CheckColumn = ORIGINAL_LOGIN() OR USER_NAME() = 'Manager'; GO
Create the Security Policy on the table
SourceTable
with passing the column name as the parameter:CREATE SECURITY POLICY SourceTableFilter ADD FILTER PREDICATE Security.mySecurityPredicate(systemuser) ON dbo.SourceTable WITH (STATE = ON) GO
[!NOTE] At this point, the data is already restricted by the
mySecurityPredicate
function logic.
Allow user access to SQL Server - SQL Server side
The following steps depend on the SQL Server version that you’re using.
Create a sign in and User for each Microsoft Entra ID credential that is going to access the data stored in SQL Server:
CREATE LOGIN [user@domain.com] FROM EXTERNAL PROVIDER --MASTER CREATE USER [user@domain.com] FROM EXTERNAL PROVIDER --DATABASE
Grant SELECT on the Security function to the Microsoft Entra ID user:
GRANT SELECT ON Security.mySecurityPredicate to [user@domain.com]
Grant SELECT on the
SourceTable
to the Microsoft Entra ID user:GRANT SELECT ON dbo.SourceTable to [user@domain.com]
Define SQL external table connection String - Kusto side
For more information on the connection string, see SQL External Table Connection Strings.
Create a SQL External Table with using Connection String with
Active Directory Integrated
authentication type. For more information, see Microsoft Entra integrated (impersonation)..create external table SQLSourceTable (id:long, region:string, central:string, systemser:string) kind=sql table=SourceTable ( h@'Server=tcp:[sql server endpoint],1433;Authentication=Active Directory Integrated;Initial Catalog=[database name];' ) with ( docstring = "Docs", folder = "ExternalTables", createifnotexists = false, primarykey = 'id' )
Connection String:
Server=tcp:[sql server endpoint],1433;Authentication=Active Directory Integrated;Initial Catalog=[database name];
Validate the data isolation based on the Microsoft Entra ID, like it would work with row-level security on in Kusto. In this case, the data is filtered based on the SourceTable’s
systemuser
column, matching the Microsoft Entra ID user (email address) from the Kusto impersonation:external_table('SQLSourceTable')
[!NOTE] The policy can be disabled and enabled again, on the SQL Server side, for testing purposes.
To disable and enable the policy, use the following SQL commands:
ALTER SECURITY POLICY SourceTableFilter
WITH (STATE = OFF);
ALTER SECURITY POLICY SourceTableFilter
WITH (STATE = ON);
With the Security Policy enabled on the SQL Server side, Kusto users only see the records matching their Microsoft Entra IDs, as the result of the query against the SQL External table. With the Security Policy disabled, all users are able to access the full table content as the result of the query against the SQL External table.
Related content
9.5.2 - Azure Storage external tables
9.5.2.1 - Create and alter Azure Storage delta external tables
The commands in this article can be used to create or alter a delta external table in the database from which the command is executed. A delta external table references Delta Lake table data located in Azure Blob Storage, Azure Data Lake Store Gen1, or Azure Data Lake Store Gen2.
To accelerate queries over external delta tables, see Query acceleration policy.
Permissions
To .create
requires at least Database User permissions, and to .alter
requires at least Table Admin permissions.
To .create-or-alter
an external table using managed identity authentication requires AllDatabasesAdmin permissions.
Syntax
(.create
| .alter
| .create-or-alter
) external
table
TableName [(
Schema)
] kind
=
delta
(
StorageConnectionString )
[with
(
Property [,
…])
]
Parameters
Name | Type | Required | Description |
---|---|---|---|
TableName | string | ✔️ | An external table name that adheres to the entity names rules. An external table can’t have the same name as a regular table in the same database. |
Schema | string | The optional external data schema is a comma-separated list of one or more column names and data types, where each item follows the format: ColumnName : ColumnType. If not specified, it will be automatically inferred from the delta log based on the latest delta table version. | |
StorageConnectionString | string | ✔️ | delta table root folder path, including credentials. Can point to Azure Blob Storage blob container, Azure Data Lake Gen 2 file system or Azure Data Lake Gen 1 container. The external table storage type is determined by the provided connection string. See storage connection strings. |
Property | string | A key-value property pair in the format PropertyName = PropertyValue. See optional properties. |
Authentication and authorization
The authentication method to access an external table is based on the connection string provided during its creation, and the permissions required to access the table vary depending on the authentication method.
The supported authentication methods are the same as those supported by Azure Storage external tables.
Optional properties
Property | Type | Description |
---|---|---|
folder | string | Table’s folder |
docString | string | String documenting the table |
compressed | bool | Only relevant for the export scenario. If set to true, the data is exported in the format specified by the compressionType property. For the read path, compression is automatically detected. |
compressionType | string | Only relevant for the export scenario. The compression type of exported files. For non-Parquet files, only gzip is allowed. For Parquet files, possible values include gzip , snappy , lz4_raw , brotli , and zstd . Default is gzip . For the read path, compression type is automatically detected. |
namePrefix | string | If set, specifies the prefix of the files. On write operations, all files will be written with this prefix. On read operations, only files with this prefix are read. |
fileExtension | string | If set, specifies extension of the files. On write, files names will end with this suffix. On read, only files with this file extension will be read. |
encoding | string | Specifies how the text is encoded: UTF8NoBOM (default) or UTF8BOM . |
dryRun | bool | If set, the external table definition isn’t persisted. This option is useful for validating the external table definition, especially in conjunction with the filesPreview or sampleUris parameter. |
Examples
Create or alter a delta external table with an inferred schema
In the following external table, the schema is automatically inferred from the latest delta table version.
.create-or-alter external table ExternalTable
kind=delta
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Create a delta external table with a custom schema
In the following external table, a custom schema is specified and overrides the schema of the delta table. If, at some later time, you need to replace the custom schema with the schema based on the latest delta table version, run the .alter
| .create-or-alter
command without specifying a schema, like in the previous example.
.create external table ExternalTable (Timestamp:datetime, x:long, s:string)
kind=delta
(
h@'abfss://filesystem@storageaccount.dfs.core.windows.net/path;secretKey'
)
Limitations
- Time travel is not supported. Only the latest delta table version is used.
Related content
9.5.2.2 - Create and alter Azure Storage external tables
The commands in this article can be used to create or alter an Azure Storage external table in the database from which the command is executed. An Azure Storage external table references data located in Azure Blob Storage, Azure Data Lake Store Gen1, or Azure Data Lake Store Gen2.
Permissions
To .create
requires at least Database User permissions, and to .alter
requires at least Table Admin permissions.
To .create-or-alter
an external table using managed identity authentication requires AllDatabasesAdmin permissions.
Syntax
(.create
| .alter
| .create-or-alter
) external
table
TableName (
Schema)
kind
=
storage
[partition
by
(
Partitions)
[pathformat
=
(
PathFormat)
]] dataformat
=
DataFormat (
StorageConnectionString [,
…] )
[with
(
Property [,
…])
]
Parameters
Name | Type | Required | Description |
---|---|---|---|
TableName | string | ✔️ | An external table name that adheres to the entity names rules. An external table can’t have the same name as a regular table in the same database. |
Schema | string | ✔️ | The external data schema is a comma-separated list of one or more column names and data types, where each item follows the format: ColumnName : ColumnType. If the schema is unknown, use infer_storage_schema to infer the schema based on external file contents. |
Partitions | string | A comma-separated list of columns by which the external table is partitioned. Partition column can exist in the data file itself, or as part of the file path. See partitions formatting to learn how this value should look. | |
PathFormat | string | An external data folder URI path format to use with partitions. See path format. | |
DataFormat | string | ✔️ | The data format, which can be any of the ingestion formats. We recommend using the Parquet format for external tables to improve query and export performance, unless you use JSON paths mapping. When using an external table for export scenario, you’re limited to the following formats: CSV , TSV , JSON and Parquet . |
StorageConnectionString | string | ✔️ | One or more comma-separated paths to Azure Blob Storage blob containers, Azure Data Lake Gen 2 file systems or Azure Data Lake Gen 1 containers, including credentials. The external table storage type is determined by the provided connection strings. See storage connection strings. |
Property | string | A key-value property pair in the format PropertyName = PropertyValue. See optional properties. |
Authentication and authorization
The authentication method to access an external table is based on the connection string provided during its creation, and the permissions required to access the table vary depending on the authentication method.
The following table lists the supported authentication methods for Azure Storage external tables and the permissions needed to read or write to the table.
Authentication method | Azure Blob Storage / Data Lake Storage Gen2 | Data Lake Storage Gen1 |
---|---|---|
Impersonation | Read permissions: Storage Blob Data Reader Write permissions: Storage Blob Data Contributor | Read permissions: Reader Write permissions: Contributor |
Managed identity | Read permissions: Storage Blob Data Reader Write permissions: Storage Blob Data Contributor | Read permissions: Reader Write permissions: Contributor |
Shared Access (SAS) token | Read permissions: List + Read Write permissions: Write | This authentication method isn’t supported in Gen1. |
Microsoft Entra access token | No additional permissions required. | No additional permissions required. |
Storage account access key | No additional permissions required. | This authentication method isn’t supported in Gen1. |
Path format
The PathFormat parameter allows you to specify the format for the external data folder URI path in addition to partitions. It consists of a sequence of partition elements and text separators. A partition element refers to a partition that is declared in the partition by
clause, and the text separator is any text enclosed in quotes. Consecutive partition elements must be set apart using the text separator.
[ StringSeparator ] Partition [ StringSeparator ] [Partition [ StringSeparator ] …]
To construct the original file path prefix, partition elements are rendered as strings and separated with corresponding text separators. You can use the datetime_pattern
macro (datetime_pattern(
DateTimeFormat,
PartitionName)
) to specify the format used for rendering a datetime partition value. The macro adheres to the .NET format specification, and allows format specifiers to be enclosed in curly brackets. For example, the following two formats are equivalent:
- ‘year=‘yyyy’/month=‘MM
- year={yyyy}/month={MM}
By default, datetime values are rendered using the following formats:
Partition function | Default format |
---|---|
startofyear | yyyy |
startofmonth | yyyy/MM |
startofweek | yyyy/MM/dd |
startofday | yyyy/MM/dd |
bin( Column, 1d) | yyyy/MM/dd |
bin( Column, 1h) | yyyy/MM/dd/HH |
bin( Column, 1m) | yyyy/MM/dd/HH/mm |
Virtual columns
When data is exported from Spark, partition columns (that are provided to the dataframe writer’s partitionBy
method) aren’t written to data files.
This process avoids data duplication because the data is already present in the folder names (for example, column1=<value>/column2=<value>/
), and Spark can recognize it upon read.
External tables support reading this data in the form of virtual colums
. Virtual columns can be of either type string
or datetime
, and are specified using the following syntax:
.create external table ExternalTable (EventName:string, Revenue:double)
kind=storage
partition by (CustomerName:string, Date:datetime)
pathformat=("customer=" CustomerName "/date=" datetime_pattern("yyyyMMdd", Date))
dataformat=parquet
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
To filter by virtual columns in a query, specify partition names in query predicate:
external_table("ExternalTable")
| where Date between (datetime(2020-01-01) .. datetime(2020-02-01))
| where CustomerName in ("John.Doe", "Ivan.Ivanov")
Optional properties
Property | Type | Description |
---|---|---|
folder | string | Table’s folder |
docString | string | String documenting the table |
compressed | bool | Only relevant for the export scenario. If set to true, the data is exported in the format specified by the compressionType property. For the read path, compression is automatically detected. |
compressionType | string | Only relevant for the export scenario. The compression type of exported files. For non-Parquet files, only gzip is allowed. For Parquet files, possible values include gzip , snappy , lz4_raw , brotli , and zstd . Default is gzip . For the read path, compression type is automatically detected. |
includeHeaders | string | For delimited text formats (CSV, TSV, …), specifies whether files contain a header. Possible values are: All (all files contain a header), FirstFile (first file in a folder contains a header), None (no files contain a header). |
namePrefix | string | If set, specifies the prefix of the files. On write operations, all files will be written with this prefix. On read operations, only files with this prefix are read. |
fileExtension | string | If set, specifies the extension of the files. On write, files names will end with this suffix. On read, only files with this file extension will be read. |
encoding | string | Specifies how the text is encoded: UTF8NoBOM (default) or UTF8BOM . |
sampleUris | bool | If set, the command result provides several examples of simulated external data files URI as they’re expected by the external table definition. This option helps validate whether the Partitions and PathFormat parameters are defined properly. |
filesPreview | bool | If set, one of the command result tables contains a preview of .show external table artifacts command. Like sampleUri , the option helps validate the Partitions and PathFormat parameters of external table definition. |
validateNotEmpty | bool | If set, the connection strings are validated for having content in them. The command will fail if the specified URI location doesn’t exist, or if there are insufficient permissions to access it. |
dryRun | bool | If set, the external table definition isn’t persisted. This option is useful for validating the external table definition, especially in conjunction with the filesPreview or sampleUris parameter. |
File filtering logic
When querying an external table, performance is improved by filtering out irrelevant external storage files. The process of iterating files and deciding whether a file should be processed is as follows:
Build a URI pattern that represents a place where files are found. Initially, the URI pattern equals a connection string provided as part of the external table definition. If there are any partitions defined, they’re rendered using PathFormat, then appended to the URI pattern.
For all files found under the URI pattern(s) created, check that:
- Partition values match predicates used in a query.
- Blob name starts with
NamePrefix
, if such a property is defined. - Blob name ends with
FileExtension
, if such a property is defined.
Once all the conditions are met, the file is fetched and processed.
Examples
Non-partitioned external table
In the following non-partitioned external table, the files are expected to be placed directly under the container(s) defined:
.create external table ExternalTable (x:long, s:string)
kind=storage
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Partitioned by date
In the following external table partitioned by date, the files are expected to be placed under directories of the default datetime format yyyy/MM/dd
:
.create external table ExternalTable (Timestamp:datetime, x:long, s:string)
kind=storage
partition by (Date:datetime = bin(Timestamp, 1d))
dataformat=csv
(
h@'abfss://filesystem@storageaccount.dfs.core.windows.net/path;secretKey'
)
Partitioned by month
In the following external table partitioned by month, the directory format is year=yyyy/month=MM
:
.create external table ExternalTable (Timestamp:datetime, x:long, s:string)
kind=storage
partition by (Month:datetime = startofmonth(Timestamp))
pathformat=(datetime_pattern("'year='yyyy'/month='MM", Month))
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Partitioned by name and date
In the following external table, the data is partitioned first by customer name and then by date, meaning that the expected directory structure is, for example, customer_name=Softworks/2019/02/01
:
.create external table ExternalTable (Timestamp:datetime, CustomerName:string)
kind=storage
partition by (CustomerNamePart:string = CustomerName, Date:datetime = startofday(Timestamp))
pathformat=("customer_name=" CustomerNamePart "/" Date)
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Partitioned by hash and date
The following external table is partitioned first by customer name hash (modulo ten), then by date. The expected directory structure is, for example, customer_id=5/dt=20190201
, and data file names end with the .txt
extension:
.create external table ExternalTable (Timestamp:datetime, CustomerName:string)
kind=storage
partition by (CustomerId:long = hash(CustomerName, 10), Date:datetime = startofday(Timestamp))
pathformat=("customer_id=" CustomerId "/dt=" datetime_pattern("yyyyMMdd", Date))
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
with (fileExtension = ".txt")
Filter by partition columns in a query
To filter by partition columns in a query, specify original column name in query predicate:
external_table("ExternalTable")
| where Timestamp between (datetime(2020-01-01) .. datetime(2020-02-01))
| where CustomerName in ("John.Doe", "Ivan.Ivanov")
Sample Output
TableName | TableType | Folder | DocString | Properties | ConnectionStrings | Partitions | PathFormat |
---|---|---|---|---|---|---|---|
ExternalTable | Blob | ExternalTables | Docs | {“Format”:“Csv”,“Compressed”:false,“CompressionType”:null,“FileExtension”:null,“IncludeHeaders”:“None”,“Encoding”:null,“NamePrefix”:null} | [“https://storageaccount.blob.core.windows.net/container1;*******”] | [{“Mod”:10,“Name”:“CustomerId”,“ColumnName”:“CustomerName”,“Ordinal”:0},{“Function”:“StartOfDay”,“Name”:“Date”,“ColumnName”:“Timestamp”,“Ordinal”:1}] | “customer_id=” CustomerId “/dt=” datetime_pattern(“yyyyMMdd”,Date) |
Related content
9.6 - Functions
9.6.1 - Stored functions management overview
This section describes management commands used for creating and altering user-defined functions:
Function | Description |
---|---|
.alter function | Alters an existing function and stores it inside the database metadata |
.alter function docstring | Alters the DocString value of an existing function |
.alter function folder | Alters the Folder value of an existing function |
.create function | Creates a stored function |
.create-or-alter function | Creates a stored function or alters an existing function and stores it inside the database metadata |
.drop function and .drop functions | Drops a function (or functions) from the database |
.show functions and .show function | Lists all the stored functions, or a specific function, in the currently-selected database |
|.show functions
and .show function
|Lists all the stored functions, or a specific function, in the currently-selected database |
9.7 - Ingestion mappings
9.7.1 - AVRO Mapping
Use AVRO mapping to map incoming data to columns inside tables when your ingestion source file is in AVRO format.
Each AVRO mapping element must contain either of the following optional properties:
Property | Type | Description |
---|---|---|
Field | string | Name of the field in the AVRO record. |
Path | string | If the value starts with $ , it’s treated as the path to the field in the AVRO document. This path specifies the part of the AVRO document that becomes the content of the column in the table. The path that denotes the entire AVRO record is $ . If the value doesn’t start with $ , it’s treated as a constant value. Paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
ConstValue | string | The constant value to be used for a column instead of some value inside the AVRO file. |
Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
JSON serialization
The following example mapping is serialized as a JSON string when provided as part of the .ingest
management command.
[
{"Column": "event_timestamp", "Properties": {"Field": "Timestamp"}},
{"Column": "event_name", "Properties": {"Field": "Name"}},
{"Column": "event_type", "Properties": {"Field": "Type"}},
{"Column": "event_time", "Properties": {"Field": "Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
Here the serialized JSON mapping is included in the context of the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "AVRO",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Field": "Field1"}},
{"Column": "column_b", "Properties": {"Field": "$.[\'Field name with space\']"}}
]
```
)
Precreated mapping
When the mapping is precreated, reference the mapping by name in the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="AVRO",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use AVRO mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="AVRO"
)
Related content
- Use the avrotize k2a tool to create an Avro schema.
9.7.2 - CSV Mapping
Use CSV mapping to map incoming data to columns inside tables when your ingestion source file is any of the following delimiter-separated tabular formats: CSV, TSV, PSV, SCSV, SOHsv, TXT and RAW. For more information, see supported data formats.
Each CSV mapping element must contain either of the following optional properties:
Property | Type | Description |
---|---|---|
Ordinal | int | The column order number in CSV. |
ConstValue | string | The constant value to be used for a column instead of some value inside the CSV file. |
Transform | string | Transformation that should be applied on the content with mapping transformations. The only supported transformation by is SourceLocation . |
Examples
[
{"Column": "event_time", "Properties": {"Ordinal": "0"}},
{"Column": "event_name", "Properties": {"Ordinal": "1"}},
{"Column": "event_type", "Properties": {"Ordinal": "2"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2023-01-01T10:32:00"}}
{"Column": "source_location", "Properties": {"Transform": "SourceLocation"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="csv",
ingestionMapping =
```
[
{"Column": "event_time", "Properties": {"Ordinal": "0"}},
{"Column": "event_name", "Properties": {"Ordinal": "1"}},
{"Column": "event_type", "Properties": {"Ordinal": "2"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2023-01-01T10:32:00"}},
{"Column": "source_location", "Properties": {"Transform": "SourceLocation"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="csv",
ingestionMappingReference = "MappingName"
)
Identity mapping
Use CSV mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="csv"
)
9.7.3 - Ingestion mappings
Ingestion mappings are used during ingestion to map incoming data to columns inside tables.
Data Explorer supports different types of mappings, both row-oriented (CSV, JSON, AVRO and W3CLOGFILE), and column-oriented (Parquet and ORC).
Ingestion mappings can be defined in the ingest command, or can be precreated and referenced from the ingest command using ingestionMappingReference
parameters. Ingestion is possible without specifying a mapping. For more information, see identity mapping.
Each element in the mapping list is constructed from three fields:
Property | Required | Description |
---|---|---|
Column | ✔️ | Target column name in the table. |
Datatype | Datatype with which to create the mapped column if it doesn’t already exist in the table. | |
Properties | Property-bag containing properties specific for each mapping as described in each specific mapping type page. |
Supported mapping types
The following table defines mapping types to be used when ingesting or querying external data of a specific format.
Data Format | Mapping Type |
---|---|
CSV | CSV Mapping |
TSV | CSV Mapping |
TSVe | CSV Mapping |
PSV | CSV Mapping |
SCSV | CSV Mapping |
SOHsv | CSV Mapping |
TXT | CSV Mapping |
RAW | CSV Mapping |
JSON | JSON Mapping |
AVRO | AVRO Mapping |
APACHEAVRO | AVRO Mapping |
Parquet | Parquet Mapping |
ORC | ORC Mapping |
W3CLOGFILE | W3CLOGFILE Mapping |
Ingestion mapping examples
The following examples use the RawEvents
table with the following schema:
.create table RawEvents (timestamp: datetime, deviceId: guid, messageId: guid, temperature: decimal, humidity: decimal)
Simple mapping
The following example shows ingestion where the mapping is defined in the ingest command. The command ingests a JSON file from a URL into the RawEvents
table. The mapping specifies the path to each field in the JSON file.
.ingest into table RawEvents ('https://kustosamplefiles.blob.core.windows.net/jsonsamplefiles/simple.json')
with (
format = "json",
ingestionMapping =
```
[
{"column":"timestamp","Properties":{"path":"$.timestamp"}},
{"column":"deviceId","Properties":{"path":"$.deviceId"}},
{"column":"messageId","Properties":{"path":"$.messageId"}},
{"column":"temperature","Properties":{"path":"$.temperature"}},
{"column":"humidity","Properties":{"path":"$.humidity"}}
]
```
)
Mapping with ingestionMappingReference
To map the same JSON file using a precreated mapping, create the RawEventMapping
ingestion mapping reference with the following command:
.create table RawEvents ingestion json mapping 'RawEventMapping'
```
[
{"column":"timestamp","Properties":{"path":"$.timestamp"}},
{"column":"deviceId","Properties":{"path":"$.deviceId"}},
{"column":"messageId","Properties":{"path":"$.messageId"}},
{"column":"temperature","Properties":{"path":"$.temperature"}},
{"column":"humidity","Properties":{"path":"$.humidity"}}
]
```
Ingest the JSON file using the RawEventMapping
ingestion mapping reference with the following command:
.ingest into table RawEvents ('https://kustosamplefiles.blob.core.windows.net/jsonsamplefiles/simple.json')
with (
format="json",
ingestionMappingReference="RawEventMapping"
)
Identity mapping
Ingestion is possible without specifying ingestionMapping
or ingestionMappingReference
properties. The data is mapped using an identity data mapping derived from the table’s schema. The table schema remains the same. format
property should be specified. See ingestion formats.
Format type | Format | Mapping logic |
---|---|---|
Tabular data formats with defined order of columns, such as delimiter-separated or single-line formats. | CSV, TSV, TSVe, PSV, SCSV, Txt, SOHsv, Raw | All table columns are mapped in their respective order to data columns in order they appear in the data source. Column data type is taken from the table schema. |
Formats with named columns or records with named fields. | JSON, Parquet, Avro, ApacheAvro, Orc, W3CLOGFILE | All table columns are mapped to data columns or record fields having the same name (case-sensitive). Column data type is taken from the table schema. |
Mapping transformations
Some of the data format mappings (Parquet, JSON, and AVRO) support simple and useful ingest-time transformations. Where the scenario requires more complex processing at ingest time, use Update policy, which allows defining lightweight processing using KQL expression.
Path-dependant transformation | Description | Conditions |
---|---|---|
PropertyBagArrayToDictionary | Transforms JSON array of properties, such as {events:[{"n1":"v1"},{"n2":"v2"}]} , to dictionary and serializes it to valid JSON document, such as {"n1":"v1","n2":"v2"} . | Available for JSON , Parquet , AVRO , and ORC mapping types. |
SourceLocation | Name of the storage artifact that provided the data, type string (for example, the blob’s “BaseUri” field). | Available for CSV , JSON , Parquet , AVRO , ORC , and W3CLOGFILE mapping types. |
SourceLineNumber | Offset relative to that storage artifact, type long (starting with ‘1’ and incrementing per new record). | Available for CSV , JSON , Parquet , AVRO , ORC , and W3CLOGFILE mapping types. |
DateTimeFromUnixSeconds | Converts number representing unix-time (seconds since 1970-01-01) to UTC datetime string. | Available for CSV , JSON , Parquet , AVRO , and ORC mapping types. |
DateTimeFromUnixMilliseconds | Converts number representing unix-time (milliseconds since 1970-01-01) to UTC datetime string. | Available for CSV , JSON , Parquet , AVRO , and ORC mapping types. |
DateTimeFromUnixMicroseconds | Converts number representing unix-time (microseconds since 1970-01-01) to UTC datetime string. | Available for CSV , JSON , Parquet , AVRO , and ORC mapping types. |
DateTimeFromUnixNanoseconds | Converts number representing unix-time (nanoseconds since 1970-01-01) to UTC datetime string. | Available for CSV , JSON , Parquet , AVRO , and ORC mapping types. |
DropMappedFields | Maps an object in the JSON document to a column and removes any nested fields already referenced by other column mappings. | Available for JSON , Parquet , AVRO , and ORC mapping types. |
BytesAsBase64 | Treats the data as byte array and converts it to a base64-encoded string. | Available for AVRO mapping type. For ApacheAvro format, the schema type of the mapped data field should be bytes or fixed Avro type. For Avro format, the field should be an array containing byte values from [0-255] range. null is ingested if the data doesn’t represent a valid byte array. |
Mapping transformation examples
DropMappedFields
transformation:
Given the following JSON contents:
{
"Time": "2012-01-15T10:45",
"Props": {
"EventName": "CustomEvent",
"Revenue": 0.456
}
}
The following data mapping maps entire Props
object into dynamic column Props
while excluding
already mapped columns (Props.EventName
is already mapped into column EventName
, so it’s
excluded).
[
{ "Column": "Time", "Properties": { "Path": "$.Time" } },
{ "Column": "EventName", "Properties": { "Path": "$.Props.EventName" } },
{ "Column": "Props", "Properties": { "Path": "$.Props", "Transform":"DropMappedFields" } },
]
The ingested data looks as follows:
Time | EventName | Props |
---|---|---|
2012-01-15T10:45 | CustomEvent | {"Revenue": 0.456} |
BytesAsBase64
transformation
Given the following AVRO file contents:
{
"Time": "2012-01-15T10:45",
"Props": {
"id": [227,131,34,92,28,91,65,72,134,138,9,133,51,45,104,52]
}
}
The following data mapping maps the ID column twice, with and without the transformation.
[
{ "Column": "ID", "Properties": { "Path": "$.props.id" } },
{ "Column": "Base64EncodedId", "Properties": { "Path": "$.props.id", "Transform":"BytesAsBase64" } },
]
The ingested data looks as follows:
ID | Base64EncodedId |
---|---|
[227,131,34,92,28,91,65,72,134,138,9,133,51,45,104,52] | 44MiXBxbQUiGigmFMy1oNA== |
9.7.4 - JSON Mapping
Use JSON mapping to map incoming data to columns inside tables when your ingestion source file is in JSON format.
Each JSON mapping element must contain either of the following optional properties:
Property | Type | Description |
---|---|---|
Path | string | If the value starts with $ it’s interpreted as the JSON path to the field in the JSON document that will become the content of the column in the table. The JSON path that denotes the entire document is $ . If the value doesn’t start with $ it’s interpreted as a constant value. JSON paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
ConstValue | string | The constant value to be used for a column instead of some value inside the JSON file. |
Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
[
{"Column": "event_timestamp", "Properties": {"Path": "$.Timestamp"}},
{"Column": "event_name", "Properties": {"Path": "$.Event.Name"}},
{"Column": "event_type", "Properties": {"Path": "$.Event.Type"}},
{"Column": "source_uri", "Properties": {"Transform": "SourceLocation"}},
{"Column": "source_line", "Properties": {"Transform": "SourceLineNumber"}},
{"Column": "event_time", "Properties": {"Path": "$.Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "json",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Path": "$.Obj.Property"}},
{"Column": "column_b", "Properties": {"Path": "$.Property"}},
{"Column": "custom_column", "Properties": {"Path": "$.[\'Property name with space\']"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="json",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use JSON mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="json"
)
Copying JSON mapping
You can copy JSON mapping of an existing table and create a new table with the same mapping using the following process:
Run the following command on the table whose mapping you want to copy:
.show table TABLENAME ingestion json mappings | extend formatted_mapping = strcat("'",replace_string(Mapping, "'", "\\'"),"'") | project formatted_mapping
Use the output of the above command to create a new table with the same mapping:
.create table TABLENAME ingestion json mapping "TABLENAME_Mapping" RESULT_OF_ABOVE_CMD
9.7.5 - ORC Mapping
Use ORC mapping to map incoming data to columns inside tables when your ingestion source file is in ORC format.
Each ORC mapping element must contain either of the following optional properties:
Property | Type | Description |
---|---|---|
Field | string | Name of the field in the ORC record. |
Path | string | If the value starts with $ it’s interpreted as the path to the field in the ORC document that will become the content of the column in the table. The path that denotes the entire ORC record is $ . If the value doesn’t start with $ it’s interpreted as a constant value. Paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
ConstValue | string | The constant value to be used for a column instead of some value inside the ORC file. |
Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
[
{"Column": "event_timestamp", "Properties": {"Path": "$.Timestamp"}},
{"Column": "event_name", "Properties": {"Path": "$.Event.Name"}},
{"Column": "event_type", "Properties": {"Path": "$.Event.Type"}},
{"Column": "event_time", "Properties": {"Path": "$.Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "orc",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Path": "$.Field1"}},
{"Column": "column_b", "Properties": {"Path": "$.[\'Field name with space\']"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="orc",
ingestionMappingReference = "ORC_Mapping"
)
Identity mapping
Use ORC mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="orc"
)
9.7.6 - Parquet Mapping
Use Parquet mapping to map incoming data to columns inside tables when your ingestion source file is in Parquet format.
Each Parquet mapping element must contain either of the following optional properties:
Property | Type | Description |
---|---|---|
Field | string | Name of the field in the Parquet record. |
Path | string | If the value starts with $ it’s interpreted as the path to the field in the Parquet document that will become the content of the column in the table. The path that denotes the entire Parquet record is $ . If the value doesn’t start with $ it’s interpreted as a constant value. Paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
ConstValue | string | The constant value to be used for a column instead of some value inside the Parquet file. |
Transform | string | Transformation that should be applied on the content with mapping transformations. |
Parquet type conversions
Comprehensive support is provided for converting data types when you’re ingesting or querying data from a Parquet source.
The following table provides a mapping of Parquet field types, and the table column types they can be converted to. The first column lists the Parquet type, and the others show the table column types they can be converted to.
Parquet type | bool | int | long | real | decimal | datetime | timespan | string | guid | dynamic |
---|---|---|---|---|---|---|---|---|---|---|
INT8 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
INT16 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
INT32 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
INT64 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
UINT8 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
UINT16 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
UINT32 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
UINT64 | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
FLOAT32 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
FLOAT64 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
BOOLEAN | ✔️ | ❌ | ❌ | ❌ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
DECIMAL (I32) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
DECIMAL (I64) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
DECIMAL (FLBA) | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
DECIMAL (BA) | ✔️ | ❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
TIMESTAMP | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ | ❌ | ❌ |
DATE | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ | ❌ | ❌ |
STRING | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
UUID | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ✔️ | ❌ |
JSON | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
LIST | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
MAP | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
STRUCT | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
Examples
[
{"Column": "event_timestamp", "Properties": {"Path": "$.Timestamp"}},
{"Column": "event_name", "Properties": {"Path": "$.Event.Name"}},
{"Column": "event_type", "Properties": {"Path": "$.Event.Type"}},
{"Column": "event_time", "Properties": {"Path": "$.Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "parquet",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Path": "$.Field1.Subfield"}},
{"Column": "column_b", "Properties": {"Path": "$.[\'Field name with space\']"}},
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="parquet",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use Parquet mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="parquet"
)
9.7.7 - W3CLOGFILE Mapping
Use W3CLOGFILE mapping to map incoming data to columns inside tables when your ingestion source file is in W3CLOGFILE format.
Each W3CLOGFILE mapping element must contain either of the following optional properties:
Property | Type | Description |
---|---|---|
Field | string | Name of the field in the W3CLOGFILE log record. |
ConstValue | string | The constant value to be used for a column instead of some value inside the W3CLOGFILE file. |
Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
[
{"Column": "Date", "Properties": {"Field": "date"}},
{"Column": "Time", "Properties": {"Field": "time"}},
{"Column": "IP", "Properties": {"Field": "s-ip"}},
{"Column": "ClientMethod", "Properties": {"Field": "cs-method"}},
{"Column": "ClientQuery", "Properties": {"Field": "cs-uri-query"}},
{"Column": "ServerPort", "Properties": {"Field": "s-port"}},
{"Column": "ClientIP", "Properties": {"Field": "c-ip"}},
{"Column": "UserAgent", "Properties": {"Field": "cs(User-Agent)"}},
{"Column": "Referer", "Properties": {"Field": "cs(Referer)"}},
{"Column": "Status", "Properties": {"Field": "sc-status"}},
{"Column": "ResponseBytes", "Properties": {"Field": "sc-bytes"}},
{"Column": "RequestBytes", "Properties": {"Field": "cs-bytes"}},
{"Column": "TimeTaken", "Properties": {"Field": "time-taken"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "w3clogfile",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Field": "field1"}},
{"Column": "column_b", "Properties": {"Field": "field2"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest
management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="w3clogfile",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use W3CLOGFILE mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="w3clogfile"
)
9.8 - Manage external table mappings
9.9 - Materialized views
9.9.1 - Materialized views
Materialized views expose an aggregation query over a source table, or over another materialized view.
Materialized views always return an up-to-date result of the aggregation query (always fresh). Querying a materialized view is more performant than running the aggregation directly over the source table.
Why use materialized views?
By investing resources (data storage, background CPU cycles) for materialized views of commonly used aggregations, you get the following benefits:
Performance improvement: Querying a materialized view commonly performs better than querying the source table for the same aggregation function(s).
Freshness: A materialized view query always returns the most up-to-date results, independent of when materialization last took place. The query combines the materialized part of the view with the records in the source table, which haven’t yet been materialized (the
delta
part), always providing the most up-to-date results.Cost reduction: Querying a materialized view consumes less resources than doing the aggregation over the source table. Retention policy of source table can be reduced if only aggregation is required. This setup reduces hot cache costs for the source table.
For example use cases, see Materialized view use cases.
How materialized views work
A materialized view is made of two components:
- A materialized part - a table holding aggregated records from the source table, which have already been processed. This table always holds a single record per the aggregation’s group-by combination.
- A delta - the newly ingested records in the source table that haven’t yet been processed.
Querying the materialized view combines the materialized part with the delta part, providing an up-to-date result of the aggregation query. The offline materialization process ingests new records from the delta to the materialized table, and updates existing records. If the intersection between the delta and the materialized part is large, and many records require updates, this might have a negative impact on the materialization process. See monitor materialized views on how to troubleshoot such situations.
Materialized views queries
There are 2 ways to query a materialized view:
Query the entire view: when you query the materialized view by its name, similarly to querying a table, the materialized view query combines the materialized part of the view with the records in the source table that haven’t been materialized yet (the
delta
).- Querying the materialized view always returns the most up-to-date results, based on all records ingested to the source table. For more information about the materialized vs. non-materialized parts in materialized view, see how materialized views work.
- This option might not perform best as it needs to materialize the
delta
part during query time. Performance in this case depends on the view’s age and the filters applied in the query. The materialized view query optimizer section includes possible ways to improve query performance when querying the entire view.
Query the materialized part only: another way of querying the view is by using the
materialized_view()
function. This option supports querying only the materialized part of the view, while specifying the max latency the user is willing to tolerate.- This option isn’t guaranteed to return the most up-to-date records, but it should always be more performant than querying the entire view.
- This function is useful for scenarios in which you’re willing to sacrifice some freshness for performance, for example for telemetry dashboards.
Materialized views participate in cross-cluster or cross-database queries, but aren’t included in wildcard unions or searches.
- The following examples all include materialized views by the name
ViewName
:
cluster('cluster1').database('db').ViewName cluster('cluster1').database('*').ViewName database('*').ViewName database('DB*').ViewName database('*').materialized_view('ViewName') database('DB*').materialized_view('ViewName')
- The following examples do not include records from materialized views:
cluster('cluster1').database('db').* database('*').View* search in (*) search *
- The following examples all include materialized views by the name
Materialized views participate in cross-Eventhouse or cross-database queries, but aren’t included in wildcard unions or searches.
- The following examples all include materialized views by the name
ViewName
:
cluster("<serviceURL>").database('db').ViewName cluster("<serviceURL>").database('*').ViewName database('*').ViewName database('DB*').ViewName database('*').materialized_view('ViewName') database('DB*').materialized_view('ViewName')
- The following examples do not include records from materialized views:
cluster("<serviceURL>").database('db').* database('*').View* search in (*) search *
- The following examples all include materialized views by the name
Materialized view query optimizer
When querying the entire view, the materialized part is combined with the delta
during query time. This includes aggregating the delta
and joining it with the materialized part.
- Querying the entire view performs better if the query includes filters on the group by keys of the materialized view query. See more tips about how to create your materialized view, based on your query pattern, in the
.create materialized-view
performance tips section. - The query optimizer chooses summarize/join strategies that are expected to improve query performance. For example, the decision on whether to shuffle the query is based on number of records in
delta
part. The following client request properties provide some control over the optimizations applied. You can test these properties with your materialized view queries and evaluate their impact on queries performance.
Client request property name | Type | Description |
---|---|---|
materialized_view_query_optimization_costbased_enabled | bool | If set to false , disables summarize/join optimizations in materialized view queries. Uses default strategies. Default is true . |
materialized_view_shuffle | dynamic | Force shuffling of the materialized view query, and (optionally) provide specific keys to shuffle by. See examples below. |
ingestion_time()
function in the context of materialized views
ingestion_time() function returns null values, when used in the context of a materialized view, if querying the entire view. When querying the materialized part of the view, the return value depends on the type of materialized view:
- In materialized views which include a single
arg_max()
/arg_min()
/take_any()
aggregation, theingestion_time()
is equal to theingestion_time()
of the corresponding record in the source table. - In all other materialized views, the value of
ingestion_time()
is approximately the time of materialization (see how materialized views work).
Examples
Query the entire view. The most recent records in source table are included:
ViewName
Query the materialized part of the view only, regardless of when it was last materialized.
materialized_view("ViewName")
Query the entire view, and provide a “hint” to use
shuffle
strategy. The most recent records in source table are included:- Example #1: shuffle based on the
Id
column (similarly to usinghint.shufflekey=Id
):
set materialized_view_shuffle = dynamic([{"Name" : "ViewName", "Keys" : [ "Id" ] }]); ViewName
- Example #2: shuffle based on all keys (similarly to using
hint.strategy=shuffle
):
set materialized_view_shuffle = dynamic([{"Name" : "ViewName" }]); ViewName
- Example #1: shuffle based on the
Performance considerations
The main contributors that can impact a materialized view health are:
Cluster resources: Like any other process running on the cluster, materialized views consume resources (CPU, memory) from the cluster. If the cluster is overloaded, adding materialized views to it may cause a degradation in the cluster’s performance. Monitor your cluster’s health using cluster health metrics. Optimized autoscale currently doesn’t take materialized views health under consideration as part of autoscale rules.
- The materialization process is limited by the amount of memory and CPU it can consume. These limits are defined, and can be changed, in the materialized views workload group.
Overlap with materialized data: During materialization, all new records ingested to the source table since the last materialization (the delta) are processed and materialized into the view. The higher the intersection between new records and already materialized records is, the worse the performance of the materialized view will be. A materialized view works best if the number of records being updated (for example, in
arg_max
view) is a small subset of the source table. If all or most of the materialized view records need to be updated in every materialization cycle, then the materialized view might not perform well.Ingestion rate: There are no hard-coded limits on the data volume or ingestion rate in the source table of the materialized view. However, the recommended ingestion rate for materialized views is no more than 1-2GB/sec. Higher ingestion rates may still perform well. Performance depends on database size, available resources, and amount of intersection with existing data.
Number of materialized views in cluster: The above considerations apply to each individual materialized view defined in the cluster. Each view consumes its own resources, and many views compete with each other on available resources. While there are no hard-coded limits to the number of materialized views in a cluster, the cluster may not be able to handle all materialized views, when there are many defined. The capacity policy can be adjusted if there is more than a single materialized view in the cluster. Increase the value of
ClusterMinimumConcurrentOperations
in the policy to run more materialized views concurrently.Materialized view definition: The materialized view definition must be defined according to query best practices for best query performance. For more information, see create command performance tips.
Materialized view over materialized view
A materialized view can be created over another materialized view if the source materialized view is a deduplication view. Specifically, the aggregation of the source materialized view must be take_any(*)
in order to deduplicate source records. The second materialized view can use any supported aggregation functions. For specific information on how to create a materialized view over a materialized view, see .create materialized-view
command.
Related content
9.9.2 - Materialized views data purge
Data purge commands can be used to purge records from materialized views. The same guidelines for purging records from a table apply to materialized views purge.
The purge command only deletes records from the materialized part of the view (what is the materialized part?). Therefore, if the source table of the materialized view includes records to purge, these records may be returned from the materialized view query, even after purge completed successfully.
The recommended process for purging records from a materialized view is:
- Purge the source table of the materialized view.
- After the source table purge is completed successfully, purge the materialized view.
Limitations
The purge predicate of a materialized view purge can only reference the group by keys of the aggregation, or any column in a arg_max()/arg_min() /take_any() view. It cannot reference other aggregation functions result columns.
For example, for a materialized view MV
, which is defined with the following aggregation function:
T | summarize count(), avg(Duration) by UserId
The following purge predicate isn’t valid, since it references the result of the avg() aggregation:
MV | where avg_Duration > 1h
Related content
9.9.3 - Materialized views limitations
The materialized view source
- The source table of a materialized view:
- Must be a table into which data is directly ingested, using an update policy, or ingest from query commands.
- Using move extents or replace extents from other tables to the source table of the materialized view is only supported if using
setNewIngestionTime
property as part of the move extents command (refer to .move extents and .replace extents commands for more details). - Moving extents to the source table of a materialized view, while not using
setNewIngestionTime
can cause the move to fail with one of the following errors:Cannot drop/move extents from/to table 'TableName' since Materialized View 'ViewName' is currently processing some of these extents
.Cannot move extents to 'TableName' since materialized view 'ViewName' will not process these extents (can lead to data loss in the materialized view)
.
- Using move extents or replace extents from other tables to the source table of the materialized view is only supported if using
- Must be a table into which data is directly ingested, using an update policy, or ingest from query commands.
- The source table of a materialized view must have IngestionTime policy enabled. This policy is enabled by default.
- If the materialized view uses a default
lookback
, theingestion_time()
must be preserved in the materialized view’s query. Operators such as mv-expand or pivot plugin don’t preserve theingestion_time()
, so they can’t be used in a materialized view with alookback
. For more information, see Lookback period. - The source table of a materialized view can’t be a table with a restricted view access policy.
- A materialized view can’t be created on top of another materialized view, unless the first materialized view is of type
take_any(*)
aggregation. See materialized view over materialized view. - Materialized views can’t be defined over external tables.
Impact of records ingested to or dropped from the source table
- A materialized view only processes new records ingested into the source table. Records that are removed from the source table, either by running data purge/soft delete/drop extents, or due to retention policy or any other reason, have no impact on the materialized view.
- The materialized view has its own retention policy, which is independent of the retention policy of the source table. The materialized view might include records that aren’t present in the source table.
Follower databases
- Materialized views can’t be created in follower databases. Follower databases are read-only and materialized views require write operations.
- Materialized views can’t be created in database shortcuts. Database shortcuts are read-only and materialized views require write operations.
- Materialized views that are defined on leader databases can be queried from their followers, like any other table in the leader.
- Use the leader cluster to monitor follower database materialized views. For more information, see Materialized views in follower databases.
- Use the source Eventhouse to monitor shortcut database materialized views. For more information, see Monitor materialized views.
Other
- Cursor functions can’t be used on top of materialized views.
- Continuous export from a materialized view isn’t supported.
Related content
9.9.4 - Materialized views policies
This article includes information about policies that can be set on materialized views.
Retention and caching policy
A materialized view has a retention policy and caching policy. The materialized view derives the database retention and caching policies by default. These policies can be changed using retention policy management commands or caching policy management commands.
Both policies are applied on the materialized part of the materialized view only. For an explanation of the differences between the materialized part and delta part, see how materialized views work. For example, if the caching policy of a materialized view is set to 7d, but the caching policy of its source table is set to 0d, there may still be disk misses when querying the materialized view. This behavior occurs because the source table (delta part) also participates in the query.
The retention policy of the materialized view is unrelated to the retention policy of the source table. Retention policy of source table can be shorter than the retention policy of the materialized view, if source records are required for a shorter period. We recommend a minimum retention policy of at least few days, and recoverability set to true on the source table. This setting allows for fast recovery for errors and for diagnostic purposes.
The retention and caching policies both depend on Extent Creation time. The last update for a record determines the extent creation time for a materialized view.
Partitioning policy
A partitioning policy can be applied on a materialized view. We recommend configuring a partitioning policy on a materialized view only when most or all of the view queries filter by one of the materialized view’s group-by keys. This situation is common in multi-tenant solutions, where one of the materialized view’s group-by keys is the tenant’s identifier (for example, tenantId
, customerId
). For more information, see the first use case described in the partitioning policy supported scenarios page.
For the commands to alter a materialized view’s partitioning policy, see partitioning policy commands.
Adding a partitioning policy on a materialized view increases the number of extents in the materialized view, and creates more “work” for the materialization process. For more information on the reason for this behavior, see the extents rebuild process mentioned in how materialized views work.
Row level security policy
A row level security can be applied on a materialized view, with several limitations:
- The policy can be applied only to materialized views with arg_max()/arg_min()/take_any() aggregation functions, or when the row level security query references the group by keys of the materialized view aggregation.
- The policy is applied to the materialized part of the view only.
- If the same row level security policy isn’t defined on the source table of the materialized view, then querying the materialized view may return records that should be hidden by the policy. This happens because querying the materialized view queries the source table as well.
- We recommend defining the same row level security policy both on the source table and the materialized view if the view is an arg_max() or arg_min()/take_any().
- When defining a row level security policy on the source table of an arg_max() or arg_min()/take_any() materialized view, the command fails if there’s no row level security policy defined on the materialized view itself. The purpose of the failure is to alert the user of a potential data leak, since the materialized view may expose information. To mitigate this error, do one of the following actions:
- Define the row level security policy over the materialized view.
- Choose to ignore the error by adding
allowMaterializedViewsWithoutRowLevelSecurity
property to the alter policy command. For example:
.alter table SourceTable policy row_level_security enable with (allowMaterializedViewsWithoutRowLevelSecurity=true) "RLS_function"
For commands for configuring a row level security policy on a materialized view, see row_level_security policy commands.
9.9.5 - Materialized views use cases
Materialized views expose an aggregation query over a source table or another materialized view. This article covers common and advanced use cases for materialized views.
Common use cases
The following are common scenarios that can be addressed by using a materialized view:
Update data: Update data by returning the last record per entity using
arg_max()
(aggregation function). For example, create a view that only materializes records ingested from now on:.create materialized-view ArgMax on table T { T | summarize arg_max(Timestamp, *) by User }
Reduce the resolution of data Reduce the resolution of data by calculating periodic statistics over the raw data. Use various aggregation functions by period of time. For example, maintain an up-to-date snapshot of distinct users per day:
.create materialized-view UsersByDay on table T { T | summarize dcount(User) by bin(Timestamp, 1d) }
Deduplicate records: Deduplicate records in a table using
take_any()
(aggregation function). For example, create a materialized view that deduplicates the source table based on theEventId
column, using a lookback of 6 hours. Records are deduplicated against only records ingested 6 hours before current records..create materialized-view with(lookback=6h) DeduplicatedTable on table T { T | summarize take_any(*) by EventId }
[!NOTE] You can conceal the source table by creating a function with the same name as the table that references the materialized view instead. This pattern ensures that callers querying the table access the deduplicated materialized view because functions override tables with the same name. To avoid cyclic references in the view definition, use the table() function to reference the source table:
.create materialized-view DeduplicatedTable on table T { table('T') | summarize take_any(*) by EventId }
For more examples, see the .create materialized-view command.
Advanced scenario
You can use a materialized view for create/update/delete event processing. For records with incomplete or outdated information in each column, a materialized view can provide the latest updates for each column, excluding entities that were deleted.
Consider the following input table named Events
:
Input
Timestamp | cud | ID | col1 | col2 | col3 |
---|---|---|---|---|---|
2023-10-24 00:00:00.0000000 | C | 1 | 1 | 2 | |
2023-10-24 01:00:00.0000000 | U | 1 | 22 | 33 | |
2023-10-24 02:00:00.0000000 | U | 1 | 23 | ||
2023-10-24 00:00:00.0000000 | C | 2 | 1 | 2 | |
2023-10-24 00:10:00.0000000 | U | 2 | 4 | ||
2023-10-24 02:00:00.0000000 | D | 2 |
Create a materialized view to get the latest update per column, using the arg_max() aggregation function:
.create materialized-view ItemHistory on table Events
{
Events
| extend Timestamp_col1 = iff(isnull(col1), datetime(1970-01-01), Timestamp),
Timestamp_col2 = iff(isnull(col2), datetime(1970-01-01), Timestamp),
Timestamp_col3 = iff(isnull(col3), datetime(1970-01-01), Timestamp)
| summarize arg_max(Timestamp_col1, col1), arg_max(Timestamp_col2, col2), arg_max(Timestamp_col3, col3), arg_max(Timestamp, cud) by id
}
Output
ID | Timestamp_col1 | col1 | Timestamp_col2 | col2 | Timestamp_col3 | col3 | Timestamp | cud |
---|---|---|---|---|---|---|---|---|
2 | 2023-10-24 00:00:00.0000000 | 1 | 2023-10-24 00:10:00.0000000 | 4 | 1970-01-01 00:00:00.0000000 | 2023-10-24 02:00:00.0000000 | D | |
1 | 2023-10-24 00:00:00.0000000 | 1 | 2023-10-24 02:00:00.0000000 | 23 | 2023-10-24 01:00:00.0000000 | 33 | 2023-10-24 02:00:00.0000000 | U |
You can create a stored function to further clean the results:
ItemHistory
| project Timestamp, cud, id, col1, col2, col3
| where cud != "D"
| project-away cud
Final Output
The latest update for each column for ID 1
, since ID 2
was deleted.
Timestamp | ID | col1 | col2 | col3 |
---|---|---|---|---|
2023-10-24 02:00:00.0000000 | 1 | 1 | 23 | 33 |
Materialized views vs. update policies
Materialized views and update policies work differently and serve different use cases. Use the following guidelines to identify which one you should use:
Materialized views are suitable for aggregations, while update policies aren’t. Update policies run separately for each ingestion batch, and therefore can only perform aggregations within the same ingestion batch. If you require an aggregation query, always use materialized views.
Update policies are useful for data transformations, enrichments with dimension tables (usually using lookup operator) and other data manipulations that can run in the scope of a single ingestion.
Update policies run during ingestion time. Data isn’t available for queries in the source table or the target table until all update policies run. Materialized views, on the other hand, aren’t part of the ingestion pipeline. The materialization process runs periodically in the background, post ingestion. Records in source table are available for queries before they’re materialized.
Both update policies and materialized views can incorporate joins, but their effectiveness is limited to specific scenarios. Specifically, joins are suitable only when the data required for the join from both sides is accessible at the time of the update policy or materialization process. If matching entities are ingested when the update policy or materialization runs, there’s a risk of overlooking data. See more about
dimension tables
in materialized view query parameter and in fact and dimension tables.
Related content
9.9.6 - Monitor materialized views
Monitor the materialized view’s health in the following ways:
Monitor materialized views metrics in the Azure portal with Azure Monitor. Use the materialized view age metric,
MaterializedViewAgeSeconds
, as the primary metric to monitor the freshness of the view.Monitor materialized view metrics in your Microsoft Fabric workspace. Use the materialized view age metric,
MaterializedViewAgeSeconds
as the primary metric to monitor the freshness of the view. For more information, see Enable monitoring in your workspace.Monitor the
IsHealthy
property using.show materialized-view
.Check for failures using
.show materialized-view failures
.
Troubleshooting unhealthy materialized views
If the MaterializedViewAge
metric constantly increases, and the MaterializedViewHealth
metric shows that the view is unhealthy, follow these recommendations to identify the root cause:
Check the number of materialized views on the cluster, and the current capacity for materialized views:
.show capacity | where Resource == "MaterializedView" | project Resource, Total, Consumed
Output
Resource Total Consumed MaterializedView 1 0 - The number of materialized views that can run concurrently depends on the capacity shown in the
Total
column, while theConsumed
column shows the number of materialized views currently running. You can use the Materialized views capacity policy to specify the minimum and maximum number of concurrent operations, overriding the system’s default concurrency level. The system determines the current concurrency, shown inTotal
, based on the cluster’s available resources. The following example overrides the system’s decision and changes the minimum concurrent operations from one to three:
.alter-merge cluster policy capacity '{ "MaterializedViewsCapacity": { "ClusterMinimumConcurrentOperations": 3 } }'
- If you explicitly change this policy, monitor the cluster’s health and ensure that other workloads aren’t affected by this change.
- The number of materialized views that can run concurrently depends on the capacity shown in the
Check if there are failures during the materialization process using .show materialized-view failures.
- If the error is permanent, the system automatically disables the materialized view. To check if it’s disabled, use the .show materialized-view command and see if the value in the
IsEnabled
column isfalse
. Then check the Journal for the disabled event with the .show journal command. An example of a permanent failure is a source table schema change that makes it incompatible with the materialized view. For more information, see .create materialized-view command. - If the failure is transient, the system automatically retries the operation. However, the failure can delay the materialization and increase the age of the materialized view. This type of failure occurs, for example, when hitting memory limits or with a query time-out. See the following recommendations for more ways to troubleshoot transient failures.
- If the error is permanent, the system automatically disables the materialized view. To check if it’s disabled, use the .show materialized-view command and see if the value in the
Analyze the materialization process using the .show commands-and-queries command. Replace Databasename and ViewName to filter for a specific view:
.show commands-and-queries | where Database == "DatabaseName" and ClientActivityId startswith "DN.MaterializedViews;ViewName;"
- Check the memory consumption in the
MemoryPeak
column to identify any operations that failed due to hitting memory limits, such as, runaway queries. By default, the materialization process is limited to a 15-GB memory peak per node. If the queries or commands executed during the materialization process exceed this value, the materialization fails due to memory limits. To increase the memory peak per node, alter the $materialized-views workload group. The following example alters the materialized views workload group to use a maximum of 64-GB memory peak per node during materialization:
.alter-merge workload_group ['$materialized-views'] ``` { "RequestLimitsPolicy": { "MaxMemoryPerQueryPerNode": { "Value": 68719241216 } } }
[!NOTE]
MaxMemoryPerQueryPerNode
can’t exceed 50% of the total memory available on each node.- Check if the materialization process is hitting cold cache. The following example shows cache statistics over the past day for the materialized view,
ViewName
:
.show commands-and-queries | where ClientActivityId startswith "DN.MaterializedViews;ViewName" | where StartedOn > ago(1d) | extend HotCacheHits = tolong(CacheStatistics.Shards.Hot.HitBytes), HotCacheMisses = tolong(CacheStatistics.Shards.Hot.MissBytes), HotCacheRetrieved = tolong(CacheStatistics.Shards.Hot.RetrieveBytes), ColdCacheHits = tolong(CacheStatistics.Shards.Cold.HitBytes), ColdCacheMisses = tolong(CacheStatistics.Shards.Cold.MissBytes), ColdCacheRetrieved = tolong(CacheStatistics.Shards.Cold.RetrieveBytes) | summarize HotCacheHits = format_bytes(sum(HotCacheHits)), HotCacheMisses = format_bytes(sum(HotCacheMisses)), HotCacheRetrieved = format_bytes(sum(HotCacheRetrieved)), ColdCacheHits =format_bytes(sum(ColdCacheHits)), ColdCacheMisses = format_bytes(sum(ColdCacheMisses)), ColdCacheRetrieved = format_bytes(sum(ColdCacheRetrieved))
Output
HotCacheHits HotCacheMisses HotCacheRetrieved ColdCacheHits ColdCacheMisses ColdCacheRetrieved 26 GB 0 Bytes 0 Bytes 1 GB 0 Bytes 866 MB * If the view isn’t fully in the hot cache, materialization can experience disk misses, significantly slowing down the process. * Increasing the caching policy for the materialized view helps avoid cache misses. For more information, see [hot and cold cache and caching policy](..//Management/Policies/Update/Update%20policy/Update%20policy.md) and [.alter materialized-view policy caching command](..//Management/Policies/Merge/.alter%20materialized-view%20policy%20merge%20command/.alter%20materialized-view%20policy%20merge%20command.md).
- Check if the materialization is scanning old records by checking the
ScannedExtentsStatistics
with the .show queries command. If the number of scanned extents is high and theMinDataScannedTime
is old, the materialization cycle needs to scan all, or most, of the materialized part of the view. The scan is needed to find intersections with the delta. For more information about the delta and the materialized part, see How materialized views work. The following recommendations provide ways to reduce the amount of data scanned in materialized cycles by minimizing the intersection with the delta.
- Check the memory consumption in the
If the materialization cycle scans a large amount of data, potentially including cold cache, consider making the following changes to the materialized view definition:
- Include a
datetime
group-by key in the view definition. This can significantly reduce the amount of data scanned, as long as there is no late arriving data in this column. For more information, see Performance tips. You need to create a new materialized view since updates to group-by keys aren’t supported. - Use a
lookback
as part of the view definition. For more information, see .create materialized view supported properties.
- Include a
Check whether there’s enough ingestion capacity by verifying if either the
MaterializedViewResult
metric or IngestionUtilization metric showInsufficientCapacity
values. You can increase ingestion capacity by scaling the available resources (preferred) or by altering the ingestion capacity policy.Check whether there’s enough ingestion capacity by verifying if the
MaterializedViewResult
metric showsInsufficientCapacity
values. You can increase ingestion capacity by scaling the available resources.If the materialized view is still unhealthy, then the service doesn’t have sufficient capacity or resources to materialize all the data on time. Consider the following options:
- Scale out the cluster by increasing the minimum instance count. Optimized autoscale doesn’t take materialized views into consideration and doesn’t scale out the cluster automatically if materialized views are unhealthy. You need to set the minimum instance count to provide the cluster with more resources to accommodate materialized views.
- Scale out the Eventhouse to provide it with more resources to accommodate materialized views. For more information, see Enable minimum consumption.
- Divide the materialized view into several smaller views, each covering a subset of the data. For instance, you can split them based on a high cardinality key from the materialized view’s group-by keys. All views are based on the same source table, and each view filters by
SourceTable | where hash(key, number_of_views) == i
, wherei
is part of the set{0,1,…,number_of_views-1}
. Then, you can define a stored function that unions all the smaller materialized views. Use this function in queries to access the combined data.
While splitting the view might increase CPU usage, it reduces the memory peak in materialization cycles. Reducing the memory peak can help if the single view is failing due to memory limits.
MaterializedViewResult metric
The MaterializedViewResult
metric provides information about the result of a materialization cycle and can be used to identify issues in the materialized view health status. The metric includes the Database
and MaterializedViewName
and a Result
dimension.
The Result
dimension can have one of the following values:
Success: The materialization completed successfully.
SourceTableNotFound: The source table of the materialized view was dropped, so the materialized view is disabled automatically.
SourceTableSchemaChange: The schema of the source table changed in a way that isn’t compatible with the materialized view definition. Since the materialized view query no longer matches the materialized view schema, the materialized view is disabled automatically.
InsufficientCapacity: The instance doesn’t have sufficient capacity to materialize the materialized view, due to a lack of ingestion capacity. While insufficient capacity failures can be transient, if they reoccur often, try scaling out the instance or increasing the relevant capacity in the policy.
InsufficientCapacity: The instance doesn’t have sufficient capacity to materialize the materialized view, due to a lack of ingestion capacity. While insufficient capacity failures can be transient, if they reoccur often, try scaling out the instance or increasing capacity. For more information, see Plan your capacity size.
InsufficientResources: The database doesn’t have sufficient resources (CPU/memory) to materialize the materialized view. While insufficient resource errors might be transient, if they reoccur often, try scaling up or scaling out. For more ideas, see Troubleshooting unhealthy materialized views.
Materialized views in follower databases
Materialized views can be defined in follower databases. However, the monitoring of these materialized views should be based on the leader database, where the materialized view is defined. Specifically:
- Metrics related to materialized view execution (
MaterializedViewResult
,MaterializedViewExtentsRebuild
) are only present in the leader database. Metrics related to monitoring (MaterializedViewAgeSeconds
,MaterializedViewHealth
,MaterializedViewRecordsInDelta
) also appear in the follower databases. - The .show materialized-view failures command only works in the leader database.
Track resource consumption
Materialized views resource consumption: the resources consumed by the materialized views materialization process can be tracked using the .show commands-and-queries
command. Filter the records for a specific view using the following (replace DatabaseName
and ViewName
):
.show commands-and-queries
| where Database == "DatabaseName" and ClientActivityId startswith "DN.MaterializedViews;ViewName;"
Related content
9.10 - Stored query results
9.10.1 - Stored query results
Stored query results store the result of a query on the service for up to 24 hours. The same principal identity that created the stored query can reference the results in later queries.
Stored query results can be useful in the following scenarios:
- Paging through query results. The initial command runs the query and returns the first “page” of records. Later queries reference other “pages” without the need to rerun the query.
- Drill-down scenarios, in which the results of an initial query are then explored using other queries.
Updates to security policies, such as database access and row level security, aren’t propagated to stored query results. Use .drop stored_query_results
if there’s user permission revocation.
Stored query results behave like tables, in that the order of records isn’t preserved. To paginate through the results, we recommended that the query includes unique ID columns. If a query returns multiple result sets, only the first result set is stored.
The following table lists the management commands and functions used for managing stored query results:
Command | Description |
---|---|
.set stored_query_result command | Creates a stored query result to store the results of a query on the service for up to 24 hours. |
.show stored_query_result command | Shows information on active query results. |
.drop stored_query_result command | Deletes active query results. |
stored_query_result() | Retrieves a stored query result. |
Related Content
9.11 - Tables
9.11.1 - Tables management
This topic discusses the life cycle of tables and associated management commands that are helpful for exploring, creating and altering tables.
Select the links in the table below for more information about them.
For information on optimizing table schema, see Schema optimization best practices.
Commands | Operation |
---|---|
.alter table docstring , .alter table folder | Manage table display properties |
.create ingestion mapping , .show ingestion mappings , .alter ingestion mapping , .drop ingestion mapping | Manage ingestion mapping |
.create tables , .create table , .create-merge tables , .create-merge table , .alter table , .alter-merge table , .drop tables , .drop table , .undo drop table , .rename table | Create/modify/drop tables |
.show tables .show table details .show table schema | Enumerate tables in a database |
.ingest , .set , .append , .set-or-append (see Data ingestion overview). | Data ingestion into a table |
.clear table data | Clears all the data of a table |
CRUD naming conventions for tables
(See full details in the sections linked to in the table, above.)
Command syntax | Semantics |
---|---|
.create entityType entityName ... | If an entity of that type and name exists, returns the entity. Otherwise, create the entity. |
.create-merge entityType entityName... | If an entity of that type and name exists, merge the existing entity with the specified entity. Otherwise, create the entity. |
.alter entityType entityName ... | If an entity of that type and name does not exist, error. Otherwise, replace it with the specified entity. |
.alter-merge entityType entityName ... | If an entity of that type and name does not exist, error. Otherwise, merge it with the specified entity. |
.drop entityType entityName ... | If an entity of that type and name does not exist, error. Otherwise, drop it. |
.drop entityType entityName ifexists ... | If an entity of that type and name does not exist, return. Otherwise, drop it. |
10 - Security roles
10.1 - Manage database security roles
Principals are granted access to resources through a role-based access control model, where their assigned security roles determine their resource access.
In this article, you’ll learn how to use management commands to view existing security roles and add and drop principal association to security roles on the database level.
Permissions
You must have at least Database Admin permissions to run these commands.
Database level security roles
The following table shows the possible security roles on the database level and describes the permissions granted for each role.
Role | Permissions |
---|---|
admins | View and modify the database and database entities. |
users | View the database and create new database entities. |
viewers | View tables in the database where RestrictedViewAccess isn’t turned on. |
unrestrictedviewers | View the tables in the database even where RestrictedViewAccess is turned on. The principal must also have admins , viewers , or users permissions. |
ingestors | Ingest data to the database without access to query. |
monitors | View database metadata such as schemas, operations, and permissions. |
Show existing security roles
Before you add or remove principals, you can use the .show
command to see a table with all of the principals and roles that are already set on the database.
Syntax
To show all roles:
.show
database
DatabaseName principals
To show your roles:
.show
database
DatabaseName principal
roles
Parameters
Name | Type | Required | Description |
---|---|---|---|
DatabaseName | string | ✔️ | The name of the database for which to list principals. |
Example
The following command lists all security principals that have access to the Samples
database.
.show database Samples principals
Example output
Role | PrincipalType | PrincipalDisplayName | PrincipalObjectId | PrincipalFQN |
---|---|---|---|---|
Database Samples Admin | Microsoft Entra user | Abbi Atkins | cd709aed-a26c-e3953dec735e | aaduser=abbiatkins@fabrikam.com |
Add and drop principal association to security roles
This section provides syntax, parameters, and examples for adding and removing principals to and from security roles.
Syntax
Action database
DatabaseName Role (
Principal [,
Principal…] )
[skip-results
] [ Description ]
Parameters
Name | Type | Required | Description |
---|---|---|---|
Action | string | ✔️ | The command .add , .drop , or .set ..add adds the specified principals, .drop removes the specified principals, and .set adds the specified principals and removes all previous ones. |
DatabaseName | string | ✔️ | The name of the database for which to add principals. |
Role | string | ✔️ | The role to assign to the principal. For databases, roles can be admins , users , viewers , unrestrictedviewers , ingestors , or monitors . |
Principal | string | ✔️ | One or more principals or managed identities. To reference managed identities, use the “App” format using the managed identity object ID or managed identity client (application) ID. For guidance on how to specify these principals, see Referencing Microsoft Entra principals and groups. |
skip-results | string | If provided, the command won’t return the updated list of database principals. | |
Description | string | Text to describe the change that displays when using the .show command. |
Name | Type | Required | Description |
---|---|---|---|
Action | string | ✔️ | The command .add , .drop , or .set ..add adds the specified principals, .drop removes the specified principals, and .set adds the specified principals and removes all previous ones. |
DatabaseName | string | ✔️ | The name of the database for which to add principals. |
Role | string | ✔️ | The role to assign to the principal. For databases, this can be admins , users , viewers , unrestrictedviewers , ingestors , or monitors . |
Principal | string | ✔️ | One or more principals. For guidance on how to specify these principals, see Referencing Microsoft Entra principals and groups. |
skip-results | string | If provided, the command won’t return the updated list of database principals. | |
Description | string | Text to describe the change that displays when using the .show command. |
Examples
In the following examples, you’ll see how to add security roles, remove security roles, and add and remove security roles in the same command.
Add security roles with .add
The following example adds a principal to the users
role on the Samples
database.
.add database Samples users ('aaduser=imikeoein@fabrikam.com')
The following example adds an application to the viewers
role on the Samples
database.
.add database Samples viewers ('aadapp=4c7e82bd-6adb-46c3-b413-fdd44834c69b;fabrikam.com')
Remove security roles with .drop
The following example removes all principals in the group from the admins
role on the Samples
database.
.drop database Samples admins ('aadGroup=SomeGroupEmail@fabrikam.com')
Add new security roles and remove the old with .set
The following example removes existing viewers
and adds the provided principals as viewers
on the Samples
database.
.set database Samples viewers ('aaduser=imikeoein@fabrikam.com', 'aaduser=abbiatkins@fabrikam.com')
Remove all security roles with .set
The following command removes all existing viewers
on the Samples
database.
.set database Samples viewers none
Related content
10.2 - Manage external table roles
Principals are granted access to resources through a role-based access control model, where their assigned security roles determine their resource access.
On external tables, the only security role is admins
. External table admins
have the ability to view, modify, and remove the external table.
In this article, you’ll learn how to use management commands to view existing admins as well as add and remove admins on external tables.
Permissions
You must have Database Admin permissions or be an External Table Admin on the specific external table to run these commands. For more information, see role-based access control.
Show existing admins
Before you add or remove principals, you can use the .show
command to see a table with all of the principals that already have admin access on the external table.
Syntax
To show all roles:
.show
external table
ExternalTableName principals
To show your roles:
.show
external table
ExternalTableName principal
roles
Parameters
Name | Type | Required | Description |
---|---|---|---|
ExternalTableName | string | ✔️ | The name of the external table for which to list principals. |
Example
The following command lists all security principals that have access to the Samples
external table.
.show external table Samples principals
Example output
Role | PrincipalType | PrincipalDisplayName | PrincipalObjectId | PrincipalFQN |
---|---|---|---|---|
External Table Samples Admin | Microsoft Entra user | Abbi Atkins | cd709aed-a26c-e3953dec735e | aaduser=abbiatkins@fabrikam.com |
Add and drop admins
This section provides syntax, parameters, and examples for adding and removing principals.
Syntax
Action external table
ExternalTableName admins
(
Principal [,
Principal…] )
[skip-results
] [ Description ]
Parameters
Name | Type | Required | Description |
---|---|---|---|
Action | string | ✔️ | The command .add , .drop , or .set ..add adds the specified principals, .drop removes the specified principals, and .set adds the specified principals and removes all previous ones. |
ExternalTableName | string | ✔️ | The name of the external table for which to add principals. |
Principal | string | ✔️ | One or more principals. For guidance how to specify these principals, see Referencing security principals. |
skip-results | string | If provided, the command won’t return the updated list of external table principals. | |
Description | string | Text to describe the change that will be displayed when using the .show command. |
Examples
In the following examples, you’ll see how to add admins, remove admins, and add and remove admins in the same command.
Add admins with .add
The following example adds a principal to the admins
role on the Samples
external table.
.add external table Samples admins ('aaduser=imikeoein@fabrikam.com')
Remove admins with .drop
The following example removes all principals in the group from the admins
role on the Samples
external table.
.drop external table Samples admins ('aadGroup=SomeGroupEmail@fabrikam.com')
Add new admins and remove the old with .set
The following example removes existing admins
and adds the provided principals as admins
on the Samples
external table.
.set external table Samples admins ('aaduser=imikeoein@fabrikam.com', 'aaduser=abbiatkins@fabrikam.com')
Remove all admins with .set
The following command removes all existing admins
on the Samples
external table.
.set external table Samples admins none
Related content
10.3 - Manage function roles
Principals are granted access to resources through a role-based access control model, where their assigned security roles determine their resource access.
On functions, the only security role is admins
. Function admins
have the ability to view, modify, and remove the function.
In this article, you’ll learn how to use management commands to view existing admins as well as add and remove admins on functions.
Permissions
You must have Database Admin permissions or be a Function Admin on the specific function to run these commands. For more information, see role-based access control.
Show existing admins
Before you add or remove principals, you can use the .show
command to see a table with all of the principals that already have admin access on the function.
Syntax
To show all roles:
.show
function
FunctionName principals
To show your roles:
.show
function
FunctionName principal
roles
Parameters
Name | Type | Required | Description |
---|---|---|---|
FunctionName | string | ✔️ | The name of the function for which to list principals. |
Example
The following command lists all security principals that have access to the SampleFunction
function.
.show function SampleFunction principals
Example output
Role | PrincipalType | PrincipalDisplayName | PrincipalObjectId | PrincipalFQN |
---|---|---|---|---|
Function SampleFunction Admin | Microsoft Entra user | Abbi Atkins | cd709aed-a26c-e3953dec735e | aaduser=abbiatkins@fabrikam.com |
Add and drop admins
This section provides syntax, parameters, and examples for adding and removing principals.
Syntax
Action function
FunctionName admins
(
Principal [,
Principal…] )
[skip-results
] [ Description ]
Parameters
Name | Type | Required | Description |
---|---|---|---|
Action | string | ✔️ | The command .add , .drop , or .set ..add adds the specified principals, .drop removes the specified principals, and .set adds the specified principals and removes all previous ones. |
FunctionName | string | ✔️ | The name of the function for which to add principals. |
Principal | string | ✔️ | One or more principals. For guidance on how to specify these principals, see Referencing security principals. |
skip-results | string | If provided, the command won’t return the updated list of function principals. | |
Description | string | Text to describe the change that will be displayed when using the .show command. |
Examples
In the following examples, you’ll see how to add admins, remove admins, and add and remove admins in the same command.
Add admins with .add
The following example adds a principal to the admins
role on the SampleFunction
function.
.add function SampleFunction admins ('aaduser=imikeoein@fabrikam.com')
Remove admins with .drop
The following example removes all principals in the group from the admins
role on the SampleFunction
function.
.drop function SampleFunction admins ('aadGroup=SomeGroupEmail@fabrikam.com')
Add new admins and remove the old with .set
The following example removes existing admins
and adds the provided principals as admins
on the SampleFunction
function.
.set function SampleFunction admins ('aaduser=imikeoein@fabrikam.com', 'aaduser=abbiatkins@fabrikam.com')
Remove all admins with .set
The following command removes all existing admins
on the SampleFunction
function.
.set function SampleFunction admins none
Related content
10.4 - Manage materialized view roles
Principals are granted access to resources through a role-based access control model, where their assigned security roles determine their resource access.
On materialized views, the only security role is admins
. Materialized view admins
have the ability to view, modify, and remove the materialized view.
In this article, you’ll learn how to use management commands to view existing admins as well as add and remove admins on materialized views.
Permissions
You must have Database Admin permissions or be a Materialized View Admin on the specific materialized view to run these commands. For more information, see role-based access control.
Show existing admins
Before you add or remove principals, you can use the .show
command to see a table with all of the principals that already have admin access on the materialized view.
Syntax
To show all roles:
.show
materialized-view
MaterializedViewName principals
To show your roles:
.show
materialized-view
MaterializedViewName principal
roles
Parameters
Name | Type | Required | Description |
---|---|---|---|
MaterializedViewName | string | ✔️ | The name of the materialized view for which to list principals. |
Example
The following command lists all security principals that have access to the SampleView
materialized view.
.show materialized view SampleView principals
Example output
Role | PrincipalType | PrincipalDisplayName | PrincipalObjectId | PrincipalFQN |
---|---|---|---|---|
Materialized View SampleView Admin | Microsoft Entra user | Abbi Atkins | cd709aed-a26c-e3953dec735e | aaduser=abbiatkins@fabrikam.com |
Add and drop admins
This section provides syntax, parameters, and examples for adding and removing principals.
Syntax
Action materialized-view
MaterializedViewName admins
(
Principal [,
Principal…] )
[skip-results
] [ Description ]
Parameters
Name | Type | Required | Description |
---|---|---|---|
Action | string | ✔️ | The command .add , .drop , or .set ..add adds the specified principals, .drop removes the specified principals, and .set adds the specified principals and removes all previous ones. |
MaterializedViewName | string | ✔️ | The name of the materialized view for which to add principals. |
Principal | string | ✔️ | One or more principals. For guidance on how to specify these principals, see Referencing security principals. |
skip-results | string | If provided, the command won’t return the updated list of materialized view principals. | |
Description | string | Text to describe the change that will be displayed when using the .show command. |
Examples
In the following examples, you’ll see how to add admins, remove admins, and add and remove admins in the same command.
Add admins with .add
The following example adds a principal to the admins
role on the SampleView
materialized view.
.add materialized-view SampleView admins ('aaduser=imikeoein@fabrikam.com')
Remove admins with .drop
The following example removes all principals in the group from the admins
role on the SampleView
materialized view.
.drop materialized-view SampleView admins ('aadGroup=SomeGroupEmail@fabrikam.com')
Add new admins and remove the old with .set
The following example removes existing admins
and adds the provided principals as admins
on the SampleView
materialized view.
.set materialized-view SampleView admins ('aaduser=imikeoein@fabrikam.com', 'aaduser=abbiatkins@fabrikam.com')
Remove all admins with .set
The following command removes all existing admins
on the SampleView
materialized view.
.set materialized-view SampleView admins none
Related content
10.5 - Referencing security principals
The authorization model allows for the use of Microsoft Entra user and application identities and Microsoft Accounts (MSAs) as security principals. This article provides an overview of the supported principal types for both Microsoft Entra ID and MSAs, and demonstrates how to properly reference these principals when assigning security roles using management commands.
Microsoft Entra ID
The recommended way to access your environment is by authenticating to the Microsoft Entra service. Microsoft Entra ID is an identity provider capable of authenticating security principals and coordinating with other identity providers, such as Microsoft’s Active Directory.
Microsoft Entra ID supports the following authentication scenarios:
- User authentication (interactive sign-in): Used to authenticate human principals.
- Application authentication (non-interactive sign-in): Used to authenticate services and applications that have to run or authenticate without user interaction.
Referencing Microsoft Entra principals and groups
The syntax for referencing Microsoft Entra user and application principals and groups is outlined in the following table.
If you use a User Principal Name (UPN) to reference a user principal, and an attempt will be made to infer the tenant from the domain name and try to find the principal. If the principal isn’t found, explicitly specify the tenant ID or name in addition to the user’s UPN or object ID.
Similarly, you can reference a security group with the group email address in UPN format and an attempt will be made to infer the tenant from the domain name. If the group isn’t found, explicitly specify the tenant ID or name in addition to the group display name or object ID.
Type of Entity | Microsoft Entra tenant | Syntax |
---|---|---|
User | Implicit | aaduser =UPN |
User | Explicit (ID) | aaduser =UPN;TenantIdor aaduser =ObjectID;TenantId |
User | Explicit (Name) | aaduser =UPN;TenantNameor aaduser =ObjectID;TenantName |
Group | Implicit | aadgroup =GroupEmailAddress |
Group | Explicit (ID) | aadgroup =GroupDisplayName;TenantIdor aadgroup =GroupObjectId;TenantId |
Group | Explicit (Name) | aadgroup =GroupDisplayName;TenantNameor aadgroup =GroupObjectId;TenantName |
App | Explicit (ID) | aadapp =ApplicationDisplayName;TenantIdor aadapp =ApplicationId;TenantId |
App | Explicit (Name) | aadapp =ApplicationDisplayName;TenantNameor aadapp =ApplicationId;TenantName |
Examples
The following example uses the user UPN to define a principal the user role on the Test
database. The tenant information isn’t specified, so your cluster will attempt to resolve the Microsoft Entra tenant using the UPN.
.add database Test users ('aaduser=imikeoein@fabrikam.com') 'Test user (AAD)'
The following example uses a group name and tenant name to assign the group to the user role on the Test
database.
.add database Test users ('aadgroup=SGDisplayName;fabrikam.com') 'Test group @fabrikam.com (AAD)'
The following example uses an app ID and tenant name to assign the app the user role on the Test
database.
.add database Test users ('aadapp=4c7e82bd-6adb-46c3-b413-fdd44834c69b;fabrikam.com') 'Test app @fabrikam.com (AAD)'
Microsoft Accounts (MSAs)
User authentication for Microsoft Accounts (MSAs) is supported. MSAs are all of the Microsoft-managed non-organizational user accounts. For example, hotmail.com
, live.com
, outlook.com
.
Referencing MSA principals
IdP | Type | Syntax |
---|---|---|
Live.com | User | msauser= UPN |
Example
The following example assigns an MSA user to the user role on the Test
database.
.add database Test users ('msauser=abbiatkins@live.com') 'Test user (live.com)'
to manage data partitioning policies for tables
Read the authentication overview
Learn how to use the Azure portal to manage database principals and roles
Learn how to use management commands to assign security roles
Learn how to use management commands to assign security roles
10.6 - Security roles
Principals are granted access to resources through a role-based access control model, where their assigned security roles determine their resource access.
When a principal attempts an operation, the system performs an authorization check to make sure the principal is associated with at least one security role that grants permissions to perform the operation. Failing an authorization check aborts the operation.
The management commands listed in this article can be used to manage principals and their security roles on databases, tables, external tables, materialized views, and functions.
To learn how to configure them in the Azure portal, see Manage cluster permissions.
Management commands
The following table describes the commands used for managing security roles.
Command | Description |
---|---|
.show | Lists principals with the given role. |
.add | Adds one or more principals to the role. |
.drop | Removes one or more principals from the role. |
.set | Sets the role to the specific list of principals, removing all previous ones. |
Security roles
The following table describes the level of access granted for each role and shows a check if the role can be assigned within the given object type.
Role | Permissions | Databases | Tables | External tables | Materialized views | Functions |
---|---|---|---|---|---|---|
admins | View, modify, and remove the object and subobjects. | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
users | View the object and create new subobjects. | ✔️ | ||||
viewers | View the object where RestrictedViewAccess isn’t turned on. | ✔️ | ||||
unrestrictedviewers | View the object even where RestrictedViewAccess is turned on. The principal must also have admins , viewers or users permissions. | ✔️ | ||||
ingestors | Ingest data to the object without access to query. | ✔️ | ✔️ | |||
monitors | View metadata such as schemas, operations, and permissions. | ✔️ |
For a full description of the security roles at each scope, see Kusto role-based access control.
Common scenarios
Show your principal roles
To see your own roles on the cluster, run the following command:
To see your own roles on the eventhouse, run the following command:
.show cluster principal roles
Show your roles on a resource
To check the roles assigned to you on a specific resource, run the following command within the relevant database or the database that contains the resource:
// For a database:
.show database DatabaseName principal roles
// For a table:
.show table TableName principal roles
// For an external table:
.show external table ExternalTableName principal roles
// For a function:
.show function FunctionName principal roles
// For a materialized view:
.show materialized-view MaterializedViewName principal roles
Show the roles of all principals on a resource
To see the roles assigned to all principals for a particular resource, run the following command within the relevant database or the database that contains the resource:
// For a database:
.show database DatabaseName principals
// For a table:
.show table TableName principals
// For an external table:
.show external table ExternalTableName principals
// For a function:
.show function FunctionName principals
// For a materialized view:
.show materialized-view MaterializedViewName principals
Modify the role assignments
For details on how to modify your role assignments at the database and table levels, see Manage database security roles and Manage table security roles.
Related content
10.7 - Access control
10.7.1 - Access Control Overview
Access control is based on authentication and authorization. Each query and command on an Azure Data Explorer resource, such as a cluster or database, must pass both authentication and authorization checks.
Access control is based on authentication and authorization. Each query and command on a Fabric resource, such as a database, must pass both authentication and authorization checks.
- Authentication: Validates the identity of the security principal making a request
- Authorization: Validates the security principal making a request is permitted to make that request on the target resource
Authentication
To programmatically authenticate, a client must communicate with Microsoft Entra ID and request an access token specific to the Kusto service. Then, the client can use the acquired access token as proof of identity when issuing requests to your database.
The main authentication scenarios are as follows:
- User authentication: Used to verify the identity of human users.
- Application authentication: Used to verify the identity of an application that needs to access resources without human intervention by using configured credentials.
- On-behalf-of (OBO) authentication: Allows an application to exchange a token for said application with a token to access a Kusto service. This flow must be implemented with MSAL.
- Single page application (SPA) authentication: Allows client-side SPA web applications to sign in users and get tokens to access your database. This flow must be implemented with MSAL.
User authentication
User authentication happens when a user presents credentials to Microsoft Entra ID or an identity provider that federates with Microsoft Entra ID, such as Active Directory Federation Services. The user gets back a security token that can be presented to the Azure Data Explorer service. Azure Data Explorer determines whether the token is valid, whether the token is issued by a trusted issuer, and what security claims the token contains.
Azure Data Explorer supports the following methods of user authentication, including through the Kusto client libraries:
- Interactive user authentication with sign-in through the user interface.
- User authentication with a Microsoft Entra token issued for Azure Data Explorer.
- User authentication with a Microsoft Entra token issued for another resource that can be exchanged for an Azure Data Explorer token using On-behalf-of (OBO) authentication.
Application authentication
Application authentication is needed when requests aren’t associated with a specific user or when no user is available to provide credentials. In this case, the application authenticates to Microsoft Entra ID or the federated IdP by presenting secret information.
Azure Data Explorer supports the following methods of application authentication, including through the Kusto client libraries:
- Application authentication with an Azure managed identity.
- Application authentication with an X.509v2 certificate installed locally.
- Application authentication with an X.509v2 certificate given to the client library as a byte stream.
- Application authentication with a Microsoft Entra application ID and a Microsoft Entra application key. The application ID and application key are like a username and password.
- Application authentication with a previously obtained valid Microsoft Entra token, issued to Azure Data Explorer.
- Application authentication with a Microsoft Entra token issued for another resource that can be exchanged for an Azure Data Explorer token using On-behalf-of (OBO) authentication.
Authorization
Before carrying out an action on a resource, all authenticated users must pass an authorization check. The Kusto role-based access control model is used, where principals are ascribed to one or more security roles. Authorization is granted as long as one of the roles assigned to the user allows them to perform the specified action. For example, the Database User role grants security principals the right to read the data of a particular database, create tables in the database, and more.
The association of security principals to security roles can be defined individually or by using security groups that are defined in Microsoft Entra ID. For more information on how to assign security roles, see Security roles overview.
Group authorization
Authorization can be granted to Microsoft Entra ID groups by assigning one or more roles to the group.
When checking authorization for a user or application principal, the system first looks for an explicit role assignment that permits the specific action. If the role assignment doesn’t exists, then the system checks the principal’s membership in all groups that could authorize the action.
If the principal is a member of a group with appropriate permissions, the requested action is authorized. Otherwise, the action doesn’t pass the authorization check and is disallowed.
Force group membership refresh
Principals can force a refresh of group membership information. This capability is useful in scenarios where just-in-time (JIT) privileged access services, such as Microsoft Entra Privileged Identity Management (PIM), are used to obtain higher privileges on a resource.
Refresh for a specific group
Principals can force a refresh of group membership for a specific group. However, the following restrictions apply:
- A refresh can be requested up to 10 times per hour per principal.
- The requesting principal must be a member of the group at the time of the request.
The request results in an error if either of these conditions aren’t met.
To reevaluate the current principal’s membership of a group, run the following command:
.clear cluster cache groupmembership with (group='<GroupFQN>')
Use the group’s fully qualified name (FQN). For more information, see Referencing Microsoft Entra principals and groups.
Refresh for other principals
A privileged principal can request a refresh for other principals. The requesting principal must have AllDatabaseMonitor access for the target service. Privileged principals can also run the previous command without restrictions.
To refresh another principal’s group membership, run the following command:
.clear cluster cache groupmembership with (principal='<PrincipalFQN>', group='<GroupFQN>')
Related content
- Understand Kusto role-based access control.
- For user or application authentication, use the Kusto client libraries.
- For OBO or SPA authentication, see How to authenticate with Microsoft Authentication Library (MSAL).
- For referencing principals and groups, see Referencing Microsoft Entra principals and groups.
10.7.2 - Microsoft Entra application registration
Microsoft Entra application authentication requires creating and registering an application with Microsoft Entra ID. A service principal is automatically created when the application registration is created in a Microsoft Entra tenant.
The app registration can either be created in the Azure portal, or programatically with Azure CLI. Choose the tab that fits your scenario.
Portal
Register the app
Sign in to Azure portal and open the Microsoft Entra ID blade.
Browse to App registrations and select New registration.
Name the application, for example “example-app”.
Select a supported account type, which determines who can use the application.
Under Redirect URI, select Web for the type of application you want to create. The URI is optional and is left blank in this case.
Select Register.
Set up authentication
There are two types of authentication available for service principals: password-based authentication (application secret) and certificate-based authentication. The following section describes using a password-based authentication for the application’s credentials. You can alternatively use an X509 certificate to authenticate your application. For more information, see How to configure Microsoft Entra certificate-based authentication.
Through the course of this section, you’ll copy the following values: Application ID and key value. Paste these values somewhere, like a text editor, for use in the step configure client credentials to the database.
Browse to the Overview blade.
Copy the Application (client) ID and the Directory (tenant) ID.
[!NOTE] You’ll need the application ID and the tenant ID to authorize the service principal to access the database.
In the Certificates & secrets blade, select New client secret.
Enter a description and expiration.
Select Add.
Copy the key value.
[!NOTE] When you leave this page, the key value won’t be accessible.
You’ve created your Microsoft Entra application and service principal.
Azure CLI
Sign in to your Azure subscription via Azure CLI. Then authenticate in the browser.
az login
Choose the subscription to host the principal. This step is needed when you have multiple subscriptions.
az account set --subscription YOUR_SUBSCRIPTION_GUID
Create the service principal. In this example, the service principal is called
my-service-principal
.az ad sp create-for-rbac -n "my-service-principal" --role Contributor --scopes /subscriptions/{SubID}
From the returned JSON data, copy the
appId
,password
, andtenant
for future use.{ "appId": "00001111-aaaa-2222-bbbb-3333cccc4444", "displayName": "my-service-principal", "name": "my-service-principal", "password": "00001111-aaaa-2222-bbbb-3333cccc4444", "tenant": "00001111-aaaa-2222-bbbb-3333cccc4444" }
You’ve created your Microsoft Entra application and service principal.
Configure delegated permissions for the application - optional
If your application needs to access your database using the credentials of the calling user, configure delegated permissions for your application. For example, if you’re building a web API and you want to authenticate using the credentials of the user who is calling your API.
If you only need access to an authorized data resource, you can skip this section and continue to Grant a service principal access to the database.
Browse to the API permissions blade of your App registration.
Select Add a permission.
Select APIs my organization uses.
Search for and select Azure Data Explorer.
In Delegated permissions, select the user_impersonation box.
Select Add permissions.
Grant a service principal access to the database
Once your application registration is created, you need to grant the corresponding service principal access to your database. The following example gives viewer access. For other roles, see Kusto role-based access control.
Use the values of Application ID and Tenant ID as copied in a previous step.
Execute the following command in your query editor, replacing the placeholder values ApplicationID and TenantID with your actual values:
.add database <DatabaseName> viewers ('aadapp=<ApplicationID>;<TenantID>') '<Notes>'
For example:
.add database Logs viewers ('aadapp=00001111-aaaa-2222-bbbb-3333cccc4444;9876abcd-e5f6-g7h8-i9j0-1234kl5678mn') 'App Registration'
The last parameter is a string that shows up as notes when you query the roles associated with a database.
[!NOTE] After creating the application registration, there might be a several minute delay until it can be referenced. If you receive an error that the application is not found, wait and try again.
For more information on roles, see Role-based access control.
Use application credentials to access a database
Use the application credentials to programmatically access your database by using the client library.
. . .
string applicationClientId = "<myClientID>";
string applicationKey = "<myApplicationKey>";
string authority = "<myApplicationTenantID>";
. . .
var kcsb = new KustoConnectionStringBuilder($"https://{clusterName}.kusto.windows.net/{databaseName}")
.WithAadApplicationKeyAuthentication(
applicationClientId,
applicationKey,
authority);
var client = KustoClientFactory.CreateCslQueryProvider(kcsb);
var queryResult = client.ExecuteQuery($"{query}");
[!NOTE] Specify the application id and key of the application registration (service principal) created earlier.
For more information, see How to authenticate with Microsoft Authentication Library (MSAL) in apps and use Azure Key Vault with .NET Core web app.
Troubleshooting
Invalid resource error
If your application is used to authenticate users, or applications for access, you must set up delegated permissions for the service application. Declare your application can authenticate users or applications for access. Not doing so will result in an error similar to the following, when an authentication attempt is made:
AADSTS650057: Invalid resource. The client has requested access to a resource which is not listed in the requested permissions in the client's application registration...
Enable user consent error
Your Microsoft Entra tenant administrator might enact a policy that prevents tenant users from giving consent to applications. This situation will result in an error similar to the following, when a user tries to sign in to your application:
AADSTS65001: The user or administrator has not consented to use the application with ID '<App ID>' named 'App Name'
You’ll need to contact your Microsoft Entra administrator to grant consent for all users in the tenant, or enable user consent for your specific application.
Related content
10.7.3 - Role-based access control
Azure Data Explorer uses a role-based access control (RBAC) model in which principals get access to resources based on their assigned roles. Roles are defined for a specific cluster, database, table, external table, materialized view, or function. When defined for a cluster, the role applies to all databases in the cluster. When defined for a database, the role applies to all entities in the database.
Azure Resource Manager (ARM) roles, such as subscription owner or cluster owner, grant access permissions for resource administration. For data administration, you need the roles described in this document.
Real-Time Intelligence in Fabric uses a hybrid role-based access control (RBAC) model in which principals get access to resources based on their assigned roles granted from one or both of two sources: Fabric, and Kusto management commands. The user will have the union of the roles granted from both sources.
Within Fabric, roles can be assigned or inherited by assigning a role in a workspace, or by sharing a specific item based on the item permission model.
Fabric roles
Role | Permissions granted on items |
---|---|
Workspace Admin | Admin RBAC role on all items in the workspace. |
Workspace Member | Admin RBAC role on all items in the workspace. |
Workspace Contributor | Admin RBAC role on all items in the workspace. |
Workspace Viewer | Viewer RBAC role on all items in the workspace. |
Item Editor | Admin RBAC role on the item. |
Item Viewer | Viewer RBAC role on the item. |
Roles can further be defined on the data plane for a specific database, table, external table, materialized view, or function, by using management commands. In both cases, roles applied at a higher level (Workspace, Eventhouse) are inherited by lower levels (Database, Table).
Roles and permissions
The following table outlines the roles and permissions available at each scope.
The Permissions column displays the access granted to each role.
The Dependencies column lists the minimum roles required to obtain the role in that row. For example, to become a Table Admin, you must first have a role like Database User or a role that includes the permissions of Database User, such as Database Admin or AllDatabasesAdmin. When multiple roles are listed in the Dependencies column, only one of them is needed to obtain the role.
The How the role is obtained column offers ways that the role can be granted or inherited.
The Manage column offers ways to add or remove role principals.
Scope | Role | Permissions | Dependencies | Manage |
---|---|---|---|---|
Cluster | AllDatabasesAdmin | Full permission to all databases in the cluster. May show and alter certain cluster-level policies. Includes all permissions. | Azure portal | |
Cluster | AllDatabasesViewer | Read all data and metadata of any database in the cluster. | Azure portal | |
Cluster | AllDatabasesMonitor | Execute .show commands in the context of any database in the cluster. | Azure portal | |
Database | Admin | Full permission in the scope of a particular database. Includes all lower level permissions. | Azure portal or management commands | |
Database | User | Read all data and metadata of the database. Create tables and functions, and become the admin for those tables and functions. | Azure portal or management commands | |
Database | Viewer | Read all data and metadata, except for tables with the RestrictedViewAccess policy turned on. | Azure portal or management commands | |
Database | Unrestrictedviewer | Read all data and metadata, including in tables with the RestrictedViewAccess policy turned on. | Database User or Database Viewer | Azure portal or management commands |
Database | Ingestor | Ingest data to all tables in the database without access to query the data. | Azure portal or management commands | |
Database | Monitor | Execute .show commands in the context of the database and its child entities. | Azure portal or management commands | |
Table | Admin | Full permission in the scope of a particular table. | Database User | management commands |
Table | Ingestor | Ingest data to the table without access to query the data. | Database User or Database Ingestor | management commands |
External Table | Admin | Full permission in the scope of a particular external table. | Database User or Database Viewer | management commands |
Materialized view | Admin | Full permission to alter the view, delete the view, and grant admin permissions to another principal. | Database User or Table Admin | management commands |
Function | Admin | Full permission to alter the function, delete the function, and grant admin permissions to another principal. | Database User or Table Admin | management commands |
Scope | Role | Permissions | How the role is obtained |
---|---|---|---|
Eventhouse | AllDatabasesAdmin | Full permission to all databases in the Eventhouse. May show and alter certain Eventhouse-level policies. Includes all permissions. | - Inherited as workspace admin, workspace member, or workspace contributor. Can’t be assigned with management commands. |
Database | Admin | Full permission in the scope of a particular database. Includes all lower level permissions. | - Inherited as workspace admin, workspace member, or workspace contributor - Item shared with editing permissions. - Assigned with management commands |
Database | User | Read all data and metadata of the database. Create tables and functions, and become the admin for those tables and functions. | - Assigned with management commands |
Database | Viewer | Read all data and metadata, except for tables with the RestrictedViewAccess policy turned on. | - Item shared with viewing permissions. - Assigned with management commands |
Database | Unrestrictedviewer | Read all data and metadata, including in tables with the RestrictedViewAccess policy turned on. | - Assigned with management commands. Dependent on having Database User or Database Viewer. |
Database | Ingestor | Ingest data to all tables in the database without access to query the data. | - Assigned with management commands |
Database | Monitor | Execute .show commands in the context of the database and its child entities. | - Assigned with management commands |
Table | Admin | Full permission in the scope of a particular table. | - Inherited as workspace admin, workspace member, or workspace contributor - Parent item (KQL Database) shared with editing permissions. - Assigned with management commands. Dependent on having Database User on the parent database. |
Table | Ingestor | Ingest data to the table without access to query the data. | - Assigned with management commands. Dependent on having Database User or Database Ingestor on the parent database. |
External Table | Admin | Full permission in the scope of a particular external table. | - Assigned with management commands. Dependent on having Database User or Database Viewer on the parent database. |
Related content
10.8 - Manage table roles
10.8.1 - Manage table security roles
Principals are granted access to resources through a role-based access control model, where their assigned security roles determine their resource access.
In this article, you’ll learn how to use management commands to view existing security roles as well as add and remove security roles on the table level.
Permissions
You must have at least Table Admin permissions to run these commands.
Table level security roles
The following table shows the possible security roles on the table level and describes the permissions granted for each role.
Role | Permissions |
---|---|
admins | View, modify, and remove the table and table entities. |
ingestors | Ingest data to the table without access to query. |
Show existing security roles
Before you add or remove principals, you can use the .show
command to see a table with all of the principals and roles that are already set on the table.
Syntax
To show all roles:
.show
table
TableName principals
To show your roles:
.show
table
TableName principal
roles
Parameters
Name | Type | Required | Description |
---|---|---|---|
TableName | string | ✔️ | The name of the table for which to list principals. |
Example
The following command lists all security principals that have access to the StormEvents
table.
.show table StormEvents principals
Example output
Role | PrincipalType | PrincipalDisplayName | PrincipalObjectId | PrincipalFQN |
---|---|---|---|---|
Table StormEvents Admin | Microsoft Entra user | Abbi Atkins | cd709aed-a26c-e3953dec735e | aaduser=abbiatkins@fabrikam.com |
Add and drop security roles
This section provides syntax, parameters, and examples for adding and removing principals.
Syntax
Action table
TableName Role (
Principal [,
Principal…] )
[skip-results
] [ Description ]
Parameters
Name | Type | Required | Description |
---|---|---|---|
Action | string | ✔️ | The command .add , .drop , or .set ..add adds the specified principals, .drop removes the specified principals, and .set adds the specified principals and removes all previous ones. |
TableName | string | ✔️ | The name of the table for which to add principals. |
Role | string | ✔️ | The role to assign to the principal. For tables, this can be admins or ingestors . |
Principal | string | ✔️ | One or more principals. For guidance on how to specify these principals, see Referencing security principals. |
skip-results | string | If provided, the command won’t return the updated list of table principals. | |
Description | string | Text to describe the change that will be displayed when using the .show command. |
Examples
In the following examples, you’ll see how to add security roles, remove security roles, and add and remove security roles in the same command.
Add security roles with .add
The following example adds a principal to the admins
role on the StormEvents
table.
.add table StormEvents admins ('aaduser=imikeoein@fabrikam.com')
The following example adds an application to the ingestors
role on the StormEvents
table.
.add table StormEvents ingestors ('aadapp=4c7e82bd-6adb-46c3-b413-fdd44834c69b;fabrikam.com')
Remove security roles with .drop
The following example removes all principals in the group from the admins
role on the StormEvents
table.
.drop table StormEvents admins ('aadGroup=SomeGroupEmail@fabrikam.com')
Add new security roles and remove the old with .set
The following example removes existing ingestors
and adds the provided principals as ingestors
on the StormEvents
table.
.set table StormEvents ingestors ('aaduser=imikeoein@fabrikam.com', 'aaduser=abbiatkins@fabrikam.com')
Remove all security roles with .set
The following command removes all existing ingestors
on the StormEvents
table.
.set table StormEvents ingestors none
Related content
10.8.2 - Manage view access to tables
Principals gain access to resources, such as databases and tables, based on their assigned security roles. The viewer
security role is only available at the database level, and assigning a principal this role gives them view access to all tables in the database.
In this article, you learn methods for controlling a principal’s table view access.
Structure data for controlled access
To control access more effectively, we recommend that you separate tables into different databases based on access privileges. For instance, create a distinct database for sensitive data and restrict access to specific principals by assigning them the relevant security roles.
Restricted View Access policy
To restrict access to specific tables, you can turn on the Restricted View Access policy for those tables. This policy ensures that only principals with the unrestrictedViewer
role can access the table. Meanwhile, principals with the regular viewer
role can’t view the table.
Row Level Security policy
The Row Level Security (RLS) policy allows you to restrict access to rows of data based on specific criteria and allows masking data in columns. When you create an RLS policy on a table, the restriction applies to all users, including database administrators and the RLS creator.
Create a follower database
Create a follower database and follow only the relevant tables that you’d like to share with the specific principal or set of principals.
Create a database shortcut in Fabric and follow only the relevant tables that you’d like to share with the specific principal or set of principals.
Related content
- Learn more about role-based access control
- Use management commands to assign security roles
11 - Operations
11.1 - Estimate table size
Understanding the size of a table can be helpful for efficient resource management and optimized query performance. In this article, you’ll learn different methods to estimate table sizes and how to use them effectively.
Original size of ingested data
Use the .show table details to estimate the original data size of a table. For an example, see Use .show table details.
This command provides an estimation of the uncompressed size of data ingested into your table based on the assumption that the data was transferred in CSV format. The estimation is based on approximate lengths of numeric values, such as integers, longs, datetimes, and guids, by considering their string representations.
Example use case: Track the size of incoming data over time to make informed decisions about capacity planning.
Table size in terms of access bytes
Use the estimate_data_size() along with the sum() aggregation function to estimate table size based on data types and their respective byte sizes. For an example, see Use estimate_data_size().
This method provides a more precise estimation by considering the byte sizes of numeric values without formatting them as strings. For example, integer values require 4 bytes whereas long and datetime values require 8 bytes. By using this approach, you can accurately estimate the data size that would fit in memory.
Example use case: Determine the cost of a query in terms of bytes to be scanned.
Combined size of multiple tables
You can use the union operator along with the estimate_data_size() and sum() functions to estimate the combined size of multiple tables in terms of access bytes. For an example, see Use union with estimate_data_size().
Example use case: Assess the memory requirements for consolidating data from multiple tables into a single dataset.
Examples
Use .show table details
The following query estimates the original data size of the StormEvents
table.
.show table StormEvents details
| project TotalOriginalSize
Output
TotalOriginalSize |
---|
60192011 |
Use estimate_data_size()
The following query estimates the original data size of the StormEvents
table in bytes.
StormEvents
| extend sizeEstimateOfColumn = estimate_data_size(*)
| summarize totalSize=sum(sizeEstimateOfColumn)
Output
totalSize |
---|
58608932 |
Use union with estimate_data_size()
The following query estimates the data size based for all tables in the Samples
database.
union withsource=_TableName *
| extend sizeEstimateOfColumn = estimate_data_size(*)
| summarize totalSize=sum(sizeEstimateOfColumn)
| extend sizeGB = format_bytes(totalSize,2,"GB")
totalSize | sizeGB |
---|---|
1761782453926 | 1640.79 GB |
11.2 - Journal management
Journal
contains information about metadata operations done on your database.
The metadata operations can result from a management command that a user executed, or internal management commands that the system executed, such as drop extents by retention.
Taking a dependency on them isn’t recommended.
Event | EventTimestamp | Database | EntityName | UpdatedEntityName | EntityVersion | EntityContainerName |
---|---|---|---|---|---|---|
CREATE-TABLE | 2017-01-05 14:25:07 | InternalDb | MyTable1 | MyTable1 | v7.0 | InternalDb |
RENAME-TABLE | 2017-01-13 10:30:01 | InternalDb | MyTable1 | MyTable2 | v8.0 | InternalDb |
OriginalEntityState | UpdatedEntityState | ChangeCommand | Principal |
---|---|---|---|
. | Name: MyTable1, Attributes: Name=’[MyTable1].[col1]’, Type=‘I32’ | .create table MyTable1 (col1:int) | imike@fabrikam.com |
. | The database properties (too long to be displayed here) | .create database TestDB persist (@“https://imfbkm.blob.core.windows.net/md", @“https://imfbkm.blob.core.windows.net/data") | Microsoft Entra app id=76263cdb-abcd-545644e9c404 |
Name: MyTable1, Attributes: Name=’[MyTable1].[col1]’, Type=‘I32’ | Name: MyTable2, Attributes: Name=’[MyTable1].[col1]’, Type=‘I32’ | .rename table MyTable1 to MyTable2 | rdmik@fabrikam.com |
Item | Description |
---|---|
Event | The metadata event name |
EventTimestamp | The event timestamp |
Database | Metadata of this database was changed following the event |
EntityName | The entity name that the operation was executed on, before the change |
UpdatedEntityName | The new entity name after the change |
EntityVersion | The new metadata version following the change |
EntityContainerName | The entity container name (entity=column, container=table) |
OriginalEntityState | The state of the entity (entity properties) before the change |
UpdatedEntityState | The new state after the change |
ChangeCommand | The executed management command that triggered the metadata change |
Principal | The principal (user/app) that executed the management command |
.show journal
The .show journal
command returns a list of metadata changes on databases or the cluster that the user has admin access to.
The .show journal
command returns a list of metadata changes on databases or the environment that the user has admin access to.
Permissions
Everyone with permission can execute the command.
Results returned will include:
All journal entries of the user executing the command.
All journal entries of databases that the user executing the command has admin access to.
All cluster journal entries if the user executing the command is a Cluster AllDatabases Admin.
All journal entries specific to the environment level if the user executing the command has appropriate admin permissions.
.show database DatabaseName journal
The .show
database
DatabaseName journal
command returns journal for the specific database metadata changes.
Permissions
Everyone with permission can execute the command.
Results returned include:
- All journal entries of database DatabaseName if the user executing the command is a database admin in DatabaseName.
- Otherwise, all the journal entries of database
DatabaseName
and of the user executing the command.
11.3 - System information
This section summarizes commands that are available to Database Admins and Database Monitors to explore usage, track operations, and investigate ingestion failures. For more information on security roles, see Kusto role-based access control.
.show journal
- displays history of the metadata operations..show operations
- displays administrative operations both running and completed, since Admin node was last elected..show queries
- displays information on completed and running queries..show commands
- displays information on completed commands and their resources utilization..show commands-and-queries
- displays information on completed commands and queries, and their resources utilization..show ingestion failures
- displays information on failures encountered during data ingestion..show table details
- displays information on table size and other table statistics..show table data statistics
- displays table data statistics per column.
11.4 - Operations
11.5 - Queries and commands
11.6 - Statistics
12 - Workload groups
12.1 - Query consistency policy
A workload group’s query consistency policy allows specifying options that control the consistency mode of queries.
The policy object
Each option consists of:
- A typed
Value
- the value of the limit. IsRelaxable
- a boolean value that defines if the option can be relaxed by the caller, as part of the request’s request properties. Default istrue
.
The following limits are configurable:
Name | Type | Description | Supported values | Default value | Matching client request property |
---|---|---|---|---|---|
QueryConsistency | QueryConsistency | The consistency mode to use. | Strong , Weak , or WeakAffinitizedByQuery , WeakAffinitizedByDatabase | Strong | queryconsistency |
CachedResultsMaxAge | timespan | The maximum age of cached query results that can be returned. | A non-negative timespan | null | query_results_cache_max_age |
Example
"QueryConsistencyPolicy": {
"QueryConsistency": {
"IsRelaxable": true,
"Value": "Weak"
},
"CachedResultsMaxAge": {
"IsRelaxable": true,
"Value": "05:00:00"
}
}
Monitoring
You can monitor the latency of the metadata snapshot age on nodes serving as weak consistency service heads by using the Weak consistency latency
metric. For more information, see Query metrics.
Related content
12.2 - Request limits policy
A workload group’s request limits policy allows limiting the resources used by the request during its execution.
The policy object
Each limit consists of:
- A typed
Value
- the value of the limit. IsRelaxable
- a boolean value that defines if the limit can be relaxed by the caller, as part of the request’s request properties.
The following limits are configurable:
Property | Type | Description | Supported values | Matching client request property |
---|---|---|---|---|
DataScope | string | The query’s data scope. This value determines whether the query applies to all data or just the hot cache. | All , HotCache , or null | query_datascope |
MaxMemoryPerQueryPerNode | long | The maximum amount of memory (in bytes) a query can allocate. | [1 , 50% of a single node’s total RAM] | max_memory_consumption_per_query_per_node |
MaxMemoryPerIterator | long | The maximum amount of memory (in bytes) a query operator can allocate. | [1 , Min(32212254720 , 50% of a single node’s total RAM)] | maxmemoryconsumptionperiterator |
MaxFanoutThreadsPercentage | int | The percentage of threads on each node to fan out query execution to. When set to 100%, the cluster assigns all CPUs on each node. For example, 16 CPUs on a cluster deployed on Azure D14_v2 nodes. | [1 , 100 ] | query_fanout_threads_percent |
MaxFanoutNodesPercentage | int | The percentage of nodes on the cluster to fan out query execution to. Functions in a similar manner to MaxFanoutThreadsPercentage . | [1 , 100 ] | query_fanout_nodes_percent |
MaxResultRecords | long | The maximum number of records a request is allowed to return to the caller, beyond which the results are truncated. The truncation limit affects the final result of the query, as delivered back to the client. However, the truncation limit doesn’t apply to intermediate results of subqueries, such as those that result from having cross-cluster references. | [1 , 9223372036854775807 ] | truncationmaxrecords |
MaxResultBytes | long | The maximum data size (in bytes) a request is allowed to return to the caller, beyond which the results are truncated. The truncation limit affects the final result of the query, as delivered back to the client. However, the truncation limit doesn’t apply to intermediate results of subqueries, such as those that result from having cross-cluster references. | [1 , 9223372036854775807 ] | truncationmaxsize |
MaxExecutionTime | timespan | The maximum duration of a request. Notes: 1) This can be used to place more limits on top of the default limits on execution time, but not extend them. 2) Timeout processing isn’t at the resolution of seconds, rather it’s designed to prevent a query from running for minutes. 3) The time it takes to read the payload back at the client isn’t treated as part of the timeout. It depends on how quickly the caller pulls the data from the stream. 4) Total execution time can exceed the configured value if aborting execution takes longer to complete. | [00:00:00 , 01:00:00 ] | servertimeout |
Property | Type | Description | Supported values | Matching client request property |
– | – | – | – | – |
DataScope | string | The query’s data scope. This value determines whether the query applies to all data or just the hot cache. | All , HotCache , or null | query_datascope |
MaxMemoryPerQueryPerNode | long | The maximum amount of memory (in bytes) a query can allocate. | [1 , 50% of a single node’s total RAM] | max_memory_consumption_per_query_per_node |
MaxMemoryPerIterator | long | The maximum amount of memory (in bytes) a query operator can allocate. | [1 , Min(32212254720 , 50% of a single node’s total RAM)] | maxmemoryconsumptionperiterator |
MaxFanoutThreadsPercentage | int | The percentage of threads on each node to fan out query execution to. When set to 100%, the Eventhouse assigns all CPUs on each node. For example, 16 CPUs on an eventhouse deployed on Azure D14_v2 nodes. | [1 , 100 ] | query_fanout_threads_percent |
MaxFanoutNodesPercentage | int | The percentage of nodes on the Eventhouse to fan out query execution to. Functions in a similar manner to MaxFanoutThreadsPercentage . | [1 , 100 ] | query_fanout_nodes_percent |
MaxResultRecords | long | The maximum number of records a request is allowed to return to the caller, beyond which the results are truncated. The truncation limit affects the final result of the query, as delivered back to the client. However, the truncation limit doesn’t apply to intermediate results of subqueries, such as the results from having cross-eventhouse references. | [1 , 9223372036854775807 ] | truncationmaxrecords |
MaxResultBytes | long | The maximum data size (in bytes) a request is allowed to return to the caller, beyond which the results are truncated. The truncation limit affects the final result of the query, as delivered back to the client. However, the truncation limit doesn’t apply to intermediate results of subqueries, such as results from having cross-eventhouse references. | [1 , 9223372036854775807 ] | truncationmaxsize |
MaxExecutionTime | timespan | The maximum duration of a request. Notes: 1) This can be used to place more limits on top of the default limits on execution time, but not extend them. 2) Timeout processing isn’t at the resolution of seconds, rather it’s designed to prevent a query from running for minutes. 3) The time it takes to read the payload back at the client isn’t treated as part of the timeout. It depends on how quickly the caller pulls the data from the stream. 4) Total execution time might exceed the configured value if aborting execution takes longer to complete. | [00:00:00 , 01:00:00 ] | servertimeout |
CPU resource usage
Queries can use all the CPU resources within the cluster. By default, when multiple queries are running concurrently, the system employs a fair round-robin approach to distribute resources. This strategy is optimal for achieving high performance with ad-hoc queries. Queries can use all the CPU resources within the Eventhouse. By default, when multiple queries are running concurrently, the system employs a fair round-robin approach to distribute resources. This strategy is optimal for achieving high performance with ad-hoc queries.
However, there are scenarios where you might want to restrict the CPU resources allocated to a specific query. For instance, if you’re running a background job that can accommodate higher latencies. The request limits policy provides the flexibility to specify a lower percentage of threads or nodes to be used when executing distributed subquery operations. The default setting is 100%.
The default
workload group
The default
workload group has the following policy defined by default. This policy can be altered.
{
"DataScope": {
"IsRelaxable": true,
"Value": "All"
},
"MaxMemoryPerQueryPerNode": {
"IsRelaxable": true,
"Value": < 50% of a single node's total RAM >
},
"MaxMemoryPerIterator": {
"IsRelaxable": true,
"Value": 5368709120
},
"MaxFanoutThreadsPercentage": {
"IsRelaxable": true,
"Value": 100
},
"MaxFanoutNodesPercentage": {
"IsRelaxable": true,
"Value": 100
},
"MaxResultRecords": {
"IsRelaxable": true,
"Value": 500000
},
"MaxResultBytes": {
"IsRelaxable": true,
"Value": 67108864
},
"MaxExecutiontime": {
"IsRelaxable": true,
"Value": "00:04:00"
}
}
Example
The following JSON represents a custom requests limits policy object:
{
"DataScope": {
"IsRelaxable": true,
"Value": "HotCache"
},
"MaxMemoryPerQueryPerNode": {
"IsRelaxable": true,
"Value": 2684354560
},
"MaxMemoryPerIterator": {
"IsRelaxable": true,
"Value": 2684354560
},
"MaxFanoutThreadsPercentage": {
"IsRelaxable": true,
"Value": 50
},
"MaxFanoutNodesPercentage": {
"IsRelaxable": true,
"Value": 50
},
"MaxResultRecords": {
"IsRelaxable": true,
"Value": 1000
},
"MaxResultBytes": {
"IsRelaxable": true,
"Value": 33554432
},
"MaxExecutiontime": {
"IsRelaxable": true,
"Value": "00:01:00"
}
}
Related content
12.3 - Request queuing policy
A workload group’s request queuing policy controls queuing of requests for delayed execution, once a certain threshold of concurrent requests is exceeded.
Queuing of requests can reduce the number of throttling errors during times of peak activity. It does so by queuing incoming requests up to a predefined short time period, while polling for available capacity during that time period.
The policy might be defined only for workload groups with a request rate limit policy that limits the max concurrent requests at the scope of the workload group.
Use the .alter-merge workload group management command, to enable request queuing.
The policy object
The policy includes a single property:
IsEnabled
: A boolean indicating if the policy is enabled. The default value isfalse
.
Related content
12.4 - Request rate limit policy
The workload group’s request rate limit policy lets you limit the number of concurrent requests classified into the workload group, per workload group or per principal.
Rate limits are enforced at the level defined by the workload group’s Request rate limits enforcement policy.
The policy object
A request rate limit policy has the following properties:
Name | Supported values | Description |
---|---|---|
IsEnabled | true , false | Indicates if the policy is enabled or not. |
Scope | WorkloadGroup , Principal | The scope to which the limit applies. |
LimitKind | ConcurrentRequests , ResourceUtilization | The kind of the request rate limit. |
Properties | Property bag | Properties of the request rate limit. |
Concurrent requests rate limit
A request rate limit of kind ConcurrentRequests
includes the following property:
Name | Type | Description | Supported Values |
---|---|---|---|
MaxConcurrentRequests | int | The maximum number of concurrent requests. | [0 , 10000 ] |
When a request exceeds the limit on maximum number of concurrent requests:
- The request’s state, as presented by System information commands, will be
Throttled
. - The error message will include the origin of the throttling and the capacity that’s been exceeded.
The following table shows a few examples of concurrent requests that exceed the maximum limit and the error message that these requests return:
Scenario | Error message |
---|---|
A throttled .create table command that was classified to the default workload group, which has a limit of 80 concurrent requests at the scope of the workload group. | The management command was aborted due to throttling. Retrying after some backoff might succeed. CommandType: ‘TableCreate’, Capacity: 80, Origin: ‘RequestRateLimitPolicy/WorkloadGroup/default’. |
A throttled query that was classified to a workload group named MyWorkloadGroup , which has a limit of 50 concurrent requests at the scope of the workload group. | The query was aborted due to throttling. Retrying after some backoff might succeed. Capacity: 50, Origin: ‘RequestRateLimitPolicy/WorkloadGroup/MyWorkloadGroup’. |
A throttled query that was classified to a workload group named MyWorkloadGroup , which has a limit of 10 concurrent requests at the scope of a principal. | The query was aborted due to throttling. Retrying after some backoff might succeed. Capacity: 10, Origin: ‘RequestRateLimitPolicy/WorkloadGroup/MyWorkloadGroup/Principal/aaduser=9e04c4f5-1abd-48d4-a3d2-9f58615b4724;6ccf3fe8-6343-4be5-96c3-29a128dd9570’. |
- The HTTP response code will be
429
. The subcode will beTooManyRequests
. - The exception type will be
QueryThrottledException
for queries, andControlCommandThrottledException
for management commands.
Resource utilization rate limit
A request rate limit of kind ResourceUtilization
includes the following properties:
Name | Type | Description | Supported Values |
---|---|---|---|
ResourceKind | ResourceKind | The resource to limit.When ResourceKind is TotalCpuSeconds , the limit is enforced based on post-execution reports of CPU utilization of completed requests. Requests that report utilization of 0.005 seconds of CPU or lower aren’t counted. The limit (MaxUtilization ) represents the total CPU seconds that can be consumed by requests within a specified time window (TimeWindow ). For example, a user running ad-hoc queries may have a limit of 1000 CPU seconds per hour. If this limit is exceeded, subsequent queries will be throttled, even if started concurrently, as the cumulative CPU seconds have surpassed the defined limit within the sliding window period. | RequestCount , TotalCpuSeconds |
MaxUtilization | long | The maximum of the resource that can be utilized. | RequestCount: [1 , 16777215 ]; TotalCpuSeconds: [1 , 828000 ] |
TimeWindow | timespan | The sliding time window during which the limit is applied. | [00:00:01 , 01:00:00 ] |
When a request exceeds the limit on resources utilization:
- The request’s state, as presented by System information commands, will be
Throttled
. - The error message will include the origin of the throttling and the quota that’s been exceeded. For example:
The following table shows a few examples of requests that exceed the resource utilization rate limit and the error message that these requests return:
Scenario | Error message |
---|---|
A throttled request that was classified to a workload group named Automated Requests , which has a limit of 1000 requests per hour at the scope of a principal. | The request was denied due to exceeding quota limitations. Resource: ‘RequestCount’, Quota: ‘1000’, TimeWindow: ‘01:00:00’, Origin: ‘RequestRateLimitPolicy/WorkloadGroup/Automated Requests/Principal/aadapp=9e04c4f5-1abd-48d4-a3d2-9f58615b4724;6ccf3fe8-6343-4be5-96c3-29a128dd9570’. |
A throttled request, that was classified to a workload group named Automated Requests , which has a limit of 2000 total CPU seconds per hour at the scope of the workload group. | The request was denied due to exceeding quota limitations. Resource: ‘TotalCpuSeconds’, Quota: ‘2000’, TimeWindow: ‘01:00:00’, Origin: ‘RequestRateLimitPolicy/WorkloadGroup/Automated Requests’. |
- The HTTP response code will be
429
. The subcode will beTooManyRequests
. - The exception type will be
QuotaExceededException
.
How consistency affects rate limits
With strong consistency, the default limit on maximum concurrent requests depends on the SKU of the cluster, and is calculated as: Cores-Per-Node x 10
. For example, a cluster that’s set up with Azure D14_v2 nodes, where each node has 16 vCores, will have a default limit of 16
x 10
= 160
.
With weak consistency, the effective default limit on maximum concurrent requests depends on the SKU of the cluster and number of query heads, and is calculated as: Cores-Per-Node x 10 x Number-Of-Query-Heads
. For example, a cluster that’s set up with Azure D14_v2 and 5 query heads, where each node has 16 vCores, will have an effective default limit of 16
x 10
x 5
= 800
.
With strong consistency, the default limit on maximum concurrent requests depends on the SKU of the eventhouse, and is calculated as: Cores-Per-Node x 10
. For example, a eventhouse that’s set up with Azure D14_v2 nodes, where each node has 16 vCores, will have a default limit of 16
x 10
= 160
.
With weak consistency, the effective default limit on maximum concurrent requests depends on the SKU of the eventhouse and number of query heads, and is calculated as: Cores-Per-Node x 10 x Number-Of-Query-Heads
. For example, a eventhouse that’s set up with Azure D14_v2 and 5 query heads, where each node has 16 vCores, will have an effective default limit of 16
x 10
x 5
= 800
.
For more information, see Query consistency.
The default
workload group
The default
workload group has the following policy defined by default. This policy can be altered.
[
{
"IsEnabled": true,
"Scope": "WorkloadGroup",
"LimitKind": "ConcurrentRequests",
"Properties": {
"MaxConcurrentRequests": < Cores-Per-Node x 10 >
}
}
]
Examples
The following policies allow up to:
- 500 concurrent requests for the workload group.
- 25 concurrent requests per principal.
- 50 requests per principal per hour.
[
{
"IsEnabled": true,
"Scope": "WorkloadGroup",
"LimitKind": "ConcurrentRequests",
"Properties": {
"MaxConcurrentRequests": 500
}
},
{
"IsEnabled": true,
"Scope": "Principal",
"LimitKind": "ConcurrentRequests",
"Properties": {
"MaxConcurrentRequests": 25
}
},
{
"IsEnabled": true,
"Scope": "Principal",
"LimitKind": "ResourceUtilization",
"Properties": {
"ResourceKind": "RequestCount",
"MaxUtilization": 50,
"TimeWindow": "01:00:00"
}
}
]
The following policies will block all requests classified to the workload group:
[
{
"IsEnabled": true,
"Scope": "WorkloadGroup",
"LimitKind": "ConcurrentRequests",
"Properties": {
"MaxConcurrentRequests": 0
}
},
]
Related content
12.5 - Request rate limits enforcement policy
A workload group’s request rate limits enforcement policy controls how request rate limits are enforced.
The policy object
A request rate limit policy has the following properties:
Name | Supported values | Default value | Description |
---|---|---|---|
QueriesEnforcementLevel | Cluster , QueryHead | QueryHead | Indicates the enforcement level for queries. |
CommandsEnforcementLevel | Cluster , Database | Database | Indicates the enforcement level for commands. |
Request rate limits enforcement level
Request rate limits can be enforced at one of the following levels:
Cluster
:- Rate limits are enforced by the single cluster admin node.
Database
:- Rate limits are enforced by the database admin node that manages the database the request was sent to.
- If there are multiple database admin nodes, the configured rate limit is effectively multiplied by the number of database admin nodes.
QueryHead
:- Rate limits for queries are enforced by the query head node that the query was routed to.
- This option affects queries that are sent with either strong or weak query consistency.
- Strongly consistent queries run on the database admin node, and the configured rate limit is effectively multiplied by the number of database admin nodes.
- For weakly consistent queries, the configured rate limit is effectively multiplied by the number of query head nodes.
- This option doesn’t apply to management commands.
Cluster
:- Rate limits are enforced by the single Eventhouse admin node.
Database
:- Rate limits are enforced by the database admin node that manages the database the request was sent to.
- If there are multiple database admin nodes, the configured rate limit is effectively multiplied by the number of database admin nodes.
QueryHead
:- Rate limits for queries are enforced by the query head node that the query was routed to.
- This option affects queries that are sent with either strong or weak query consistency.
- Strongly consistent queries run on the database admin node, and the configured rate limit is effectively multiplied by the number of database admin nodes.
- For weakly consistent queries, the configured rate limit is effectively multiplied by the number of query head nodes.
- This option doesn’t apply to management commands.
Examples
Setup
The cluster has 10 nodes as follows:
- one cluster admin node.
- two database admin nodes (each manages 50% of the cluster’s databases).
- 50% of the tail nodes (5 out of 10) can serve as query heads for weakly consistent queries.
The
default
workload group is defined with the following policies:
"RequestRateLimitPolicies": [
{
"IsEnabled": true,
"Scope": "WorkloadGroup",
"LimitKind": "ConcurrentRequests",
"Properties": {
"MaxConcurrentRequests": 200
}
}
],
"RequestRateLimitsEnforcementPolicy": {
"QueriesEnforcementLevel": "QueryHead",
"CommandsEnforcementLevel": "Database"
}
Effective rate limits
The effective rate limits for the default
workload group are:
The maximum number of concurrent cluster-scoped management commands is
200
.The maximum number of concurrent database-scoped management commands is
2
(database admin nodes) x200
(max per admin node) =400
.The maximum number of concurrent strongly consistent queries is
2
(database admin nodes) x200
(max per admin node) =400
.The maximum number of concurrent weakly consistent queries is
5
(query heads) x200
(max per query head) =1000
.The maximum number of concurrent eventhouse-scoped management commands is
200
.The maximum number of concurrent database-scoped management commands is
2
(database admin nodes) x200
(max per admin node) =400
.The maximum number of concurrent strongly consistent queries is
2
(database admin nodes) x200
(max per admin node) =400
.The maximum number of concurrent weakly consistent queries is
5
(query heads) x200
(max per query head) =1000
.
Related content
12.6 - Workload groups
Workload groups allow you to group together sets of management commands and queries based on shared characteristics, and apply policies to control per-request limits and request rate limits for each of these groups.
Together with workload group policies, workload groups serve as a resource governance system for incoming requests to the cluster. When a request is initiated, it gets classified into a workload group. The classification is based on a user-defined function defined as part of a request classification policy. The request follows the policies assigned to the designated workload group throughout its execution.
Workload groups are defined at the cluster level, and up to 10 custom groups can be defined in addition to the three built-in workload groups. Together with workload group policies, workload groups serve as a resource governance system for incoming requests to the eventhouse. When a request is initiated, it gets classified into a workload group. The classification is based on a user-defined function defined as part of a request classification policy. The request follows the policies assigned to the designated workload group throughout its execution.
Workload groups are defined at the eventhouse level, and up to 10 custom groups can be defined in addition to the three built-in workload groups.
Use cases for custom workload groups
The following list covers some common use cases for creating custom workload groups:
Protect against runaway queries: Create a workload group with a requests limits policy to set restrictions on resource usage and parallelism during query execution. For example, this policy can regulate result set size, memory per iterator, memory per node, execution time, and CPU resource usage.
Control the rate of requests: Create a workload group with a request rate limit policy to manage the behavior of concurrent requests from a specific principal or application. This policy can restrict the number of concurrent requests, request count within a time period, and total CPU seconds per time period. While your cluster comes with default limits, such as query limits, you have the flexibility to adjust these limits based on your requirements.
Create shared environments: Imagine a scenario where you have 3 different customer teams running queries and commands on a shared cluster, possibly even accessing shared databases. If you’re billing these teams based on their resource usage, you can create three distinct workload groups, each with unique limits. These workload groups would allow you to effectively manage and monitor the resource usage of each customer team.
Control the rate of requests: Create a workload group with a request rate limit policy to manage the behavior of concurrent requests from a specific principal or application. This policy can restrict the number of concurrent requests, request count within a time period, and total CPU seconds per time period. While your eventhouse comes with default limits, such as query limits, you have the flexibility to adjust these limits based on your requirements.
Create shared environments: Imagine a scenario where you have 3 different customer teams running queries and commands on a shared eventhouse, possibly even accessing shared databases. If you’re billing these teams based on their resource usage, you can create three distinct workload groups, each with unique limits. These workload groups would allow you to effectively manage and monitor the resource usage of each customer team.
Monitor resources utilization: Workload groups can help you create periodic reports on the resource consumption of a given principal or application. For instance, if these principals represent different clients, such reports can facilitate accurate billing. For more information, see Monitor requests by workload group.
Create and manage workload groups
Use the following commands to manage workload groups and their policies:
Workload group policies
The following policies can be defined per workload group:
- Request limits policy
- Request rate limit policy
- Request rate limits enforcement policy
- Request queuing policy
- Query consistency policy
Built-in workload groups
The pre-defined workload groups are:
Default workload group
Requests are classified into the default
group under these conditions:
- There are no criteria to classify a request.
- An attempt was made to classify the request into a non-existent group.
- A general classification failure has occurred.
You can:
- Change the criteria used for routing these requests.
- Change the policies that apply to the
default
workload group. - Classify requests into the
default
workload group.
To monitor what gets classified to the default
workload group, see Monitor requests by workload group.
Internal workload group
The internal
workload group is populated with requests that are for internal use only.
You can’t:
- Change the criteria used for routing these requests.
- Change the policies that apply to the
internal
workload group. - Classify requests into the
internal
workload group.
To monitor what gets classified to the internal
workload group, see Monitor requests by workload group.
Materialized views workload group
The $materialized-views
workload group applies to the materialized views materialization process. For more information on how materialized views work, see Materialized views overview.
You can change the following values in the workload group’s request limits policy:
- MaxMemoryPerQueryPerNode
- MaxMemoryPerIterator
- MaxFanoutThreadsPercentage
- MaxFanoutNodesPercentage
Monitor requests by workload group
System commands indicate the workload group into which a request was classified. You can use these commands to aggregate resources utilization by workload group for completed requests.
The same information can also be viewed and analyzed in Azure Monitor insights.
Related content
12.7 - Request classification policy
12.7.1 - Request classification policy
The classification process assigns incoming requests to a workload group, based on the characteristics of the requests. Tailor the classification logic by writing a user-defined function, as part of a cluster-level request classification policy. The classification process assigns incoming requests to a workload group, based on the characteristics of the requests. Tailor the classification logic by writing a user-defined function, as part of an Eventhouse-level request classification policy.
In the absence of an enabled request classification policy, all requests are classified into the default
workload group.
Policy object
The policy has the following properties:
IsEnabled
:bool
- Indicates if the policy is enabled or not.ClassificationFunction
:string
- The body of the function to use for classifying requests.
Classification function
The classification of incoming requests is based on a user-defined function. The results of the function are used to classify requests into existing workload groups.
The user-defined function has the following characteristics and behaviors:
- If
IsEnabled
is set totrue
in the policy, the user-defined function is evaluated for every new request. - The user-defined function gives workload group context for the request for the full lifetime of the request.
- The request is given the
default
workload group context in the following situations:- The user-defined function returns an empty string,
default
, or the name of nonexistent workload group. - The function fails for any reason.
- The user-defined function returns an empty string,
- Only one user-defined function can be designated at any given time.
Requirements and limitations
A classification function:
- Must return a single scalar value of type
string
. That is the name of the workload group to assign the request to. - Must not reference any other entity (database, table, or function).
- Specifically - it might not use the following functions and operators:
cluster()
database()
table()
external_table()
externaldata
- Specifically - it might not use the following functions and operators:
- Has access to a special
dynamic
symbol, a property-bag namedrequest_properties
, with the following properties:
Name | Type | Description | Examples |
---|---|---|---|
current_database | string | The name of the request database. | "MyDatabase" |
current_application | string | The name of the application that sent the request. | "Kusto.Explorer" , "KusWeb" |
current_principal | string | The fully qualified name of the principal identity that sent the request. | "aaduser=1793eb1f-4a18-418c-be4c-728e310c86d3;83af1c0e-8c6d-4f09-b249-c67a2e8fda65" |
query_consistency | string | For queries: the consistency of the query - strongconsistency or weakconsistency . This property is set by the caller as part of the request’s request properties: The client request property to set is: queryconsistency . | "strongconsistency" , "weakconsistency" |
request_description | string | Custom text that the author of the request can include. The text is set by the caller as part of the request’s Client request properties: The client request property to set is: request_description . | "Some custom description" ; automatically populated for dashboards: "dashboard:{dashboard_id};version:{version};sourceId:{source_id};sourceType:{tile/parameter}" |
request_text | string | The obfuscated text of the request. Obfuscated string literals included in the query text are replaced by multiple of star (* ) characters. Note: only the leading 65,536 characters of the request text are evaluated. | ".show version" |
request_type | string | The type of the request - Command or Query . | "Command" , "Query" |
Examples
A single workload group
iff(request_properties.current_application == "Kusto.Explorer" and request_properties.request_type == "Query",
"Ad-hoc queries",
"default")
Multiple workload groups
case(current_principal_is_member_of('aadgroup=somesecuritygroup@contoso.com'), "First workload group",
request_properties.current_database == "MyDatabase" and request_properties.current_principal has 'aadapp=', "Second workload group",
request_properties.current_application == "Kusto.Explorer" and request_properties.request_type == "Query", "Third workload group",
request_properties.current_application == "Kusto.Explorer", "Third workload group",
request_properties.current_application == "KustoQueryRunner", "Fourth workload group",
request_properties.request_description == "this is a test", "Fifth workload group",
hourofday(now()) between (17 .. 23), "Sixth workload group",
"default")
Management commands
Use the following management commands to manage a cluster’s request classification policy.
Command | Description |
---|---|
.alter cluster request classification policy | Alters cluster’s request classification policy |
.alter-merge cluster request classification policy | Enables or disables a cluster’s request classification policy |
.delete cluster request classification policy | Deletes the cluster’s request classification policy |
.show cluster request classification policy | Shows the cluster’s request classification policy |
Use the following management commands to manage an Eventhouse’s request classification policy. |
Command | Description |
---|---|
.alter cluster request classification policy | Alters Eventhouse’s request classification policy |
.alter-merge cluster request classification policy | Enables or disables an Eventhouse’s request classification policy |
.delete cluster request classification policy | Deletes the Eventhouse’s request classification policy |
.show cluster request classification policy | Shows the Eventhouse’s request classification policy |
Related content
12.8 - Workload group commands
13 - Management commands overview
This article describes the management commands, also known as control commands, used to manage Kusto. Management commands are requests to the service to retrieve information that is not necessarily data in the database tables, or to modify the service state, etc.
Differentiating management commands from queries
Kusto uses three mechanisms to differentiate queries and management commands: at the language level, at the protocol level, and at the API level. This is done for security purposes.
At the language level, the first character of the text of a request determines
if the request is a management command or a query. Management commands must start with
the dot (.
) character, and no query may start by that character.
At the protocol level, different HTTP/HTTPS endpoints are used for control commands as opposed to queries.
At the API level, different functions are used to send management commands as opposed to queries.
Combining queries and management commands
Management commands can reference queries (but not vice-versa) or other management commands. There are several supported scenarios:
- AdminThenQuery: A management command is executed, and its result (represented as a temporary data table) serves as the input to a query.
- AdminFromQuery: Either a query or a
.show
admin command is executed, and its result (represented as a temporary data table) serves as the input to a management command.
Note that in all cases, the entire combination is technically a management command,
not a query, so the text of the request must start with a dot (.
) character,
and the request must be sent to the management endpoint of the service.
Also note that query statements appear within the query part of the text (they can’t precede the command itself).
AdminThenQuery is indicated in one of two ways:
- By using a pipe (
|
) character, the query therefore treats the results of the management command as if it were any other data-producing query operator. - By using a semicolon (
;
) character, which then introduces the results of the management command into a special symbol called$command_results
, that one may then use in the query any number of times.
For example:
// 1. Using pipe: Count how many tables are in the database-in-scope:
.show tables
| count
// 2. Using semicolon: Count how many tables are in the database-in-scope:
.show tables;
$command_results
| count
// 3. Using semicolon, and including a let statement:
.show tables;
let useless=(n:string){strcat(n,'-','useless')};
$command_results | extend LastColumn=useless(TableName)
AdminFromQuery is indicated by the <|
character combination. For example,
in the following we first execute a query that produces a table with a single
column (named str
of type string
) and a single row, and write it as the table
name MyTable
in the database in context:
.set MyTable <|
let text="Hello, World!";
print str=text