This is the multi-page printable view of this section. Click here to print.
Schema
- 1: Entity groups
- 2: Graph models in Azure Data Explorer - Overview and usage
- 3: Graph snapshots overview
- 4: Persistent graphs overview
- 5: Avrotize k2a tool
- 6: Best practices for schema management
- 7: Databases management
- 8: Columns
- 9: External tables
- 9.1: Azure SQL external tables
- 9.1.1: Create and alter Azure SQL external tables
- 9.1.2: Query SQL external tables
- 9.1.3: Use row-level security with Azure SQL external tables
- 9.2: Azure Storage external tables
- 10: Functions
- 11: Ingestion mappings
- 11.1: AVRO Mapping
- 11.2: CSV Mapping
- 11.3: Ingestion mappings
- 11.4: JSON Mapping
- 11.5: ORC Mapping
- 11.6: Parquet Mapping
- 11.7: W3CLOGFILE Mapping
- 12: Manage external table mappings
- 13: Materialized views
- 13.1: Materialized views
- 13.2: Materialized views data purge
- 13.3: Materialized views limitations
- 13.4: Materialized views policies
- 13.5: Materialized views use cases
- 13.6: Monitor materialized views
- 14: Stored query results
- 14.1: Stored query results
- 15: Tables
- 15.1: Avrotize k2a tool
- 15.2: Tables management
1 - Entity groups
Entity groups are named entities stored in a database that the macro-expand query operator can reference.
Storing an entity group in the database, instead of providing its value in the query text itself, makes it easier to manage these objects.
Management commands
| Function | Description |
|---|---|
| .alter entity_group | Alters an existing entity group and stores it inside the database metadata. |
| .alter-merge entity_group | Alters and merges the value of an existing entity group. |
| .create entity_group | Creates a stored entity group. |
| .drop entity_group | Drops an entity group from the database. |
| .show entity_group(s) | Lists all the stored entity groups, or a specific entity group, in the current database. |
Related content
2 - Graph models in Azure Data Explorer - Overview and usage
Graph models in Azure Data Explorer enable you to define, manage, and efficiently query persistent graph structures within your database. Unlike transient graphs created using the make-graph operator, graph models are stored representations that can be queried repeatedly without rebuilding the graph for each query, significantly improving performance for complex relationship-based analysis.
Overview
A graph model is a database object that represents a labeled property graph (LPG) within Azure Data Explorer. It consists of nodes, also called vertices, and edges, also called relationships. Both nodes and edges can have properties that describe them. The model defines the schema of the graph, including node and edge types with their properties. It also defines the process for constructing the graph from tabular data stored in KQL Database tables and from tabular expressions.
Key characteristics
Graph models offer:
- Metadata persistence: Store graph specifications in database metadata for durability and reusability
- Materialized snapshots: Eliminate the need to rebuild graphs for each query, dramatically improving query performance
- Schema definition: Support optional but recommended defined schemas for nodes and edges, ensuring data consistency
- Deep KQL integration: Seamlessly integrate with graph semantics
- Optimized traversals: Include specialized indexing for efficient graph traversal operations, making complex pattern matching and path-finding queries significantly faster
When to use graph models
Graph models provide significant advantages for relationship-based analysis but require additional setup compared to ad-hoc graph queries. Consider using graph models when:
- Performance is critical: You repeatedly run graph queries on the same underlying data and need optimized performance
- Complex relationship data: You have data with many interconnected relationships that benefit from a graph representation
- Stable structure: Your graph structure is relatively stable, with periodic but not constant updates
- Advanced graph operations: You need to perform complex traversals, path finding, pattern matching, or community detection on your data
- Consistent schema: Your graph analysis requires a well-defined structure with consistent node and edge types
For simpler, one-time graph analysis on smaller datasets, the make-graph operator might be more appropriate.
Graph model components
A graph model consists of two main components:
Schema (optional)
The schema defines the structure and properties of nodes and edges in the graph model. While optional, the schema serves several important purposes:
- Type safety: Schema properties define the expected data types for node and edge properties, ensuring type consistency during graph queries
- Property validation: All properties defined in the schema become valid properties for nodes/edges with the corresponding labels, regardless of whether these properties appear in the step query columns
- Query compatibility: Schema properties can be safely referenced in graph-match queries without type collisions with step query columns
Schema structure
- Nodes: Defines node label types and their typed properties (e.g.,
"Person": {"Name": "string", "Age": "long"}) - Edges: Defines edge label types and their typed properties (e.g.,
"WORKS_AT": {"StartDate": "datetime", "Position": "string"})
Definition
The Definition specifies how to build the graph from tabular data through a series of sequential operations. This section is the core of the graph model, as it transforms your relational data into a graph structure.
Key characteristics of the Definition:
Sequential execution: Steps are executed in the exact order they appear in the Definition array. This order is critical because:
- Nodes must typically be created before edges that reference them
- Later steps can build upon or modify the results of earlier steps
- The sequence affects performance and memory usage during graph construction
Incremental construction: Each step adds to the graph being built, allowing you to:
- Combine data from multiple tables or sources
- Apply different logic for different types of nodes or edges
- Build complex graph structures incrementally
Step types:
AddNodes: Steps that define how to create nodes from tabular data
- Can be used multiple times to add different types of nodes
- Each step can pull from different data sources or apply different filters
- Node properties are derived from the columns in the query result
AddEdges: Steps that define how to create edges from tabular data
- Can reference nodes that don’t yet exist (the system will create placeholder nodes and update them when AddNodes steps are processed later)
- Can create relationships between nodes from the same or different AddNodes steps
- Edge properties are derived from the columns in the query result
- While it’s possible to add edges before nodes, it’s recommended to add nodes first for better readability and understanding
Execution flow example:
Step 1 (AddNodes): Create Person nodes from Employees table
Step 2 (AddNodes): Create Company nodes from Organizations table
Step 3 (AddEdges): Create WORKS_AT edges between Person and Company nodes
Step 4 (AddEdges): Create KNOWS edges between Person nodes
This sequential approach ensures that when Step 3 creates WORKS_AT edges, both the Person nodes (from Step 1) and Company nodes (from Step 2) already exist in the graph.
Labels in Graph models
Labels are critical identifiers that categorize nodes and edges in the graph, enabling efficient filtering and pattern matching. Azure Data Explorer graph models support two complementary types of labels:
| Label type | Description | Usage |
|---|---|---|
| Static labels | Fixed labels assigned to specific node or edge types during graph model definition. All instances of a type share the same static labels. Defined explicitly in the Schema section and referenced in the “Labels” array in AddNodes and AddEdges steps. | Use for categorizing nodes by type, such as “User”, “Device”, or “Service”. Ideal for well-known, stable entity and relationship types. |
| Dynamic labels | Labels derived from data properties during graph construction using the LabelsColumn parameter. Values can vary per instance based on actual data. Not predefined in the Schema section. Can be a single label (string column) or multiple labels (dynamic array column). | Use for data-driven categorization, such as environment labels (“Production”, “Staging”) or status indicators. Useful for systems where node/edge types evolve over time. |
Definition steps in detail
The Definition section of a graph model contains steps that define how to construct the graph from tabular data. Each step has specific parameters depending on its kind.
AddNodes steps
AddNodes steps define how to create nodes in the graph from tabular data:
| Parameter | Required | Description |
|---|---|---|
| Kind | Yes | Must be set to “AddNodes” |
| Query | Yes | A KQL query that retrieves the data for nodes. The query result must include all columns required for node properties and identifiers |
| NodeIdColumn | Yes | The column from the query result used as the unique identifier for each node |
| Labels | No | An array of static label names defined in the Schema section to apply to these nodes |
| LabelsColumn | No | A column from the query result that provides dynamic labels for each node. Can be a string column (single label) or dynamic array column (multiple labels) |
AddEdges steps
AddEdges steps define how to create relationships between nodes in the graph:
| Parameter | Required | Description |
|---|---|---|
| Kind | Yes | Must be set to “AddEdges” |
| Query | Yes | A KQL query that retrieves the data for edges. The query result must include source and target node identifiers and any edge properties |
| SourceColumn | Yes | The column from the query result that contains the source node identifiers |
| TargetColumn | Yes | The column from the query result that contains the target node identifiers |
| Labels | No | An array of static label names defined in the Schema section to apply to these edges |
| LabelsColumn | No | A column from the query result that provides dynamic labels for each edge. Can be a string column (single label) or dynamic array column (multiple labels) |
Graph model examples
Basic example with both static and dynamic labels
The following example creates a professional network graph model that combines static schema definitions with dynamic labeling:
.create-or-alter graph_model ProfessionalNetwork ```
{
"Schema": {
"Nodes": {
"Person": {"Name": "string", "Age": "long", "Title": "string"},
"Company": {"Name": "string", "Industry": "string", "FoundedYear": "int"}
},
"Edges": {
"WORKS_AT": {"StartDate": "datetime", "Position": "string", "Department": "string"},
"KNOWS": {"ConnectionDate": "datetime", "ConnectionStrength": "int"}
}
},
"Definition": {
"Steps": [
{
"Kind": "AddNodes",
"Query": "Employees | project Id, Name, Age, Title, NodeType",
"NodeIdColumn": "Id",
"Labels": ["Person"],
"LabelsColumn": "NodeType"
},
{
"Kind": "AddNodes",
"Query": "Organizations | project Id, Name, Industry, FoundedYear",
"NodeIdColumn": "Id",
"Labels": ["Company"]
},
{
"Kind": "AddEdges",
"Query": "EmploymentRecords | project EmployeeId, CompanyId, StartDate, Position, Department",
"SourceColumn": "EmployeeId",
"TargetColumn": "CompanyId",
"Labels": ["WORKS_AT"]
},
{
"Kind": "AddEdges",
"Query": "Connections | project PersonA, PersonB, ConnectionDate, ConnectionType, ConnectionStrength",
"SourceColumn": "PersonA",
"TargetColumn": "PersonB",
"Labels": ["KNOWS"],
"LabelsColumn": "ConnectionType"
}
]
}
}
```
This model would enable queries such as finding colleagues connected through multiple degrees of separation, identifying people working in the same industry, or analyzing organizational relationships.
Creating and managing Graph models
Azure Data Explorer provides a comprehensive set of management commands for working with graph models throughout their lifecycle.
Command summary
| Command | Purpose | Key parameters |
|---|---|---|
| .create-or-alter graph_model | Create a new graph model or modify an existing one | Database, Name, Schema, Definition |
| .drop graph_model | Remove a graph model | Database, Name |
| .show graph_models | List available graph models | Database [optional] |
Graph model lifecycle
A typical workflow for managing graph models involves:
- Development - Create an initial graph model with a schema and definition that maps to your data
- Validation - Query the model to verify correct structure and expected results
- Maintenance - Periodically update the model as your data structure evolves
- Snapshot management - Create and retire snapshots to balance performance and freshness
Graph snapshots
Graph snapshots are database entities that represent instances of graph models at specific points in time. While a graph model defines the structure and data sources for a graph, a snapshot is the actual materialized graph that can be queried.
Key aspects of graph snapshots:
- Each snapshot is linked to a specific graph model
- A single graph model can have multiple snapshots associated with it
- Snapshots are created with the
.make graph_snapshotcommand - Snapshots include metadata such as creation time and the source graph model
- Snapshots enable querying the graph as it existed at a specific point in time
For more detailed information about working with graph snapshots, see Graph snapshots overview.
Querying Graph models
Graph models are queried using the graph() function, which provides access to the graph entity. This function supports retrieving either the most recent snapshot of the graph or creating the graph at query time if snapshots aren’t available.
Basic query structure
graph("GraphModelName")
| graph-match <pattern>
where <filters>
project <output fields>
Query examples
1. Basic node-edge-node pattern
// Find people who commented on posts by employees in the last week
graph("SocialNetwork")
| graph-match (person)-[comments]->(post)<-[authored]-(employee)
where person.age > 30
and comments.createTime > ago(7d)
project person.name, post.title, employee.userName
2. Multiple relationship patterns
// Find people who both work with and are friends with each other
graph("ProfessionalNetwork")
| graph-match (p1)-[worksWith]->(p2)-[friendsWith]->(p1)
where labels(worksWith) has "WORKS_WITH" and labels(friendsWith) has "FRIENDS_WITH" and
labels(p1) has "Person" and labels(p2) has "Person"
project p1.name, p2.name, p1.department
3. Variable-length paths
// Find potential influence paths up to 3 hops away
graph("InfluenceNetwork")
| graph-match (influencer)-[influences*1..3]->(target)
where influencer.id == "user123" and all(influences, labels() has "INFLUENCES")
project influencePath = influences,
pathLength = array_length(influences),
target.name
The graph() function provides a consistent way to access graph data without needing to explicitly construct the graph for each query.
Frequently Asked Questions
Who is responsible for refreshing the graph?
Users or processes must refresh the graph themselves. Initially, no automatic refresh policies exist for new graph entities. However, the graph remains queryable even if the snapshot is being created or has not yet been created yet.
How can a graph be refreshed?
To refresh a graph:
- Create a new snapshot using an asynchronous operation (
.make graph_snapshot) - Once created, incoming graph queries automatically use the new snapshot
- Optional: Drop the old snapshot to free up resources (
.drop graph_snapshot)
What if different steps create duplicate edges or nodes?
The Definition steps execute sequentially, and duplicate handling differs between nodes and edges:
Edges: Duplicates remain as duplicates by default since edges don’t have unique identifiers. If multiple steps create identical source-target relationships, each one becomes a separate edge in the graph. This behavior is intentional as multiple relationships between the same nodes can represent different interactions or events over time.
Nodes: “Duplicates” are automatically merged based on the NodeIdColumn value - the system assumes they represent the same entity. When multiple steps create nodes with the same identifier:
- All properties from different steps are combined into a single node
- If there are conflicting property values for the same property name, the value from the step that executed last takes precedence
- Properties that exist in one step but not another are preserved
This merge behavior allows you to build nodes incrementally across steps, such as adding basic information in one step and enriching with additional properties in subsequent steps.
How do graph models handle schema changes?
When the schema of your underlying data changes:
- Alter your graph model using the
.create-or-alter graph_modelcommand to update its schema or definition - To materialize these changes, create a new snapshot
- Older snapshots remain accessible until explicitly dropped
Can I query across multiple graph models?
Yes, you can query multiple graph models within a single query using composition:
- Use the output of one
graph()operator as input to anothergraph()operator - Process and transform results from one graph before feeding into another graph query
- Chain multiple graph operations for cross-domain analysis without creating a unified model
Example:
// Query the first graph model
graph("EmployeeNetwork")
| graph-match (person)-[manages]->(team)
where labels(manages) has "MANAGES" and labels(person) has "Employee"
project manager=person.name, teamId=team.id
// Use these results to query another graph model
| join (
graph("ProjectNetwork")
| graph-match (project)-[assignedTo]->(team)
where labels(assignedTo) has "ASSIGNED_TO"
project projectName=project.name, teamId=team.id
) on teamId
What’s the difference between labels and properties?
- Labels: Categorize nodes and edges for structural pattern matching
- Properties: Store data values associated with nodes and edges (used in filtering and output)
Related content
3 - Graph snapshots overview
A graph snapshot is a database entity that represents a materialized instance of a graph model at a specific point in time. While a graph model defines the structure and data sources, a snapshot is the queryable graph implementation.
Overview
Graph snapshots provide:
- Model linkage: Connected to a specific graph model
- Point-in-time materialization: Represents the graph state at creation time
- Persistence: Stored in the database until explicitly dropped
- Direct querying: Enables queries without rebuilding the graph
- Metadata storage: Contains creation time and model information
Multiple snapshots from the same graph model enable historical analysis and temporal comparison of graph data.
Graph snapshot structure
Each graph snapshot contains two primary components:
Metadata
- Name: Unique snapshot identifier
- SnapshotTime: Creation timestamp
- Model information:
- ModelName: Source graph model name
- ModelVersion: Model version at snapshot creation
- ModelCreationTime: Source model creation timestamp
Graph data
- Nodes: Materialized nodes from the model’s
AddNodesoperations - Edges: Materialized relationships from the model’s
AddEdgesoperations - Properties: Node and edge properties as defined in the model
Example snapshot configuration
{
"Metadata": {
"Name": "UserInteractionsSnapshot",
"SnapshotTime": "2025-04-28T10:15:30Z"
},
"ModelInformation": {
"ModelName": "SocialNetworkGraph",
"ModelVersion": "v1.2",
"ModelCreationTime": "2025-04-15T08:20:10Z"
}
}
Management commands
Use these commands to manage graph snapshots:
| Command | Purpose |
|---|---|
| .make graph_snapshot | Create a snapshot from an existing graph model |
| .drop graph_snapshot | Remove a snapshot from the database |
| .show graph_snapshots | List available snapshots in the database |
Querying snapshots
Query graph snapshots using the graph() function:
Query the latest snapshot
graph("SocialNetworkGraph")
| graph-match (person)-[knows]->(friend)
where person.age > 30
project person.name, friend.name
Query a specific snapshot
graph("SocialNetworkGraph", "UserInteractionsSnapshot")
| graph-match (person)-[knows]->(friend)
where person.age > 30
project person.name, friend.name
For advanced pattern matching and traversals, see Graph operators.
Monitoring and diagnostics commands
Use these commands to analyze the snapshot build process:
| Command | Purpose |
|---|---|
| .show graph_snapshot statistics | Display detailed statistics for a specific snapshot, including performance metrics and resource utilization |
| .show graph_snapshots statistics | Display statistics for all snapshots of a graph model |
| .show graph_snapshots failures | Display information about failed snapshot creation attempts with failure reasons and troubleshooting details |
These commands provide insights into:
- Performance analysis: CPU time, memory usage, and duration metrics for snapshot creation
- Resource monitoring: Peak memory consumption and processing efficiency
- Build process details: Step-by-step breakdown of the snapshot creation process
- Failure investigation: Detailed error information and retry behavior analysis
- Troubleshooting: Operation IDs for deep-dive analysis and failure classification
Key benefits
Graph snapshots provide:
- Enhanced performance: Eliminates graph rebuilding for each query
- Data consistency: Ensures all queries operate on identical graph state
- Temporal analysis: Enables historical comparison across time periods
- Resource optimization: Reduces CPU and memory consumption for repeated operations
Related content
4 - Persistent graphs overview
Persistent graphs enable you to store, manage, and query graph data structures at scale. Unlike transient graphs created with the make-graph operator, persistent graphs are durable database objects that persist beyond individual query executions, providing enterprise-grade graph analytics capabilities.
Overview
Persistent graphs consist of two primary components:
- Graph models: Define the structure and schema of your graph
- Graph snapshots: Persistent instances of graph models that you can query
This architecture provides both flexibility in defining graph schemas and efficiency in querying graph data at scale.
Key benefits
Persistent graphs offer significant advantages for enterprise-scale graph analytics:
- Durable storage: Graph models and snapshots persist in database metadata for long-term availability
- Scalability: Handle large graphs that exceed single-node memory limitations
- Reusability: Multiple users and applications can query the same graph structure without reconstruction
- Performance optimization: Eliminate graph construction overhead for repeated queries
- Schema support: Structured definitions for different node and edge types with their properties
- Version control: Multiple snapshots enable representation of graphs at different points in time
Graph models
A graph model defines the specifications of a graph stored in your database metadata. It includes:
- Schema definition: Node and edge types with their properties
- Data source mappings: Instructions for building the graph from tabular data
- Labels: Both static (predefined) and dynamic (generated at runtime) labels for nodes and edges
Graph models contain the blueprint for creating graph snapshots, not the actual graph data.
Managing graph models
The following commands are available for managing graph models:
| Command | Description |
|---|---|
| .create-or-alter graph_model | Creates a new graph model or alters an existing one |
| .show graph_model | Displays details of a specific graph model |
| .show graph_models | Lists all graph models in the database |
| .drop graph_model | Removes a graph model |
Graph snapshots
A graph snapshot is the actual graph instance materialized from a graph model. It represents:
- A specific point-in-time view of the data as defined by the model
- The nodes, edges, and their properties in a queryable format
- A self-contained entity that persists until explicitly removed
Snapshots are the entities you query when working with persistent graphs.
Managing graph snapshots
The following commands are available for managing graph snapshots:
| Command | Description |
|---|---|
| .make graph_snapshot | Creates a new graph snapshot from a graph model |
| .show graph_snapshot | Displays details of a specific graph snapshot |
| .show graph_snapshots | Lists all graph snapshots in the database |
| .drop graph_snapshot | Removes a single graph snapshot |
| .drop graph_snapshots | Removes multiple graph snapshots based on criteria |
Workflow
The typical workflow for creating and using persistent graphs follows these steps:
- Create a graph model - Define the structure and data sources for your graph
- Create a graph snapshot - Materialize the graph model into a queryable snapshot
- Query the graph snapshot - Use KQL graph operators to analyze the graph data
- Manage lifecycle - Create new snapshots as needed and drop old ones
Querying persistent graphs
Once a graph snapshot is created, it can be queried using the graph function followed by other KQL graph operators:
graph("MyGraphModel")
| graph-match (n)-[e]->(m)
project n, e, m
To query a specific snapshot, provide the snapshot name:
graph("MyGraphModel", "MyGraphSnapshot")
| graph-match (n)-[e]->(m)
project n, e, m
The graph-match operator enables pattern matching and traversal operations, while graph-shortest-paths helps find optimal connections between entities. The graph-to-table operator converts graph results back to tabular format.
Key considerations
This section describes key considerations and current limitations of graph models and snapshots.
Snapshot limitations
Persistent graphs have the following limitations:
- Regular database limit: Maximum of 5,000 graph snapshots per database
- Free virtual cluster limit: Maximum of 500 graph snapshots per database
- Snapshot creation time: Limited to 1 hour
Next steps
5 - Avrotize k2a tool
Avrotize is a versatile tool for converting data and database schema formats, and generating code in various programming languages. The tool supports the conversion of Kusto table schemas to Apache Avro format and vice versa with the Convert Kusto table definition to Avrotize Schema command. The tool handles dynamic columns in Kusto tables by:
- Inferring the schema through sampling
- Resolving arrays and records at any level of nesting
- Detecting conflicting schemas
- Creating type unions for each different schema branch
Convert table definition to AVRO format
You can use the avrotize k2a command to connect to a Kusto database and create an Avro schema with a record type for each of the tables in the database.
The following are examples of how to use the command:
Create an Avro schema with a top-level union with a record for each table:
avrotize k2a --kusto-uri <Uri> --kusto-database <DatabaseName> --avsc <AvroFilename.avsc>Create a XRegistry Catalog file with CloudEvent wrappers and per-event schemas:
In the following example, you create xRegistry catalog files with schemas for each table. If the input table contains CloudEvents identified by columns like id, source, and type, the tool creates separate schemas for each event type.
avrotize k2a --kusto-uri <URI> --kusto-database <DatabaseName> --avsc <AvroFilename.xreg.json> --emit-cloudevents-xregistry --avro-namespace <AvroNamespace>
Convert AVRO schema to Kusto table declaration
You can use the avrotize a2k command to create KQL table declarations from Avro schema and JSON mappings. It can also include docstrings in the table declarations extracted from the “doc” annotations in the Avro record types.
If the Avro schema is a single record type, the output script includes a .create table command for the record. The record fields are converted into columns in the table. If the Avro schema is a type union (a top-level array), the output script emits a separate .create table command for each record type in the union.
avrotize a2k .\<AvroFilename.avsc> --out <KustoFilename.kql>
The Avrotize tool is capable of converting JSON Schema, XML Schema, ASN.1 Schema, and Protobuf 2 and Protobuf 3 schemas into Avro schema. You can first convert the source schema into an Avro schema to normalize it and then convert it into Kusto schema.
For example, to convert “address.json” into Avro schema, the following command first converts an input JSON Schema document “address.json” to normalize it:
avrotize j2a address.json --out address.avsc
Then convert the Avro schema file into Kusto schema:
avrotize a2k address.avsc --out address.kql
You can also chain the commands together to convert from JSON Schema via Avro into Kusto schema:
avrotize j2a address.json | avrotize a2k --out address.kql
Related content
6 - Best practices for schema management
Here are several best practices to follow. They’ll help make your management commands work better, and have a lighter impact on the service resources.
| Action | Use | Don’t use | Notes |
|---|---|---|---|
| Create multiple tables | Use a single .create tables command | Don’t issue many .create table commands | |
| Rename multiple tables | Make a single call to .rename tables | Don’t issue a separate call for each pair of tables | |
| Show commands | Use the lowest-scoped .show command | Don’t apply filters after a pipe (|) | Limit use as much as possible. When possible, cache the information they return. |
| Show extents | Use .show table T extents | Don’t use `.show cluster extents | where TableName == ‘T’` |
| Show database schema. | Use .show database DB schema | Don’t use `.show schema | where DatabaseName == ‘DB’` |
| Show large schema | Use .show databases schema | Don’t use .show schema | For example, use on an environment with more than 100 databases. |
| Check a table’s existence or get the table’s schema | Use .show table T schema as json | Don’t use .show table T | Only use this command to get actual statistics on a single table. |
Define the schema for a table that will include datetime values | Set the relevant columns to the datetime type | Don’t convert string or numeric columns to datetime at query time for filtering, if that can be done before or during ingestion time | |
| Add extent tag to metadata | Use sparingly | Avoid drop-by: tags, which limit the system’s ability to do performance-oriented grooming processes in the background. | See performance notes. |
7 - Databases management
This topic describes the following database management commands:
| Command | Description |
|---|---|
.show databases | Returns a table in which every record corresponds to a database in the cluster that the user has access to. |
.show database | Returns a table showing the properties of the context database. |
| .alter database prettyname command | Alters a database’s pretty (friendly) name. |
| .drop database prettyname command | Drops a database’s pretty (friendly) name. |
.show database schema | Returns a flat list of the structure of the selected databases with all their tables and columns in a single table or JSON object. |
| .show database schema violations | Returns a list of the schema validation issues within a database. |
.show databases entities | Returns a list of databases’ entities (for example: tables, materialized views, etc.) |
.execute database script | Executes batch of management commands in scope of a single database. |
| Command | Description |
| ——– | ———— |
.show databases | Returns a table in which every record corresponds to a database in the eventhouse that the user has access to. |
.show database | Returns a table showing the properties of the context database. |
.show database schema | Returns a flat list of the structure of the selected databases with all their tables and columns in a single table or JSON object. |
| .show database schema violations | Returns a list of the schema validation issues within a database. |
.show databases entities | Returns a list of databases’ entities (for example: tables, materialized views, etc.) |
.execute database script | Executes batch of management commands in scope of a single database. |
|.execute database script | Executes batch of management commands in scope of a single database. |
8 - Columns
8.1 - Change column type without data loss
The .alter column command changes the column type, making the original data unrecoverable. To preserve preexisting data while changing the column type, create a new, properly typed table.
For each table OriginalTable you’d like to change a column type in, execute the following steps:
Create a table
NewTablewith the correct schema (the right column types and the same column order).Ingest the data into
NewTablefromOriginalTable, applying the required data transformations. In the following example, Col1 is being converted to the string data type..set-or-append NewTable <| OriginalTable | extend Col1=tostring(Col1)Use the .rename tables command to swap table names.
.rename tables NewTable=OriginalTable, OriginalTable=NewTableWhen the command completes, the new data from existing ingestion pipelines flows to
OriginalTablethat is now typed correctly.Drop the table
NewTable.NewTableincludes only a copy of the historical data from before the schema change. It can be safely dropped after confirming the schema and data inOriginalTablewere correctly updated..drop table NewTable
Example
The following example updates the schema of OriginalTable while preserving its data.
Create the table, OriginalTable, with a column, “Col1,” of type guid.
.create table OriginalTable (Col1:guid, Id:int)
Then ingest data into OriginalTable.
.ingest inline into table OriginalTable <|
b642dec0-1040-4eac-84df-a75cfeba7aa4,1
c224488c-ad42-4e6c-bc55-ae10858af58d,2
99784a64-91ad-4897-ae0e-9d44bed8eda0,3
d8857a93-2728-4bcb-be1d-1a2cd35386a7,4
b1ddcfcc-388c-46a2-91d4-5e70aead098c,5
Create the table, NewTable of type string.
.create table NewTable (Col1:string, Id:int)
Append data from OriginalTable to NewTable and use the tostring() function to convert the “Col1” column from type guid to type string.
.set-or-append NewTable <| OriginalTable | extend Col1=tostring(Col1)
Swap the table names.
.rename tables NewTable = OriginalTable, OriginalTable = NewTable
Drop table, NewTable with the old schema and data.
.drop table NewTable
Related content
8.2 - Columns management
This section describes the following management commands used for managing table columns:
| Command | Description |
|---|---|
| .alter column | Alters the data type of an existing table column |
| .alter-merge column docstrings and .alter column docstrings | Sets the docstring property of one or more columns of the specified table |
.alter table, .alter-merge table | Modify the schema of a table (add/remove columns) |
| drop column and drop table columns | Removes one or multiple columns from a table |
| rename column or columns | Changes the name of an existing or multiple table columns |
9 - External tables
9.1 - Azure SQL external tables
9.1.1 - Create and alter Azure SQL external tables
Creates or alters an Azure SQL external table in the database in which the command is executed.
Supported Azure SQL external table types
- SQL Server
- MySQL
- PostgreSQL
- Cosmos DB
Permissions
To .create requires at least Database User permissions and to .alter requires at least Table Admin permissions.
To .create, .alter, or .create-or-alter an external table using managed identity authentication requires Database Admin permissions. This method is supported for SQL Server and Cosmos DB external tables.
Syntax
(.create | .alter | .create-or-alter) external table TableName (Schema) kind = sql [ table = SqlTableName ] (SqlConnectionString) [with ( [ sqlDialect = SqlDialect ] , [ Property , … ])]
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| TableName | string | ✔️ | The name of the external table. The name must follow the rules for entity names, and an external table can’t have the same name as a regular table in the same database. |
| Schema | string | ✔️ | The external data schema is a comma-separated list of one or more column names and data types, where each item follows the format: ColumnName : ColumnType. |
| SqlConnectionString | string | ✔️ | The connection string to the SQL server. |
| SqlTableName | string | The name of the SQL table not including the database name. If the name of the table contains a period ("."), use [‘Name.of.the.table’] notation. If the table isn’t in the default SQL database schema, use [‘SchemaName.TableName’] notation. For example, for a table “T1” in schema “S1”: [‘S1.T1’] This specification is required for all types of tables except for Cosmos DB, as for Cosmos DB the collection name is part of the connection string. | |
| SqlDialect | string | Indicates the type of Azure SQL external table. SQL Server is the default. For MySQL, specify MySQL. For PostgreSQL, specify PostgreSQL. For Cosmos DB, specify CosmosDbSql. | |
| Property | string | A key-value property pair in the format PropertyName = PropertyValue. See optional properties. |
Optional properties
| Property | Type | Description |
|---|---|---|
folder | string | The table’s folder. |
docString | string | A string documenting the table. |
firetriggers | true/false | If true, instructs the target system to fire INSERT triggers defined on the SQL table. The default is false. (For more information, see BULK INSERT and System.Data.SqlClient.SqlBulkCopy) |
createifnotexists | true/ false | If true, the target SQL table is created if it doesn’t already exist; the primarykey property must be provided in this case to indicate the result column that is the primary key. The default is false. |
primarykey | string | If createifnotexists is true, the resulting column name is used as the SQL table’s primary key if it’s created by this command. |
Authentication and authorization
To interact with an external Azure SQL table, you must specify authentication means as part of the SqlConnectionString. The SqlConnectionString defines the resource to access and its authentication information.
For more information, see Azure SQL external table authentication methods.
Examples
The following examples show how to create each type of Azure SQL external table.
SQL Server
.create external table MySqlExternalTable (x:long, s:string)
kind=sql
table=MySqlTable
(
h@'Server=tcp:myserver.database.windows.net,1433;Authentication=Active Directory Integrated;Initial Catalog=mydatabase;'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
createifnotexists = true,
primarykey = x,
firetriggers=true
)
Output
| TableName | TableType | Folder | DocString | Properties |
|---|---|---|---|---|
| MySqlExternalTable | Sql | ExternalTables | Docs | { “TargetEntityKind”: “sqltable`”, “TargetEntityName”: “MySqlTable”, “TargetEntityConnectionString”: “Server=tcp:myserver.database.windows.net,1433;Authentication=Active Directory Integrated;Initial Catalog=mydatabase;”, “FireTriggers”: true, “CreateIfNotExists”: true, “PrimaryKey”: “x” } |
MySQL
.create external table MySqlExternalTable (x:long, s:string)
kind=sql
table=MySqlTable
(
h@'Server=myserver.mysql.database.windows.net;Port = 3306;UID = USERNAME;Pwd = PASSWORD;Database = mydatabase;'
)
with
(
sqlDialect = "MySql",
docstring = "Docs",
folder = "ExternalTables",
)
PostgreSQL
.create external table PostgreSqlExternalTable (x:long, s:string)
kind=sql
table=PostgreSqlTable
(
h@'Host = hostname.postgres.database.azure.com; Port = 5432; Database= db; User Id=user; Password=pass; Timeout = 30;'
)
with
(
sqlDialect = "PostgreSQL",
docstring = "Docs",
folder = "ExternalTables",
)
Cosmos DB
.create external table CosmosDBSQLExternalTable (x:long, s:string)
kind=sql
(
h@'AccountEndpoint=https://cosmosdbacc.documents.azure.com/;Database=MyDatabase;Collection=MyCollection;AccountKey=' h'R8PM...;'
)
with
(
sqlDialect = "CosmosDbSQL",
docstring = "Docs",
folder = "ExternalTables",
)
Related content
9.1.2 - Query SQL external tables
You can query a SQL external table just as you would query an Azure Data Explorer or a table in a KQL Database.
How it works
Azure SQL external table queries are translated from Kusto Query Language (KQL) to SQL. The operators after the external_table function call, such as where, project, count, and so on, are pushed down and translated into a single SQL query to be executed against the target SQL table.
Example
For example, consider an external table named MySqlExternalTable with two columns x and s. In this case, the following KQL query is translated into the following SQL query.
KQL query
external_table(MySqlExternalTable)
| where x > 5
| count
SQL query
SELECT COUNT(*) FROM (SELECT x, s FROM MySqlTable WHERE x > 5) AS Subquery1
9.1.3 - Use row-level security with Azure SQL external tables
Apply row-level security on Azure SQL external tables
This document describes how to apply a row-level security (RLS) solution with SQL external tables. row-level security implements data isolation at the user level, restricting the access to data based on the current user credential. However, Kusto external tables don’t support RLS policy definitions, so data isolation on external SQL tables require a different approach. The following solution employs using row-level security in SQL Server, and Microsoft Entra ID Impersonation in the SQL Server connection string. This combination provides the same behavior as applying user access control with RLS on standard Kusto tables, such that the users querying the SQL External Table are able to only see the records addressed to them, based on the row-level security policy defined in the source database.
Prerequisites
ALTER ANY SECURITY POLICYpermission on the SQL Server- Table admin level permissions on the Kusto-side SQL external table
Sample table
The example source is a SQL Server table called SourceTable, with the following schema. The systemuser column contains the user email to whom the data record belongs. This is the same user who should have access to this data.
CREATE TABLE SourceTable (
id INT,
region VARCHAR(5),
central VARCHAR(5),
systemuser VARCHAR(200)
)
Configure row-level security in the source SQL Server - SQL Server side
For general information on SQL Server row-level security, see row-level security in SQL Server.
Create a SQL Function with the logic for the data access policy. In this example, the row-level security is based on the current user’s email matching the
systemusercolumn. This logic could be modified to meet any other business requirement.CREATE SCHEMA Security; GO CREATE FUNCTION Security.mySecurityPredicate(@CheckColumn AS nvarchar(100)) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT 1 AS mySecurityPredicate_result WHERE @CheckColumn = ORIGINAL_LOGIN() OR USER_NAME() = 'Manager'; GOCreate the Security Policy on the table
SourceTablewith passing the column name as the parameter:CREATE SECURITY POLICY SourceTableFilter ADD FILTER PREDICATE Security.mySecurityPredicate(systemuser) ON dbo.SourceTable WITH (STATE = ON) GO[!NOTE] At this point, the data is already restricted by the
mySecurityPredicatefunction logic.
Allow user access to SQL Server - SQL Server side
The following steps depend on the SQL Server version that you’re using.
Create a sign in and User for each Microsoft Entra ID credential that is going to access the data stored in SQL Server:
CREATE LOGIN [user@domain.com] FROM EXTERNAL PROVIDER --MASTER CREATE USER [user@domain.com] FROM EXTERNAL PROVIDER --DATABASEGrant SELECT on the Security function to the Microsoft Entra ID user:
GRANT SELECT ON Security.mySecurityPredicate to [user@domain.com]Grant SELECT on the
SourceTableto the Microsoft Entra ID user:GRANT SELECT ON dbo.SourceTable to [user@domain.com]
Define SQL external table connection String - Kusto side
For more information on the connection string, see SQL External Table Connection Strings.
Create a SQL External Table with using Connection String with
Active Directory Integratedauthentication type. For more information, see Microsoft Entra integrated (impersonation)..create external table SQLSourceTable (id:long, region:string, central:string, systemser:string) kind=sql table=SourceTable ( h@'Server=tcp:[sql server endpoint],1433;Authentication=Active Directory Integrated;Initial Catalog=[database name];' ) with ( docstring = "Docs", folder = "ExternalTables", createifnotexists = false, primarykey = 'id' )Connection String:
Server=tcp:[sql server endpoint],1433;Authentication=Active Directory Integrated;Initial Catalog=[database name];Validate the data isolation based on the Microsoft Entra ID, like it would work with row-level security on in Kusto. In this case, the data is filtered based on the SourceTable’s
systemusercolumn, matching the Microsoft Entra ID user (email address) from the Kusto impersonation:external_table('SQLSourceTable')[!NOTE] The policy can be disabled and enabled again, on the SQL Server side, for testing purposes.
To disable and enable the policy, use the following SQL commands:
ALTER SECURITY POLICY SourceTableFilter
WITH (STATE = OFF);
ALTER SECURITY POLICY SourceTableFilter
WITH (STATE = ON);
With the Security Policy enabled on the SQL Server side, Kusto users only see the records matching their Microsoft Entra IDs, as the result of the query against the SQL External table. With the Security Policy disabled, all users are able to access the full table content as the result of the query against the SQL External table.
Related content
9.2 - Azure Storage external tables
9.2.1 - Create and alter Azure Storage delta external tables
The commands in this article can be used to create or alter a delta external table in the database from which the command is executed. A delta external table references Delta Lake table data located in Azure Blob Storage, Azure Data Lake Store Gen1, or Azure Data Lake Store Gen2.
To accelerate queries over external delta tables, see Query acceleration policy.
Permissions
To .create requires at least Database User permissions, and to .alter requires at least Table Admin permissions.
To .create-or-alter an external table using managed identity authentication requires AllDatabasesAdmin permissions.
Syntax
(.create | .alter | .create-or-alter) external table TableName [(Schema)] kind = delta (StorageConnectionString ) [with (Property [, …])]
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| TableName | string | ✔️ | An external table name that adheres to the entity names rules. An external table can’t have the same name as a regular table in the same database. |
| Schema | string | The optional external data schema is a comma-separated list of one or more column names and data types, where each item follows the format: ColumnName : ColumnType. If not specified, it will be automatically inferred from the delta log based on the latest delta table version. | |
| StorageConnectionString | string | ✔️ | delta table root folder path, including credentials. Can point to Azure Blob Storage blob container, Azure Data Lake Gen 2 file system or Azure Data Lake Gen 1 container. The external table storage type is determined by the provided connection string. See storage connection strings. |
| Property | string | A key-value property pair in the format PropertyName = PropertyValue. See optional properties. |
Authentication and authorization
The authentication method to access an external table is based on the connection string provided during its creation, and the permissions required to access the table vary depending on the authentication method.
The supported authentication methods are the same as those supported by Azure Storage external tables.
Optional properties
| Property | Type | Description |
|---|---|---|
folder | string | Table’s folder |
docString | string | String documenting the table |
compressed | bool | Only relevant for the export scenario. If set to true, the data is exported in the format specified by the compressionType property. For the read path, compression is automatically detected. |
compressionType | string | Only relevant for the export scenario. The compression type of exported files. For non-Parquet files, only gzip is allowed. For Parquet files, possible values include gzip, snappy, lz4_raw, brotli, and zstd. Default is gzip. For the read path, compression type is automatically detected. |
namePrefix | string | If set, specifies the prefix of the files. On write operations, all files will be written with this prefix. On read operations, only files with this prefix are read. |
fileExtension | string | If set, specifies extension of the files. On write, files names will end with this suffix. On read, only files with this file extension will be read. |
encoding | string | Specifies how the text is encoded: UTF8NoBOM (default) or UTF8BOM. |
dryRun | bool | If set, the external table definition isn’t persisted. This option is useful for validating the external table definition, especially in conjunction with the filesPreview or sampleUris parameter. |
Examples
Create or alter a delta external table with an inferred schema
In the following external table, the schema is automatically inferred from the latest delta table version.
.create-or-alter external table ExternalTable
kind=delta
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Create a delta external table with a custom schema
In the following external table, a custom schema is specified and overrides the schema of the delta table. If, at some later time, you need to replace the custom schema with the schema based on the latest delta table version, run the .alter | .create-or-alter command without specifying a schema, like in the previous example.
.create external table ExternalTable (Timestamp:datetime, x:long, s:string)
kind=delta
(
h@'abfss://filesystem@storageaccount.dfs.core.windows.net/path;secretKey'
)
Limitations
- Time travel is not supported. Only the latest delta table version is used.
Related content
9.2.2 - Create and alter Azure Storage external tables
The commands in this article can be used to create or alter an Azure Storage external table in the database from which the command is executed. An Azure Storage external table references data located in Azure Blob Storage, Azure Data Lake Store Gen1, or Azure Data Lake Store Gen2.
Permissions
To .create requires at least Database User permissions, and to .alter requires at least Table Admin permissions.
To .create-or-alter an external table using managed identity authentication requires AllDatabasesAdmin permissions.
Syntax
(.create | .alter | .create-or-alter) external table TableName (Schema) kind = storage [partition by (Partitions) [pathformat = (PathFormat)]] dataformat = DataFormat (StorageConnectionString [, …] ) [with (Property [, …])]
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| TableName | string | ✔️ | An external table name that adheres to the entity names rules. An external table can’t have the same name as a regular table in the same database. |
| Schema | string | ✔️ | The external data schema is a comma-separated list of one or more column names and data types, where each item follows the format: ColumnName : ColumnType. If the schema is unknown, use infer_storage_schema to infer the schema based on external file contents. |
| Partitions | string | A comma-separated list of columns by which the external table is partitioned. Partition column can exist in the data file itself, or as part of the file path. See partitions formatting to learn how this value should look. | |
| PathFormat | string | An external data folder URI path format to use with partitions. See path format. | |
| DataFormat | string | ✔️ | The data format, which can be any of the ingestion formats. We recommend using the Parquet format for external tables to improve query and export performance, unless you use JSON paths mapping. When using an external table for export scenario, you’re limited to the following formats: CSV, TSV, JSON, and Parquet. |
| StorageConnectionString | string | ✔️ | One or more comma-separated paths to Azure Blob Storage blob containers, Azure Data Lake Gen 2 file systems or Azure Data Lake Gen 1 containers, including credentials. The provided connection string determines the external table storage type. See storage connection strings. |
| Property | string | A key-value property pair in the format PropertyName = PropertyValue. See optional properties. |
Authentication and authorization
The authentication method to access an external table is based on the connection string provided during its creation, and the permissions required to access the table vary depending on the authentication method.
The following table lists the supported authentication methods for Azure Storage external tables and the permissions needed to read or write to the table.
| Authentication method | Azure Blob Storage / Data Lake Storage Gen2 | Data Lake Storage Gen1 |
|---|---|---|
| Impersonation | Read permissions: Storage Blob Data Reader Write permissions: Storage Blob Data Contributor | Read permissions: Reader Write permissions: Contributor |
| Managed identity | Read permissions: Storage Blob Data Reader Write permissions: Storage Blob Data Contributor | Read permissions: Reader Write permissions: Contributor |
| Shared Access (SAS) token | Read permissions: List + Read Write permissions: Write | This authentication method isn’t supported in Gen1. |
| Microsoft Entra access token | No additional permissions required. | No additional permissions required. |
| Storage account access key | No additional permissions required. | This authentication method isn’t supported in Gen1. |
| Authentication method | Azure Blob Storage / Data Lake Storage Gen2 | Data Lake Storage Gen1 |
|---|---|---|
| Impersonation | Read permissions: Storage Blob Data Reader Write permissions: Storage Blob Data Contributor | Read permissions: Reader Write permissions: Contributor |
| Shared Access (SAS) token | Read permissions: List + Read Write permissions: Write | This authentication method isn’t supported in Gen1. |
| Microsoft Entra access token | No additional permissions required. | No additional permissions required. |
| Storage account access key | No additional permissions required. | This authentication method isn’t supported in Gen1. |
Path format
The PathFormat parameter allows you to specify the format for the external data folder URI path in addition to partitions. It consists of a sequence of partition elements and text separators. A partition element refers to a partition that is declared in the partition by clause, and the text separator is any text enclosed in quotes. Consecutive partition elements must be set apart using the text separator.
[ StringSeparator ] Partition [ StringSeparator ] [Partition [ StringSeparator ] …]
To construct the original file path prefix, partition elements are rendered as strings and separated with corresponding text separators. You can use the datetime_pattern macro (datetime_pattern(DateTimeFormat, PartitionName)) to specify the format used for rendering a datetime partition value. The macro adheres to the .NET format specification, and allows format specifiers to be enclosed in curly brackets. For example, the following two formats are equivalent:
- ‘year=‘yyyy’/month=‘MM
- year={yyyy}/month={MM}
By default, datetime values are rendered using the following formats:
| Partition function | Default format |
|---|---|
startofyear | yyyy |
startofmonth | yyyy/MM |
startofweek | yyyy/MM/dd |
startofday | yyyy/MM/dd |
bin(Column, 1d) | yyyy/MM/dd |
bin(Column, 1h) | yyyy/MM/dd/HH |
bin(Column, 1m) | yyyy/MM/dd/HH/mm |
Virtual columns
When data is exported from Spark, partition columns (that are provided to the dataframe writer’s partitionBy method) aren’t written to data files.
This process avoids data duplication because the data is already present in the folder names (for example, column1=<value>/column2=<value>/), and Spark can recognize it upon read.
External tables support reading this data in the form of virtual colums. Virtual columns can be of either type string or datetime, and are specified using the following syntax:
.create external table ExternalTable (EventName:string, Revenue:double)
kind=storage
partition by (CustomerName:string, Date:datetime)
pathformat=("customer=" CustomerName "/date=" datetime_pattern("yyyyMMdd", Date))
dataformat=parquet
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
To filter by virtual columns in a query, specify partition names in query predicate:
external_table("ExternalTable")
| where Date between (datetime(2020-01-01) .. datetime(2020-02-01))
| where CustomerName in ("John.Doe", "Ivan.Ivanov")
Optional properties
| Property | Type | Description |
|---|---|---|
folder | string | Table’s folder |
docString | string | String documenting the table |
compressed | bool | Only relevant for the export scenario. If set to true, the data is exported in the format specified by the compressionType property. For the read path, compression is automatically detected. |
compressionType | string | Only relevant for the export scenario. The compression type of exported files. For non-Parquet files, only gzip is allowed. For Parquet files, possible values include gzip, snappy, lz4_raw, brotli, and zstd. Default is gzip. For the read path, compression type is automatically detected. |
includeHeaders | string | For delimited text formats (CSV, TSV, …), specifies whether files contain a header. Possible values are: All (all files contain a header), FirstFile (first file in a folder contains a header), None (no files contain a header). |
namePrefix | string | If set, specifies the prefix of the files. On write operations, all files are written with this prefix. On read operations, only files with this prefix are read. |
fileExtension | string | If set, specifies the extension of the files. On write, files names end with this suffix. On read, only files with this file extension are read. |
encoding | string | Specifies how the text is encoded: UTF8NoBOM (default) or UTF8BOM. |
sampleUris | bool | If set, the command result provides several examples of simulated external data files URI as they’re expected by the external table definition. This option helps validate whether the Partitions and PathFormat parameters are defined properly. |
filesPreview | bool | If set, one of the command result tables contains a preview of .show external table artifacts command. Like sampleUri, the option helps validate the Partitions and PathFormat parameters of external table definition. |
validateNotEmpty | bool | If set, the connection strings are validated for having content in them. The command fails if the specified URI location doesn’t exist, or if there are insufficient permissions to access it. |
dryRun | bool | If set, the external table definition isn’t persisted. This option is useful for validating the external table definition, especially in conjunction with the filesPreview or sampleUris parameter. |
File filtering logic
When you query an external table, performance is improved by filtering out irrelevant external storage files. The process of iterating files and deciding whether a file should be processed is as follows:
Build a URI pattern that represents a place where files are found. Initially, the URI pattern equals a connection string provided as part of the external table definition. If there are any partitions defined, they’re rendered using PathFormat, then appended to the URI pattern.
For all files found under the URI pattern(s) created, check that:
- Partition values match predicates used in a query.
- Blob name starts with
NamePrefix, if such a property is defined. - Blob name ends with
FileExtension, if such a property is defined.
Once all the conditions are met, the file is fetched and processed.
Examples
Nonpartitioned external table
In the following nonpartitioned external table, the files are expected to be placed directly under the container(s) defined:
.create external table ExternalTable (x:long, s:string)
kind=storage
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Partitioned by date
In the following external table partitioned by date, the files are expected to be placed under directories of the default datetime format yyyy/MM/dd:
.create external table ExternalTable (Timestamp:datetime, x:long, s:string)
kind=storage
partition by (Date:datetime = bin(Timestamp, 1d))
dataformat=csv
(
h@'abfss://filesystem@storageaccount.dfs.core.windows.net/path;secretKey'
)
Partitioned by month
In the following external table partitioned by month, the directory format is year=yyyy/month=MM:
.create external table ExternalTable (Timestamp:datetime, x:long, s:string)
kind=storage
partition by (Month:datetime = startofmonth(Timestamp))
pathformat=(datetime_pattern("'year='yyyy'/month='MM", Month))
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Partitioned by name and date
In the following external table, the data is partitioned first by customer name and then by date, meaning that the expected directory structure is, for example, customer_name=Softworks/2019/02/01:
.create external table ExternalTable (Timestamp:datetime, CustomerName:string)
kind=storage
partition by (CustomerNamePart:string = CustomerName, Date:datetime = startofday(Timestamp))
pathformat=("customer_name=" CustomerNamePart "/" Date)
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
Partitioned by hash and date
The following external table is partitioned first by customer name hash (modulo ten), then by date. The expected directory structure is, for example, customer_id=5/dt=20190201, and data file names end with the .txt extension:
.create external table ExternalTable (Timestamp:datetime, CustomerName:string)
kind=storage
partition by (CustomerId:long = hash(CustomerName, 10), Date:datetime = startofday(Timestamp))
pathformat=("customer_id=" CustomerId "/dt=" datetime_pattern("yyyyMMdd", Date))
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
with (fileExtension = ".txt")
Filter by partition columns in a query
To filter by partition columns in a query, specify original column name in query predicate:
external_table("ExternalTable")
| where Timestamp between (datetime(2020-01-01) .. datetime(2020-02-01))
| where CustomerName in ("John.Doe", "Ivan.Ivanov")
Sample Output
| TableName | TableType | Folder | DocString | Properties | ConnectionStrings | Partitions | PathFormat |
|---|---|---|---|---|---|---|---|
| ExternalTable | Blob | ExternalTables | Docs | {“Format”:“Csv”,“Compressed”:false,“CompressionType”:null,“FileExtension”:null,“IncludeHeaders”:“None”,“Encoding”:null,“NamePrefix”:null} | [“https://storageaccount.blob.core.windows.net/container1;*******”] | [{“Mod”:10,“Name”:“CustomerId”,“ColumnName”:“CustomerName”,“Ordinal”:0},{“Function”:“StartOfDay”,“Name”:“Date”,“ColumnName”:“Timestamp”,“Ordinal”:1}] | “customer_id=” CustomerId “/dt=” datetime_pattern(“yyyyMMdd”,Date) |
Related content
10 - Functions
10.1 - Stored functions management overview
This section describes management commands used for creating and altering user-defined functions:
| Function | Description |
|---|---|
.alter function | Alters an existing function and stores it inside the database metadata |
.alter function docstring | Alters the DocString value of an existing function |
.alter function folder | Alters the Folder value of an existing function |
.create function | Creates a stored function |
.create-or-alter function | Creates a stored function or alters an existing function and stores it inside the database metadata |
.drop function and .drop functions | Drops a function (or functions) from the database |
.show functions and .show function | Lists all the stored functions, or a specific function, in the currently-selected database |
|.show functions and .show function |Lists all the stored functions, or a specific function, in the currently-selected database |
11 - Ingestion mappings
11.1 - AVRO Mapping
Use AVRO mapping to map incoming data to columns inside tables when your ingestion source file is in AVRO format.
Each AVRO mapping element must contain either of the following optional properties:
| Property | Type | Description |
|---|---|---|
| Field | string | Name of the field in the AVRO record. |
| Path | string | If the value starts with $, it’s treated as the path to the field in the AVRO document. This path specifies the part of the AVRO document that becomes the content of the column in the table. The path that denotes the entire AVRO record is $. If the value doesn’t start with $, it’s treated as a constant value. Paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
| ConstValue | string | The constant value to be used for a column instead of some value inside the AVRO file. |
| Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
JSON serialization
The following example mapping is serialized as a JSON string when provided as part of the .ingest management command.
[
{"Column": "event_timestamp", "Properties": {"Field": "Timestamp"}},
{"Column": "event_name", "Properties": {"Field": "Name"}},
{"Column": "event_type", "Properties": {"Field": "Type"}},
{"Column": "event_time", "Properties": {"Field": "Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
Here the serialized JSON mapping is included in the context of the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "AVRO",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Field": "Field1"}},
{"Column": "column_b", "Properties": {"Field": "$.[\'Field name with space\']"}}
]
```
)
Precreated mapping
When the mapping is precreated, reference the mapping by name in the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="AVRO",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use AVRO mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="AVRO"
)
Related content
- Use the avrotize k2a tool to create an Avro schema.
11.2 - CSV Mapping
Use CSV mapping to map incoming data to columns inside tables when your ingestion source file is any of the following delimiter-separated tabular formats: CSV, TSV, PSV, SCSV, SOHsv, TXT and RAW. For more information, see supported data formats.
Each CSV mapping element must contain either of the following optional properties:
| Property | Type | Description |
|---|---|---|
| Ordinal | int | The column order number in CSV. |
| ConstValue | string | The constant value to be used for a column instead of some value inside the CSV file. |
| Transform | string | Transformation that should be applied on the content with mapping transformations. The only supported transformation by is SourceLocation. |
Examples
[
{"Column": "event_time", "Properties": {"Ordinal": "0"}},
{"Column": "event_name", "Properties": {"Ordinal": "1"}},
{"Column": "event_type", "Properties": {"Ordinal": "2"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2023-01-01T10:32:00"}}
{"Column": "source_location", "Properties": {"Transform": "SourceLocation"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="csv",
ingestionMapping =
```
[
{"Column": "event_time", "Properties": {"Ordinal": "0"}},
{"Column": "event_name", "Properties": {"Ordinal": "1"}},
{"Column": "event_type", "Properties": {"Ordinal": "2"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2023-01-01T10:32:00"}},
{"Column": "source_location", "Properties": {"Transform": "SourceLocation"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="csv",
ingestionMappingReference = "MappingName"
)
Identity mapping
Use CSV mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="csv"
)
11.3 - Ingestion mappings
Ingestion mappings are used during ingestion to map incoming data to columns inside tables.
Data Explorer supports different types of mappings, both row-oriented (CSV, JSON, AVRO and W3CLOGFILE), and column-oriented (Parquet and ORC).
Ingestion mappings can be defined in the ingest command, or can be precreated and referenced from the ingest command using ingestionMappingReference parameters. Ingestion is possible without specifying a mapping. For more information, see identity mapping.
Each element in the mapping list is constructed from three fields:
| Property | Required | Description |
|---|---|---|
| Column | ✔️ | Target column name in the table. |
| Datatype | Datatype with which to create the mapped column if it doesn’t already exist in the table. | |
| Properties | Property-bag containing properties specific for each mapping as described in each specific mapping type page. |
Supported mapping types
The following table defines mapping types to be used when ingesting or querying external data of a specific format.
| Data Format | Mapping Type |
|---|---|
| CSV | CSV Mapping |
| TSV | CSV Mapping |
| TSVe | CSV Mapping |
| PSV | CSV Mapping |
| SCSV | CSV Mapping |
| SOHsv | CSV Mapping |
| TXT | CSV Mapping |
| RAW | CSV Mapping |
| JSON | JSON Mapping |
| AVRO | AVRO Mapping |
| APACHEAVRO | AVRO Mapping |
| Parquet | Parquet Mapping |
| ORC | ORC Mapping |
| W3CLOGFILE | W3CLOGFILE Mapping |
Ingestion mapping examples
The following examples use the RawEvents table with the following schema:
.create table RawEvents (timestamp: datetime, deviceId: guid, messageId: guid, temperature: decimal, humidity: decimal)
Simple mapping
The following example shows ingestion where the mapping is defined in the ingest command. The command ingests a JSON file from a URL into the RawEvents table. The mapping specifies the path to each field in the JSON file.
.ingest into table RawEvents ('https://kustosamplefiles.blob.core.windows.net/jsonsamplefiles/simple.json')
with (
format = "json",
ingestionMapping =
```
[
{"column":"timestamp","Properties":{"path":"$.timestamp"}},
{"column":"deviceId","Properties":{"path":"$.deviceId"}},
{"column":"messageId","Properties":{"path":"$.messageId"}},
{"column":"temperature","Properties":{"path":"$.temperature"}},
{"column":"humidity","Properties":{"path":"$.humidity"}}
]
```
)
Mapping with ingestionMappingReference
To map the same JSON file using a precreated mapping, create the RawEventMapping ingestion mapping reference with the following command:
.create table RawEvents ingestion json mapping 'RawEventMapping'
```
[
{"column":"timestamp","Properties":{"path":"$.timestamp"}},
{"column":"deviceId","Properties":{"path":"$.deviceId"}},
{"column":"messageId","Properties":{"path":"$.messageId"}},
{"column":"temperature","Properties":{"path":"$.temperature"}},
{"column":"humidity","Properties":{"path":"$.humidity"}}
]
```
Ingest the JSON file using the RawEventMapping ingestion mapping reference with the following command:
.ingest into table RawEvents ('https://kustosamplefiles.blob.core.windows.net/jsonsamplefiles/simple.json')
with (
format="json",
ingestionMappingReference="RawEventMapping"
)
Identity mapping
Ingestion is possible without specifying ingestionMapping or ingestionMappingReference properties. The data is mapped using an identity data mapping derived from the table’s schema. The table schema remains the same. format property should be specified. See ingestion formats.
| Format type | Format | Mapping logic |
|---|---|---|
| Tabular data formats with defined order of columns, such as delimiter-separated or single-line formats. | CSV, TSV, TSVe, PSV, SCSV, Txt, SOHsv, Raw | All table columns are mapped in their respective order to data columns in order they appear in the data source. Column data type is taken from the table schema. |
| Formats with named columns or records with named fields. | JSON, Parquet, Avro, ApacheAvro, Orc, W3CLOGFILE | All table columns are mapped to data columns or record fields having the same name (case-sensitive). Column data type is taken from the table schema. |
Mapping transformations
Some of the data format mappings (Parquet, JSON, and AVRO) support simple and useful ingest-time transformations. Where the scenario requires more complex processing at ingest time, use Update policy, which allows defining lightweight processing using KQL expression.
| Path-dependant transformation | Description | Conditions |
|---|---|---|
| PropertyBagArrayToDictionary | Transforms JSON array of properties, such as {events:[{"n1":"v1"},{"n2":"v2"}]}, to dictionary and serializes it to valid JSON document, such as {"n1":"v1","n2":"v2"}. | Available for JSON, Parquet, AVRO, and ORC mapping types. |
| SourceLocation | Name of the storage artifact that provided the data, type string (for example, the blob’s “BaseUri” field). | Available for CSV, JSON, Parquet, AVRO, ORC, and W3CLOGFILE mapping types. |
| SourceLineNumber | Offset relative to that storage artifact, type long (starting with ‘1’ and incrementing per new record). | Available for CSV, JSON, Parquet, AVRO, ORC, and W3CLOGFILE mapping types. |
| DateTimeFromUnixSeconds | Converts number representing unix-time (seconds since 1970-01-01) to UTC datetime string. | Available for CSV, JSON, Parquet, AVRO, and ORC mapping types. |
| DateTimeFromUnixMilliseconds | Converts number representing unix-time (milliseconds since 1970-01-01) to UTC datetime string. | Available for CSV, JSON, Parquet, AVRO, and ORC mapping types. |
| DateTimeFromUnixMicroseconds | Converts number representing unix-time (microseconds since 1970-01-01) to UTC datetime string. | Available for CSV, JSON, Parquet, AVRO, and ORC mapping types. |
| DateTimeFromUnixNanoseconds | Converts number representing unix-time (nanoseconds since 1970-01-01) to UTC datetime string. | Available for CSV, JSON, Parquet, AVRO, and ORC mapping types. |
| DropMappedFields | Maps an object in the JSON document to a column and removes any nested fields already referenced by other column mappings. | Available for JSON, Parquet, AVRO, and ORC mapping types. |
| BytesAsBase64 | Treats the data as byte array and converts it to a base64-encoded string. | Available for AVRO mapping type. For ApacheAvro format, the schema type of the mapped data field should be bytes or fixed Avro type. For Avro format, the field should be an array containing byte values from [0-255] range. null is ingested if the data doesn’t represent a valid byte array. |
Mapping transformation examples
DropMappedFields transformation:
Given the following JSON contents:
{
"Time": "2012-01-15T10:45",
"Props": {
"EventName": "CustomEvent",
"Revenue": 0.456
}
}
The following data mapping maps entire Props object into dynamic column Props while excluding
already mapped columns (Props.EventName is already mapped into column EventName, so it’s
excluded).
[
{ "Column": "Time", "Properties": { "Path": "$.Time" } },
{ "Column": "EventName", "Properties": { "Path": "$.Props.EventName" } },
{ "Column": "Props", "Properties": { "Path": "$.Props", "Transform":"DropMappedFields" } },
]
The ingested data looks as follows:
| Time | EventName | Props |
|---|---|---|
2012-01-15T10:45 | CustomEvent | {"Revenue": 0.456} |
BytesAsBase64 transformation
Given the following AVRO file contents:
{
"Time": "2012-01-15T10:45",
"Props": {
"id": [227,131,34,92,28,91,65,72,134,138,9,133,51,45,104,52]
}
}
The following data mapping maps the ID column twice, with and without the transformation.
[
{ "Column": "ID", "Properties": { "Path": "$.props.id" } },
{ "Column": "Base64EncodedId", "Properties": { "Path": "$.props.id", "Transform":"BytesAsBase64" } },
]
The ingested data looks as follows:
| ID | Base64EncodedId |
|---|---|
[227,131,34,92,28,91,65,72,134,138,9,133,51,45,104,52] | 44MiXBxbQUiGigmFMy1oNA== |
11.4 - JSON Mapping
Use JSON mapping to map incoming data to columns inside tables when your ingestion source file is in JSON format.
Each JSON mapping element must contain either of the following optional properties:
| Property | Type | Description |
|---|---|---|
| Path | string | If the value starts with $ it’s interpreted as the JSON path to the field in the JSON document that will become the content of the column in the table. The JSON path that denotes the entire document is $. If the value doesn’t start with $ it’s interpreted as a constant value. JSON paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
| ConstValue | string | The constant value to be used for a column instead of some value inside the JSON file. |
| Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
[
{"Column": "event_timestamp", "Properties": {"Path": "$.Timestamp"}},
{"Column": "event_name", "Properties": {"Path": "$.Event.Name"}},
{"Column": "event_type", "Properties": {"Path": "$.Event.Type"}},
{"Column": "source_uri", "Properties": {"Transform": "SourceLocation"}},
{"Column": "source_line", "Properties": {"Transform": "SourceLineNumber"}},
{"Column": "event_time", "Properties": {"Path": "$.Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "json",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Path": "$.Obj.Property"}},
{"Column": "column_b", "Properties": {"Path": "$.Property"}},
{"Column": "custom_column", "Properties": {"Path": "$.[\'Property name with space\']"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="json",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use JSON mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="json"
)
Copying JSON mapping
You can copy JSON mapping of an existing table and create a new table with the same mapping using the following process:
Run the following command on the table whose mapping you want to copy:
.show table TABLENAME ingestion json mappings | extend formatted_mapping = strcat("'",replace_string(Mapping, "'", "\\'"),"'") | project formatted_mappingUse the output of the above command to create a new table with the same mapping:
.create table TABLENAME ingestion json mapping "TABLENAME_Mapping" RESULT_OF_ABOVE_CMD
11.5 - ORC Mapping
Use ORC mapping to map incoming data to columns inside tables when your ingestion source file is in ORC format.
Each ORC mapping element must contain either of the following optional properties:
| Property | Type | Description |
|---|---|---|
| Field | string | Name of the field in the ORC record. |
| Path | string | If the value starts with $ it’s interpreted as the path to the field in the ORC document that will become the content of the column in the table. The path that denotes the entire ORC record is $. If the value doesn’t start with $ it’s interpreted as a constant value. Paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
| ConstValue | string | The constant value to be used for a column instead of some value inside the ORC file. |
| Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
[
{"Column": "event_timestamp", "Properties": {"Path": "$.Timestamp"}},
{"Column": "event_name", "Properties": {"Path": "$.Event.Name"}},
{"Column": "event_type", "Properties": {"Path": "$.Event.Type"}},
{"Column": "event_time", "Properties": {"Path": "$.Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "orc",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Path": "$.Field1"}},
{"Column": "column_b", "Properties": {"Path": "$.[\'Field name with space\']"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="orc",
ingestionMappingReference = "ORC_Mapping"
)
Identity mapping
Use ORC mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="orc"
)
11.6 - Parquet Mapping
Use Parquet mapping to map incoming data to columns inside tables when your ingestion source file is in Parquet format.
Each Parquet mapping element must contain either of the following optional properties:
| Property | Type | Description |
|---|---|---|
| Field | string | Name of the field in the Parquet record. |
| Path | string | If the value starts with $ it’s interpreted as the path to the field in the Parquet document that will become the content of the column in the table. The path that denotes the entire Parquet record is $. If the value doesn’t start with $ it’s interpreted as a constant value. Paths that include special characters should be escaped as ['Property Name']. For more information, see JSONPath syntax. |
| ConstValue | string | The constant value to be used for a column instead of some value inside the Parquet file. |
| Transform | string | Transformation that should be applied on the content with mapping transformations. |
Parquet type conversions
Comprehensive support is provided for converting data types when you’re ingesting or querying data from a Parquet source.
The following table provides a mapping of Parquet field types, and the table column types they can be converted to. The first column lists the Parquet type, and the others show the table column types they can be converted to.
| Parquet type | bool | int | long | real | decimal | datetime | timespan | string | guid | dynamic |
|---|---|---|---|---|---|---|---|---|---|---|
| INT8 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| INT16 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| INT32 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| INT64 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| UINT8 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| UINT16 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| UINT32 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| UINT64 | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| FLOAT32 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| FLOAT64 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| BOOLEAN | ✔️ | ❌ | ❌ | ❌ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| DECIMAL (I32) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| DECIMAL (I64) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| DECIMAL (FLBA) | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| DECIMAL (BA) | ✔️ | ❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
| TIMESTAMP | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ | ❌ | ❌ |
| DATE | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ | ❌ | ❌ |
| STRING | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
| UUID | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ✔️ | ❌ |
| JSON | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
| LIST | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| MAP | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| STRUCT | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
Examples
[
{"Column": "event_timestamp", "Properties": {"Path": "$.Timestamp"}},
{"Column": "event_name", "Properties": {"Path": "$.Event.Name"}},
{"Column": "event_type", "Properties": {"Path": "$.Event.Type"}},
{"Column": "event_time", "Properties": {"Path": "$.Timestamp", "Transform": "DateTimeFromUnixMilliseconds"}},
{"Column": "ingestion_time", "Properties": {"ConstValue": "2021-01-01T10:32:00"}},
{"Column": "full_record", "Properties": {"Path": "$"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "parquet",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Path": "$.Field1.Subfield"}},
{"Column": "column_b", "Properties": {"Path": "$.[\'Field name with space\']"}},
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="parquet",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use Parquet mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="parquet"
)
11.7 - W3CLOGFILE Mapping
Use W3CLOGFILE mapping to map incoming data to columns inside tables when your ingestion source file is in W3CLOGFILE format.
Each W3CLOGFILE mapping element must contain either of the following optional properties:
| Property | Type | Description |
|---|---|---|
| Field | string | Name of the field in the W3CLOGFILE log record. |
| ConstValue | string | The constant value to be used for a column instead of some value inside the W3CLOGFILE file. |
| Transform | string | Transformation that should be applied on the content with mapping transformations. |
Examples
[
{"Column": "Date", "Properties": {"Field": "date"}},
{"Column": "Time", "Properties": {"Field": "time"}},
{"Column": "IP", "Properties": {"Field": "s-ip"}},
{"Column": "ClientMethod", "Properties": {"Field": "cs-method"}},
{"Column": "ClientQuery", "Properties": {"Field": "cs-uri-query"}},
{"Column": "ServerPort", "Properties": {"Field": "s-port"}},
{"Column": "ClientIP", "Properties": {"Field": "c-ip"}},
{"Column": "UserAgent", "Properties": {"Field": "cs(User-Agent)"}},
{"Column": "Referer", "Properties": {"Field": "cs(Referer)"}},
{"Column": "Status", "Properties": {"Field": "sc-status"}},
{"Column": "ResponseBytes", "Properties": {"Field": "sc-bytes"}},
{"Column": "RequestBytes", "Properties": {"Field": "cs-bytes"}},
{"Column": "TimeTaken", "Properties": {"Field": "time-taken"}}
]
The mapping above is serialized as a JSON string when it’s provided as part of the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format = "w3clogfile",
ingestionMapping =
```
[
{"Column": "column_a", "Properties": {"Field": "field1"}},
{"Column": "column_b", "Properties": {"Field": "field2"}}
]
```
)
Pre-created mapping
When the mapping is pre-created, reference the mapping by name in the .ingest management command.
.ingest into Table123 (@"source1", @"source2")
with
(
format="w3clogfile",
ingestionMappingReference = "Mapping_Name"
)
Identity mapping
Use W3CLOGFILE mapping during ingestion without defining a mapping schema (see identity mapping).
.ingest into Table123 (@"source1", @"source2")
with
(
format="w3clogfile"
)
12 - Manage external table mappings
13 - Materialized views
13.1 - Materialized views
Materialized views expose an aggregation query over a source table, or over another materialized view.
Materialized views always return an up-to-date result of the aggregation query (always fresh). Querying a materialized view is more performant than running the aggregation directly over the source table.
Why use materialized views?
By investing resources (data storage, background CPU cycles) for materialized views of commonly used aggregations, you get the following benefits:
Performance improvement: Querying a materialized view commonly performs better than querying the source table for the same aggregation function(s).
Freshness: A materialized view query always returns the most up-to-date results, independent of when materialization last took place. The query combines the materialized part of the view with the records in the source table, which haven’t yet been materialized (the
deltapart), always providing the most up-to-date results.Cost reduction: Querying a materialized view consumes less resources than doing the aggregation over the source table. Retention policy of source table can be reduced if only aggregation is required. This setup reduces hot cache costs for the source table.
For example use cases, see Materialized view use cases.
How materialized views work
A materialized view is made of two components:
- A materialized part - a table holding aggregated records from the source table, which have already been processed. This table always holds a single record per the aggregation’s group-by combination.
- A delta - the newly ingested records in the source table that haven’t yet been processed.
Querying the materialized view combines the materialized part with the delta part, providing an up-to-date result of the aggregation query. The offline materialization process ingests new records from the delta to the materialized table, and updates existing records. If the intersection between the delta and the materialized part is large, and many records require updates, this might have a negative impact on the materialization process. See monitor materialized views on how to troubleshoot such situations.
Materialized views queries
There are 2 ways to query a materialized view:
Query the entire view: when you query the materialized view by its name, similarly to querying a table, the materialized view query combines the materialized part of the view with the records in the source table that haven’t been materialized yet (the
delta).- Querying the materialized view always returns the most up-to-date results, based on all records ingested to the source table. For more information about the materialized vs. non-materialized parts in materialized view, see how materialized views work.
- This option might not perform best as it needs to materialize the
deltapart during query time. Performance in this case depends on the view’s age and the filters applied in the query. The materialized view query optimizer section includes possible ways to improve query performance when querying the entire view.
Query the materialized part only: another way of querying the view is by using the
materialized_view()function. This option supports querying only the materialized part of the view, while specifying the max latency the user is willing to tolerate.- This option isn’t guaranteed to return the most up-to-date records, but it should always be more performant than querying the entire view.
- This function is useful for scenarios in which you’re willing to sacrifice some freshness for performance, for example for telemetry dashboards.
Materialized views participate in cross-cluster or cross-database queries, but aren’t included in wildcard unions or searches.
- The following examples all include materialized views by the name
ViewName:
cluster('cluster1').database('db').ViewName cluster('cluster1').database('*').ViewName database('*').ViewName database('DB*').ViewName database('*').materialized_view('ViewName') database('DB*').materialized_view('ViewName')- The following examples do not include records from materialized views:
cluster('cluster1').database('db').* database('*').View* search in (*) search *- The following examples all include materialized views by the name
Materialized views participate in cross-Eventhouse or cross-database queries, but aren’t included in wildcard unions or searches.
- The following examples all include materialized views by the name
ViewName:
cluster("<serviceURL>").database('db').ViewName cluster("<serviceURL>").database('*').ViewName database('*').ViewName database('DB*').ViewName database('*').materialized_view('ViewName') database('DB*').materialized_view('ViewName')- The following examples do not include records from materialized views:
cluster("<serviceURL>").database('db').* database('*').View* search in (*) search *- The following examples all include materialized views by the name
Materialized view query optimizer
When querying the entire view, the materialized part is combined with the delta during query time. This includes aggregating the delta and joining it with the materialized part.
- Querying the entire view performs better if the query includes filters on the group by keys of the materialized view query. See more tips about how to create your materialized view, based on your query pattern, in the
.create materialized-viewperformance tips section. - The query optimizer chooses summarize/join strategies that are expected to improve query performance. For example, the decision on whether to shuffle the query is based on number of records in
deltapart. The following client request properties provide some control over the optimizations applied. You can test these properties with your materialized view queries and evaluate their impact on queries performance.
| Client request property name | Type | Description |
|---|---|---|
materialized_view_query_optimization_costbased_enabled | bool | If set to false, disables summarize/join optimizations in materialized view queries. Uses default strategies. Default is true. |
materialized_view_shuffle | dynamic | Force shuffling of the materialized view query, and (optionally) provide specific keys to shuffle by. See examples below. |
ingestion_time() function in the context of materialized views
ingestion_time() function returns null values, when used in the context of a materialized view, if querying the entire view. When querying the materialized part of the view, the return value depends on the type of materialized view:
- In materialized views which include a single
arg_max()/arg_min()/take_any()aggregation, theingestion_time()is equal to theingestion_time()of the corresponding record in the source table. - In all other materialized views, the value of
ingestion_time()is approximately the time of materialization (see how materialized views work).
Examples
Query the entire view. The most recent records in source table are included:
ViewNameQuery the materialized part of the view only, regardless of when it was last materialized.
materialized_view("ViewName")Query the entire view, and provide a “hint” to use
shufflestrategy. The most recent records in source table are included:- Example #1: shuffle based on the
Idcolumn (similarly to usinghint.shufflekey=Id):
set materialized_view_shuffle = dynamic([{"Name" : "ViewName", "Keys" : [ "Id" ] }]); ViewName- Example #2: shuffle based on all keys (similarly to using
hint.strategy=shuffle):
set materialized_view_shuffle = dynamic([{"Name" : "ViewName" }]); ViewName- Example #1: shuffle based on the
Performance considerations
The main contributors that can impact a materialized view health are:
Cluster resources: Like any other process running on the cluster, materialized views consume resources (CPU, memory) from the cluster. If the cluster is overloaded, adding materialized views to it may cause a degradation in the cluster’s performance. Monitor your cluster’s health using cluster health metrics. Optimized autoscale currently doesn’t take materialized views health under consideration as part of autoscale rules.
- The materialization process is limited by the amount of memory and CPU it can consume. These limits are defined, and can be changed, in the materialized views workload group.
Overlap with materialized data: During materialization, all new records ingested to the source table since the last materialization (the delta) are processed and materialized into the view. The higher the intersection between new records and already materialized records is, the worse the performance of the materialized view will be. A materialized view works best if the number of records being updated (for example, in
arg_maxview) is a small subset of the source table. If all or most of the materialized view records need to be updated in every materialization cycle, then the materialized view might not perform well.Ingestion rate: There are no hard-coded limits on the data volume or ingestion rate in the source table of the materialized view. However, the recommended ingestion rate for materialized views is no more than 1-2GB/sec. Higher ingestion rates may still perform well. Performance depends on database size, available resources, and amount of intersection with existing data.
Number of materialized views in cluster: The above considerations apply to each individual materialized view defined in the cluster. Each view consumes its own resources, and many views compete with each other on available resources. While there are no hard-coded limits to the number of materialized views in a cluster, the cluster may not be able to handle all materialized views, when there are many defined. The capacity policy can be adjusted if there is more than a single materialized view in the cluster. Increase the value of
ClusterMinimumConcurrentOperationsin the policy to run more materialized views concurrently.Materialized view definition: The materialized view definition must be defined according to query best practices for best query performance. For more information, see create command performance tips.
Materialized view over materialized view
A materialized view can be created over another materialized view if the source materialized view is a deduplication view. Specifically, the aggregation of the source materialized view must be take_any(*) in order to deduplicate source records. The second materialized view can use any supported aggregation functions. For specific information on how to create a materialized view over a materialized view, see .create materialized-view command.
Related content
13.2 - Materialized views data purge
Data purge commands can be used to purge records from materialized views. The same guidelines for purging records from a table apply to materialized views purge.
The purge command only deletes records from the materialized part of the view (what is the materialized part?). Therefore, if the source table of the materialized view includes records to purge, these records may be returned from the materialized view query, even after purge completed successfully.
The recommended process for purging records from a materialized view is:
- Purge the source table of the materialized view.
- After the source table purge is completed successfully, purge the materialized view.
Limitations
The purge predicate of a materialized view purge can only reference the group by keys of the aggregation, or any column in a arg_max()/arg_min() /take_any() view. It cannot reference other aggregation functions result columns.
For example, for a materialized view MV, which is defined with the following aggregation function:
T | summarize count(), avg(Duration) by UserId
The following purge predicate isn’t valid, since it references the result of the avg() aggregation:
MV | where avg_Duration > 1h
Related content
13.3 - Materialized views limitations
The materialized view source
- The source table of a materialized view:
- Must be a table into which data is directly ingested, using an update policy, or ingest from query commands.
- Using move extents or replace extents from other tables to the source table of the materialized view is only supported if using
setNewIngestionTimeproperty as part of the move extents command (refer to .move extents and .replace extents commands for more details). - Moving extents to the source table of a materialized view, while not using
setNewIngestionTimecan cause the move to fail with one of the following errors:Cannot drop/move extents from/to table 'TableName' since Materialized View 'ViewName' is currently processing some of these extents.Cannot move extents to 'TableName' since materialized view 'ViewName' will not process these extents (can lead to data loss in the materialized view).
- Using move extents or replace extents from other tables to the source table of the materialized view is only supported if using
- Must be a table into which data is directly ingested, using an update policy, or ingest from query commands.
- The source table of a materialized view must have IngestionTime policy enabled. This policy is enabled by default.
- If the materialized view uses a default
lookback, theingestion_time()must be preserved in the materialized view’s query. Operators such as mv-expand or pivot plugin don’t preserve theingestion_time(), so they can’t be used in a materialized view with alookback. For more information, see Lookback period. - The source table of a materialized view can’t be a table with a restricted view access policy.
- A materialized view can’t be created on top of another materialized view, unless the first materialized view is of type
take_any(*)aggregation. See materialized view over materialized view. - Materialized views can’t be defined over external tables.
Impact of records ingested to or dropped from the source table
- A materialized view only processes new records ingested into the source table. Records that are removed from the source table, either by running data purge/soft delete/drop extents, or due to retention policy or any other reason, have no impact on the materialized view.
- The materialized view has its own retention policy, which is independent of the retention policy of the source table. The materialized view might include records that aren’t present in the source table.
Follower databases
- Materialized views can’t be created in follower databases. Follower databases are read-only and materialized views require write operations.
- Materialized views can’t be created in database shortcuts. Database shortcuts are read-only and materialized views require write operations.
- Materialized views that are defined on leader databases can be queried from their followers, like any other table in the leader.
- Use the leader cluster to monitor follower database materialized views. For more information, see Materialized views in follower databases.
- Use the source Eventhouse to monitor shortcut database materialized views. For more information, see Monitor materialized views.
Other
- Cursor functions can’t be used on top of materialized views.
- Continuous export from a materialized view isn’t supported.
Related content
13.4 - Materialized views policies
This article includes information about policies that can be set on materialized views.
Retention and caching policy
A materialized view has a retention policy and caching policy. The materialized view derives the database retention and caching policies by default. These policies can be changed using retention policy management commands or caching policy management commands.
Both policies are applied on the materialized part of the materialized view only. For an explanation of the differences between the materialized part and delta part, see how materialized views work. For example, if the caching policy of a materialized view is set to 7d, but the caching policy of its source table is set to 0d, there may still be disk misses when querying the materialized view. This behavior occurs because the source table (delta part) also participates in the query.
The retention policy of the materialized view is unrelated to the retention policy of the source table. Retention policy of source table can be shorter than the retention policy of the materialized view, if source records are required for a shorter period. We recommend a minimum retention policy of at least few days, and recoverability set to true on the source table. This setting allows for fast recovery for errors and for diagnostic purposes.
The retention and caching policies both depend on Extent Creation time. The last update for a record determines the extent creation time for a materialized view.
Partitioning policy
A partitioning policy can be applied on a materialized view. We recommend configuring a partitioning policy on a materialized view only when most or all of the view queries filter by one of the materialized view’s group-by keys. This situation is common in multi-tenant solutions, where one of the materialized view’s group-by keys is the tenant’s identifier (for example, tenantId, customerId). For more information, see the first use case described in the partitioning policy supported scenarios page.
For the commands to alter a materialized view’s partitioning policy, see partitioning policy commands.
Adding a partitioning policy on a materialized view increases the number of extents in the materialized view, and creates more “work” for the materialization process. For more information on the reason for this behavior, see the extents rebuild process mentioned in how materialized views work.
Row level security policy
A row level security can be applied on a materialized view, with several limitations:
- The policy can be applied only to materialized views with arg_max()/arg_min()/take_any() aggregation functions, or when the row level security query references the group by keys of the materialized view aggregation.
- The policy is applied to the materialized part of the view only.
- If the same row level security policy isn’t defined on the source table of the materialized view, then querying the materialized view may return records that should be hidden by the policy. This happens because querying the materialized view queries the source table as well.
- We recommend defining the same row level security policy both on the source table and the materialized view if the view is an arg_max() or arg_min()/take_any().
- When defining a row level security policy on the source table of an arg_max() or arg_min()/take_any() materialized view, the command fails if there’s no row level security policy defined on the materialized view itself. The purpose of the failure is to alert the user of a potential data leak, since the materialized view may expose information. To mitigate this error, do one of the following actions:
- Define the row level security policy over the materialized view.
- Choose to ignore the error by adding
allowMaterializedViewsWithoutRowLevelSecurityproperty to the alter policy command. For example:
.alter table SourceTable policy row_level_security enable with (allowMaterializedViewsWithoutRowLevelSecurity=true) "RLS_function"
For commands for configuring a row level security policy on a materialized view, see row_level_security policy commands.
13.5 - Materialized views use cases
Materialized views expose an aggregation query over a source table or another materialized view. This article covers common and advanced use cases for materialized views.
Common use cases
The following are common scenarios that can be addressed by using a materialized view:
Update data: Update data by returning the last record per entity using
arg_max()(aggregation function). For example, create a view that only materializes records ingested from now on:.create materialized-view ArgMax on table T { T | summarize arg_max(Timestamp, *) by User }Reduce the resolution of data Reduce the resolution of data by calculating periodic statistics over the raw data. Use various aggregation functions by period of time. For example, maintain an up-to-date snapshot of distinct users per day:
.create materialized-view UsersByDay on table T { T | summarize dcount(User) by bin(Timestamp, 1d) }Deduplicate records: Deduplicate records in a table using
take_any()(aggregation function). For example, create a materialized view that deduplicates the source table based on theEventIdcolumn, using a lookback of 6 hours. Records are deduplicated against only records ingested 6 hours before current records..create materialized-view with(lookback=6h) DeduplicatedTable on table T { T | summarize take_any(*) by EventId }[!NOTE] You can conceal the source table by creating a function with the same name as the table that references the materialized view instead. This pattern ensures that callers querying the table access the deduplicated materialized view because functions override tables with the same name. To avoid cyclic references in the view definition, use the table() function to reference the source table:
.create materialized-view DeduplicatedTable on table T { table('T') | summarize take_any(*) by EventId }
For more examples, see the .create materialized-view command.
Advanced scenario
You can use a materialized view for create/update/delete event processing. For records with incomplete or outdated information in each column, a materialized view can provide the latest updates for each column, excluding entities that were deleted.
Consider the following input table named Events:
Input
| Timestamp | cud | ID | col1 | col2 | col3 |
|---|---|---|---|---|---|
| 2023-10-24 00:00:00.0000000 | C | 1 | 1 | 2 | |
| 2023-10-24 01:00:00.0000000 | U | 1 | 22 | 33 | |
| 2023-10-24 02:00:00.0000000 | U | 1 | 23 | ||
| 2023-10-24 00:00:00.0000000 | C | 2 | 1 | 2 | |
| 2023-10-24 00:10:00.0000000 | U | 2 | 4 | ||
| 2023-10-24 02:00:00.0000000 | D | 2 |
Create a materialized view to get the latest update per column, using the arg_max() aggregation function:
.create materialized-view ItemHistory on table Events
{
Events
| extend Timestamp_col1 = iff(isnull(col1), datetime(1970-01-01), Timestamp),
Timestamp_col2 = iff(isnull(col2), datetime(1970-01-01), Timestamp),
Timestamp_col3 = iff(isnull(col3), datetime(1970-01-01), Timestamp)
| summarize arg_max(Timestamp_col1, col1), arg_max(Timestamp_col2, col2), arg_max(Timestamp_col3, col3), arg_max(Timestamp, cud) by id
}
Output
| ID | Timestamp_col1 | col1 | Timestamp_col2 | col2 | Timestamp_col3 | col3 | Timestamp | cud |
|---|---|---|---|---|---|---|---|---|
| 2 | 2023-10-24 00:00:00.0000000 | 1 | 2023-10-24 00:10:00.0000000 | 4 | 1970-01-01 00:00:00.0000000 | 2023-10-24 02:00:00.0000000 | D | |
| 1 | 2023-10-24 00:00:00.0000000 | 1 | 2023-10-24 02:00:00.0000000 | 23 | 2023-10-24 01:00:00.0000000 | 33 | 2023-10-24 02:00:00.0000000 | U |
You can create a stored function to further clean the results:
ItemHistory
| project Timestamp, cud, id, col1, col2, col3
| where cud != "D"
| project-away cud
Final Output
The latest update for each column for ID 1, since ID 2 was deleted.
| Timestamp | ID | col1 | col2 | col3 |
|---|---|---|---|---|
| 2023-10-24 02:00:00.0000000 | 1 | 1 | 23 | 33 |
Materialized views vs. update policies
Materialized views and update policies work differently and serve different use cases. Use the following guidelines to identify which one you should use:
Materialized views are suitable for aggregations, while update policies aren’t. Update policies run separately for each ingestion batch, and therefore can only perform aggregations within the same ingestion batch. If you require an aggregation query, always use materialized views.
Update policies are useful for data transformations, enrichments with dimension tables (usually using lookup operator) and other data manipulations that can run in the scope of a single ingestion.
Update policies run during ingestion time. Data isn’t available for queries in the source table or the target table until all update policies run. Materialized views, on the other hand, aren’t part of the ingestion pipeline. The materialization process runs periodically in the background, post ingestion. Records in source table are available for queries before they’re materialized.
Both update policies and materialized views can incorporate joins, but their effectiveness is limited to specific scenarios. Specifically, joins are suitable only when the data required for the join from both sides is accessible at the time of the update policy or materialization process. If matching entities are ingested when the update policy or materialization runs, there’s a risk of overlooking data. See more about
dimension tablesin materialized view query parameter and in fact and dimension tables.
Related content
13.6 - Monitor materialized views
Monitor the materialized view’s health in the following ways:
Monitor materialized views metrics in the Azure portal with Azure Monitor. Use the materialized view age metric,
MaterializedViewAgeSeconds, as the primary metric to monitor the freshness of the view.Monitor materialized view metrics in your Microsoft Fabric workspace. Use the materialized view age metric,
MaterializedViewAgeSecondsas the primary metric to monitor the freshness of the view. For more information, see Enable monitoring in your workspace.Monitor the
IsHealthyproperty using.show materialized-view.Check for failures using
.show materialized-view failures.
Troubleshooting unhealthy materialized views
If the MaterializedViewAge metric constantly increases, and the MaterializedViewHealth metric shows that the view is unhealthy, follow these recommendations to identify the root cause:
Check the number of materialized views on the cluster, and the current capacity for materialized views:
.show capacity | where Resource == "MaterializedView" | project Resource, Total, ConsumedOutput
Resource Total Consumed MaterializedView 1 0 - The number of materialized views that can run concurrently depends on the capacity shown in the
Totalcolumn, while theConsumedcolumn shows the number of materialized views currently running. You can use the Materialized views capacity policy to specify the minimum and maximum number of concurrent operations, overriding the system’s default concurrency level. The system determines the current concurrency, shown inTotal, based on the cluster’s available resources. The following example overrides the system’s decision and changes the minimum concurrent operations from one to three:
.alter-merge cluster policy capacity '{ "MaterializedViewsCapacity": { "ClusterMinimumConcurrentOperations": 3 } }'- If you explicitly change this policy, monitor the cluster’s health and ensure that other workloads aren’t affected by this change.
- The number of materialized views that can run concurrently depends on the capacity shown in the
Check if there are failures during the materialization process using .show materialized-view failures.
- If the error is permanent, the system automatically disables the materialized view. To check if it’s disabled, use the .show materialized-view command and see if the value in the
IsEnabledcolumn isfalse. Then check the Journal for the disabled event with the .show journal command. An example of a permanent failure is a source table schema change that makes it incompatible with the materialized view. For more information, see .create materialized-view command. - If the failure is transient, the system automatically retries the operation. However, the failure can delay the materialization and increase the age of the materialized view. This type of failure occurs, for example, when hitting memory limits or with a query time-out. See the following recommendations for more ways to troubleshoot transient failures.
- If the error is permanent, the system automatically disables the materialized view. To check if it’s disabled, use the .show materialized-view command and see if the value in the
Analyze the materialization process using the .show commands-and-queries command. Replace Databasename and ViewName to filter for a specific view:
.show commands-and-queries | where Database == "DatabaseName" and ClientActivityId startswith "DN.MaterializedViews;ViewName;"- Check the memory consumption in the
MemoryPeakcolumn to identify any operations that failed due to hitting memory limits, such as, runaway queries. By default, the materialization process is limited to a 15-GB memory peak per node. If the queries or commands executed during the materialization process exceed this value, the materialization fails due to memory limits. To increase the memory peak per node, alter the $materialized-views workload group. The following example alters the materialized views workload group to use a maximum of 64-GB memory peak per node during materialization:
.alter-merge workload_group ['$materialized-views'] ``` { "RequestLimitsPolicy": { "MaxMemoryPerQueryPerNode": { "Value": 68718888960 } } }[!NOTE]
MaxMemoryPerQueryPerNodecan’t exceed 50% of the total memory available on each node.- Check if the materialization process is hitting cold cache. The following example shows cache statistics over the past day for the materialized view,
ViewName:
.show commands-and-queries | where ClientActivityId startswith "DN.MaterializedViews;ViewName" | where StartedOn > ago(1d) | extend HotCacheHits = tolong(CacheStatistics.Shards.Hot.HitBytes), HotCacheMisses = tolong(CacheStatistics.Shards.Hot.MissBytes), HotCacheRetrieved = tolong(CacheStatistics.Shards.Hot.RetrieveBytes), ColdCacheHits = tolong(CacheStatistics.Shards.Cold.HitBytes), ColdCacheMisses = tolong(CacheStatistics.Shards.Cold.MissBytes), ColdCacheRetrieved = tolong(CacheStatistics.Shards.Cold.RetrieveBytes) | summarize HotCacheHits = format_bytes(sum(HotCacheHits)), HotCacheMisses = format_bytes(sum(HotCacheMisses)), HotCacheRetrieved = format_bytes(sum(HotCacheRetrieved)), ColdCacheHits =format_bytes(sum(ColdCacheHits)), ColdCacheMisses = format_bytes(sum(ColdCacheMisses)), ColdCacheRetrieved = format_bytes(sum(ColdCacheRetrieved))Output
HotCacheHits HotCacheMisses HotCacheRetrieved ColdCacheHits ColdCacheMisses ColdCacheRetrieved 26 GB 0 Bytes 0 Bytes 1 GB 0 Bytes 866 MB * If the view isn’t fully in the hot cache, materialization can experience disk misses, significantly slowing down the process. * Increasing the caching policy for the materialized view helps avoid cache misses. For more information, see [hot and cold cache and caching policy](..//Management/Policies/Update/Update%20policy/Update%20policy.md) and [.alter materialized-view policy caching command](..//Management/Policies/Merge/.alter%20materialized-view%20policy%20merge%20command/.alter%20materialized-view%20policy%20merge%20command.md).- Check if the materialization is scanning old records by checking the
ScannedExtentsStatisticswith the .show queries command. If the number of scanned extents is high and theMinDataScannedTimeis old, the materialization cycle needs to scan all, or most, of the materialized part of the view. The scan is needed to find intersections with the delta. For more information about the delta and the materialized part, see How materialized views work. The following recommendations provide ways to reduce the amount of data scanned in materialized cycles by minimizing the intersection with the delta.
- Check the memory consumption in the
If the materialization cycle scans a large amount of data, potentially including cold cache, consider making the following changes to the materialized view definition:
- Include a
datetimegroup-by key in the view definition. This can significantly reduce the amount of data scanned, as long as there is no late arriving data in this column. For more information, see Performance tips. You need to create a new materialized view since updates to group-by keys aren’t supported. - Use a
lookbackas part of the view definition. For more information, see .create materialized view supported properties.
- Include a
Check whether there’s enough ingestion capacity by verifying if either the
MaterializedViewResultmetric or IngestionUtilization metric showInsufficientCapacityvalues. You can increase ingestion capacity by scaling the available resources (preferred) or by altering the ingestion capacity policy.Check whether there’s enough ingestion capacity by verifying if the
MaterializedViewResultmetric showsInsufficientCapacityvalues. You can increase ingestion capacity by scaling the available resources.If the materialized view is still unhealthy, then the service doesn’t have sufficient capacity or resources to materialize all the data on time. Consider the following options:
- Scale out the cluster by increasing the minimum instance count. Optimized autoscale doesn’t take materialized views into consideration and doesn’t scale out the cluster automatically if materialized views are unhealthy. You need to set the minimum instance count to provide the cluster with more resources to accommodate materialized views.
- Scale out the Eventhouse to provide it with more resources to accommodate materialized views. For more information, see Enable minimum consumption.
- Divide the materialized view into several smaller views, each covering a subset of the data. For instance, you can split them based on a high cardinality key from the materialized view’s group-by keys. All views are based on the same source table, and each view filters by
SourceTable | where hash(key, number_of_views) == i, whereiis part of the set{0,1,…,number_of_views-1}. Then, you can define a stored function that unions all the smaller materialized views. Use this function in queries to access the combined data.
While splitting the view might increase CPU usage, it reduces the memory peak in materialization cycles. Reducing the memory peak can help if the single view is failing due to memory limits.
MaterializedViewResult metric
The MaterializedViewResult metric provides information about the result of a materialization cycle and can be used to identify issues in the materialized view health status. The metric includes the Database and MaterializedViewName and a Result dimension.
The Result dimension can have one of the following values:
Success: The materialization completed successfully.
SourceTableNotFound: The source table of the materialized view was dropped, so the materialized view is disabled automatically.
SourceTableSchemaChange: The schema of the source table changed in a way that isn’t compatible with the materialized view definition. Since the materialized view query no longer matches the materialized view schema, the materialized view is disabled automatically.
InsufficientCapacity: The instance doesn’t have sufficient capacity to materialize the materialized view, due to a lack of ingestion capacity. While insufficient capacity failures can be transient, if they reoccur often, try scaling out the instance or increasing the relevant capacity in the policy.
InsufficientCapacity: The instance doesn’t have sufficient capacity to materialize the materialized view, due to a lack of ingestion capacity. While insufficient capacity failures can be transient, if they reoccur often, try scaling out the instance or increasing capacity. For more information, see Plan your capacity size.
InsufficientResources: The database doesn’t have sufficient resources (CPU/memory) to materialize the materialized view. While insufficient resource errors might be transient, if they reoccur often, try scaling up or scaling out. For more ideas, see Troubleshooting unhealthy materialized views.
Materialized views in follower databases
Materialized views can be defined in follower databases. However, the monitoring of these materialized views should be based on the leader database, where the materialized view is defined. Specifically:
- Metrics related to materialized view execution (
MaterializedViewResult,MaterializedViewExtentsRebuild) are only present in the leader database. Metrics related to monitoring (MaterializedViewAgeSeconds,MaterializedViewHealth,MaterializedViewRecordsInDelta) also appear in the follower databases. - The .show materialized-view failures command only works in the leader database.
Track resource consumption
Materialized views resource consumption: the resources consumed by the materialized views materialization process can be tracked using the .show commands-and-queries command. Filter the records for a specific view using the following (replace DatabaseName and ViewName):
.show commands-and-queries
| where Database == "DatabaseName" and ClientActivityId startswith "DN.MaterializedViews;ViewName;"
Related content
14 - Stored query results
14.1 - Stored query results
Stored query results store the result of a query on the service for up to 24 hours. The same principal identity that created the stored query can reference the results in later queries.
Stored query results can be useful in the following scenarios:
- Paging through query results. The initial command runs the query and returns the first “page” of records. Later queries reference other “pages” without the need to rerun the query.
- Drill-down scenarios, in which the results of an initial query are then explored using other queries.
Updates to security policies, such as database access and row level security, aren’t propagated to stored query results. Use .drop stored_query_results if there’s user permission revocation.
Stored query results behave like tables, in that the order of records isn’t preserved. To paginate through the results, we recommended that the query includes unique ID columns. If a query returns multiple result sets, only the first result set is stored.
The following table lists the management commands and functions used for managing stored query results:
| Command | Description |
|---|---|
| .set stored_query_result command | Creates a stored query result to store the results of a query on the service for up to 24 hours. |
| .show stored_query_result command | Shows information on active query results. |
| .drop stored_query_result command | Deletes active query results. |
| stored_query_result() | Retrieves a stored query result. |
Related Content
15 - Tables
15.1 - Avrotize k2a tool
Avrotize is a versatile tool for converting data and database schema formats, and generating code in various programming languages. The tool supports the conversion of Kusto table schemas to Apache Avro format and vice versa with the Convert Kusto table definition to Avrotize Schema command. The tool handles dynamic columns in Kusto tables by:
- Inferring the schema through sampling
- Resolving arrays and records at any level of nesting
- Detecting conflicting schemas
- Creating type unions for each different schema branch
Convert table definition to AVRO format
You can use the avrotize k2a command to connect to a Kusto database and create an Avro schema with a record type for each of the tables in the database.
The following are examples of how to use the command:
Create an Avro schema with a top-level union with a record for each table:
avrotize k2a --kusto-uri <Uri> --kusto-database <DatabaseName> --avsc <AvroFilename.avsc>Create a XRegistry Catalog file with CloudEvent wrappers and per-event schemas:
In the following example, you create xRegistry catalog files with schemas for each table. If the input table contains CloudEvents identified by columns like id, source, and type, the tool creates separate schemas for each event type.
avrotize k2a --kusto-uri <URI> --kusto-database <DatabaseName> --avsc <AvroFilename.xreg.json> --emit-cloudevents-xregistry --avro-namespace <AvroNamespace>
Convert AVRO schema to Kusto table declaration
You can use the avrotize a2k command to create KQL table declarations from Avro schema and JSON mappings. It can also include docstrings in the table declarations extracted from the “doc” annotations in the Avro record types.
If the Avro schema is a single record type, the output script includes a .create table command for the record. The record fields are converted into columns in the table. If the Avro schema is a type union (a top-level array), the output script emits a separate .create table command for each record type in the union.
avrotize a2k .\<AvroFilename.avsc> --out <KustoFilename.kql>
The Avrotize tool is capable of converting JSON Schema, XML Schema, ASN.1 Schema, and Protobuf 2 and Protobuf 3 schemas into Avro schema. You can first convert the source schema into an Avro schema to normalize it and then convert it into Kusto schema.
For example, to convert “address.json” into Avro schema, the following command first converts an input JSON Schema document “address.json” to normalize it:
avrotize j2a address.json --out address.avsc
Then convert the Avro schema file into Kusto schema:
avrotize a2k address.avsc --out address.kql
You can also chain the commands together to convert from JSON Schema via Avro into Kusto schema:
avrotize j2a address.json | avrotize a2k --out address.kql
Related content
15.2 - Tables management
This topic discusses the life cycle of tables and associated management commands that are helpful for exploring, creating and altering tables.
Select the links in the table below for more information about them.
For information on optimizing table schema, see Schema optimization best practices.
| Commands | Operation |
|---|---|
.alter table docstring, .alter table folder | Manage table display properties |
.create ingestion mapping, .show ingestion mappings, .alter ingestion mapping, .drop ingestion mapping | Manage ingestion mapping |
.create tables, .create table, .create-merge tables, .create-merge table, .alter table, .alter-merge table, .drop tables, .drop table, .undo drop table, .rename table | Create/modify/drop tables |
.show tables .show table details.show table schema | Enumerate tables in a database |
.ingest, .set, .append, .set-or-append (see Data ingestion overview). | Data ingestion into a table |
.clear table data | Clears all the data of a table |
CRUD naming conventions for tables
(See full details in the sections linked to in the table, above.)
| Command syntax | Semantics |
|---|---|
.create entityType entityName ... | If an entity of that type and name exists, returns the entity. Otherwise, create the entity. |
.create-merge entityType entityName... | If an entity of that type and name exists, merge the existing entity with the specified entity. Otherwise, create the entity. |
.alter entityType entityName ... | If an entity of that type and name does not exist, error. Otherwise, replace it with the specified entity. |
.alter-merge entityType entityName ... | If an entity of that type and name does not exist, error. Otherwise, merge it with the specified entity. |
.drop entityType entityName ... | If an entity of that type and name does not exist, error. Otherwise, drop it. |
.drop entityType entityName ifexists ... | If an entity of that type and name does not exist, return. Otherwise, drop it. |