infer_storage_schema plugin

Learn how to use the infer_storage_schema plugin to infer the schema of external data.

This plugin infers the schema of external data, and returns it as CSL schema string. The string can be used when creating external tables. The plugin is invoked with the evaluate operator.

Authentication and authorization

In the properties of the request, you specify storage connection strings to access. Each storage connection string specifies the authorization method to use for access to the storage. Depending on the authorization method, the principal may need to be granted permissions on the external storage to perform the schema inference.

The following table lists the supported authentication methods and any required permissions by storage type.

Authentication methodAzure Blob Storage / Data Lake Storage Gen2Data Lake Storage Gen1
ImpersonationStorage Blob Data ReaderReader
Shared Access (SAS) tokenList + ReadThis authentication method isn’t supported in Gen1.
Microsoft Entra access token
Storage account access keyThis authentication method isn’t supported in Gen1.

Syntax

evaluate infer_storage_schema( Options )

Parameters

NameTypeRequiredDescription
Optionsdynamic✔️A property bag specifying the properties of the request.

Supported properties of the request

NameTypeRequiredDescription
StorageContainersdynamic✔️An array of storage connection strings that represent prefix URI for stored data artifacts.
DataFormatstring✔️One of the supported data formats.
FileExtensionstringIf specified, the function only scans files ending with this file extension. Specifying the extension may speed up the process or eliminate data reading issues.
FileNamePrefixstringIf specified, the function only scans files starting with this prefix. Specifying the prefix may speed up the process.
ModestringThe schema inference strategy. A value of: any, last, all. The function infers the data schema from the first found file, from the last written file, or from all files respectively. The default value is last.
InferenceOptionsdynamicMore inference options. Valid options: UseFirstRowAsHeader for delimited file formats. For example, 'InferenceOptions': {'UseFirstRowAsHeader': true} .

Returns

The infer_storage_schema plugin returns a single result table containing a single row/column containing CSL schema string.

Example

let options = dynamic({
  'StorageContainers': [
    h@'https://storageaccount.blob.core.windows.net/MobileEvents;secretKey'
  ],
  'FileExtension': '.parquet',
  'FileNamePrefix': 'part-',
  'DataFormat': 'parquet'
});
evaluate infer_storage_schema(options)

Output

CslSchema
app_id:string, user_id:long, event_time:datetime, country:string, city:string, device_type:string, device_vendor:string, ad_network:string, campaign:string, site_id:string, event_type:string, event_name:string, organic:string, days_from_install:int, revenue:real

Use the returned schema in external table definition:

.create external table MobileEvents(
    app_id:string, user_id:long, event_time:datetime, country:string, city:string, device_type:string, device_vendor:string, ad_network:string, campaign:string, site_id:string, event_type:string, event_name:string, organic:string, days_from_install:int, revenue:real
)
kind=blob
partition by (dt:datetime = bin(event_time, 1d), app:string = app_id)
pathformat = ('app=' app '/dt=' datetime_pattern('yyyyMMdd', dt))
dataformat = parquet
(
    h@'https://storageaccount.blob.core.windows.net/MovileEvents;secretKey'
)