This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Machine learning plugins

1: autocluster plugin
2: basket plugin
3: diffpatterns plugin
4: diffpatterns_text plugin

1 - autocluster plugin

Learn how to use the autocluster plugin to find common patterns in data.

autocluster finds common patterns of discrete attributes (dimensions) in the data. It then reduces the results of the original query, whether it’s 100 or 100,000 rows, to a few patterns. The plugin was developed to help analyze failures (such as exceptions or crashes) but can potentially work on any filtered dataset. The plugin is invoked with the evaluate operator.

Syntax

T | evaluate autocluster ([SizeWeight [, WeightColumn [, NumSeeds [, CustomWildcard [, … ]]]]])

Parameters

The parameters must be ordered as specified in the syntax. To indicate that the default value should be used, put the string tilde value ~. For more information, see Examples.

Name	Type	Required	Description
T	`string`	✔️	The input tabular expression.
SizeWeight	double		A double between 0 and 1 that controls the balance between generic (high coverage) and informative (many shared) values. Increasing this value typically reduces the quantity of patterns while expanding coverage. Conversely, decreasing this value generates more specific patterns characterized by increased shared values and a smaller percentage coverage. The default is `0.5`. The formula is a weighted geometric mean with weights `SizeWeight` and `1-SizeWeight`.
WeightColumn	`string`		Considers each row in the input according to the specified weight. Each row has a default weight of `1`. The argument must be a name of a numeric integer column. A common usage of a weight column is to take into account sampling or bucketing or aggregation of the data that is already embedded into each row.
NumSeeds	`int`		Determines the number of initial local search points. Adjusting the number of seeds impacts result quantity or quality based on data structure. Increasing seeds can enhance results but with a slower query tradeoff. Decreasing below five yields negligible improvements, while increasing above 50 rarely generates more patterns. The default is `25`.
CustomWildcard	`string`		A type literal that sets the wildcard value for a specific type in the results table, indicating no restriction on this column. The default is `null`, which represents an empty string. If the default is a good value in the data, a different wildcard value should be used, such as `*`. You can include multiple custom wildcards by adding them consecutively.

Returns

The autocluster plugin usually returns a small set of patterns. The patterns capture portions of the data with shared common values across multiple discrete attributes. Each pattern in the results is represented by a row.

The first column is the segment ID. The next two columns are the count and percentage of rows out of the original query that are captured by the pattern. The remaining columns are from the original query. Their value is either a specific value from the column, or a wildcard value (which are by default null) meaning variable values.

The patterns aren’t distinct, may be overlapping, and usually don’t cover all the original rows. Some rows may not fall under any pattern.

Examples

Using evaluate

T | evaluate autocluster()

Using autocluster

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.6)

Output

SegmentId	Count	Percent	State	EventType	Damage
0	2278	38.7		Hail	NO
1	512	8.7		Thunderstorm Wind	YES
2	898	15.3	TEXAS

Using custom wildcards

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.2, '~', '~', '*')

Output

SegmentId	Count	Percent	State	EventType	Damage
0	2278	38.7	*	Hail	NO
1	512	8.7	*	Thunderstorm Wind	YES
2	898	15.3	TEXAS	*	*

2 - basket plugin

Learn how to use the basket plugin to find frequent patterns in data that exceed a frequency threshold.

The basket plugin finds frequent patterns of attributes in the data and returns the patterns that pass a frequency threshold in that data. A pattern represents a subset of the rows that have the same value across one or more columns. The basket plugin is based on the Apriori algorithm originally developed for basket analysis data mining.

Syntax

T | evaluate basket ([ Threshold, WeightColumn, MaxDimensions, CustomWildcard, CustomWildcard, … ])

Parameters

Name	Type	Description
Threshold	`long`	A `double` in the range of 0.015 to 1 that sets the minimal ratio of the rows to be considered frequent. Patterns with a smaller ratio won’t be returned. The default value is 0.05. To use the default value, input the tilde: `~`. Example: `T
WeightColumn	`string`	The column name to use to consider each row in the input according to the specified weight. Must be a name of a numeric type column, such as `int`, `long`, `real`. By default, each row has a weight of 1. To use the default value, input the tilde: `~`. A common use of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row. Example: `T
MaxDimensions	`int`	Sets the maximal number of uncorrelated dimensions per basket, limited by default, to minimize the query runtime. The default is 5. To use the default value, input the tilde: `~`. Example: `T
CustomWildcard	`string`	Sets the wildcard value for a specific type in the result table that will indicate that the current pattern doesn’t have a restriction on this column. The default is `null` except for string columns whose default value is an empty string. If the default is a good value in the data, a different wildcard value should be used, such as `*`. To use the default value, input the tilde: `~`. Example: `T

Returns

The basket plugin returns frequent patterns that pass a ratio threshold. The default threshold is 0.05.

Each pattern is represented by a row in the results. The first column is the segment ID. The next two columns are the count and percentage of rows, from the original query that match the pattern. The remaining columns relate to the original query, with either a specific value from the column or a wildcard value, which is by default null, meaning a variable value.

Example

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State, EventType, Damage, DamageCrops
| evaluate basket(0.2)

Output

SegmentId	Count	Percent	EventType	Damage	DamageCrops
0	4574	77.7		NO	0
1	2278	38.7	Hail	NO	0
2	5675	96.4			0
3	2371	40.3	Hail		0
4	1279	21.7	Thunderstorm Wind		0
5	2468	41.9	Hail
6	1310	22.3		YES
7	1291	21.9	Thunderstorm Wind

Example with custom wildcards

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State, EventType, Damage, DamageCrops
| evaluate basket(0.2, '~', '~', '*', int(-1))

Output

SegmentId	Count	Percent	State	EventType	Damage	DamageCrops
0	4574	77.7	*	*	NO	0
1	2278	38.7	*	Hail	NO	0
2	5675	96.4	*	*	*	0
3	2371	40.3	*	Hail	*	0
4	1279	21.7	*	Thunderstorm Wind	*	0
5	2468	41.9	*	Hail	*	-1
6	1310	22.3	*	*	YES	-1
7	1291	21.9	*	Thunderstorm Wind	*	-1

3 - diffpatterns plugin

Learn how to use the diffpatterns plugin to compare two datasets of the same structure to find the differences between the two datasets.

Compares two datasets of the same structure and finds patterns of discrete attributes (dimensions) that characterize differences between the two datasets. The plugin is invoked with the evaluate operator.

diffpatterns was developed to help analyze failures (for example, by comparing failures to non-failures in a given time frame), but can potentially find differences between any two datasets of the same structure.

Syntax

T | evaluate diffpatterns(SplitColumn, SplitValueA, SplitValueB [, WeightColumn,Threshold, MaxDimensions, CustomWildcard, …])

Parameters

Name	Type	Required	Description
SplitColumn	`string`	✔️	The column name that tells the algorithm how to split the query into datasets. According to the specified values for the SplitValueA and SplitValueB arguments, the algorithm splits the query into two datasets, “A” and “B”, and analyzes the differences between them. As such, the split column must have at least two distinct values.
SplitValueA	`string`	✔️	A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “A”.
SplitValueB	`string`	✔️	A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “B”.
WeightColumn	`string`		The column used to consider each row in the input according to the specified weight. Must be a name of a numeric column, such as `int`, `long`, `real`. By default each row has a weight of ‘1’. To use the default value, input the tilde: `~`. A common usage of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row. Example: `T
Threshold	`real`		A real in the range of 0.015 to 1. This value sets the minimal pattern ratio difference between the two sets. The default is 0.05. To use the default value, input the tilde: `~`. Example: `T
MaxDimensions	`int`		Sets the maximum number of uncorrelated dimensions per result pattern. By specifying a limit, you decrease the query runtime. The default is unlimited. To use the default value, input the tilde: `~`. Example: `T
CustomWildcard	`string`		Sets the wildcard value for a specific type in the result table that will indicate that the current pattern doesn’t have a restriction on this column. The default is null, except for string columns for which the default is an empty string. If the default is a viable value in the data, a different wildcard value should be used. For example, `*`. To use the default value, input the tilde: `~`. Example: `T

Returns

diffpatterns returns a small set of patterns that capture different portions of the data in the two sets (that is, a pattern capturing a large percentage of the rows in the first dataset and low percentage of the rows in the second set). Each pattern is represented by a row in the results.

The result of diffpatterns returns the following columns:

SegmentId: the identity assigned to the pattern in the current query (note: IDs aren’t guaranteed to be the same in repeating queries).
CountA: the number of rows captured by the pattern in Set A (Set A is the equivalent of where tostring(splitColumn) == SplitValueA).
CountB: the number of rows captured by the pattern in Set B (Set B is the equivalent of where tostring(splitColumn) == SplitValueB).
PercentA: the percentage of rows in Set A captured by the pattern (100.0 * CountA / count(SetA)).
PercentB: the percentage of rows in Set B captured by the pattern (100.0 * CountB / count(SetB)).
PercentDiffAB: the absolute percentage point difference between A and B (|PercentA - PercentB|) is the main measure of significance of patterns in describing the difference between the two sets.
Rest of the columns: are the original schema of the input and describe the pattern, each row (pattern) represents the intersection of the non-wildcard values of the columns (equivalent of where col1==val1 and col2==val2 and ... colN=valN for each non-wildcard value in the row).

For each pattern, columns that aren’t set in the pattern (that is, without restriction on a specific value) will contain a wildcard value, which is null by default. See in the Arguments section below how wildcards can be manually changed.

Note: the patterns are often not distinct. They may be overlapping, and usually don’t cover all the original rows. Some rows may not fall under any pattern.

Example

StormEvents 
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , 1 , 0)
| project State , EventType , Source , Damage, DamageCrops
| evaluate diffpatterns(Damage, "0", "1" )

Output

SegmentId	CountA	CountB	PercentA	PercentB	PercentDiffAB	State	EventType	Source	DamageCrops
0	2278	93	49.8	7.1	42.7		Hail		0
1	779	512	17.03	39.08	22.05		Thunderstorm Wind
2	1098	118	24.01	9.01	15			Trained Spotter	0
3	136	158	2.97	12.06	9.09			Newspaper
4	359	214	7.85	16.34	8.49		Flash Flood
5	50	122	1.09	9.31	8.22	IOWA
6	655	279	14.32	21.3	6.98			Law Enforcement
7	150	117	3.28	8.93	5.65		Flood
8	362	176	7.91	13.44	5.52			Emergency Manager

4 - diffpatterns_text plugin

Learn how to use the diffpatterns_text plugin to compare two string value datasets to find the differences between the two datasets.

Compares two datasets of string values and finds text patterns that characterize differences between the two datasets. The plugin is invoked with the evaluate operator.

The diffpatterns_text returns a set of text patterns that capture different portions of the data in the two sets. For example, a pattern capturing a large percentage of the rows when the condition is true and low percentage of the rows when the condition is false. The patterns are built from consecutive tokens separated by white space, with a token from the text column or a * representing a wildcard. Each pattern is represented by a row in the results.

Syntax

T | evaluate diffpatterns_text(TextColumn, BooleanCondition [, MinTokens, Threshold , MaxTokens])

Parameters

Name	Type	Required	Description
TextColumn	`string`	✔️	The text column to analyze.
BooleanCondition	`string`	✔️	An expression that evaluates to a boolean value. The algorithm splits the query into the two datasets to compare based on this expression.
MinTokens	`int`		An integer value between 0 and 200 that represents the minimal number of non-wildcard tokens per result pattern. The default is 1.
Threshold	`decimal`		A decimal value between 0.015 and 1 that sets the minimal pattern ratio difference between the two sets. Default is 0.05. See diffpatterns.
MaxTokens	`int`		An integer value between 0 and 20 that sets the maximal number of tokens per result pattern, specifying a lower limit decreases the query runtime.

Returns

The result of diffpatterns_text returns the following columns:

Count_of_True: The number of rows matching the pattern when the condition is true.
Count_of_False: The number of rows matching the pattern when the condition is false.
Percent_of_True: The percentage of rows matching the pattern from the rows when the condition is true.
Percent_of_False: The percentage of rows matching the pattern from the rows when the condition is false.
Pattern: The text pattern containing tokens from the text string and ‘*’ for wildcards.

Example

The following example uses data from the StormEvents table in the help cluster. To access this data, sign in to https://dataexplorer.azure.com/clusters/help/databases/Samples. In the left menu, browse to help > Samples > Tables > Storm_Events.

The examples in this tutorial use the StormEvents table, which is publicly available in the Weather analytics sample data.

StormEvents     
| where EventNarrative != "" and monthofyear(StartTime) > 1 and monthofyear(StartTime) < 9
| where EventType == "Drought" or EventType == "Extreme Cold/Wind Chill"
| evaluate diffpatterns_text(EpisodeNarrative, EventType == "Extreme Cold/Wind Chill", 2)

Output

Count_of_True	Count_of_False	Percent_of_True	Percent_of_False	Pattern
11	0	6.29	0	Winds shifting northwest in * wake * a surface trough brought heavy lake effect snowfall downwind * Lake Superior from
9	0	5.14	0	Canadian high pressure settled * * region * produced the coldest temperatures since February * 2006. Durations * freezing temperatures
0	34	0	6.24	* * * * * * * * * * * * * * * * * * West Tennessee,
0	42	0	7.71	* * * * * * caused * * * * * * * * across western Colorado. *
0	45	0	8.26	* * below normal *
0	110	0	20.18	Below normal *

Machine learning plugins

1 - autocluster plugin

Syntax

Parameters

Returns

Examples

Using evaluate

Using autocluster

Using custom wildcards

Related content

2 - basket plugin

Syntax

Parameters

Returns

Example

Example with custom wildcards

3 - diffpatterns plugin

Syntax

Parameters

Returns

Example

4 - diffpatterns_text plugin

Syntax

Parameters

Returns

Example