This is the multi-page printable view of this section. Click here to print.
Machine learning plugins
1 - autocluster plugin
autocluster
finds common patterns of discrete attributes (dimensions) in the data. It then reduces the results of the original query, whether it’s 100 or 100,000 rows, to a few patterns. The plugin was developed to help analyze failures (such as exceptions or crashes) but can potentially work on any filtered dataset. The plugin is invoked with the evaluate
operator.
Syntax
T |
evaluate
autocluster
(
[SizeWeight [,
WeightColumn [,
NumSeeds [,
CustomWildcard [,
… ]]]]])
Parameters
The parameters must be ordered as specified in the syntax. To indicate that the default value should be used, put the string tilde value ~
. For more information, see Examples.
Name | Type | Required | Description |
---|---|---|---|
T | string | ✔️ | The input tabular expression. |
SizeWeight | double | A double between 0 and 1 that controls the balance between generic (high coverage) and informative (many shared) values. Increasing this value typically reduces the quantity of patterns while expanding coverage. Conversely, decreasing this value generates more specific patterns characterized by increased shared values and a smaller percentage coverage. The default is 0.5 . The formula is a weighted geometric mean with weights SizeWeight and 1-SizeWeight . | |
WeightColumn | string | Considers each row in the input according to the specified weight. Each row has a default weight of 1 . The argument must be a name of a numeric integer column. A common usage of a weight column is to take into account sampling or bucketing or aggregation of the data that is already embedded into each row. | |
NumSeeds | int | Determines the number of initial local search points. Adjusting the number of seeds impacts result quantity or quality based on data structure. Increasing seeds can enhance results but with a slower query tradeoff. Decreasing below five yields negligible improvements, while increasing above 50 rarely generates more patterns. The default is 25 . | |
CustomWildcard | string | A type literal that sets the wildcard value for a specific type in the results table, indicating no restriction on this column. The default is null , which represents an empty string. If the default is a good value in the data, a different wildcard value should be used, such as * . You can include multiple custom wildcards by adding them consecutively. |
Returns
The autocluster
plugin usually returns a small set of patterns. The patterns capture portions of the data with shared common values across multiple discrete attributes. Each pattern in the results is represented by a row.
The first column is the segment ID. The next two columns are the count and percentage of rows out of the original query that are captured by the pattern. The remaining columns are from the original query. Their value is either a specific value from the column, or a wildcard value (which are by default null) meaning variable values.
The patterns aren’t distinct, may be overlapping, and usually don’t cover all the original rows. Some rows may not fall under any pattern.
Examples
Using evaluate
T | evaluate autocluster()
Using autocluster
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.6)
Output
SegmentId | Count | Percent | State | EventType | Damage | |||
---|---|---|---|---|---|---|---|---|
0 | 2278 | 38.7 | Hail | NO | ||||
1 | 512 | 8.7 | Thunderstorm Wind | YES | ||||
2 | 898 | 15.3 | TEXAS |
Using custom wildcards
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.2, '~', '~', '*')
Output
SegmentId | Count | Percent | State | EventType | Damage | |||
---|---|---|---|---|---|---|---|---|
0 | 2278 | 38.7 | * | Hail | NO | |||
1 | 512 | 8.7 | * | Thunderstorm Wind | YES | |||
2 | 898 | 15.3 | TEXAS | * | * |
Related content
2 - basket plugin
The basket
plugin finds frequent patterns of attributes in the data and returns the patterns that pass a frequency threshold in that data. A pattern represents a subset of the rows that have the same value across one or more columns. The basket
plugin is based on the Apriori algorithm originally developed for basket analysis data mining.
Syntax
T | evaluate
basket
(
[ Threshold,
WeightColumn,
MaxDimensions,
CustomWildcard,
CustomWildcard,
… ])
Parameters
Name | Type | Required | Description |
---|---|---|---|
Threshold | long | A double in the range of 0.015 to 1 that sets the minimal ratio of the rows to be considered frequent. Patterns with a smaller ratio won’t be returned. The default value is 0.05. To use the default value, input the tilde: ~ .Example: `T | |
WeightColumn | string | The column name to use to consider each row in the input according to the specified weight. Must be a name of a numeric type column, such as int , long , real . By default, each row has a weight of 1. To use the default value, input the tilde: ~ . A common use of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row.Example: `T | |
MaxDimensions | int | Sets the maximal number of uncorrelated dimensions per basket, limited by default, to minimize the query runtime. The default is 5. To use the default value, input the tilde: ~ .Example: `T | |
CustomWildcard | string | Sets the wildcard value for a specific type in the result table that will indicate that the current pattern doesn’t have a restriction on this column. The default is null except for string columns whose default value is an empty string. If the default is a good value in the data, a different wildcard value should be used, such as * . To use the default value, input the tilde: ~ .Example: `T |
Returns
The basket
plugin returns frequent patterns that pass a ratio threshold. The default threshold is 0.05.
Each pattern is represented by a row in the results. The first column is the segment ID. The next two columns are the count and percentage of rows, from the original query that match the pattern. The remaining columns relate to the original query, with either a specific value from the column or a wildcard value, which is by default null, meaning a variable value.
Example
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State, EventType, Damage, DamageCrops
| evaluate basket(0.2)
Output
SegmentId | Count | Percent | State | EventType | Damage | DamageCrops | ||
---|---|---|---|---|---|---|---|---|
0 | 4574 | 77.7 | NO | 0 | ||||
1 | 2278 | 38.7 | Hail | NO | 0 | |||
2 | 5675 | 96.4 | 0 | |||||
3 | 2371 | 40.3 | Hail | 0 | ||||
4 | 1279 | 21.7 | Thunderstorm Wind | 0 | ||||
5 | 2468 | 41.9 | Hail | |||||
6 | 1310 | 22.3 | YES | |||||
7 | 1291 | 21.9 | Thunderstorm Wind |
Example with custom wildcards
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State, EventType, Damage, DamageCrops
| evaluate basket(0.2, '~', '~', '*', int(-1))
Output
SegmentId | Count | Percent | State | EventType | Damage | DamageCrops | ||
---|---|---|---|---|---|---|---|---|
0 | 4574 | 77.7 | * | * | NO | 0 | ||
1 | 2278 | 38.7 | * | Hail | NO | 0 | ||
2 | 5675 | 96.4 | * | * | * | 0 | ||
3 | 2371 | 40.3 | * | Hail | * | 0 | ||
4 | 1279 | 21.7 | * | Thunderstorm Wind | * | 0 | ||
5 | 2468 | 41.9 | * | Hail | * | -1 | ||
6 | 1310 | 22.3 | * | * | YES | -1 | ||
7 | 1291 | 21.9 | * | Thunderstorm Wind | * | -1 |
3 - diffpatterns plugin
Compares two datasets of the same structure and finds patterns of discrete attributes (dimensions) that characterize differences between the two datasets. The plugin is invoked with the evaluate
operator.
diffpatterns
was developed to help analyze failures (for example, by comparing failures to non-failures in a given time frame), but can potentially find differences between any two datasets of the same structure.
Syntax
T | evaluate diffpatterns(
SplitColumn,
SplitValueA,
SplitValueB [,
WeightColumn,
Threshold,
MaxDimensions,
CustomWildcard,
…])
Parameters
Name | Type | Required | Description |
---|---|---|---|
SplitColumn | string | ✔️ | The column name that tells the algorithm how to split the query into datasets. According to the specified values for the SplitValueA and SplitValueB arguments, the algorithm splits the query into two datasets, “A” and “B”, and analyzes the differences between them. As such, the split column must have at least two distinct values. |
SplitValueA | string | ✔️ | A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “A”. |
SplitValueB | string | ✔️ | A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “B”. |
WeightColumn | string | The column used to consider each row in the input according to the specified weight. Must be a name of a numeric column, such as int , long , real . By default each row has a weight of ‘1’. To use the default value, input the tilde: ~ . A common usage of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row.Example: `T | |
Threshold | real | A real in the range of 0.015 to 1. This value sets the minimal pattern ratio difference between the two sets. The default is 0.05. To use the default value, input the tilde: ~ .Example: `T | |
MaxDimensions | int | Sets the maximum number of uncorrelated dimensions per result pattern. By specifying a limit, you decrease the query runtime. The default is unlimited. To use the default value, input the tilde: ~ .Example: `T | |
CustomWildcard | string | Sets the wildcard value for a specific type in the result table that will indicate that the current pattern doesn’t have a restriction on this column. The default is null, except for string columns for which the default is an empty string. If the default is a viable value in the data, a different wildcard value should be used. For example, * . To use the default value, input the tilde: ~ .Example: `T |
Returns
diffpatterns
returns a small set of patterns that capture different portions of the data in the two sets (that is, a pattern capturing a large percentage of the rows in the first dataset and low percentage of the rows in the second set). Each pattern is represented by a row in the results.
The result of diffpatterns
returns the following columns:
SegmentId: the identity assigned to the pattern in the current query (note: IDs aren’t guaranteed to be the same in repeating queries).
CountA: the number of rows captured by the pattern in Set A (Set A is the equivalent of
where tostring(splitColumn) == SplitValueA
).CountB: the number of rows captured by the pattern in Set B (Set B is the equivalent of
where tostring(splitColumn) == SplitValueB
).PercentA: the percentage of rows in Set A captured by the pattern (100.0 * CountA / count(SetA)).
PercentB: the percentage of rows in Set B captured by the pattern (100.0 * CountB / count(SetB)).
PercentDiffAB: the absolute percentage point difference between A and B (|PercentA - PercentB|) is the main measure of significance of patterns in describing the difference between the two sets.
Rest of the columns: are the original schema of the input and describe the pattern, each row (pattern) represents the intersection of the non-wildcard values of the columns (equivalent of
where col1==val1 and col2==val2 and ... colN=valN
for each non-wildcard value in the row).
For each pattern, columns that aren’t set in the pattern (that is, without restriction on a specific value) will contain a wildcard value, which is null by default. See in the Arguments section below how wildcards can be manually changed.
- Note: the patterns are often not distinct. They may be overlapping, and usually don’t cover all the original rows. Some rows may not fall under any pattern.
Example
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , 1 , 0)
| project State , EventType , Source , Damage, DamageCrops
| evaluate diffpatterns(Damage, "0", "1" )
Output
SegmentId | CountA | CountB | PercentA | PercentB | PercentDiffAB | State | EventType | Source | DamageCrops |
---|---|---|---|---|---|---|---|---|---|
0 | 2278 | 93 | 49.8 | 7.1 | 42.7 | Hail | 0 | ||
1 | 779 | 512 | 17.03 | 39.08 | 22.05 | Thunderstorm Wind | |||
2 | 1098 | 118 | 24.01 | 9.01 | 15 | Trained Spotter | 0 | ||
3 | 136 | 158 | 2.97 | 12.06 | 9.09 | Newspaper | |||
4 | 359 | 214 | 7.85 | 16.34 | 8.49 | Flash Flood | |||
5 | 50 | 122 | 1.09 | 9.31 | 8.22 | IOWA | |||
6 | 655 | 279 | 14.32 | 21.3 | 6.98 | Law Enforcement | |||
7 | 150 | 117 | 3.28 | 8.93 | 5.65 | Flood | |||
8 | 362 | 176 | 7.91 | 13.44 | 5.52 | Emergency Manager |
4 - diffpatterns_text plugin
Compares two datasets of string values and finds text patterns that characterize differences between the two datasets. The plugin is invoked with the evaluate
operator.
The diffpatterns_text
returns a set of text patterns that capture different portions of the data in the two sets. For example, a pattern capturing a large percentage of the rows when the condition is true
and low percentage of the rows when the condition is false
. The patterns are built from consecutive tokens separated by white space, with a token from the text column or a *
representing a wildcard. Each pattern is represented by a row in the results.
Syntax
T | evaluate diffpatterns_text(
TextColumn, BooleanCondition [, MinTokens, Threshold , MaxTokens])
Parameters
Name | Type | Required | Description |
---|---|---|---|
TextColumn | string | ✔️ | The text column to analyze. |
BooleanCondition | string | ✔️ | An expression that evaluates to a boolean value. The algorithm splits the query into the two datasets to compare based on this expression. |
MinTokens | int | An integer value between 0 and 200 that represents the minimal number of non-wildcard tokens per result pattern. The default is 1. | |
Threshold | decimal | A decimal value between 0.015 and 1 that sets the minimal pattern ratio difference between the two sets. Default is 0.05. See diffpatterns. | |
MaxTokens | int | An integer value between 0 and 20 that sets the maximal number of tokens per result pattern, specifying a lower limit decreases the query runtime. |
Returns
The result of diffpatterns_text returns the following columns:
- Count_of_True: The number of rows matching the pattern when the condition is
true
. - Count_of_False: The number of rows matching the pattern when the condition is
false
. - Percent_of_True: The percentage of rows matching the pattern from the rows when the condition is
true
. - Percent_of_False: The percentage of rows matching the pattern from the rows when the condition is
false
. - Pattern: The text pattern containing tokens from the text string and ‘
*
’ for wildcards.
Example
The following example uses data from the StormEvents table in the help cluster. To access this data, sign in to https://dataexplorer.azure.com/clusters/help/databases/Samples. In the left menu, browse to help > Samples > Tables > Storm_Events.
The examples in this tutorial use the StormEvents
table, which is publicly available in the Weather analytics sample data.
StormEvents
| where EventNarrative != "" and monthofyear(StartTime) > 1 and monthofyear(StartTime) < 9
| where EventType == "Drought" or EventType == "Extreme Cold/Wind Chill"
| evaluate diffpatterns_text(EpisodeNarrative, EventType == "Extreme Cold/Wind Chill", 2)
Output
Count_of_True | Count_of_False | Percent_of_True | Percent_of_False | Pattern |
---|---|---|---|---|
11 | 0 | 6.29 | 0 | Winds shifting northwest in * wake * a surface trough brought heavy lake effect snowfall downwind * Lake Superior from |
9 | 0 | 5.14 | 0 | Canadian high pressure settled * * region * produced the coldest temperatures since February * 2006. Durations * freezing temperatures |
0 | 34 | 0 | 6.24 | * * * * * * * * * * * * * * * * * * West Tennessee, |
0 | 42 | 0 | 7.71 | * * * * * * caused * * * * * * * * across western Colorado. * |
0 | 45 | 0 | 8.26 | * * below normal * |
0 | 110 | 0 | 20.18 | Below normal * |