This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Machine learning plugins

1 - autocluster plugin

Learn how to use the autocluster plugin to find common patterns in data.

autocluster finds common patterns of discrete attributes (dimensions) in the data. It then reduces the results of the original query, whether it’s 100 or 100,000 rows, to a few patterns. The plugin was developed to help analyze failures (such as exceptions or crashes) but can potentially work on any filtered dataset. The plugin is invoked with the evaluate operator.

Syntax

T | evaluate autocluster ([SizeWeight [, WeightColumn [, NumSeeds [, CustomWildcard [, … ]]]]])

Parameters

The parameters must be ordered as specified in the syntax. To indicate that the default value should be used, put the string tilde value ~. For more information, see Examples.

NameTypeRequiredDescription
Tstring✔️The input tabular expression.
SizeWeightdoubleA double between 0 and 1 that controls the balance between generic (high coverage) and informative (many shared) values. Increasing this value typically reduces the quantity of patterns while expanding coverage. Conversely, decreasing this value generates more specific patterns characterized by increased shared values and a smaller percentage coverage. The default is 0.5. The formula is a weighted geometric mean with weights SizeWeight and 1-SizeWeight.
WeightColumnstringConsiders each row in the input according to the specified weight. Each row has a default weight of 1. The argument must be a name of a numeric integer column. A common usage of a weight column is to take into account sampling or bucketing or aggregation of the data that is already embedded into each row.
NumSeedsintDetermines the number of initial local search points. Adjusting the number of seeds impacts result quantity or quality based on data structure. Increasing seeds can enhance results but with a slower query tradeoff. Decreasing below five yields negligible improvements, while increasing above 50 rarely generates more patterns. The default is 25.
CustomWildcardstringA type literal that sets the wildcard value for a specific type in the results table, indicating no restriction on this column. The default is null, which represents an empty string. If the default is a good value in the data, a different wildcard value should be used, such as *. You can include multiple custom wildcards by adding them consecutively.

Returns

The autocluster plugin usually returns a small set of patterns. The patterns capture portions of the data with shared common values across multiple discrete attributes. Each pattern in the results is represented by a row.

The first column is the segment ID. The next two columns are the count and percentage of rows out of the original query that are captured by the pattern. The remaining columns are from the original query. Their value is either a specific value from the column, or a wildcard value (which are by default null) meaning variable values.

The patterns aren’t distinct, may be overlapping, and usually don’t cover all the original rows. Some rows may not fall under any pattern.

Examples

Using evaluate

T | evaluate autocluster()

Using autocluster

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.6)

Output

SegmentIdCountPercentStateEventTypeDamage
0227838.7HailNO
15128.7Thunderstorm WindYES
289815.3TEXAS

Using custom wildcards

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.2, '~', '~', '*')

Output

SegmentIdCountPercentStateEventTypeDamage
0227838.7*HailNO
15128.7*Thunderstorm WindYES
289815.3TEXAS**

2 - basket plugin

Learn how to use the basket plugin to find frequent patterns in data that exceed a frequency threshold.

The basket plugin finds frequent patterns of attributes in the data and returns the patterns that pass a frequency threshold in that data. A pattern represents a subset of the rows that have the same value across one or more columns. The basket plugin is based on the Apriori algorithm originally developed for basket analysis data mining.

Syntax

T | evaluate basket ([ Threshold, WeightColumn, MaxDimensions, CustomWildcard, CustomWildcard, … ])

Parameters

NameTypeRequiredDescription
ThresholdlongA double in the range of 0.015 to 1 that sets the minimal ratio of the rows to be considered frequent. Patterns with a smaller ratio won’t be returned. The default value is 0.05. To use the default value, input the tilde: ~.

Example: `T
WeightColumnstringThe column name to use to consider each row in the input according to the specified weight. Must be a name of a numeric type column, such as int, long, real. By default, each row has a weight of 1. To use the default value, input the tilde: ~. A common use of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row.

Example: `T
MaxDimensionsintSets the maximal number of uncorrelated dimensions per basket, limited by default, to minimize the query runtime. The default is 5. To use the default value, input the tilde: ~.

Example: `T
CustomWildcardstringSets the wildcard value for a specific type in the result table that will indicate that the current pattern doesn’t have a restriction on this column. The default is null except for string columns whose default value is an empty string. If the default is a good value in the data, a different wildcard value should be used, such as *. To use the default value, input the tilde: ~.

Example: `T

Returns

The basket plugin returns frequent patterns that pass a ratio threshold. The default threshold is 0.05.

Each pattern is represented by a row in the results. The first column is the segment ID. The next two columns are the count and percentage of rows, from the original query that match the pattern. The remaining columns relate to the original query, with either a specific value from the column or a wildcard value, which is by default null, meaning a variable value.

Example

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State, EventType, Damage, DamageCrops
| evaluate basket(0.2)

Output

SegmentIdCountPercentStateEventTypeDamageDamageCrops
0457477.7NO0
1227838.7HailNO0
2567596.40
3237140.3Hail0
4127921.7Thunderstorm Wind0
5246841.9Hail
6131022.3YES
7129121.9Thunderstorm Wind

Example with custom wildcards

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State, EventType, Damage, DamageCrops
| evaluate basket(0.2, '~', '~', '*', int(-1))

Output

SegmentIdCountPercentStateEventTypeDamageDamageCrops
0457477.7**NO0
1227838.7*HailNO0
2567596.4***0
3237140.3*Hail*0
4127921.7*Thunderstorm Wind*0
5246841.9*Hail*-1
6131022.3**YES-1
7129121.9*Thunderstorm Wind*-1

3 - diffpatterns plugin

Learn how to use the diffpatterns plugin to compare two datasets of the same structure to find the differences between the two datasets.

Compares two datasets of the same structure and finds patterns of discrete attributes (dimensions) that characterize differences between the two datasets. The plugin is invoked with the evaluate operator.

diffpatterns was developed to help analyze failures (for example, by comparing failures to non-failures in a given time frame), but can potentially find differences between any two datasets of the same structure.

Syntax

T | evaluate diffpatterns(SplitColumn, SplitValueA, SplitValueB [, WeightColumn,Threshold, MaxDimensions, CustomWildcard, …])

Parameters

NameTypeRequiredDescription
SplitColumnstring✔️The column name that tells the algorithm how to split the query into datasets. According to the specified values for the SplitValueA and SplitValueB arguments, the algorithm splits the query into two datasets, “A” and “B”, and analyzes the differences between them. As such, the split column must have at least two distinct values.
SplitValueAstring✔️A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “A”.
SplitValueBstring✔️A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “B”.
WeightColumnstringThe column used to consider each row in the input according to the specified weight. Must be a name of a numeric column, such as int, long, real. By default each row has a weight of ‘1’. To use the default value, input the tilde: ~. A common usage of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row.

Example: `T
ThresholdrealA real in the range of 0.015 to 1. This value sets the minimal pattern ratio difference between the two sets. The default is 0.05. To use the default value, input the tilde: ~.

Example: `T
MaxDimensionsintSets the maximum number of uncorrelated dimensions per result pattern. By specifying a limit, you decrease the query runtime. The default is unlimited. To use the default value, input the tilde: ~.

Example: `T
CustomWildcardstringSets the wildcard value for a specific type in the result table that will indicate that the current pattern doesn’t have a restriction on this column. The default is null, except for string columns for which the default is an empty string. If the default is a viable value in the data, a different wildcard value should be used. For example, *. To use the default value, input the tilde: ~.

Example: `T

Returns

diffpatterns returns a small set of patterns that capture different portions of the data in the two sets (that is, a pattern capturing a large percentage of the rows in the first dataset and low percentage of the rows in the second set). Each pattern is represented by a row in the results.

The result of diffpatterns returns the following columns:

  • SegmentId: the identity assigned to the pattern in the current query (note: IDs aren’t guaranteed to be the same in repeating queries).

  • CountA: the number of rows captured by the pattern in Set A (Set A is the equivalent of where tostring(splitColumn) == SplitValueA).

  • CountB: the number of rows captured by the pattern in Set B (Set B is the equivalent of where tostring(splitColumn) == SplitValueB).

  • PercentA: the percentage of rows in Set A captured by the pattern (100.0 * CountA / count(SetA)).

  • PercentB: the percentage of rows in Set B captured by the pattern (100.0 * CountB / count(SetB)).

  • PercentDiffAB: the absolute percentage point difference between A and B (|PercentA - PercentB|) is the main measure of significance of patterns in describing the difference between the two sets.

  • Rest of the columns: are the original schema of the input and describe the pattern, each row (pattern) represents the intersection of the non-wildcard values of the columns (equivalent of where col1==val1 and col2==val2 and ... colN=valN for each non-wildcard value in the row).

For each pattern, columns that aren’t set in the pattern (that is, without restriction on a specific value) will contain a wildcard value, which is null by default. See in the Arguments section below how wildcards can be manually changed.

  • Note: the patterns are often not distinct. They may be overlapping, and usually don’t cover all the original rows. Some rows may not fall under any pattern.

Example

StormEvents 
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , 1 , 0)
| project State , EventType , Source , Damage, DamageCrops
| evaluate diffpatterns(Damage, "0", "1" )

Output

SegmentIdCountACountBPercentAPercentBPercentDiffABStateEventTypeSourceDamageCrops
022789349.87.142.7Hail0
177951217.0339.0822.05Thunderstorm Wind
2109811824.019.0115Trained Spotter0
31361582.9712.069.09Newspaper
43592147.8516.348.49Flash Flood
5501221.099.318.22IOWA
665527914.3221.36.98Law Enforcement
71501173.288.935.65Flood
83621767.9113.445.52Emergency Manager

4 - diffpatterns_text plugin

Learn how to use the diffpatterns_text plugin to compare two string value datasets to find the differences between the two datasets.

Compares two datasets of string values and finds text patterns that characterize differences between the two datasets. The plugin is invoked with the evaluate operator.

The diffpatterns_text returns a set of text patterns that capture different portions of the data in the two sets. For example, a pattern capturing a large percentage of the rows when the condition is true and low percentage of the rows when the condition is false. The patterns are built from consecutive tokens separated by white space, with a token from the text column or a * representing a wildcard. Each pattern is represented by a row in the results.

Syntax

T | evaluate diffpatterns_text(TextColumn, BooleanCondition [, MinTokens, Threshold , MaxTokens])

Parameters

NameTypeRequiredDescription
TextColumnstring✔️The text column to analyze.
BooleanConditionstring✔️An expression that evaluates to a boolean value. The algorithm splits the query into the two datasets to compare based on this expression.
MinTokensintAn integer value between 0 and 200 that represents the minimal number of non-wildcard tokens per result pattern. The default is 1.
ThresholddecimalA decimal value between 0.015 and 1 that sets the minimal pattern ratio difference between the two sets. Default is 0.05. See diffpatterns.
MaxTokensintAn integer value between 0 and 20 that sets the maximal number of tokens per result pattern, specifying a lower limit decreases the query runtime.

Returns

The result of diffpatterns_text returns the following columns:

  • Count_of_True: The number of rows matching the pattern when the condition is true.
  • Count_of_False: The number of rows matching the pattern when the condition is false.
  • Percent_of_True: The percentage of rows matching the pattern from the rows when the condition is true.
  • Percent_of_False: The percentage of rows matching the pattern from the rows when the condition is false.
  • Pattern: The text pattern containing tokens from the text string and ‘*’ for wildcards.

Example

The following example uses data from the StormEvents table in the help cluster. To access this data, sign in to https://dataexplorer.azure.com/clusters/help/databases/Samples. In the left menu, browse to help > Samples > Tables > Storm_Events.

The examples in this tutorial use the StormEvents table, which is publicly available in the Weather analytics sample data.

StormEvents     
| where EventNarrative != "" and monthofyear(StartTime) > 1 and monthofyear(StartTime) < 9
| where EventType == "Drought" or EventType == "Extreme Cold/Wind Chill"
| evaluate diffpatterns_text(EpisodeNarrative, EventType == "Extreme Cold/Wind Chill", 2)

Output

Count_of_TrueCount_of_FalsePercent_of_TruePercent_of_FalsePattern
1106.290Winds shifting northwest in * wake * a surface trough brought heavy lake effect snowfall downwind * Lake Superior from
905.140Canadian high pressure settled * * region * produced the coldest temperatures since February * 2006. Durations * freezing temperatures
03406.24* * * * * * * * * * * * * * * * * * West Tennessee,
04207.71* * * * * * caused * * * * * * * * across western Colorado. *
04508.26* * below normal *
0110020.18Below normal *