This is the multi-page printable view of this section. Click here to print.
Window functions
1 - next()
Returns the value of a column in a row that is at some offset following the current row in a serialized row set.
Syntax
next(
column,
[ offset,
default_value ])
Parameters
Name | Type | Required | Description |
---|---|---|---|
column | string | ✔️ | The column from which to get the values. |
offset | int | The amount of rows to move from the current row. Default is 1. | |
default_value | scalar | The default value when there’s no value in the next row. When no default value is specified, null is used. |
Examples
Filter data based on comparison between adjacent rows
The following query returns rows that show breaks longer than a quarter of a second between calls to sensor-9
.
TransformedSensorsData
| where SensorName == 'sensor-9'
| sort by Timestamp asc
| extend timeDiffInMilliseconds = datetime_diff('millisecond', next(Timestamp, 1), Timestamp)
| where timeDiffInMilliseconds > 250
Output
Timestamp | SensorName | Value | PublisherId | MachineId | timeDiff |
---|---|---|---|---|---|
2022-04-13T00:58:53.048506Z | sensor-9 | 0.39217481975439894 | fdbd39ab-82ac-4ca0-99ed-2f83daf3f9bb | M100 | 251 |
2022-04-13T01:07:09.63713Z | sensor-9 | 0.46645392778288297 | e3ed081e-501b-4d59-8e60-8524633d9131 | M100 | 313 |
2022-04-13T01:07:10.858267Z | sensor-9 | 0.693091598493419 | 278ca033-2b5e-4f2c-b493-00319b275aea | M100 | 254 |
2022-04-13T01:07:11.203834Z | sensor-9 | 0.52415808840249778 | 4ea27181-392d-4947-b811-ad5af02a54bb | M100 | 331 |
2022-04-13T01:07:14.431908Z | sensor-9 | 0.35430645405452 | 0af415c2-59dc-4a50-89c3-9a18ae5d621f | M100 | 268 |
… | … | … | … | … | … |
Perform aggregation based on comparison between adjacent rows
The following query calculates the average time difference in milliseconds between calls to sensor-9
.
TransformedSensorsData
| where SensorName == 'sensor-9'
| sort by Timestamp asc
| extend timeDiffInMilliseconds = datetime_diff('millisecond', next(Timestamp, 1), Timestamp)
| summarize avg(timeDiffInMilliseconds)
Output
avg_timeDiffInMilliseconds |
---|
30.726900061254298 |
Extend row with data from the next row
In the following query, as part of the serialization done with the serialize operator, a new column next_session_type
is added with data from the next row.
ConferenceSessions
| where conference == 'Build 2019'
| serialize next_session_type = next(session_type)
| project time_and_duration, session_title, session_type, next_session_type
Output
time_and_duration | session_title | session_type | next_session_type |
---|---|---|---|
Mon, May 6, 8:30-10:00 am | Vision Keynote - Satya Nadella | Keynote | Expo Session |
Mon, May 6, 1:20-1:40 pm | Azure Data Explorer: Advanced Time Series analysis | Expo Session | Breakout |
Mon, May 6, 2:00-3:00 pm | Azure’s Data Platform - Powering Modern Applications and Cloud Scale Analytics at Petabyte Scale | Breakout | Expo Session |
Mon, May 6, 4:00-4:20 pm | How BASF is using Azure Data Services | Expo Session | Expo Session |
Mon, May 6, 6:50 - 7:10 pm | Azure Data Explorer: Operationalize your ML models | Expo Session | Expo Session |
… | … | … | … |
2 - prev()
Returns the value of a specific column in a specified row. The specified row is at a specified offset from the current row in a serialized row set.
Syntax
prev(
column,
[ offset ],
[ default_value ] )
Parameters
Name | Type | Required | Description |
---|---|---|---|
column | string | ✔️ | The column from which to get the values. |
offset | int | The offset to go back in rows. The default is 1. | |
default_value | scalar | The default value to be used when there are no previous rows from which to take the value. The default is null . |
Examples
Filter data based on comparison between adjacent rows
The following query returns rows that show breaks longer than a quarter of a second between calls to sensor-9
.
TransformedSensorsData
| where SensorName == 'sensor-9'
| sort by Timestamp asc
| extend timeDiffInMilliseconds = datetime_diff('millisecond', Timestamp, prev(Timestamp, 1))
| where timeDiffInMilliseconds > 250
Output
Timestamp | SensorName | Value | PublisherId | MachineId | timeDiff |
---|---|---|---|---|---|
2022-04-13T00:58:53.048506Z | sensor-9 | 0.39217481975439894 | fdbd39ab-82ac-4ca0-99ed-2f83daf3f9bb | M100 | 251 |
2022-04-13T01:07:09.63713Z | sensor-9 | 0.46645392778288297 | e3ed081e-501b-4d59-8e60-8524633d9131 | M100 | 313 |
2022-04-13T01:07:10.858267Z | sensor-9 | 0.693091598493419 | 278ca033-2b5e-4f2c-b493-00319b275aea | M100 | 254 |
2022-04-13T01:07:11.203834Z | sensor-9 | 0.52415808840249778 | 4ea27181-392d-4947-b811-ad5af02a54bb | M100 | 331 |
2022-04-13T01:07:14.431908Z | sensor-9 | 0.35430645405452 | 0af415c2-59dc-4a50-89c3-9a18ae5d621f | M100 | 268 |
… | … | … | … | … | … |
Perform aggregation based on comparison between adjacent rows
The following query calculates the average time difference in milliseconds between calls to sensor-9
.
TransformedSensorsData
| where SensorName == 'sensor-9'
| sort by Timestamp asc
| extend timeDiffInMilliseconds = datetime_diff('millisecond', Timestamp, prev(Timestamp, 1))
| summarize avg(timeDiffInMilliseconds)
Output
avg_timeDiffInMilliseconds |
---|
30.726900061254298 |
Extend row with data from the previous row
In the following query, as part of the serialization done with the serialize operator, a new column previous_session_type
is added with data from the previous row. Since there was no session prior to the first session, the column is empty in the first row.
ConferenceSessions
| where conference == 'Build 2019'
| serialize previous_session_type = prev(session_type)
| project time_and_duration, session_title, session_type, previous_session_type
Output
time_and_duration | session_title | session_type | previous_session_type |
---|---|---|---|
Mon, May 6, 8:30-10:00 am | Vision Keynote - Satya Nadella | Keynote | |
Mon, May 6, 1:20-1:40 pm | Azure Data Explorer: Advanced Time Series analysis | Expo Session | Keynote |
Mon, May 6, 2:00-3:00 pm | Azure’s Data Platform - Powering Modern Applications and Cloud Scale Analytics at Petabyte Scale | Breakout | Expo Session |
Mon, May 6, 4:00-4:20 pm | How BASF is using Azure Data Services | Expo Session | Breakout |
Mon, May 6, 6:50 - 7:10 pm | Azure Data Explorer: Operationalize your ML models | Expo Session | Expo Session |
… | … | … | … |
3 - row_cumsum()
Calculates the cumulative sum of a column in a serialized row set.
Syntax
row_cumsum(
term [,
restart] )
Parameters
Name | Type | Required | Description |
---|---|---|---|
term | int, long, or real | ✔️ | The expression indicating the value to be summed. |
restart | bool | Indicates when the accumulation operation should be restarted, or set back to 0. It can be used to indicate partitions in the data. |
Returns
The function returns the cumulative sum of its argument.
Examples
The following example shows how to calculate the cumulative sum of the first few even integers.
datatable (a:long) [
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
]
| where a%2==0
| serialize cs=row_cumsum(a)
a | cs |
---|---|
2 | 2 |
4 | 6 |
6 | 12 |
8 | 20 |
10 | 30 |
This example shows how to calculate the cumulative sum (here, of salary
)
when the data is partitioned (here, by name
):
datatable (name:string, month:int, salary:long)
[
"Alice", 1, 1000,
"Bob", 1, 1000,
"Alice", 2, 2000,
"Bob", 2, 1950,
"Alice", 3, 1400,
"Bob", 3, 1450,
]
| order by name asc, month asc
| extend total=row_cumsum(salary, name != prev(name))
name | month | salary | total |
---|---|---|---|
Alice | 1 | 1000 | 1000 |
Alice | 2 | 2000 | 3000 |
Alice | 3 | 1400 | 4400 |
Bob | 1 | 1000 | 1000 |
Bob | 2 | 1950 | 2950 |
Bob | 3 | 1450 | 4400 |
4 - row_number()
Returns the current row’s index in a serialized row set.
The row index starts by default at 1
for the first row, and is incremented by 1
for each additional row.
Optionally, the row index can start at a different value than 1
.
Additionally, the row index may be reset according to some provided predicate.
Syntax
row_number(
[StartingIndex [,
Restart]] )
Parameters
Name | Type | Required | Description |
---|---|---|---|
StartingIndex | long | The value of the row index to start at or restart to. The default value is 1. | |
restart | bool | Indicates when the numbering is to be restarted to the StartingIndex value. The default is false . |
Returns
The function returns the row index of the current row as a value of type long
.
Examples
The following example returns a table with two columns, the first column (a
)
with numbers from 10
down to 1
, and the second column (rn
) with numbers
from 1
up to 10
:
range a from 1 to 10 step 1
| sort by a desc
| extend rn=row_number()
The following example is similar to the above, only the second column (rn
)
starts at 7
:
range a from 1 to 10 step 1
| sort by a desc
| extend rn=row_number(7)
The last example shows how one can partition the data and number the rows
per each partition. Here, we partition the data by Airport
:
datatable (Airport:string, Airline:string, Departures:long)
[
"TLV", "LH", 1,
"TLV", "LY", 100,
"SEA", "LH", 1,
"SEA", "BA", 2,
"SEA", "LY", 0
]
| sort by Airport asc, Departures desc
| extend Rank=row_number(1, prev(Airport) != Airport)
Running this query produces the following result:
Airport | Airline | Departures | Rank |
---|---|---|---|
SEA | BA | 2 | 1 |
SEA | LH | 1 | 2 |
SEA | LY | 0 | 3 |
TLV | LY | 100 | 1 |
TLV | LH | 1 | 2 |
5 - row_rank_dense()
Returns the current row’s dense rank in a serialized row set.
The row rank starts by default at 1
for the first row, and is incremented by 1
whenever the provided Term is different than the previous row’s Term.
Syntax
row_rank_dense
(
Term )
Parameters
Name | Type | Required | Description |
---|---|---|---|
Term | string | ✔️ | An expression indicating the value to consider for the rank. The rank is increased whenever the Term changes. |
restart | bool | Indicates when the numbering is to be restarted to the StartingIndex value. The default is false . |
Returns
Returns the row rank of the current row as a value of type long
.
Example
The following query shows how to rank the Airline
by the number of departures from the SEA Airport
using dense rank.
datatable (Airport:string, Airline:string, Departures:long)
[
"SEA", "LH", 3,
"SEA", "LY", 100,
"SEA", "UA", 3,
"SEA", "BA", 2,
"SEA", "EL", 3
]
| sort by Departures asc
| extend Rank=row_rank_dense(Departures)
Output
Airport | Airline | Departures | Rank |
---|---|---|---|
SEA | BA | 2 | 1 |
SEA | LH | 3 | 2 |
SEA | UA | 3 | 2 |
SEA | EL | 3 | 2 |
SEA | LY | 100 | 3 |
The following example shows how to rank the Airline
by the number of departures per each partition. Here, we partition the data by Airport
:
datatable (Airport:string, Airline:string, Departures:long)
[
"SEA", "LH", 3,
"SEA", "LY", 100,
"SEA", "UA", 3,
"SEA", "BA", 2,
"SEA", "EL", 3,
"AMS", "EL", 1,
"AMS", "BA", 1
]
| sort by Airport desc, Departures asc
| extend Rank=row_rank_dense(Departures, prev(Airport) != Airport)
Output
Airport | Airline | Departures | Rank |
---|---|---|---|
SEA | BA | 2 | 1 |
SEA | LH | 3 | 2 |
SEA | UA | 3 | 2 |
SEA | EL | 3 | 2 |
SEA | LY | 100 | 3 |
AMS | EL | 1 | 1 |
AMS | BA | 1 | 1 |
6 - row_rank_min()
Returns the current row’s minimal rank in a serialized row set.
The rank is the minimal row number that the current row’s Term appears in.
Syntax
row_rank_min
(
Term )
Parameters
Name | Type | Required | Description |
---|---|---|---|
Term | string | ✔️ | An expression indicating the value to consider for the rank. The rank is the minimal row number for Term. |
restart | bool | Indicates when the numbering is to be restarted to the StartingIndex value. The default is false . |
Returns
Returns the row rank of the current row as a value of type long
.
Example
The following query shows how to rank the Airline
by the number of departures from the SEA Airport
.
datatable (Airport:string, Airline:string, Departures:long)
[
"SEA", "LH", 3,
"SEA", "LY", 100,
"SEA", "UA", 3,
"SEA", "BA", 2,
"SEA", "EL", 3
]
| sort by Departures asc
| extend Rank=row_rank_min(Departures)
Output
Airport | Airline | Departures | Rank |
---|---|---|---|
SEA | BA | 2 | 1 |
SEA | LH | 3 | 2 |
SEA | UA | 3 | 2 |
SEA | EL | 3 | 2 |
SEA | LY | 100 | 5 |
7 - row_window_session()
Calculates session start values of a column in a serialized row set.
Syntax
row_window_session
(
Expr ,
MaxDistanceFromFirst ,
MaxDistanceBetweenNeighbors [,
Restart] )
Parameters
Name | Type | Required | Description |
---|---|---|---|
Expr | datetime | ✔️ | An expression whose values are grouped together in sessions. When Expr results in a null value, the next value starts a new session. |
MaxDistanceFromFirst | timespan | ✔️ | Determines when a new session starts using the maximum distance between the current Expr value and its value at the beginning of the session. |
MaxDistanceBetweenNeighbors | timespan | ✔️ | Another criterion for starting a new session using the maximum distance from one value of Expr to the next. |
Restart | boolean | If specified, every value that evaluates to true immediately restarts the session. |
Returns
The function returns the values at the beginning of each session. It uses the following conceptual calculation model:
Iterates over the input sequence of Expr values in order.
For each value, it decides whether to create a new session.
If a new session is created, the function returns the current value of Expr. Otherwise, it returns the previous value of Expr.
MaxDistanceFromFirst. plus MaxDistanceBetweenNeighbors.
Examples
The following example calculates session start values for a table, datatable
, with a sequence ID column and a Timestamp column to record the time of each record. The data is sorted by the sequence IDs and timestamps and then the example returns values for ID, Timestamp, and a new SessionStarted column. A session can’t exceed one hour. It continues for as long as records are less than five minutes apart and the ID stays the same. The example includes records that are less than five minutes apart.
datatable (ID:string, Timestamp:datetime) [
"1", datetime(2024-04-11 10:00:00),
"2", datetime(2024-04-11 10:18:00),
"1", datetime(2024-04-11 11:00:00),
"3", datetime(2024-04-11 11:30:00),
"2", datetime(2024-04-11 13:30:00),
"2", datetime(2024-04-11 10:16:00)
]
| sort by ID asc, Timestamp asc
| extend SessionStarted = row_window_session(Timestamp, 1h, 5m, ID != prev(ID))
Output
ID | Timestamp | SessionStarted |
---|---|---|
1 | 2024-04-11T10:00:00Z | 2024-04-11T10:00:00Z |
1 | 2024-04-11T11:00:00Z | 2024-04-11T11:00:00Z |
2 | 2024-04-11T10:16:00Z | 2024-04-11T10:16:00Z |
2 | 2024-04-11T10:18:00Z | 2024-04-11T10:16:00Z |
2 | 2024-04-11T13:30:00Z | 2024-04-11T13:30:00Z |
3 | 2024-04-11T11:30:00Z | 2024-04-11T11:30:00Z |
Related content
8 - Window functions
Window functions operate on multiple rows (records) in a row set at a time. Unlike aggregation functions, window functions require that the rows in the row set be serialized (have a specific order to them). Window functions may depend on the order to determine the result.
Window functions can only be used on serialized sets. The easiest way to serialize a row set is to use the serialize operator. This operator “freezes” the order of rows in an arbitrary manner. If the order of serialized rows is semantically important, use the sort operator to force a particular order.
The serialization process has a non-trivial cost associated with it. For example, it might prevent query parallelism in many scenarios. Therefore, don’t apply serialization unnecessarily. If necessary, rearrange the query to perform serialization on the smallest row set possible.
Serialized row set
An arbitrary row set (such as a table, or the output of a tabular operator) can be serialized in one of the following ways:
- By sorting the row set. See below for a list of operators that emit sorted row sets.
- By using the serialize operator.
Many tabular operators serialize output whenever the input is already serialized, even if the operator doesn’t itself guarantee that the result is serialized. For example, this property is guaranteed for the extend operator, the project operator, and the where operator.