kmeans_dynamic_fl()

This article describes the kmeans_dynamic_fl() user-defined function.

The function kmeans_dynamic_fl() is a UDF (user-defined function) that clusterizes a dataset using the k-means algorithm. This function is similar to kmeans_fl() just the features are supplied by a single numerical array column and not by multiple scalar columns.

Syntax

T | invoke kmeans_dynamic_fl(k, features_col, cluster_col)

Parameters

NameTypeRequiredDescription
kint✔️The number of clusters.
features_colstring✔️The name of the column containing the numeric array of features to be used for clustering.
cluster_colstring✔️The name of the column to store the output cluster ID for each record.

Function definition

You can define the function by either embedding its code as a query-defined function, or creating it as a stored function in your database, as follows:

Query-defined

Define the function using the following let statement. No permissions are required.

let kmeans_dynamic_fl=(tbl:(*),k:int, features_col:string, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features_col', features_col, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features_col = kargs["features_col"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features_col].apply(np.array)
        matrix = np.vstack(df1.values)
        kmeans = KMeans(n_clusters=k, random_state=0)
        kmeans.fit(matrix)
        result = df
        result[cluster_col] = kmeans.labels_
    ```;
    tbl
    | evaluate python(typeof(*),code, kwargs)
};
// Write your query to use the function here.

Stored

Define the stored function once using the following .create function. Database User permissions are required.

.create-or-alter function with (folder = "Packages\\ML", docstring = "K-Means clustering of features passed as a single column containing numerical array")
kmeans_dynamic_fl(tbl:(*),k:int, features_col:string, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features_col', features_col, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features_col = kargs["features_col"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features_col].apply(np.array)
        matrix = np.vstack(df1.values)
        kmeans = KMeans(n_clusters=k, random_state=0)
        kmeans.fit(matrix)
        result = df
        result[cluster_col] = kmeans.labels_
    ```;
    tbl
    | evaluate python(typeof(*),code, kwargs)
}

Example

The following example uses the invoke operator to run the function.

Clustering of artificial dataset with three clusters

Query-defined

To use a query-defined function, invoke it after the embedded function definition.

let kmeans_dynamic_fl=(tbl:(*),k:int, features_col:string, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features_col', features_col, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features_col = kargs["features_col"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features_col].apply(np.array)
        matrix = np.vstack(df1.values)
        kmeans = KMeans(n_clusters=k, random_state=0)
        kmeans.fit(matrix)
        result = df
        result[cluster_col] = kmeans.labels_
    ```;
    tbl
    | evaluate python(typeof(*),code, kwargs)
};
union 
(range x from 1 to 100 step 1 | extend x=rand()+3, y=rand()+2),
(range x from 101 to 200 step 1 | extend x=rand()+1, y=rand()+4),
(range x from 201 to 300 step 1 | extend x=rand()+2, y=rand()+6)
| project Features=pack_array(x, y), cluster_id=int(null)
| invoke kmeans_dynamic_fl(3, "Features", "cluster_id")
| extend x=toreal(Features[0]), y=toreal(Features[1])
| render scatterchart with(series=cluster_id)

Stored

union 
(range x from 1 to 100 step 1 | extend x=rand()+3, y=rand()+2),
(range x from 101 to 200 step 1 | extend x=rand()+1, y=rand()+4),
(range x from 201 to 300 step 1 | extend x=rand()+2, y=rand()+6)
| project Features=pack_array(x, y), cluster_id=int(null)
| invoke kmeans_dynamic_fl(3, "Features", "cluster_id")
| extend x=toreal(Features[0]), y=toreal(Features[1])
| render scatterchart with(series=cluster_id)

Screenshot of scatterchart of K-Means clustering of artificial dataset with three clusters.