This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Aggregate Functions

1: ANY
2: APPROX_COUNT_DISTINCT
3: ARG_MAX
4: ARG_MIN
5: ARRAY_AGG
6: AVG
7: AVG_IF
8: COUNT
9: COUNT_DISTINCT
10: COUNT_IF
11: COVAR_POP
12: COVAR_SAMP
13: GROUP_ARRAY_MOVING_AVG
14: GROUP_ARRAY_MOVING_SUM
15: HISTOGRAM
16: JSON_ARRAY_AGG
17: JSON_OBJECT_AGG
18: KURTOSIS
19: MAX
20: MAX_IF
21: MEDIAN
22: MEDIAN_TDIGEST
23: MIN
24: MIN_IF
25: QUANTILE_CONT
26: QUANTILE_DISC
27: QUANTILE_TDIGEST
28: QUANTILE_TDIGEST_WEIGHTED
29: RETENTION
30: SKEWNESS
31: STDDEV_POP
32: STDDEV_SAMP
33: STRING_AGG
34: SUM
35: SUM_IF
36: WINDOW_FUNNEL

Aggregate functions are essential tools in SQL that allow you to perform calculations on a set of values and return a single result.

These functions help you extract and summarize data from databases to gain valuable insights.

Function Name	What It Does
ANY	Checks if any row meets the specified condition
APPROX_COUNT_DISTINCT	Estimates the number of distinct values with HyperLogLog
ARG_MAX	Finds the arg value for the maximum val value
ARG_MIN	Finds the arg value for the minimum val value
AVG_IF	Calculates the average for rows meeting a condition
ARRAY_AGG	Converts all the values of a column to an Array
AVG	Calculates the average value of a specific column
COUNT_DISTINCT	Counts the number of distinct values in a column
COUNT_IF	Counts rows meeting a specified condition
COUNT	Counts the number of rows that meet certain criteria
COVAR_POP	Returns the population covariance of a set of number pairs
COVAR_SAMP	Returns the sample covariance of a set of number pairs
GROUP_ARRAY_MOVING_AVG	Returns an array with elements calculates the moving average of input values
GROUP_ARRAY_MOVING_SUM	Returns an array with elements calculates the moving sum of input values
KURTOSIS	Calculates the excess kurtosis of a set of values
MAX_IF	Finds the maximum value for rows meeting a condition
MAX	Finds the largest value in a specific column
MEDIAN	Calculates the median value of a specific column
MEDIAN_TDIGEST	Calculates the median value of a specific column using t-digest algorithm
MIN_IF	Finds the minimum value for rows meeting a condition
MIN	Finds the smallest value in a specific column
QUANTILE_CONT	Calculates the interpolated quantile for a specific column
QUANTILE_DISC	Calculates the quantile for a specific column
QUANTILE_TDIGEST	Calculates the quantile using t-digest algorithm
QUANTILE_TDIGEST_WEIGHTED	Calculates the quantile with weighted using t-digest algorithm
RETENTION	Calculates retention for a set of events
SKEWNESS	Calculates the skewness of a set of values
STDDEV_POP	Calculates the population standard deviation of a column
STDDEV_SAMP	Calculates the sample standard deviation of a column
STRING_AGG	Converts all the non-NULL values to String, separated by the delimiter
SUM_IF	Adds up the values meeting a condition of a specific column
SUM	Adds up the values of a specific column
WINDOW_FUNNEL	Analyzes user behavior in a time-ordered sequence of events

1 - ANY

Aggregate function.

The ANY() function selects the first encountered (non-NULL) value, unless all rows have NULL values in that column. The query can be executed in any order and even in a different order each time, so the result of this function is indeterminate. To get a determinate result, you can use the ‘min’ or ‘max’ function instead of ‘any’.

Analyze Syntax

func.any(<expr>)

Analyze Examples

func.any(table.product_name).alias('any_product_name')

| any_product_name |
|------------------|
| Laptop           |

SQL Syntax

ANY(<expr>)

Arguments

Arguments	Description
`<expr>`	Any expression

Return Type

The first encountered (non-NULL) value, in the type of the value. If all values are NULL, the return value is NULL.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE product_data (
  id INT,
  product_name VARCHAR NULL,
  price FLOAT NULL
);

INSERT INTO product_data (id, product_name, price)
VALUES (1, 'Laptop', 1000),
       (2, NULL, 800),
       (3, 'Keyboard', NULL),
       (4, 'Mouse', 25),
       (5, 'Monitor', 150);

Query Demo: Retrieve the First Encountered Non-NULL Product Name

SELECT ANY(product_name) AS any_product_name
FROM product_data;

Result

| any_product_name |
|------------------|
| Laptop           |

2 - APPROX_COUNT_DISTINCT

Estimates the number of distinct values in a data set with the HyperLogLog algorithm.

The HyperLogLog algorithm provides an approximation of the number of unique elements using little memory and time. Consider using this function when dealing with large data sets where an estimated result can be accepted. In exchange for some accuracy, this is a fast and efficient method of returning distinct counts.

To get an accurate result, use COUNT_DISTINCT. See Examples for more explanations.

Analyze Syntax

func.approx_count_distinct(<expr>)

Analyze Examples

func.approx_count_distinct(table.user_id).alias('approx_distinct_user_count')

| approx_distinct_user_count |
|----------------------------|
|             4              |

SQL Syntax

APPROX_COUNT_DISTINCT(<expr>)

Return Type

Integer.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE user_events (
  id INT,
  user_id INT,
  event_name VARCHAR
);

INSERT INTO user_events (id, user_id, event_name)
VALUES (1, 1, 'Login'),
       (2, 2, 'Login'),
       (3, 3, 'Login'),
       (4, 1, 'Logout'),
       (5, 2, 'Logout'),
       (6, 4, 'Login'),
       (7, 1, 'Login');

Query Demo: Estimate the Number of Distinct User IDs

SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_distinct_user_count
FROM user_events;

Result

| approx_distinct_user_count |
|----------------------------|
|             4              |

3 - ARG_MAX

Calculates the arg value for a maximum val value. If there are several values of arg for maximum values of val, returns the first of these values encountered.

Analyze Syntax

func.arg_max(<expr>)

Analyze Examples

func.arg_max(table.product, table.price).alias('max_price_product')

| max_price_product |
| ----------------- |
| Product C         |

SQL Syntax

ARG_MAX(<arg>, <val>)

Arguments

Arguments	Description
`<arg>`	Argument of any data type that PlaidCloud Lakehouse supports
`<val>`	Value of any data type that PlaidCloud Lakehouse supports

Return Type

arg value that corresponds to maximum val value.

matches arg type.

SQL Examples

Creating a Table and Inserting Sample Data

Let's create a table named "sales" and insert some sample data:

CREATE TABLE sales (
  id INTEGER,
  product VARCHAR(50),
  price FLOAT
);

INSERT INTO sales (id, product, price)
VALUES (1, 'Product A', 10.5),
       (2, 'Product B', 20.75),
       (3, 'Product C', 30.0),
       (4, 'Product D', 15.25),
       (5, 'Product E', 25.5);

Query: Using ARG_MAX() Function

Now, let's use the ARG_MAX() function to find the product that has the maximum price:

SELECT ARG_MAX(product, price) AS max_price_product
FROM sales;

The result should look like this:

| max_price_product |
| ----------------- |
| Product C         |

4 - ARG_MIN

Calculates the arg value for a minimum val value. If there are several different values of arg for minimum values of val, returns the first of these values encountered.

Analyze Syntax

func.arg_min(<expr>)

Analyze Examples

func.arg_min(table.name, table.score).alias('student_name')

| student_name |
|--------------|
| Charlie      |

SQL Syntax

ARG_MIN(<arg>, <val>)

Arguments

Arguments	Description
`<arg>`	Argument of any data type that PlaidCloud Lakehouse supports
`<val>`	Value of any data type that PlaidCloud Lakehouse supports

Return Type

arg value that corresponds to minimum val value.

matches arg type.

SQL Examples

Let's create a table students with columns id, name, and score, and insert some data:

CREATE TABLE students (
  id INT,
  name VARCHAR,
  score INT
);

INSERT INTO students (id, name, score) VALUES
  (1, 'Alice', 80),
  (2, 'Bob', 75),
  (3, 'Charlie', 90),
  (4, 'Dave', 80);

Now, we can use ARG_MIN to find the name of the student with the lowest score:

SELECT ARG_MIN(name, score) AS student_name
FROM students;

Result:

| student_name |
|--------------|
| Charlie      |

5 - ARRAY_AGG

The ARRAY_AGG function (also known by its alias LIST) transforms all the values, including NULL, of a specific column in a query result into an array.

Analyze Syntax

func.array_agg(<expr>)

Analyze Examples

table.movie_title, func.array_agg(table.rating).alias('ratings')

| movie_title |  ratings   |
|-------------|------------|
| Inception   | [5, 4, 5]  |

SQL Syntax

ARRAY_AGG(<expr>)

LIST(<expr>)

Arguments

Arguments	Description
`<expr>`	Any expression

Return Type

Returns an Array with elements that are of the same type as the original data.

SQL Examples

This example demonstrates how the ARRAY_AGG function can be used to aggregate and present data in a convenient array format:

-- Create a table and insert sample data
CREATE TABLE movie_ratings (
  id INT,
  movie_title VARCHAR,
  user_id INT,
  rating INT
);

INSERT INTO movie_ratings (id, movie_title, user_id, rating)
VALUES (1, 'Inception', 1, 5),
       (2, 'Inception', 2, 4),
       (3, 'Inception', 3, 5),
       (4, 'Interstellar', 1, 4),
       (5, 'Interstellar', 2, 3);

-- List all ratings for Inception in an array
SELECT movie_title, ARRAY_AGG(rating) AS ratings
FROM movie_ratings
WHERE movie_title = 'Inception'
GROUP BY movie_title;

| movie_title |  ratings   |
|-------------|------------|
| Inception   | [5, 4, 5]  |

6 - AVG

Aggregate function.

The AVG() function returns the average value of an expression.

Note: NULL values are not counted.

Analyze Syntax

func.avg(<column>)

Analyze Examples

func.avg(table.price).alias('avg_price')

| avg_price |
| --------- |
| 20.4      |

SQL Syntax

AVG(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

double

SQL Examples

Creating a Table and Inserting Sample Data

Let's create a table named "sales" and insert some sample data:

CREATE TABLE sales (
  id INTEGER,
  product VARCHAR(50),
  price FLOAT
);

INSERT INTO sales (id, product, price)
VALUES (1, 'Product A', 10.5),
       (2, 'Product B', 20.75),
       (3, 'Product C', 30.0),
       (4, 'Product D', 15.25),
       (5, 'Product E', 25.5);

Query: Using AVG() Function

Now, let's use the AVG() function to find the average price of all products in the "sales" table:

SELECT AVG(price) AS avg_price
FROM sales;

The result should look like this:

| avg_price |
| --------- |
| 20.4      |

7 - AVG_IF

The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.avg_if(<column>, <cond>)

Analyze Examples

func.avg_if(table.salary, table.department=='IT').alias('avg_salary_it')

| avg_salary_it   |
|-----------------|
|     65000.0     |

SQL Syntax

AVG_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE employees (
  id INT,
  salary INT,
  department VARCHAR
);

INSERT INTO employees (id, salary, department)
VALUES (1, 50000, 'HR'),
       (2, 60000, 'IT'),
       (3, 55000, 'HR'),
       (4, 70000, 'IT'),
       (5, 65000, 'IT');

Query Demo: Calculate Average Salary for IT Department

SELECT AVG_IF(salary, department = 'IT') AS avg_salary_it
FROM employees;

Result

| avg_salary_it   |
|-----------------|
|     65000.0     |

8 - COUNT

The COUNT() function returns the number of records returned by a SELECT query.

Caution: NULL values are not counted.

Analyze Syntax

func.count(<column>)

Analyze Examples

func.count(table.grade).alias('count_valid_grades')

| count_valid_grades |
|--------------------|
|          4         |

SQL Syntax

COUNT(<expr>)

Arguments

Arguments	Description
`<expr>`	Any expression. This may be a column name, the result of another function, or a math operation. `*` is also allowed, to indicate pure row counting.

Return Type

An integer.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE students (
  id INT,
  name VARCHAR,
  age INT,
  grade FLOAT NULL
);

INSERT INTO students (id, name, age, grade)
VALUES (1, 'John', 21, 85),
       (2, 'Emma', 22, NULL),
       (3, 'Alice', 23, 90),
       (4, 'Michael', 21, 88),
       (5, 'Sophie', 22, 92);

Query Demo: Count Students with Valid Grades

SELECT COUNT(grade) AS count_valid_grades
FROM students;

Result

| count_valid_grades |
|--------------------|
|          4         |

9 - COUNT_DISTINCT

Aggregate function.

The count(distinct ...) function calculates the unique value of a set of values.

To obtain an estimated result from large data sets with little memory and time, consider using APPROX_COUNT_DISTINCT.

Caution: NULL values are not counted.

Analyze Syntax

func.count_distinct(<column>)

Analyze Examples

func.count_distinct(table.category).alias('unique_categories')

| unique_categories |
|-------------------|
|         2         |

SQL Syntax

COUNT(distinct <expr> ...)
UNIQ(<expr>)

Arguments

Arguments	Description
`<expr>`	Any expression, size of the arguments is [1, 32]

Return Type

UInt64

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE products (
  id INT,
  name VARCHAR,
  category VARCHAR,
  price FLOAT
);

INSERT INTO products (id, name, category, price)
VALUES (1, 'Laptop', 'Electronics', 1000),
       (2, 'Smartphone', 'Electronics', 800),
       (3, 'Tablet', 'Electronics', 600),
       (4, 'Chair', 'Furniture', 150),
       (5, 'Table', 'Furniture', 300);

Query Demo: Count Distinct Categories

SELECT COUNT(DISTINCT category) AS unique_categories
FROM products;

Result

| unique_categories |
|-------------------|
|         2         |

10 - COUNT_IF

The suffix _IF can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.count_if(<column>, <cond>)

Analyze Examples

func.count_if(table.status, table.status=='Completed').alias('completed_orders')

| completed_orders |
|------------------|
|        3         |

SQL Example

COUNT_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE orders (
  id INT,
  customer_id INT,
  status VARCHAR,
  total FLOAT
);

INSERT INTO orders (id, customer_id, status, total)
VALUES (1, 1, 'completed', 100),
       (2, 2, 'completed', 200),
       (3, 1, 'pending', 150),
       (4, 3, 'completed', 250),
       (5, 2, 'pending', 300);

Query Demo: Count Completed Orders

SELECT COUNT_IF(status, status = 'completed') AS completed_orders
FROM orders;

Result

| completed_orders |
|------------------|
|        3         |

11 - COVAR_POP

COVAR_POP returns the population covariance of a set of number pairs.

Analyze Syntax

func.covar_pop(<expr1>, <expr2>)

Analyze Examples

func.covar_pop(table.units_sold, table.revenue).alias('covar_pop_units_revenue')

| covar_pop_units_revenue |
|-------------------------|
|        20000.0          |

SQL Syntax

COVAR_POP(<expr1>, <expr2>)

Arguments

Arguments	Description
`<expr1>`	Any numerical expression
`<expr2>`	Any numerical expression

Return Type

float64

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE product_sales (
  id INT,
  product_id INT,
  units_sold INT,
  revenue FLOAT
);

INSERT INTO product_sales (id, product_id, units_sold, revenue)
VALUES (1, 1, 10, 1000),
       (2, 2, 20, 2000),
       (3, 3, 30, 3000),
       (4, 4, 40, 4000),
       (5, 5, 50, 5000);

Query Demo: Calculate Population Covariance between Units Sold and Revenue

SELECT COVAR_POP(units_sold, revenue) AS covar_pop_units_revenue
FROM product_sales;

Result

| covar_pop_units_revenue |
|-------------------------|
|        20000.0          |

12 - COVAR_SAMP

Aggregate function.

The covar_samp() function returns the sample covariance (Σ((x - x̅)(y - y̅)) / (n - 1)) of two data columns.

Caution: NULL values are not counted.

Analyze Syntax

func.covar_samp(<expr1>, <expr2>)

Analyze Examples

func.covar_samp(table.items_sold, table.profit).alias('covar_samp_items_profit')

| covar_samp_items_profit |
|-------------------------|
|        250000.0         |

SQL Syntax

COVAR_SAMP(<expr1>, <expr2>)

Arguments

Arguments	Description
`<expr1>`	Any numerical expression
`<expr2>`	Any numerical expression

Return Type

float64, when n <= 1, returns +∞.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE store_sales (
  id INT,
  store_id INT,
  items_sold INT,
  profit FLOAT
);

INSERT INTO store_sales (id, store_id, items_sold, profit)
VALUES (1, 1, 100, 1000),
       (2, 2, 200, 2000),
       (3, 3, 300, 3000),
       (4, 4, 400, 4000),
       (5, 5, 500, 5000);

Query Demo: Calculate Sample Covariance between Items Sold and Profit

SELECT COVAR_SAMP(items_sold, profit) AS covar_samp_items_profit
FROM store_sales;

Result

| covar_samp_items_profit |
|-------------------------|
|        250000.0         |

13 - GROUP_ARRAY_MOVING_AVG

The GROUP_ARRAY_MOVING_AVG function calculates the moving average of input values. The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of input values.

Analyze Syntax

func.group_array_moving_avg(<expr1>)

Analyze Examples

table.user_id, func.group_array_moving_avg(table.request_num).alias('avg_request_num')

| user_id | avg_request_num  |
|---------|------------------|
|       1 | [5.0,11.5,21.5]  |
|       3 | [10.0,22.5,35.0] |
|       2 | [7.5,18.0,31.0]  |

SQL Syntax

GROUP_ARRAY_MOVING_AVG(<expr>)

GROUP_ARRAY_MOVING_AVG(<window_size>)(<expr>)

Arguments

Arguments	Description
`<window_size>`	Any numerical expression
`<expr>`	Any numerical expression

Return Type

Returns an Array with elements of double or decimal depending on the source data type.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE hits (
  user_id INT,
  request_num INT
);

INSERT INTO hits (user_id, request_num)
VALUES (1, 10),
       (2, 15),
       (3, 20),
       (1, 13),
       (2, 21),
       (3, 25),
       (1, 30),
       (2, 41),
       (3, 45);

SELECT user_id, GROUP_ARRAY_MOVING_AVG(2)(request_num) AS avg_request_num
FROM hits
GROUP BY user_id;

| user_id | avg_request_num  |
|---------|------------------|
|       1 | [5.0,11.5,21.5]  |
|       3 | [10.0,22.5,35.0] |
|       2 | [7.5,18.0,31.0]  |

14 - GROUP_ARRAY_MOVING_SUM

The GROUP_ARRAY_MOVING_SUM function calculates the moving sum of input values. The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of input values.

Analyze Syntax

func.group_array_moving_sum(<expr>)

Analyze Examples

table.user_id, func.group_array_moving_sum(table.request_num)

| user_id | request_num |
|---------|-------------|
|       1 | [10,23,43]  |
|       2 | [20,45,70]  |
|       3 | [15,36,62]  |

SQL Syntax

GROUP_ARRAY_MOVING_SUM(<expr>)

GROUP_ARRAY_MOVING_SUM(<window_size>)(<expr>)

Arguments

Arguments	Description
`<window_size>`	Any numerical expression
`<expr>`	Any numerical expression

Return Type

Returns an Array with elements that are of the same type as the original data.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE hits (
  user_id INT,
  request_num INT
);

INSERT INTO hits (user_id, request_num)
VALUES (1, 10),
       (2, 15),
       (3, 20),
       (1, 13),
       (2, 21),
       (3, 25),
       (1, 30),
       (2, 41),
       (3, 45);

SELECT user_id, GROUP_ARRAY_MOVING_SUM(2)(request_num) AS request_num
FROM hits
GROUP BY user_id;

| user_id | request_num |
|---------|-------------|
|       1 | [10,23,43]  |
|       2 | [20,45,70]  |
|       3 | [15,36,62]  |

15 - HISTOGRAM

Generates a data distribution histogram using an "equal height" bucketing strategy.

Analyze Syntax

func.histogram(<expr>)

Analyze Examples

See SQL Example for details

SQL Syntax

HISTOGRAM(<expr>)

-- The following two forms are equivalent:
HISTOGRAM(<max_num_buckets>)(<expr>)
HISTOGRAM(<expr> [, <max_num_buckets>])

Parameter	Description
`expr`	The data type of `expr` should be sortable.
`max_num_buckets`	Optional positive integer specifying the maximum number of buckets. Default is 128.

Return Type

Returns either an empty string or a JSON object with the following structure:

buckets: List of buckets with detailed information:
- lower: Lower bound of the bucket.
- upper: Upper bound of the bucket.
- count: Number of elements in the bucket.
- pre_sum: Cumulative count of elements up to the current bucket.
- ndv: Number of distinct values in the bucket.

SQL Examples

This example shows how the HISTOGRAM function analyzes the distribution of c_int values in the histagg table, returning bucket boundaries, distinct value counts, element counts, and cumulative counts:

CREATE TABLE histagg (
  c_id INT,
  c_tinyint TINYINT,
  c_smallint SMALLINT,
  c_int INT
);

INSERT INTO histagg VALUES
  (1, 10, 20, 30),
  (1, 11, 21, 33),
  (1, 11, 12, 13),
  (2, 21, 22, 23),
  (2, 31, 32, 33),
  (2, 10, 20, 30);

SELECT HISTOGRAM(c_int) FROM histagg;

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                                              histogram(c_int)                                                                                                             │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [{"lower":"13","upper":"13","ndv":1,"count":1,"pre_sum":0},{"lower":"23","upper":"23","ndv":1,"count":1,"pre_sum":1},{"lower":"30","upper":"30","ndv":1,"count":2,"pre_sum":2},{"lower":"33","upper":"33","ndv":1,"count":2,"pre_sum":4}] │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The result is returned as a JSON array:

[
  {
    "lower": "13",
    "upper": "13",
    "ndv": 1,
    "count": 1,
    "pre_sum": 0
  },
  {
    "lower": "23",
    "upper": "23",
    "ndv": 1,
    "count": 1,
    "pre_sum": 1
  },
  {
    "lower": "30",
    "upper": "30",
    "ndv": 1,
    "count": 2,
    "pre_sum": 2
  },
  {
    "lower": "33",
    "upper": "33",
    "ndv": 1,
    "count": 2,
    "pre_sum": 4
  }
]

This example shows how HISTOGRAM(2) groups c_int values into two buckets:

SELECT HISTOGRAM(2)(c_int) FROM histagg;
-- Or
SELECT HISTOGRAM(c_int, 2) FROM histagg;

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  histogram(2)(c_int)                                                  │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [{"lower":"13","upper":"30","ndv":3,"count":4,"pre_sum":0},{"lower":"33","upper":"33","ndv":1,"count":2,"pre_sum":4}] │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The result is returned as a JSON array:

[
  {
    "lower": "13",
    "upper": "30",
    "ndv": 3,
    "count": 4,
    "pre_sum": 0
  },
  {
    "lower": "33",
    "upper": "33",
    "ndv": 1,
    "count": 2,
    "pre_sum": 4
  }
]

16 - JSON_ARRAY_AGG

Converts values into a JSON array while skipping NULLs.

Analyze Syntax

func.json_array_agg(<expr>)

Analyze Examples

See SQL Example for details

SQL Syntax

JSON_ARRAY_AGG(<expr>)

Return Type

JSON array.

Examples

This example demonstrates how JSON_ARRAY_AGG aggregates values from each column into JSON arrays:

CREATE TABLE d (
    a DECIMAL(10, 2), 
    b STRING, 
    c INT, 
    d VARIANT, 
    e ARRAY(STRING)
);

INSERT INTO d VALUES
    (20, 'abc', NULL, '{"k":"v"}', ['a','b']),
    (10, 'de', 100, 'null', []),
    (4.23, NULL, 200, '"uvw"', ['x','y']),
    (5.99, 'xyz', 300, '[1,2,3]', ['z']);

SELECT 
    json_array_agg(a) AS aggregated_a, 
    json_array_agg(b) AS aggregated_b, 
    json_array_agg(c) AS aggregated_c, 
    json_array_agg(d) AS aggregated_d, 
    json_array_agg(e) AS aggregated_e
FROM d;

-[ RECORD 1 ]-----------------------------------
aggregated_a: [20.0,10.0,4.23,5.99]
aggregated_b: ["abc","de","xyz"]
aggregated_c: [100,200,300]
aggregated_d: [{"k":"v"},null,"uvw",[1,2,3]]
aggregated_e: [["a","b"],[],["x","y"],["z"]]

17 - JSON_OBJECT_AGG

Converts key-value pairs into a JSON object. For each row in the input, it generates a key-value pair where the key is derived from the <key_expression> and the value is derived from the <value_expression>. These key-value pairs are then combined into a single JSON object.

Analyze Syntax

func.json_object_agg(<key_expression>, <value_expression>)

Analyze Examples

See SQL Example for details

SQL Syntax

JSON_OBJECT_AGG(<key_expression>, <value_expression>)

Parameter	Description
key_expression	Specifies the key in the JSON object. Only supports string expressions. If the `key_expression` evaluates to NULL, the key-value pair is skipped.
value_expression	Specifies the value in the JSON object. It can be any supported data type. If the `value_expression` evaluates to NULL, the key-value pair is skipped.

Return Type

JSON object.

SQL Examples

This example demonstrates how JSON_OBJECT_AGG can be used to aggregate different types of data—such as decimals, integers, JSON variants, and arrays—into JSON objects, with the column b as the key for each JSON object:

CREATE TABLE d (
    a DECIMAL(10, 2), 
    b STRING, 
    c INT, 
    d VARIANT, 
    e ARRAY(STRING)
);

INSERT INTO d VALUES
    (20, 'abc', NULL, '{"k":"v"}', ['a','b']),
    (10, 'de', 100, 'null', []),
    (4.23, NULL, 200, '"uvw"', ['x','y']),
    (5.99, 'xyz', 300, '[1,2,3]', ['z']);

SELECT
    json_object_agg(b, a) AS json_a,
    json_object_agg(b, c) AS json_c,
    json_object_agg(b, d) AS json_d,
    json_object_agg(b, e) AS json_e
FROM
    d;

-[ RECORD 1 ]-----------------------------------
json_a: {"abc":20.0,"de":10.0,"xyz":5.99}
json_c: {"de":100,"xyz":300}
json_d: {"abc":{"k":"v"},"de":null,"xyz":[1,2,3]}
json_e: {"abc":["a","b"],"de":[],"xyz":["z"]}

18 - KURTOSIS

Aggregate function.

The KURTOSIS() function returns the excess kurtosis of all input values.

Analyze Syntax

func.kurtosis(<expr>)

Analyze Examples

func.kurtosis(table.price).alias('excess_kurtosis')

|     excess_kurtosis     |
|-------------------------|
| 0.06818181325581445     |

SQL Syntax

KURTOSIS(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

Nullable Float64.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE stock_prices (
  id INT,
  stock_symbol VARCHAR,
  price FLOAT
);

INSERT INTO stock_prices (id, stock_symbol, price)
VALUES (1, 'AAPL', 150),
       (2, 'AAPL', 152),
       (3, 'AAPL', 148),
       (4, 'AAPL', 160),
       (5, 'AAPL', 155);

Query Demo: Calculate Excess Kurtosis for Apple Stock Prices

SELECT KURTOSIS(price) AS excess_kurtosis
FROM stock_prices
WHERE stock_symbol = 'AAPL';

Result

|     excess_kurtosis     |
|-------------------------|
| 0.06818181325581445     |

19 - MAX

Aggregate function.

The MAX() function returns the maximum value in a set of values.

Analyze Syntax

func.max(<column>)

Analyze Examples

table.city, func.max(table.temperature).alias('max_temperature')

|    city    | max_temperature |
|------------|-----------------|
| New York   |       32        |

SQL Syntax

MAX(<expr>)

Arguments

Arguments	Description
`<expr>`	Any expression

Return Type

The maximum value, in the type of the value.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE temperatures (
  id INT,
  city VARCHAR,
  temperature FLOAT
);

INSERT INTO temperatures (id, city, temperature)
VALUES (1, 'New York', 30),
       (2, 'New York', 28),
       (3, 'New York', 32),
       (4, 'Los Angeles', 25),
       (5, 'Los Angeles', 27);

Query Demo: Find Maximum Temperature for New York City

SELECT city, MAX(temperature) AS max_temperature
FROM temperatures
WHERE city = 'New York'
GROUP BY city;

Result

|    city    | max_temperature |
|------------|-----------------|
| New York   |       32        |

20 - MAX_IF

The suffix _IF can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.max_if(<column>, <cond>)

Analyze Examples

func.max_if(table.revenue, table.salesperson_id==1).alias('max_revenue_salesperson_1')

| max_revenue_salesperson_1 |
|---------------------------|
|           3000            |

SQL Example

MAX_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE sales (
  id INT,
  salesperson_id INT,
  product_id INT,
  revenue FLOAT
);

INSERT INTO sales (id, salesperson_id, product_id, revenue)
VALUES (1, 1, 1, 1000),
       (2, 1, 2, 2000),
       (3, 1, 3, 3000),
       (4, 2, 1, 1500),
       (5, 2, 2, 2500);

Query Demo: Find Maximum Revenue for Salesperson with ID 1

SELECT MAX_IF(revenue, salesperson_id = 1) AS max_revenue_salesperson_1
FROM sales;

Result

| max_revenue_salesperson_1 |
|---------------------------|
|           3000            |

21 - MEDIAN

Aggregate function.

The MEDIAN() function computes the median of a numeric data sequence.

Caution: NULL values are not counted.

Analyze Syntax

func.median(<expr>)

Analyze Examples

func.median(table.score).alias('median_score')

|  median_score  |
|----------------|
|      85.0      |

SQL Syntax

MEDIAN(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

the type of the value.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE exam_scores (
  id INT,
  student_id INT,
  score INT
);

INSERT INTO exam_scores (id, student_id, score)
VALUES (1, 1, 80),
       (2, 2, 90),
       (3, 3, 75),
       (4, 4, 95),
       (5, 5, 85);

Query Demo: Calculate Median Exam Score

SELECT MEDIAN(score) AS median_score
FROM exam_scores;

Result

|  median_score  |
|----------------|
|      85.0      |

22 - MEDIAN_TDIGEST

Computes the median of a numeric data sequence using the t-digest algorithm.

Caution: NULL values are not included in the calculation.

Analyze Syntax

func.median_tdigest(<expr>)

Analyze Examples

func.median_tdigest(table.score).alias('median_score')

|  median_score  |
|----------------|
|      85.0      |

SQL Syntax

MEDIAN_TDIGEST(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

Returns a value of the same data type as the input values.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE exam_scores (
  id INT,
  student_id INT,
  score INT
);

INSERT INTO exam_scores (id, student_id, score)
VALUES (1, 1, 80),
       (2, 2, 90),
       (3, 3, 75),
       (4, 4, 95),
       (5, 5, 85);

-- Calculate median exam score
SELECT MEDIAN_TDIGEST(score) AS median_score
FROM exam_scores;

|  median_score  |
|----------------|
|      85.0      |

23 - MIN

Aggregate function.

The MIN() function returns the minimum value in a set of values.

Analyze Syntax

func.min(<column>)

Analyze Examples

table.station_id, func.min(table.price).alias('min_price')

| station_id | min_price |
|------------|-----------|
|     1      |   3.45    |

SQL Syntax

MIN(<expr>)

Arguments

Arguments	Description
`<expr>`	Any expression

Return Type

The minimum value, in the type of the value.

SQL Examples

title: MIN

Aggregate function.

The MIN() function returns the minimum value in a set of values.

SQL Syntax

MIN(expression)

Arguments

Arguments	Description
expression	Any expression

Return Type

The minimum value, in the type of the value.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE gas_prices (
  id INT,
  station_id INT,
  price FLOAT
);

INSERT INTO gas_prices (id, station_id, price)
VALUES (1, 1, 3.50),
       (2, 1, 3.45),
       (3, 1, 3.55),
       (4, 2, 3.40),
       (5, 2, 3.35);

Query Demo: Find Minimum Gas Price for Station 1

SELECT station_id, MIN(price) AS min_price
FROM gas_prices
WHERE station_id = 1
GROUP BY station_id;

Result

| station_id | min_price |
|------------|-----------|
|     1      |   3.45    |

24 - MIN_IF

The suffix _IF can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.min_if(<column>, <cond>)

Analyze Examples

func.min_if(table.budget, table.departing=='IT').alias('min_it_budget')

| min_it_budget |
|---------------|
|     2000      |

SQL Syntax

MIN_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE project_budgets (
  id INT,
  project_id INT,
  department VARCHAR,
  budget FLOAT
);

INSERT INTO project_budgets (id, project_id, department, budget)
VALUES (1, 1, 'HR', 1000),
       (2, 1, 'IT', 2000),
       (3, 1, 'Marketing', 3000),
       (4, 2, 'HR', 1500),
       (5, 2, 'IT', 2500);

Query Demo: Find Minimum Budget for IT Department

SELECT MIN_IF(budget, department = 'IT') AS min_it_budget
FROM project_budgets;

Result

| min_it_budget |
|---------------|
|     2000      |

25 - QUANTILE_CONT

Aggregate function.

The QUANTILE_CONT() function computes the interpolated quantile number of a numeric data sequence.

Caution: NULL values are not counted.

Analyze Syntax

func.quantile_cont(<levels>, <expr>)

Analyze Examples

func.quantile_cont(0.5, table.sales_amount).alias('median_sales_amount')

|  median_sales_amount  |
|-----------------------|
|        6000.0         |

SQL Syntax

QUANTILE_CONT(<levels>)(<expr>)
    
QUANTILE_CONT(level1, level2, ...)(<expr>)

Arguments

Arguments	Description
`<level(s)`	level(s) of quantile. Each level is constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]
`<expr>`	Any numerical expression

Return Type

Float64 or float64 array based on level number.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE sales_data (
  id INT,
  sales_person_id INT,
  sales_amount FLOAT
);

INSERT INTO sales_data (id, sales_person_id, sales_amount)
VALUES (1, 1, 5000),
       (2, 2, 5500),
       (3, 3, 6000),
       (4, 4, 6500),
       (5, 5, 7000);

Query Demo: Calculate 50th Percentile (Median) of Sales Amount using Interpolation

SELECT QUANTILE_CONT(0.5)(sales_amount) AS median_sales_amount
FROM sales_data;

Result

|  median_sales_amount  |
|-----------------------|
|        6000.0         |

26 - QUANTILE_DISC

Aggregate function.

The QUANTILE_DISC() function computes the exact quantile number of a numeric data sequence. The QUANTILE alias to QUANTILE_DISC

Caution: NULL values are not counted.

Analyze Syntax

func.quantile_disc(<levels>, <expr>)

Analyze Examples

func.quantile_disc([0.25, 0.75], table.salary).alias('salary_quantiles')

|  salary_quantiles   |
|---------------------|
| [55000.0, 65000.0]  |

SQL Syntax

QUANTILE_DISC(<levels>)(<expr>)
    
QUANTILE_DISC(level1, level2, ...)(<expr>)

Arguments

Arguments	Description
`level(s)`	level(s) of quantile. Each level is constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]
`<expr>`	Any numerical expression

Return Type

InputType or array of InputType based on level number.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE salary_data (
  id INT,
  employee_id INT,
  salary FLOAT
);

INSERT INTO salary_data (id, employee_id, salary)
VALUES (1, 1, 50000),
       (2, 2, 55000),
       (3, 3, 60000),
       (4, 4, 65000),
       (5, 5, 70000);

Query Demo: Calculate 25th and 75th Percentile of Salaries

SELECT QUANTILE_DISC(0.25, 0.75)(salary) AS salary_quantiles
FROM salary_data;

Result

|  salary_quantiles   |
|---------------------|
| [55000.0, 65000.0]  |

27 - QUANTILE_TDIGEST

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm.

Caution: NULL values are not included in the calculation.

Analyze Syntax

func.quantile_tdigest(<levels>, <expr>)

Analyze Examples

func.quantile_tdigest([0.5, 0.8], table.sales_amount).alias('sales_amounts')

| sales_amounts         |
|-----------------------+
| [6000.0,7000.0]       |

SQL Syntax

QUANTILE_TDIGEST(<level1>[, <level2>, ...])(<expr>)

Arguments

Arguments	Description
`<level n>`	A level of quantile represents a constant floating-point number ranging from 0 to 1. It is recommended to use a level value in the range of [0.01, 0.99].
`<expr>`	Any numerical expression

Return Type

Returns either a Float64 value or an array of Float64 values, depending on the number of quantile levels specified.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE sales_data (
  id INT,
  sales_person_id INT,
  sales_amount FLOAT
);

INSERT INTO sales_data (id, sales_person_id, sales_amount)
VALUES (1, 1, 5000),
       (2, 2, 5500),
       (3, 3, 6000),
       (4, 4, 6500),
       (5, 5, 7000);

SELECT QUANTILE_TDIGEST(0.5)(sales_amount) AS median_sales_amount
FROM sales_data;

median_sales_amount|
-------------------+
             6000.0|

SELECT QUANTILE_TDIGEST(0.5, 0.8)(sales_amount)
FROM sales_data;

quantile_tdigest(0.5, 0.8)(sales_amount)|
----------------------------------------+
[6000.0,7000.0]                         |

28 - QUANTILE_TDIGEST_WEIGHTED

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm. This function takes into account the weight of each sequence member. Memory consumption is log(n), where n is a number of values.

Caution: NULL values are not included in the calculation.

Analyze Syntax

func.quantile_tdigest_weighted(<levels>, <expr>, <weight_expr>)

Analyze Examples

func.quantile_tdigest_weighted([0.5, 0.8], table.sales_amount, 1).alias('sales_amounts')

| sales_amounts         |
|-----------------------+
| [6000.0,7000.0]       |

SQL Syntax

QUANTILE_TDIGEST_WEIGHTED(<level1>[, <level2>, ...])(<expr>, <weight_expr>)

Arguments

Arguments	Description
`<level n>`	A level of quantile represents a constant floating-point number ranging from 0 to 1. It is recommended to use a level value in the range of [0.01, 0.99].
`<expr>`	Any numerical expression
`<weight_expr>`	Any unsigned integer expression. Weight is a number of value occurrences.

Return Type

Returns either a Float64 value or an array of Float64 values, depending on the number of quantile levels specified.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE sales_data (
  id INT,
  sales_person_id INT,
  sales_amount FLOAT
);

INSERT INTO sales_data (id, sales_person_id, sales_amount)
VALUES (1, 1, 5000),
       (2, 2, 5500),
       (3, 3, 6000),
       (4, 4, 6500),
       (5, 5, 7000);

SELECT QUANTILE_TDIGEST_WEIGHTED(0.5)(sales_amount, 1) AS median_sales_amount
FROM sales_data;

median_sales_amount|
-------------------+
             6000.0|

SELECT QUANTILE_TDIGEST_WEIGHTED(0.5, 0.8)(sales_amount, 1)
FROM sales_data;

quantile_tdigest_weighted(0.5, 0.8)(sales_amount)|
-------------------------------------------------+
[6000.0,7000.0]                                  |

29 - RETENTION

Aggregate function

The RETENTION() function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8 that indicate whether a certain condition was met for the event.

Any condition can be specified as an argument (as in WHERE).

The conditions, except the first, apply in pairs: the result of the second will be true if the first and second are true, of the third if the first and third are true, etc.

Analyze Syntax

func.retention(<cond1> , <cond2> , ..., <cond32>)

Analyze Examples

table.user_id, func.retention(table.event_type=='signup', table.event_type='login', table.event_type='purchase').alias('sales_amounts')

| user_id | retention |
|---------|-----------|
|   1     | [1, 1, 0] |
|   2     | [1, 0, 1] |
|   3     | [1, 1, 0] |

SQL Syntax

RETENTION( <cond1> , <cond2> , ..., <cond32> );

Arguments

Arguments	Description
`<cond>`	An expression that returns a Boolean result

Return Type

The array of 1 or 0.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE user_events (
  id INT,
  user_id INT,
  event_date DATE,
  event_type VARCHAR
);

INSERT INTO user_events (id, user_id, event_date, event_type)
VALUES (1, 1, '2022-01-01', 'signup'),
       (2, 1, '2022-01-02', 'login'),
       (3, 2, '2022-01-01', 'signup'),
       (4, 2, '2022-01-03', 'purchase'),
       (5, 3, '2022-01-01', 'signup'),
       (6, 3, '2022-01-02', 'login');

Query Demo: Calculate User Retention Based on Signup, Login, and Purchase Events

SELECT
  user_id,
  RETENTION(event_type = 'signup', event_type = 'login', event_type = 'purchase') AS retention
FROM user_events
GROUP BY user_id;

Result

| user_id | retention |
|---------|-----------|
|   1     | [1, 1, 0] |
|   2     | [1, 0, 1] |
|   3     | [1, 1, 0] |

30 - SKEWNESS

Aggregate function.

The SKEWNESS() function returns the skewness of all input values.

Analyze Syntax

func.skewness(<expr>)

Analyze Examples

func.skewness(table.temperature).alias('temperature_skewness')

| temperature_skewness |
|----------------------|
|      0.68            |

SQL Syntax

SKEWNESS(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

Nullable Float64.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE temperature_data (
                                  id INT,
                                  city_id INT,
                                  temperature FLOAT
);

INSERT INTO temperature_data (id, city_id, temperature)
VALUES (1, 1, 60),
       (2, 1, 65),
       (3, 1, 62),
       (4, 2, 70),
       (5, 2, 75);

Query Demo: Calculate Skewness of Temperature Data

SELECT SKEWNESS(temperature) AS temperature_skewness
FROM temperature_data;

Result

| temperature_skewness |
|----------------------|
|      0.68            |

31 - STDDEV_POP

Aggregate function.

The STDDEV_POP() function returns the population standard deviation(the square root of VAR_POP()) of an expression.

Note: STD() or STDDEV() can also be used, which are equivalent but not standard SQL.

Caution: NULL values are not counted.

Analyze Syntax

func.stddev_pop(<expr>)

Analyze Examples

func.stddev_pop(table.score).alias('test_score_stddev_pop')

| test_score_stddev_pop |
|-----------------------|
|        7.07107        |

SQL Syntax

STDDEV_POP(<expr>)
STDDEV(<expr>)
STD(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

double

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE test_scores (
  id INT,
  student_id INT,
  score FLOAT
);

INSERT INTO test_scores (id, student_id, score)
VALUES (1, 1, 80),
       (2, 2, 85),
       (3, 3, 90),
       (4, 4, 95),
       (5, 5, 100);

Query Demo: Calculate Population Standard Deviation of Test Scores

SELECT STDDEV_POP(score) AS test_score_stddev_pop
FROM test_scores;

Result

| test_score_stddev_pop |
|-----------------------|
|        7.07107        |

32 - STDDEV_SAMP

Aggregate function.

The STDDEV_SAMP() function returns the sample standard deviation(the square root of VAR_SAMP()) of an expression.

Caution: NULL values are not counted.

Analyze Syntax

func.stddev_samp(<expr>)

Analyze Examples

func.stddev_samp(table.height).alias('height_stddev_samp')

| height_stddev_samp |
|--------------------|
|      0.240         |

SQL Syntax

STDDEV_SAMP(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

double

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE height_data (
  id INT,
  person_id INT,
  height FLOAT
);

INSERT INTO height_data (id, person_id, height)
VALUES (1, 1, 5.8),
       (2, 2, 6.1),
       (3, 3, 5.9),
       (4, 4, 5.7),
       (5, 5, 6.3);

Query Demo: Calculate Sample Standard Deviation of Heights

SELECT STDDEV_SAMP(height) AS height_stddev_samp
FROM height_data;

Result

| height_stddev_samp |
|--------------------|
|      0.240         |

33 - STRING_AGG

Aggregate function.

The STRING_AGG() function converts all the non-NULL values of a column to String, separated by the delimiter.

Analyze Syntax

func.string_agg(<expr> [, delimiter])

Analyze Examples

func.string_agg(table.language_name).alias('concatenated_languages')

|          concatenated_languages         |
|-----------------------------------------|
| Python, JavaScript, Java, C#, Ruby      |

SQL Syntax

STRING_AGG(<expr>)
STRING_AGG(<expr> [, delimiter])

Note:

If <expr> is not a String expression, should use ::VARCHAR to convert.

For example:

SELECT string_agg(number::VARCHAR, '|') AS s FROM numbers(5);
+-----------+
| s         |
+-----------+
| 0|1|2|3|4 |
+-----------+

Arguments

Arguments	Description
`<expr>`	Any string expression (if not a string, use `::VARCHAR` to convert)
`delimiter`	Optional constant String, if not specified, use empty String

Return Type

the String type

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE programming_languages (
  id INT,
  language_name VARCHAR
);

INSERT INTO programming_languages (id, language_name)
VALUES (1, 'Python'),
       (2, 'JavaScript'),
       (3, 'Java'),
       (4, 'C#'),
       (5, 'Ruby');

Query Demo: Concatenate Programming Language Names with a Delimiter

SELECT STRING_AGG(language_name, ', ') AS concatenated_languages
FROM programming_languages;

Result

|          concatenated_languages         |
|------------------------------------------|
| Python, JavaScript, Java, C#, Ruby      |

34 - SUM

Aggregate function.

The SUM() function calculates the sum of a set of values.

Caution: NULL values are not counted.

Analyze Syntax

func.sum(<column>)

Analyze Examples

func.sum(table.quantity).alias('total_quantity_sold')

| total_quantity_sold |
|---------------------|
|         41          |

SQL Syntax

SUM(<expr>)

Arguments

Arguments	Description
`<expr>`	Any numerical expression

Return Type

A double if the input type is double, otherwise integer.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE sales_data (
  id INT,
  product_id INT,
  quantity INT
);

INSERT INTO sales_data (id, product_id, quantity)
VALUES (1, 1, 10),
       (2, 2, 5),
       (3, 3, 8),
       (4, 4, 3),
       (5, 5, 15);

Query Demo: Calculate the Total Quantity of Products Sold

SELECT SUM(quantity) AS total_quantity_sold
FROM sales_data;

Result

| total_quantity_sold |
|---------------------|
|         41          |

35 - SUM_IF

The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.sum_if(<column>, <cond>)

Analyze Examples

func.sum_if(table.amount, table.status=='Completed').alias('total_amount_completed')

| total_amount_completed |
|------------------------|
|         270.0          |

SQL Syntax

SUM_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE order_data (
  id INT,
  customer_id INT,
  amount FLOAT,
  status VARCHAR
);

INSERT INTO order_data (id, customer_id, amount, status)
VALUES (1, 1, 100, 'Completed'),
       (2, 2, 50, 'Completed'),
       (3, 3, 80, 'Cancelled'),
       (4, 4, 120, 'Completed'),
       (5, 5, 75, 'Cancelled');

Query Demo: Calculate the Total Amount of Completed Orders

SELECT SUM_IF(amount, status = 'Completed') AS total_amount_completed
FROM order_data;

Result

| total_amount_completed |
|------------------------|
|         270.0          |

36 - WINDOW_FUNNEL

Funnel Analysis

Similar to windowFunnel in ClickHouse (they were created by the same author), it searches for event chains in a sliding time window and calculates the maximum number of events from the chain.

The function works according to the algorithm:

The function searches for data that triggers the first condition in the chain and sets the event counter to 1. This is the moment when the sliding window starts.
If events from the chain occur sequentially within the window, the counter is incremented. If the sequence of events is disrupted, the counter isn’t incremented.
If the data has multiple event chains at varying completion points, the function will only output the size of the longest chain.

SQL Syntax

WINDOW_FUNNEL( <window> )( <timestamp>, <cond1>, <cond2>, ..., <condN> )

Arguments

<timestamp> — Name of the column containing the timestamp. Data types supported: integer types and datetime types.
<cond> — Conditions or data describing the chain of events. Must be Boolean datatype.

Parameters

<window> — Length of the sliding window, it is the time interval between the first and the last condition. The unit of window depends on the timestamp itself and varies. Determined using the expression timestamp of cond1 <= timestamp of cond2 <= ... <= timestamp of condN <= timestamp of cond1 + window.

Returned value

The maximum number of consecutive triggered conditions from the chain within the sliding time window. All the chains in the selection are analyzed.

Type: UInt8.

Example

Determine if a set period of time is enough for the user to SELECT a phone and purchase it twice in the online store.

Set the following chain of events:

The user logged into their account on the store (event_name = 'login').
The user land the page (event_name = 'visit').
The user adds to the shopping cart(event_name = 'cart').
The user complete the purchase (event_name = 'purchase').

CREATE TABLE events(user_id BIGINT, event_name VARCHAR, event_timestamp TIMESTAMP);

INSERT INTO events VALUES(100123, 'login', '2022-05-14 10:01:00');
INSERT INTO events VALUES(100123, 'visit', '2022-05-14 10:02:00');
INSERT INTO events VALUES(100123, 'cart', '2022-05-14 10:04:00');
INSERT INTO events VALUES(100123, 'purchase', '2022-05-14 10:10:00');

INSERT INTO events VALUES(100125, 'login', '2022-05-15 11:00:00');
INSERT INTO events VALUES(100125, 'visit', '2022-05-15 11:01:00');
INSERT INTO events VALUES(100125, 'cart', '2022-05-15 11:02:00');

INSERT INTO events VALUES(100126, 'login', '2022-05-15 12:00:00');
INSERT INTO events VALUES(100126, 'visit', '2022-05-15 12:01:00');

Input table:

+---------+------------+----------------------------+
| user_id | event_name | event_timestamp            |
+---------+------------+----------------------------+
|  100123 | login      | 2022-05-14 10:01:00.000000 |
|  100123 | visit      | 2022-05-14 10:02:00.000000 |
|  100123 | cart       | 2022-05-14 10:04:00.000000 |
|  100123 | purchase   | 2022-05-14 10:10:00.000000 |
|  100125 | login      | 2022-05-15 11:00:00.000000 |
|  100125 | visit      | 2022-05-15 11:01:00.000000 |
|  100125 | cart       | 2022-05-15 11:02:00.000000 |
|  100126 | login      | 2022-05-15 12:00:00.000000 |
|  100126 | visit      | 2022-05-15 12:01:00.000000 |
+---------+------------+----------------------------+

Find out how far the user user_id could get through the chain in an hour window slides.

Query:

SELECT
    level,
    count() AS count
FROM
(
    SELECT
        user_id,
        window_funnel(3600000000)(event_timestamp, event_name = 'login', event_name = 'visit', event_name = 'cart', event_name = 'purchase') AS level
    FROM events
    GROUP BY user_id
)
GROUP BY level ORDER BY level ASC;

Note: The event_timestamp type is timestamp, 3600000000 is a hour time window.

Result:

+-------+-------+
| level | count |
+-------+-------+
|     2 |     1 |
|     3 |     1 |
|     4 |     1 |
+-------+-------+

User 100126 level is 2 (login -> visit) .
user 100125 level is 3 (login -> visit -> cart).
User 100123 level is 4 (login -> visit -> cart -> purchase).