This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Aggregate Functions

Aggregate functions are essential tools in SQL that allow you to perform calculations on a set of values and return a single result.

These functions help you extract and summarize data from databases to gain valuable insights.

Function NameWhat It Does
ANYChecks if any row meets the specified condition
APPROX_COUNT_DISTINCTEstimates the number of distinct values with HyperLogLog
ARG_MAXFinds the arg value for the maximum val value
ARG_MINFinds the arg value for the minimum val value
AVG_IFCalculates the average for rows meeting a condition
ARRAY_AGGConverts all the values of a column to an Array
AVGCalculates the average value of a specific column
COUNT_DISTINCTCounts the number of distinct values in a column
COUNT_IFCounts rows meeting a specified condition
COUNTCounts the number of rows that meet certain criteria
COVAR_POPReturns the population covariance of a set of number pairs
COVAR_SAMPReturns the sample covariance of a set of number pairs
GROUP_ARRAY_MOVING_AVGReturns an array with elements calculates the moving average of input values
GROUP_ARRAY_MOVING_SUMReturns an array with elements calculates the moving sum of input values
KURTOSISCalculates the excess kurtosis of a set of values
MAX_IFFinds the maximum value for rows meeting a condition
MAXFinds the largest value in a specific column
MEDIANCalculates the median value of a specific column
MEDIAN_TDIGESTCalculates the median value of a specific column using t-digest algorithm
MIN_IFFinds the minimum value for rows meeting a condition
MINFinds the smallest value in a specific column
QUANTILE_CONTCalculates the interpolated quantile for a specific column
QUANTILE_DISCCalculates the quantile for a specific column
QUANTILE_TDIGESTCalculates the quantile using t-digest algorithm
QUANTILE_TDIGEST_WEIGHTEDCalculates the quantile with weighted using t-digest algorithm
RETENTIONCalculates retention for a set of events
SKEWNESSCalculates the skewness of a set of values
STDDEV_POPCalculates the population standard deviation of a column
STDDEV_SAMPCalculates the sample standard deviation of a column
STRING_AGGConverts all the non-NULL values to String, separated by the delimiter
SUM_IFAdds up the values meeting a condition of a specific column
SUMAdds up the values of a specific column
WINDOW_FUNNELAnalyzes user behavior in a time-ordered sequence of events

1 - ANY

Aggregate function.

The ANY() function selects the first encountered (non-NULL) value, unless all rows have NULL values in that column. The query can be executed in any order and even in a different order each time, so the result of this function is indeterminate. To get a determinate result, you can use the ‘min’ or ‘max’ function instead of ‘any’.

Analyze Syntax

func.any(<expr>)

Analyze Examples

func.any(table.product_name).alias('any_product_name')

| any_product_name |
|------------------|
| Laptop           |

SQL Syntax

ANY(<expr>)

Arguments

ArgumentsDescription
<expr>Any expression

Return Type

The first encountered (non-NULL) value, in the type of the value. If all values are NULL, the return value is NULL.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE product_data (
  id INT,
  product_name VARCHAR NULL,
  price FLOAT NULL
);

INSERT INTO product_data (id, product_name, price)
VALUES (1, 'Laptop', 1000),
       (2, NULL, 800),
       (3, 'Keyboard', NULL),
       (4, 'Mouse', 25),
       (5, 'Monitor', 150);

Query Demo: Retrieve the First Encountered Non-NULL Product Name

SELECT ANY(product_name) AS any_product_name
FROM product_data;

Result

| any_product_name |
|------------------|
| Laptop           |

2 - APPROX_COUNT_DISTINCT

Estimates the number of distinct values in a data set with the HyperLogLog algorithm.

The HyperLogLog algorithm provides an approximation of the number of unique elements using little memory and time. Consider using this function when dealing with large data sets where an estimated result can be accepted. In exchange for some accuracy, this is a fast and efficient method of returning distinct counts.

To get an accurate result, use COUNT_DISTINCT. See Examples for more explanations.

Analyze Syntax

func.approx_count_distinct(<expr>)

Analyze Examples

func.approx_count_distinct(table.user_id).alias('approx_distinct_user_count')

| approx_distinct_user_count |
|----------------------------|
|             4              |

SQL Syntax

APPROX_COUNT_DISTINCT(<expr>)

Return Type

Integer.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE user_events (
  id INT,
  user_id INT,
  event_name VARCHAR
);

INSERT INTO user_events (id, user_id, event_name)
VALUES (1, 1, 'Login'),
       (2, 2, 'Login'),
       (3, 3, 'Login'),
       (4, 1, 'Logout'),
       (5, 2, 'Logout'),
       (6, 4, 'Login'),
       (7, 1, 'Login');

Query Demo: Estimate the Number of Distinct User IDs

SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_distinct_user_count
FROM user_events;

Result

| approx_distinct_user_count |
|----------------------------|
|             4              |

3 - ARG_MAX

Calculates the arg value for a maximum val value. If there are several values of arg for maximum values of val, returns the first of these values encountered.

Analyze Syntax

func.arg_max(<expr>)

Analyze Examples

func.arg_max(table.product, table.price).alias('max_price_product')

| max_price_product |
| ----------------- |
| Product C         |

SQL Syntax

ARG_MAX(<arg>, <val>)

Arguments

ArgumentsDescription
<arg>Argument of any data type that PlaidCloud Lakehouse supports
<val>Value of any data type that PlaidCloud Lakehouse supports

Return Type

arg value that corresponds to maximum val value.

matches arg type.

SQL Examples

Creating a Table and Inserting Sample Data

Let's create a table named "sales" and insert some sample data:

CREATE TABLE sales (
  id INTEGER,
  product VARCHAR(50),
  price FLOAT
);

INSERT INTO sales (id, product, price)
VALUES (1, 'Product A', 10.5),
       (2, 'Product B', 20.75),
       (3, 'Product C', 30.0),
       (4, 'Product D', 15.25),
       (5, 'Product E', 25.5);

Query: Using ARG_MAX() Function

Now, let's use the ARG_MAX() function to find the product that has the maximum price:

SELECT ARG_MAX(product, price) AS max_price_product
FROM sales;

The result should look like this:

| max_price_product |
| ----------------- |
| Product C         |

4 - ARG_MIN

Calculates the arg value for a minimum val value. If there are several different values of arg for minimum values of val, returns the first of these values encountered.

Analyze Syntax

func.arg_min(<expr>)

Analyze Examples

func.arg_min(table.name, table.score).alias('student_name')

| student_name |
|--------------|
| Charlie      |

SQL Syntax

ARG_MIN(<arg>, <val>)

Arguments

ArgumentsDescription
<arg>Argument of any data type that PlaidCloud Lakehouse supports
<val>Value of any data type that PlaidCloud Lakehouse supports

Return Type

arg value that corresponds to minimum val value.

matches arg type.

SQL Examples

Let's create a table students with columns id, name, and score, and insert some data:

CREATE TABLE students (
  id INT,
  name VARCHAR,
  score INT
);

INSERT INTO students (id, name, score) VALUES
  (1, 'Alice', 80),
  (2, 'Bob', 75),
  (3, 'Charlie', 90),
  (4, 'Dave', 80);

Now, we can use ARG_MIN to find the name of the student with the lowest score:

SELECT ARG_MIN(name, score) AS student_name
FROM students;

Result:

| student_name |
|--------------|
| Charlie      |

5 - ARRAY_AGG

The ARRAY_AGG function (also known by its alias LIST) transforms all the values, including NULL, of a specific column in a query result into an array.

Analyze Syntax

func.array_agg(<expr>)

Analyze Examples

table.movie_title, func.array_agg(table.rating).alias('ratings')

| movie_title |  ratings   |
|-------------|------------|
| Inception   | [5, 4, 5]  |

SQL Syntax

ARRAY_AGG(<expr>)

LIST(<expr>)

Arguments

ArgumentsDescription
<expr>Any expression

Return Type

Returns an Array with elements that are of the same type as the original data.

SQL Examples

This example demonstrates how the ARRAY_AGG function can be used to aggregate and present data in a convenient array format:

-- Create a table and insert sample data
CREATE TABLE movie_ratings (
  id INT,
  movie_title VARCHAR,
  user_id INT,
  rating INT
);

INSERT INTO movie_ratings (id, movie_title, user_id, rating)
VALUES (1, 'Inception', 1, 5),
       (2, 'Inception', 2, 4),
       (3, 'Inception', 3, 5),
       (4, 'Interstellar', 1, 4),
       (5, 'Interstellar', 2, 3);

-- List all ratings for Inception in an array
SELECT movie_title, ARRAY_AGG(rating) AS ratings
FROM movie_ratings
WHERE movie_title = 'Inception'
GROUP BY movie_title;

| movie_title |  ratings   |
|-------------|------------|
| Inception   | [5, 4, 5]  |

6 - AVG

Aggregate function.

The AVG() function returns the average value of an expression.

Note: NULL values are not counted.

Analyze Syntax

func.avg(<column>)

Analyze Examples

func.avg(table.price).alias('avg_price')

| avg_price |
| --------- |
| 20.4      |

SQL Syntax

AVG(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

double

SQL Examples

Creating a Table and Inserting Sample Data

Let's create a table named "sales" and insert some sample data:

CREATE TABLE sales (
  id INTEGER,
  product VARCHAR(50),
  price FLOAT
);

INSERT INTO sales (id, product, price)
VALUES (1, 'Product A', 10.5),
       (2, 'Product B', 20.75),
       (3, 'Product C', 30.0),
       (4, 'Product D', 15.25),
       (5, 'Product E', 25.5);

Query: Using AVG() Function

Now, let's use the AVG() function to find the average price of all products in the "sales" table:

SELECT AVG(price) AS avg_price
FROM sales;

The result should look like this:

| avg_price |
| --------- |
| 20.4      |

7 - AVG_IF

The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.avg_if(<column>, <cond>)

Analyze Examples

func.avg_if(table.salary, table.department=='IT').alias('avg_salary_it')

| avg_salary_it   |
|-----------------|
|     65000.0     |

SQL Syntax

AVG_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE employees (
  id INT,
  salary INT,
  department VARCHAR
);

INSERT INTO employees (id, salary, department)
VALUES (1, 50000, 'HR'),
       (2, 60000, 'IT'),
       (3, 55000, 'HR'),
       (4, 70000, 'IT'),
       (5, 65000, 'IT');

Query Demo: Calculate Average Salary for IT Department

SELECT AVG_IF(salary, department = 'IT') AS avg_salary_it
FROM employees;

Result

| avg_salary_it   |
|-----------------|
|     65000.0     |

8 - COUNT

The COUNT() function returns the number of records returned by a SELECT query.

Analyze Syntax

func.count(<column>)

Analyze Examples

func.count(table.grade).alias('count_valid_grades')

| count_valid_grades |
|--------------------|
|          4         |

SQL Syntax

COUNT(<expr>)

Arguments

ArgumentsDescription
<expr>Any expression.
This may be a column name, the result of another function, or a math operation.
* is also allowed, to indicate pure row counting.

Return Type

An integer.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE students (
  id INT,
  name VARCHAR,
  age INT,
  grade FLOAT NULL
);

INSERT INTO students (id, name, age, grade)
VALUES (1, 'John', 21, 85),
       (2, 'Emma', 22, NULL),
       (3, 'Alice', 23, 90),
       (4, 'Michael', 21, 88),
       (5, 'Sophie', 22, 92);

Query Demo: Count Students with Valid Grades

SELECT COUNT(grade) AS count_valid_grades
FROM students;

Result

| count_valid_grades |
|--------------------|
|          4         |

9 - COUNT_DISTINCT

Aggregate function.

The count(distinct ...) function calculates the unique value of a set of values.

To obtain an estimated result from large data sets with little memory and time, consider using APPROX_COUNT_DISTINCT.

Analyze Syntax

func.count_distinct(<column>)

Analyze Examples

func.count_distinct(table.category).alias('unique_categories')

| unique_categories |
|-------------------|
|         2         |

SQL Syntax

COUNT(distinct <expr> ...)
UNIQ(<expr>)

Arguments

ArgumentsDescription
<expr>Any expression, size of the arguments is [1, 32]

Return Type

UInt64

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE products (
  id INT,
  name VARCHAR,
  category VARCHAR,
  price FLOAT
);

INSERT INTO products (id, name, category, price)
VALUES (1, 'Laptop', 'Electronics', 1000),
       (2, 'Smartphone', 'Electronics', 800),
       (3, 'Tablet', 'Electronics', 600),
       (4, 'Chair', 'Furniture', 150),
       (5, 'Table', 'Furniture', 300);

Query Demo: Count Distinct Categories

SELECT COUNT(DISTINCT category) AS unique_categories
FROM products;

Result

| unique_categories |
|-------------------|
|         2         |

10 - COUNT_IF

The suffix _IF can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.count_if(<column>, <cond>)

Analyze Examples

func.count_if(table.status, table.status=='Completed').alias('completed_orders')

| completed_orders |
|------------------|
|        3         |

SQL Example

COUNT_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE orders (
  id INT,
  customer_id INT,
  status VARCHAR,
  total FLOAT
);

INSERT INTO orders (id, customer_id, status, total)
VALUES (1, 1, 'completed', 100),
       (2, 2, 'completed', 200),
       (3, 1, 'pending', 150),
       (4, 3, 'completed', 250),
       (5, 2, 'pending', 300);

Query Demo: Count Completed Orders

SELECT COUNT_IF(status, status = 'completed') AS completed_orders
FROM orders;

Result

| completed_orders |
|------------------|
|        3         |

11 - COVAR_POP

COVAR_POP returns the population covariance of a set of number pairs.

Analyze Syntax

func.covar_pop(<expr1>, <expr2>)

Analyze Examples

func.covar_pop(table.units_sold, table.revenue).alias('covar_pop_units_revenue')

| covar_pop_units_revenue |
|-------------------------|
|        20000.0          |

SQL Syntax

COVAR_POP(<expr1>, <expr2>)

Arguments

ArgumentsDescription
<expr1>Any numerical expression
<expr2>Any numerical expression

Return Type

float64

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE product_sales (
  id INT,
  product_id INT,
  units_sold INT,
  revenue FLOAT
);

INSERT INTO product_sales (id, product_id, units_sold, revenue)
VALUES (1, 1, 10, 1000),
       (2, 2, 20, 2000),
       (3, 3, 30, 3000),
       (4, 4, 40, 4000),
       (5, 5, 50, 5000);

Query Demo: Calculate Population Covariance between Units Sold and Revenue

SELECT COVAR_POP(units_sold, revenue) AS covar_pop_units_revenue
FROM product_sales;

Result

| covar_pop_units_revenue |
|-------------------------|
|        20000.0          |

12 - COVAR_SAMP

Aggregate function.

The covar_samp() function returns the sample covariance (Σ((x - x̅)(y - y̅)) / (n - 1)) of two data columns.

Analyze Syntax

func.covar_samp(<expr1>, <expr2>)

Analyze Examples

func.covar_samp(table.items_sold, table.profit).alias('covar_samp_items_profit')

| covar_samp_items_profit |
|-------------------------|
|        250000.0         |

SQL Syntax

COVAR_SAMP(<expr1>, <expr2>)

Arguments

ArgumentsDescription
<expr1>Any numerical expression
<expr2>Any numerical expression

Return Type

float64, when n <= 1, returns +∞.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE store_sales (
  id INT,
  store_id INT,
  items_sold INT,
  profit FLOAT
);

INSERT INTO store_sales (id, store_id, items_sold, profit)
VALUES (1, 1, 100, 1000),
       (2, 2, 200, 2000),
       (3, 3, 300, 3000),
       (4, 4, 400, 4000),
       (5, 5, 500, 5000);

Query Demo: Calculate Sample Covariance between Items Sold and Profit

SELECT COVAR_SAMP(items_sold, profit) AS covar_samp_items_profit
FROM store_sales;

Result

| covar_samp_items_profit |
|-------------------------|
|        250000.0         |

13 - GROUP_ARRAY_MOVING_AVG

The GROUP_ARRAY_MOVING_AVG function calculates the moving average of input values. The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of input values.

Analyze Syntax

func.group_array_moving_avg(<expr1>)

Analyze Examples

table.user_id, func.group_array_moving_avg(table.request_num).alias('avg_request_num')

| user_id | avg_request_num  |
|---------|------------------|
|       1 | [5.0,11.5,21.5]  |
|       3 | [10.0,22.5,35.0] |
|       2 | [7.5,18.0,31.0]  |

SQL Syntax

GROUP_ARRAY_MOVING_AVG(<expr>)

GROUP_ARRAY_MOVING_AVG(<window_size>)(<expr>)

Arguments

ArgumentsDescription
<window_size>Any numerical expression
<expr>Any numerical expression

Return Type

Returns an Array with elements of double or decimal depending on the source data type.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE hits (
  user_id INT,
  request_num INT
);

INSERT INTO hits (user_id, request_num)
VALUES (1, 10),
       (2, 15),
       (3, 20),
       (1, 13),
       (2, 21),
       (3, 25),
       (1, 30),
       (2, 41),
       (3, 45);

SELECT user_id, GROUP_ARRAY_MOVING_AVG(2)(request_num) AS avg_request_num
FROM hits
GROUP BY user_id;

| user_id | avg_request_num  |
|---------|------------------|
|       1 | [5.0,11.5,21.5]  |
|       3 | [10.0,22.5,35.0] |
|       2 | [7.5,18.0,31.0]  |

14 - GROUP_ARRAY_MOVING_SUM

The GROUP_ARRAY_MOVING_SUM function calculates the moving sum of input values. The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of input values.

Analyze Syntax

func.group_array_moving_sum(<expr>)

Analyze Examples

table.user_id, func.group_array_moving_sum(table.request_num)

| user_id | request_num |
|---------|-------------|
|       1 | [10,23,43]  |
|       2 | [20,45,70]  |
|       3 | [15,36,62]  |

SQL Syntax

GROUP_ARRAY_MOVING_SUM(<expr>)

GROUP_ARRAY_MOVING_SUM(<window_size>)(<expr>)

Arguments

ArgumentsDescription
<window_size>Any numerical expression
<expr>Any numerical expression

Return Type

Returns an Array with elements that are of the same type as the original data.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE hits (
  user_id INT,
  request_num INT
);

INSERT INTO hits (user_id, request_num)
VALUES (1, 10),
       (2, 15),
       (3, 20),
       (1, 13),
       (2, 21),
       (3, 25),
       (1, 30),
       (2, 41),
       (3, 45);

SELECT user_id, GROUP_ARRAY_MOVING_SUM(2)(request_num) AS request_num
FROM hits
GROUP BY user_id;

| user_id | request_num |
|---------|-------------|
|       1 | [10,23,43]  |
|       2 | [20,45,70]  |
|       3 | [15,36,62]  |

15 - KURTOSIS

Aggregate function.

The KURTOSIS() function returns the excess kurtosis of all input values.

Analyze Syntax

func.kurtosis(<expr>)

Analyze Examples

func.kurtosis(table.price).alias('excess_kurtosis')

|     excess_kurtosis     |
|-------------------------|
| 0.06818181325581445     |

SQL Syntax

KURTOSIS(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

Nullable Float64.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE stock_prices (
  id INT,
  stock_symbol VARCHAR,
  price FLOAT
);

INSERT INTO stock_prices (id, stock_symbol, price)
VALUES (1, 'AAPL', 150),
       (2, 'AAPL', 152),
       (3, 'AAPL', 148),
       (4, 'AAPL', 160),
       (5, 'AAPL', 155);

Query Demo: Calculate Excess Kurtosis for Apple Stock Prices

SELECT KURTOSIS(price) AS excess_kurtosis
FROM stock_prices
WHERE stock_symbol = 'AAPL';

Result

|     excess_kurtosis     |
|-------------------------|
| 0.06818181325581445     |

16 - MAX

Aggregate function.

The MAX() function returns the maximum value in a set of values.

Analyze Syntax

func.max(<column>)

Analyze Examples

table.city, func.max(table.temperature).alias('max_temperature')

|    city    | max_temperature |
|------------|-----------------|
| New York   |       32        |

SQL Syntax

MAX(<expr>)

Arguments

ArgumentsDescription
<expr>Any expression

Return Type

The maximum value, in the type of the value.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE temperatures (
  id INT,
  city VARCHAR,
  temperature FLOAT
);

INSERT INTO temperatures (id, city, temperature)
VALUES (1, 'New York', 30),
       (2, 'New York', 28),
       (3, 'New York', 32),
       (4, 'Los Angeles', 25),
       (5, 'Los Angeles', 27);

Query Demo: Find Maximum Temperature for New York City

SELECT city, MAX(temperature) AS max_temperature
FROM temperatures
WHERE city = 'New York'
GROUP BY city;

Result

|    city    | max_temperature |
|------------|-----------------|
| New York   |       32        |

17 - MAX_IF

The suffix _IF can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.max_if(<column>, <cond>)

Analyze Examples

func.max_if(table.revenue, table.salesperson_id==1).alias('max_revenue_salesperson_1')

| max_revenue_salesperson_1 |
|---------------------------|
|           3000            |

SQL Example

MAX_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE sales (
  id INT,
  salesperson_id INT,
  product_id INT,
  revenue FLOAT
);

INSERT INTO sales (id, salesperson_id, product_id, revenue)
VALUES (1, 1, 1, 1000),
       (2, 1, 2, 2000),
       (3, 1, 3, 3000),
       (4, 2, 1, 1500),
       (5, 2, 2, 2500);

Query Demo: Find Maximum Revenue for Salesperson with ID 1

SELECT MAX_IF(revenue, salesperson_id = 1) AS max_revenue_salesperson_1
FROM sales;

Result

| max_revenue_salesperson_1 |
|---------------------------|
|           3000            |

18 - MEDIAN

Aggregate function.

The MEDIAN() function computes the median of a numeric data sequence.

Analyze Syntax

func.median(<expr>)

Analyze Examples

func.median(table.score).alias('median_score')

|  median_score  |
|----------------|
|      85.0      |

SQL Syntax

MEDIAN(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

the type of the value.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE exam_scores (
  id INT,
  student_id INT,
  score INT
);

INSERT INTO exam_scores (id, student_id, score)
VALUES (1, 1, 80),
       (2, 2, 90),
       (3, 3, 75),
       (4, 4, 95),
       (5, 5, 85);

Query Demo: Calculate Median Exam Score

SELECT MEDIAN(score) AS median_score
FROM exam_scores;

Result

|  median_score  |
|----------------|
|      85.0      |

19 - MEDIAN_TDIGEST

Computes the median of a numeric data sequence using the t-digest algorithm.

Analyze Syntax

func.median_tdigest(<expr>)

Analyze Examples

func.median_tdigest(table.score).alias('median_score')

|  median_score  |
|----------------|
|      85.0      |

SQL Syntax

MEDIAN_TDIGEST(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

Returns a value of the same data type as the input values.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE exam_scores (
  id INT,
  student_id INT,
  score INT
);

INSERT INTO exam_scores (id, student_id, score)
VALUES (1, 1, 80),
       (2, 2, 90),
       (3, 3, 75),
       (4, 4, 95),
       (5, 5, 85);

-- Calculate median exam score
SELECT MEDIAN_TDIGEST(score) AS median_score
FROM exam_scores;

|  median_score  |
|----------------|
|      85.0      |

20 - MIN

Aggregate function.

The MIN() function returns the minimum value in a set of values.

Analyze Syntax

func.min(<column>)

Analyze Examples

table.station_id, func.min(table.price).alias('min_price')

| station_id | min_price |
|------------|-----------|
|     1      |   3.45    |

SQL Syntax

MIN(<expr>)

Arguments

ArgumentsDescription
<expr>Any expression

Return Type

The minimum value, in the type of the value.

SQL Examples


title: MIN

Aggregate function.

The MIN() function returns the minimum value in a set of values.

SQL Syntax

MIN(expression)

Arguments

ArgumentsDescription
expressionAny expression

Return Type

The minimum value, in the type of the value.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE gas_prices (
  id INT,
  station_id INT,
  price FLOAT
);

INSERT INTO gas_prices (id, station_id, price)
VALUES (1, 1, 3.50),
       (2, 1, 3.45),
       (3, 1, 3.55),
       (4, 2, 3.40),
       (5, 2, 3.35);

Query Demo: Find Minimum Gas Price for Station 1

SELECT station_id, MIN(price) AS min_price
FROM gas_prices
WHERE station_id = 1
GROUP BY station_id;

Result

| station_id | min_price |
|------------|-----------|
|     1      |   3.45    |

21 - MIN_IF

The suffix _IF can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.min_if(<column>, <cond>)

Analyze Examples

func.min_if(table.budget, table.departing=='IT').alias('min_it_budget')

| min_it_budget |
|---------------|
|     2000      |

SQL Syntax

MIN_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE project_budgets (
  id INT,
  project_id INT,
  department VARCHAR,
  budget FLOAT
);

INSERT INTO project_budgets (id, project_id, department, budget)
VALUES (1, 1, 'HR', 1000),
       (2, 1, 'IT', 2000),
       (3, 1, 'Marketing', 3000),
       (4, 2, 'HR', 1500),
       (5, 2, 'IT', 2500);

Query Demo: Find Minimum Budget for IT Department

SELECT MIN_IF(budget, department = 'IT') AS min_it_budget
FROM project_budgets;

Result

| min_it_budget |
|---------------|
|     2000      |

22 - QUANTILE_CONT

Aggregate function.

The QUANTILE_CONT() function computes the interpolated quantile number of a numeric data sequence.

Analyze Syntax

func.quantile_cont(<levels>, <expr>)

Analyze Examples

func.quantile_cont(0.5, table.sales_amount).alias('median_sales_amount')

|  median_sales_amount  |
|-----------------------|
|        6000.0         |

SQL Syntax

QUANTILE_CONT(<levels>)(<expr>)
    
QUANTILE_CONT(level1, level2, ...)(<expr>)

Arguments

ArgumentsDescription
<level(s)level(s) of quantile. Each level is constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]
<expr>Any numerical expression

Return Type

Float64 or float64 array based on level number.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE sales_data (
  id INT,
  sales_person_id INT,
  sales_amount FLOAT
);

INSERT INTO sales_data (id, sales_person_id, sales_amount)
VALUES (1, 1, 5000),
       (2, 2, 5500),
       (3, 3, 6000),
       (4, 4, 6500),
       (5, 5, 7000);

Query Demo: Calculate 50th Percentile (Median) of Sales Amount using Interpolation

SELECT QUANTILE_CONT(0.5)(sales_amount) AS median_sales_amount
FROM sales_data;

Result

|  median_sales_amount  |
|-----------------------|
|        6000.0         |

23 - QUANTILE_DISC

Aggregate function.

The QUANTILE_DISC() function computes the exact quantile number of a numeric data sequence. The QUANTILE alias to QUANTILE_DISC

Analyze Syntax

func.quantile_disc(<levels>, <expr>)

Analyze Examples

func.quantile_disc([0.25, 0.75], table.salary).alias('salary_quantiles')

|  salary_quantiles   |
|---------------------|
| [55000.0, 65000.0]  |

SQL Syntax

QUANTILE_DISC(<levels>)(<expr>)
    
QUANTILE_DISC(level1, level2, ...)(<expr>)

Arguments

ArgumentsDescription
level(s)level(s) of quantile. Each level is constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]
<expr>Any numerical expression

Return Type

InputType or array of InputType based on level number.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE salary_data (
  id INT,
  employee_id INT,
  salary FLOAT
);

INSERT INTO salary_data (id, employee_id, salary)
VALUES (1, 1, 50000),
       (2, 2, 55000),
       (3, 3, 60000),
       (4, 4, 65000),
       (5, 5, 70000);

Query Demo: Calculate 25th and 75th Percentile of Salaries

SELECT QUANTILE_DISC(0.25, 0.75)(salary) AS salary_quantiles
FROM salary_data;

Result

|  salary_quantiles   |
|---------------------|
| [55000.0, 65000.0]  |

24 - QUANTILE_TDIGEST

import FunctionDescription from '@site/src/components/FunctionDescription';

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm.

Analyze Syntax

func.quantile_tdigest(<levels>, <expr>)

Analyze Examples

func.quantile_tdigest([0.5, 0.8], table.sales_amount).alias('sales_amounts')

| sales_amounts         |
|-----------------------+
| [6000.0,7000.0]       |

SQL Syntax

QUANTILE_TDIGEST(<level1>[, <level2>, ...])(<expr>)

Arguments

ArgumentsDescription
<level n>A level of quantile represents a constant floating-point number ranging from 0 to 1. It is recommended to use a level value in the range of [0.01, 0.99].
<expr>Any numerical expression

Return Type

Returns either a Float64 value or an array of Float64 values, depending on the number of quantile levels specified.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE sales_data (
  id INT,
  sales_person_id INT,
  sales_amount FLOAT
);

INSERT INTO sales_data (id, sales_person_id, sales_amount)
VALUES (1, 1, 5000),
       (2, 2, 5500),
       (3, 3, 6000),
       (4, 4, 6500),
       (5, 5, 7000);

SELECT QUANTILE_TDIGEST(0.5)(sales_amount) AS median_sales_amount
FROM sales_data;

median_sales_amount|
-------------------+
             6000.0|

SELECT QUANTILE_TDIGEST(0.5, 0.8)(sales_amount)
FROM sales_data;

quantile_tdigest(0.5, 0.8)(sales_amount)|
----------------------------------------+
[6000.0,7000.0]                         |

25 - QUANTILE_TDIGEST_WEIGHTED

import FunctionDescription from '@site/src/components/FunctionDescription';

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm. This function takes into account the weight of each sequence member. Memory consumption is log(n), where n is a number of values.

Analyze Syntax

func.quantile_tdigest_weighted(<levels>, <expr>, <weight_expr>)

Analyze Examples

func.quantile_tdigest_weighted([0.5, 0.8], table.sales_amount, 1).alias('sales_amounts')

| sales_amounts         |
|-----------------------+
| [6000.0,7000.0]       |

SQL Syntax

QUANTILE_TDIGEST_WEIGHTED(<level1>[, <level2>, ...])(<expr>, <weight_expr>)

Arguments

ArgumentsDescription
<level n>A level of quantile represents a constant floating-point number ranging from 0 to 1. It is recommended to use a level value in the range of [0.01, 0.99].
<expr>Any numerical expression
<weight_expr>Any unsigned integer expression. Weight is a number of value occurrences.

Return Type

Returns either a Float64 value or an array of Float64 values, depending on the number of quantile levels specified.

SQL Examples

-- Create a table and insert sample data
CREATE TABLE sales_data (
  id INT,
  sales_person_id INT,
  sales_amount FLOAT
);

INSERT INTO sales_data (id, sales_person_id, sales_amount)
VALUES (1, 1, 5000),
       (2, 2, 5500),
       (3, 3, 6000),
       (4, 4, 6500),
       (5, 5, 7000);

SELECT QUANTILE_TDIGEST_WEIGHTED(0.5)(sales_amount, 1) AS median_sales_amount
FROM sales_data;

median_sales_amount|
-------------------+
             6000.0|

SELECT QUANTILE_TDIGEST_WEIGHTED(0.5, 0.8)(sales_amount, 1)
FROM sales_data;

quantile_tdigest_weighted(0.5, 0.8)(sales_amount)|
-------------------------------------------------+
[6000.0,7000.0]                                  |

26 - RETENTION

Aggregate function

The RETENTION() function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8 that indicate whether a certain condition was met for the event.

Any condition can be specified as an argument (as in WHERE).

The conditions, except the first, apply in pairs: the result of the second will be true if the first and second are true, of the third if the first and third are true, etc.

Analyze Syntax

func.retention(<cond1> , <cond2> , ..., <cond32>)

Analyze Examples

table.user_id, func.retention(table.event_type=='signup', table.event_type='login', table.event_type='purchase').alias('sales_amounts')

| user_id | retention |
|---------|-----------|
|   1     | [1, 1, 0] |
|   2     | [1, 0, 1] |
|   3     | [1, 1, 0] |

SQL Syntax

RETENTION( <cond1> , <cond2> , ..., <cond32> );

Arguments

ArgumentsDescription
<cond>An expression that returns a Boolean result

Return Type

The array of 1 or 0.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE user_events (
  id INT,
  user_id INT,
  event_date DATE,
  event_type VARCHAR
);

INSERT INTO user_events (id, user_id, event_date, event_type)
VALUES (1, 1, '2022-01-01', 'signup'),
       (2, 1, '2022-01-02', 'login'),
       (3, 2, '2022-01-01', 'signup'),
       (4, 2, '2022-01-03', 'purchase'),
       (5, 3, '2022-01-01', 'signup'),
       (6, 3, '2022-01-02', 'login');

Query Demo: Calculate User Retention Based on Signup, Login, and Purchase Events

SELECT
  user_id,
  RETENTION(event_type = 'signup', event_type = 'login', event_type = 'purchase') AS retention
FROM user_events
GROUP BY user_id;

Result

| user_id | retention |
|---------|-----------|
|   1     | [1, 1, 0] |
|   2     | [1, 0, 1] |
|   3     | [1, 1, 0] |

27 - SKEWNESS

Aggregate function.

The SKEWNESS() function returns the skewness of all input values.

Analyze Syntax

func.skewness(<expr>)

Analyze Examples

func.skewness(table.temperature).alias('temperature_skewness')

| temperature_skewness |
|----------------------|
|      0.68            |

SQL Syntax

SKEWNESS(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

Nullable Float64.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE temperature_data (
                                  id INT,
                                  city_id INT,
                                  temperature FLOAT
);

INSERT INTO temperature_data (id, city_id, temperature)
VALUES (1, 1, 60),
       (2, 1, 65),
       (3, 1, 62),
       (4, 2, 70),
       (5, 2, 75);

Query Demo: Calculate Skewness of Temperature Data

SELECT SKEWNESS(temperature) AS temperature_skewness
FROM temperature_data;

Result

| temperature_skewness |
|----------------------|
|      0.68            |

28 - STDDEV_POP

Aggregate function.

The STDDEV_POP() function returns the population standard deviation(the square root of VAR_POP()) of an expression.

Analyze Syntax

func.stddev_pop(<expr>)

Analyze Examples

func.stddev_pop(table.score).alias('test_score_stddev_pop')

| test_score_stddev_pop |
|-----------------------|
|        7.07107        |

SQL Syntax

STDDEV_POP(<expr>)
STDDEV(<expr>)
STD(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

double

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE test_scores (
  id INT,
  student_id INT,
  score FLOAT
);

INSERT INTO test_scores (id, student_id, score)
VALUES (1, 1, 80),
       (2, 2, 85),
       (3, 3, 90),
       (4, 4, 95),
       (5, 5, 100);

Query Demo: Calculate Population Standard Deviation of Test Scores

SELECT STDDEV_POP(score) AS test_score_stddev_pop
FROM test_scores;

Result

| test_score_stddev_pop |
|-----------------------|
|        7.07107        |

29 - STDDEV_SAMP

Aggregate function.

The STDDEV_SAMP() function returns the sample standard deviation(the square root of VAR_SAMP()) of an expression.

Analyze Syntax

func.stddev_samp(<expr>)

Analyze Examples

func.stddev_samp(table.height).alias('height_stddev_samp')

| height_stddev_samp |
|--------------------|
|      0.240         |

SQL Syntax

STDDEV_SAMP(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

double

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE height_data (
  id INT,
  person_id INT,
  height FLOAT
);

INSERT INTO height_data (id, person_id, height)
VALUES (1, 1, 5.8),
       (2, 2, 6.1),
       (3, 3, 5.9),
       (4, 4, 5.7),
       (5, 5, 6.3);

Query Demo: Calculate Sample Standard Deviation of Heights

SELECT STDDEV_SAMP(height) AS height_stddev_samp
FROM height_data;

Result

| height_stddev_samp |
|--------------------|
|      0.240         |

30 - STRING_AGG

Aggregate function.

The STRING_AGG() function converts all the non-NULL values of a column to String, separated by the delimiter.

Analyze Syntax

func.string_agg(<expr> [, delimiter])

Analyze Examples

func.string_agg(table.language_name).alias('concatenated_languages')

|          concatenated_languages         |
|-----------------------------------------|
| Python, JavaScript, Java, C#, Ruby      |

SQL Syntax

STRING_AGG(<expr>)
STRING_AGG(<expr> [, delimiter])

Arguments

ArgumentsDescription
<expr>Any string expression (if not a string, use ::VARCHAR to convert)
delimiterOptional constant String, if not specified, use empty String

Return Type

the String type

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE programming_languages (
  id INT,
  language_name VARCHAR
);

INSERT INTO programming_languages (id, language_name)
VALUES (1, 'Python'),
       (2, 'JavaScript'),
       (3, 'Java'),
       (4, 'C#'),
       (5, 'Ruby');

Query Demo: Concatenate Programming Language Names with a Delimiter

SELECT STRING_AGG(language_name, ', ') AS concatenated_languages
FROM programming_languages;

Result

|          concatenated_languages         |
|------------------------------------------|
| Python, JavaScript, Java, C#, Ruby      |

31 - SUM

Aggregate function.

The SUM() function calculates the sum of a set of values.

Analyze Syntax

func.sum(<column>)

Analyze Examples

func.sum(table.quantity).alias('total_quantity_sold')

| total_quantity_sold |
|---------------------|
|         41          |

SQL Syntax

SUM(<expr>)

Arguments

ArgumentsDescription
<expr>Any numerical expression

Return Type

A double if the input type is double, otherwise integer.

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE sales_data (
  id INT,
  product_id INT,
  quantity INT
);

INSERT INTO sales_data (id, product_id, quantity)
VALUES (1, 1, 10),
       (2, 2, 5),
       (3, 3, 8),
       (4, 4, 3),
       (5, 5, 15);

Query Demo: Calculate the Total Quantity of Products Sold

SELECT SUM(quantity) AS total_quantity_sold
FROM sales_data;

Result

| total_quantity_sold |
|---------------------|
|         41          |

32 - SUM_IF

The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition.

Analyze Syntax

func.sum_if(<column>, <cond>)

Analyze Examples

func.sum_if(table.amount, table.status=='Completed').alias('total_amount_completed')

| total_amount_completed |
|------------------------|
|         270.0          |

SQL Syntax

SUM_IF(<column>, <cond>)

SQL Examples

Create a Table and Insert Sample Data

CREATE TABLE order_data (
  id INT,
  customer_id INT,
  amount FLOAT,
  status VARCHAR
);

INSERT INTO order_data (id, customer_id, amount, status)
VALUES (1, 1, 100, 'Completed'),
       (2, 2, 50, 'Completed'),
       (3, 3, 80, 'Cancelled'),
       (4, 4, 120, 'Completed'),
       (5, 5, 75, 'Cancelled');

Query Demo: Calculate the Total Amount of Completed Orders

SELECT SUM_IF(amount, status = 'Completed') AS total_amount_completed
FROM order_data;

Result

| total_amount_completed |
|------------------------|
|         270.0          |

33 - WINDOW_FUNNEL

Funnel Analysis

Similar to windowFunnel in ClickHouse (they were created by the same author), it searches for event chains in a sliding time window and calculates the maximum number of events from the chain.

The function works according to the algorithm:

  • The function searches for data that triggers the first condition in the chain and sets the event counter to 1. This is the moment when the sliding window starts.

  • If events from the chain occur sequentially within the window, the counter is incremented. If the sequence of events is disrupted, the counter isn’t incremented.

  • If the data has multiple event chains at varying completion points, the function will only output the size of the longest chain.

SQL Syntax

WINDOW_FUNNEL( <window> )( <timestamp>, <cond1>, <cond2>, ..., <condN> )

Arguments

  • <timestamp> — Name of the column containing the timestamp. Data types supported: integer types and datetime types.
  • <cond> — Conditions or data describing the chain of events. Must be Boolean datatype.

Parameters

  • <window> — Length of the sliding window, it is the time interval between the first and the last condition. The unit of window depends on the timestamp itself and varies. Determined using the expression timestamp of cond1 <= timestamp of cond2 <= ... <= timestamp of condN <= timestamp of cond1 + window.

Returned value

The maximum number of consecutive triggered conditions from the chain within the sliding time window. All the chains in the selection are analyzed.

Type: UInt8.

Example

Determine if a set period of time is enough for the user to SELECT a phone and purchase it twice in the online store.

Set the following chain of events:

  1. The user logged into their account on the store (event_name = 'login').
  2. The user land the page (event_name = 'visit').
  3. The user adds to the shopping cart(event_name = 'cart').
  4. The user complete the purchase (event_name = 'purchase').
CREATE TABLE events(user_id BIGINT, event_name VARCHAR, event_timestamp TIMESTAMP);

INSERT INTO events VALUES(100123, 'login', '2022-05-14 10:01:00');
INSERT INTO events VALUES(100123, 'visit', '2022-05-14 10:02:00');
INSERT INTO events VALUES(100123, 'cart', '2022-05-14 10:04:00');
INSERT INTO events VALUES(100123, 'purchase', '2022-05-14 10:10:00');

INSERT INTO events VALUES(100125, 'login', '2022-05-15 11:00:00');
INSERT INTO events VALUES(100125, 'visit', '2022-05-15 11:01:00');
INSERT INTO events VALUES(100125, 'cart', '2022-05-15 11:02:00');

INSERT INTO events VALUES(100126, 'login', '2022-05-15 12:00:00');
INSERT INTO events VALUES(100126, 'visit', '2022-05-15 12:01:00');

Input table:

+---------+------------+----------------------------+
| user_id | event_name | event_timestamp            |
+---------+------------+----------------------------+
|  100123 | login      | 2022-05-14 10:01:00.000000 |
|  100123 | visit      | 2022-05-14 10:02:00.000000 |
|  100123 | cart       | 2022-05-14 10:04:00.000000 |
|  100123 | purchase   | 2022-05-14 10:10:00.000000 |
|  100125 | login      | 2022-05-15 11:00:00.000000 |
|  100125 | visit      | 2022-05-15 11:01:00.000000 |
|  100125 | cart       | 2022-05-15 11:02:00.000000 |
|  100126 | login      | 2022-05-15 12:00:00.000000 |
|  100126 | visit      | 2022-05-15 12:01:00.000000 |
+---------+------------+----------------------------+

Find out how far the user user_id could get through the chain in an hour window slides.

Query:

SELECT
    level,
    count() AS count
FROM
(
    SELECT
        user_id,
        window_funnel(3600000000)(event_timestamp, event_name = 'login', event_name = 'visit', event_name = 'cart', event_name = 'purchase') AS level
    FROM events
    GROUP BY user_id
)
GROUP BY level ORDER BY level ASC;

Result:

+-------+-------+
| level | count |
+-------+-------+
|     2 |     1 |
|     3 |     1 |
|     4 |     1 |
+-------+-------+
  • User 100126 level is 2 (login -> visit) .
  • user 100125 level is 3 (login -> visit -> cart).
  • User 100123 level is 4 (login -> visit -> cart -> purchase).