HISTOGRAM (Lakehouse v1)
Generates a data distribution histogram using an “equal height” bucketing strategy.
Analyze Syntax
Section titled “Analyze Syntax”func.histogram(<expr>)Analyze Examples
Section titled “Analyze Examples”See SQL Example for details
SQL Syntax
Section titled “SQL Syntax”HISTOGRAM(<expr>)
-- The following two forms are equivalent:HISTOGRAM(<max_num_buckets>)(<expr>)HISTOGRAM(<expr> [, <max_num_buckets>])| Parameter | Description |
|---|---|
expr | The data type of expr should be sortable. |
max_num_buckets | Optional positive integer specifying the maximum number of buckets. Default is 128. |
Return Type
Section titled “Return Type”Returns either an empty string or a JSON object with the following structure:
- buckets: List of buckets with detailed information:
- lower: Lower bound of the bucket.
- upper: Upper bound of the bucket.
- count: Number of elements in the bucket.
- pre_sum: Cumulative count of elements up to the current bucket.
- ndv: Number of distinct values in the bucket.
SQL Examples
Section titled “SQL Examples”This example shows how the HISTOGRAM function analyzes the distribution of c_int values in the histagg table, returning bucket boundaries, distinct value counts, element counts, and cumulative counts:
CREATE TABLE histagg ( c_id INT, c_tinyint TINYINT, c_smallint SMALLINT, c_int INT);
INSERT INTO histagg VALUES (1, 10, 20, 30), (1, 11, 21, 33), (1, 11, 12, 13), (2, 21, 22, 23), (2, 31, 32, 33), (2, 10, 20, 30);
SELECT HISTOGRAM(c_int) FROM histagg;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ histogram(c_int) │├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤│ [{"lower":"13","upper":"13","ndv":1,"count":1,"pre_sum":0},{"lower":"23","upper":"23","ndv":1,"count":1,"pre_sum":1},{"lower":"30","upper":"30","ndv":1,"count":2,"pre_sum":2},{"lower":"33","upper":"33","ndv":1,"count":2,"pre_sum":4}] │└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘The result is returned as a JSON array:
[ { "lower": "13", "upper": "13", "ndv": 1, "count": 1, "pre_sum": 0 }, { "lower": "23", "upper": "23", "ndv": 1, "count": 1, "pre_sum": 1 }, { "lower": "30", "upper": "30", "ndv": 1, "count": 2, "pre_sum": 2 }, { "lower": "33", "upper": "33", "ndv": 1, "count": 2, "pre_sum": 4 }]This example shows how HISTOGRAM(2) groups c_int values into two buckets:
SELECT HISTOGRAM(2)(c_int) FROM histagg;-- OrSELECT HISTOGRAM(c_int, 2) FROM histagg;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ histogram(2)(c_int) │├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤│ [{"lower":"13","upper":"30","ndv":3,"count":4,"pre_sum":0},{"lower":"33","upper":"33","ndv":1,"count":2,"pre_sum":4}] │└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘The result is returned as a JSON array:
[ { "lower": "13", "upper": "30", "ndv": 3, "count": 4, "pre_sum": 0 }, { "lower": "33", "upper": "33", "ndv": 1, "count": 2, "pre_sum": 4 }]