This document demonstrates how to leverage PlaidCloud Lakehouse's built-in AI functions for creating document embeddings, searching for similar documents, and generating text completions based on context.
This is the multi-page printable view of this section. Click here to print.
AI Functions
1 - AI_EMBEDDING_VECTOR
This document provides an overview of the ai_embedding_vector function in PlaidCloud Lakehouse and demonstrates how to create document embeddings using this function.
The main code implementation can be found here.
By default, PlaidCloud Lakehouse leverages the text-embedding-ada model for generating embeddings.
Starting from PlaidCloud Lakehouse v1.1.47, PlaidCloud Lakehouse supports the Azure OpenAI service.
This integration offers improved data privacy.
To use Azure OpenAI, add the following configurations to the [query]
section:
# Azure OpenAI
openai_api_chat_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_embedding_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_version = "2023-03-15-preview"
PlaidCloud Lakehouse relies on (Azure) OpenAI for AI_EMBEDDING_VECTOR
and sends the embedding column data to (Azure) OpenAI.
They will only work when the PlaidCloud Lakehouse configuration includes the openai_api_key
, otherwise they will be inactive.
This function is available by default on PlaidCloud Lakehouse using an Azure OpenAI key. If you use them, you acknowledge that your data will be sent to Azure OpenAI by us.
Overview of ai_embedding_vector
The ai_embedding_vector
function in PlaidCloud Lakehouse is a built-in function that generates vector embeddings for text data. It is useful for natural language processing tasks, such as document similarity, clustering, and recommendation systems.
The function takes a text input and returns a high-dimensional vector that represents the input text's semantic meaning and context. The embeddings are created using pre-trained models on large text corpora, capturing the relationships between words and phrases in a continuous space.
Creating embeddings using ai_embedding_vector
To create embeddings for a text document using the ai_embedding_vector
function, follow the example below.
- Create a table to store the documents:
CREATE TABLE documents (
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT32)
);
- Insert example documents into the table:
INSERT INTO documents(id, title, content)
VALUES
(1, 'A Brief History of AI', 'Artificial intelligence (AI) has been a fascinating concept of science fiction for decades...'),
(2, 'Machine Learning vs. Deep Learning', 'Machine learning and deep learning are two subsets of artificial intelligence...'),
(3, 'Neural Networks Explained', 'A neural network is a series of algorithms that endeavors to recognize underlying relationships...'),
- Generate the embeddings:
UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;
After running the query, the embedding column in the table will contain the generated embeddings.
The embeddings are stored as an array of FLOAT32
values in the embedding column, which has the ARRAY(FLOAT32)
column type.
You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.
- Inspect the embeddings:
SELECT length(embedding) FROM documents;
+-------------------+
| length(embedding) |
+-------------------+
| 1536 |
| 1536 |
| 1536 |
+-------------------+
The query above shows that the generated embeddings have a length of 1536(dimensions) for each document.
2 - AI_TEXT_COMPLETION
This document provides an overview of the ai_text_completion
function in PlaidCloud Lakehouse and demonstrates how to generate text completions using this function.
The main code implementation can be found here.
Starting from PlaidCloud Lakehouse v1.1.47, PlaidCloud Lakehouse supports the Azure OpenAI service.
This integration offers improved data privacy.
To use Azure OpenAI, add the following configurations to the [query]
section:
# Azure OpenAI
openai_api_chat_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_embedding_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_version = "2023-03-15-preview"
PlaidCloud Lakehouse relies on (Azure) OpenAI for AI_TEXT_COMPLETION
and sends the completion prompt data to (Azure) OpenAI.
They will only work when the PlaidCloud Lakehouse configuration includes the openai_api_key
, otherwise they will be inactive.
This function is available by default on PlaidCloud Lakehouse using an Azure OpenAI key. If you use them, you acknowledge that your data will be sent to Azure OpenAI by us.
Overview of ai_text_completion
The ai_text_completion
function in PlaidCloud Lakehouse is a built-in function that generates text completions based on a given prompt. It is useful for natural language processing tasks, such as question answering, text generation, and autocompletion systems.
The function takes a text prompt as input and returns a generated completion for the prompt. The completions are created using pre-trained models on large text corpora, capturing the relationships between words and phrases in a continuous space.
Generating text completions using ai_text_completion
Here is a simple example using the ai_text_completion
function in PlaidCloud Lakehouse to generate a text completion:
SELECT ai_text_completion('What is artificial intelligence?') AS completion;
Result:
+--------------------------------------------------------------------------------------------------------------------+
| completion |
+--------------------------------------------------------------------------------------------------------------------+
| Artificial intelligence (AI) is the field of study focused on creating machines and software capable of thinking, learning, and solving problems in a way that mimics human intelligence. This includes areas such as machine learning, natural language processing, computer vision, and robotics. |
+--------------------------------------------------------------------------------------------------------------------+
In this example, we provide the prompt "What is artificial intelligence?" to the ai_text_completion
function, and it returns a generated completion that briefly describes artificial intelligence.
3 - AI_TO_SQL
Converts natural language instructions into SQL queries with the latest model text-davinci-003
.
PlaidCloud Lakehouse offers an efficient solution for constructing SQL queries by incorporating OLAP and AI. Through this function, instructions written in a natural language can be converted into SQL query statements that align with the table schema. For example, the function can be provided with a sentence like "Get all items that cost 10 dollars or less" as an input and generate the corresponding SQL query "SELECT * FROM items WHERE price <= 10" as output.
The main code implementation can be found here.
Starting from PlaidCloud Lakehouse v1.1.47, PlaidCloud Lakehouse supports the Azure OpenAI service.
This integration offers improved data privacy.
To use Azure OpenAI, add the following configurations to the [query]
section:
# Azure OpenAI
openai_api_chat_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_embedding_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_version = "2023-03-15-preview"
PlaidCloud Lakehouse relies on (Azure) OpenAI for AI_TO_SQL
but only sends the table schema to (Azure) OpenAI, not the data.
They will only work when the PlaidCloud Lakehouse configuration includes the openai_api_key
, otherwise they will be inactive.
This function is available by default on PlaidCloud Lakehouse using an Azure OpenAI key. If you use them, you acknowledge that your table schema will be sent to Azure OpenAI by us.
Analyze Syntax
func.ai_to_sql('<natural-language-instruction>')
Analyze Examples
In this example, an SQL query statement is generated from an instruction with the AI_TO_SQL function, and the resulting statement is executed to obtain the query results.
func.ai_to_sql('List the total amount spent by users from the USA who are older than 30 years, grouped by their names, along with the number of orders they made in 2022')
A SQL statement is generated by the function as the output:
*************************** 1. row ***************************
database: openai
generated_sql: SELECT name, SUM(price) AS total_spent, COUNT(order_id) AS total_orders
FROM users
JOIN orders ON users.id = orders.user_id
WHERE country = 'USA' AND age > 30 AND order_date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY name;
SQL Syntax
USE <your-database>;
SELECT * FROM ai_to_sql('<natural-language-instruction>');
Obtain and Config OpenAI API Key
- To obtain your openAI API key, please visit https://platform.openai.com/account/api-keys and generate a new key.
- Configure the databend-query.toml file with the openai_api_key setting.
[query]
... ...
openai_api_key = "<your-key>"
SQL Examples
In this example, an SQL query statement is generated from an instruction with the AI_TO_SQL function, and the resulting statement is executed to obtain the query results.
- Prepare data.
CREATE DATABASE IF NOT EXISTS openai;
USE openai;
CREATE TABLE users(
id INT,
name VARCHAR,
age INT,
country VARCHAR
);
CREATE TABLE orders(
order_id INT,
user_id INT,
product_name VARCHAR,
price DECIMAL(10,2),
order_date DATE
);
-- Insert sample data into the users table
INSERT INTO users VALUES (1, 'Alice', 31, 'USA'),
(2, 'Bob', 32, 'USA'),
(3, 'Charlie', 45, 'USA'),
(4, 'Diana', 29, 'USA'),
(5, 'Eva', 35, 'Canada');
-- Insert sample data into the orders table
INSERT INTO orders VALUES (1, 1, 'iPhone', 1000.00, '2022-03-05'),
(2, 1, 'OpenAI Plus', 20.00, '2022-03-06'),
(3, 2, 'OpenAI Plus', 20.00, '2022-03-07'),
(4, 2, 'MacBook Pro', 2000.00, '2022-03-10'),
(5, 3, 'iPad', 500.00, '2022-03-12'),
(6, 3, 'AirPods', 200.00, '2022-03-14');
- Run the AI_TO_SQL function with an instruction written in English as the input.
SELECT * FROM ai_to_sql(
'List the total amount spent by users from the USA who are older than 30 years, grouped by their names, along with the number of orders they made in 2022');
A SQL statement is generated by the function as the output:
*************************** 1. row ***************************
database: openai
generated_sql: SELECT name, SUM(price) AS total_spent, COUNT(order_id) AS total_orders
FROM users
JOIN orders ON users.id = orders.user_id
WHERE country = 'USA' AND age > 30 AND order_date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY name;
- Run the generated SQL statement to get the query results.
+---------+-------------+-------------+
| name | order_count | total_spent |
+---------+-------------+-------------+
| Bob | 2 | 2020.00 |
| Alice | 2 | 1020.00 |
| Charlie | 2 | 700.00 |
+---------+-------------+-------------+
4 - COSINE_DISTANCE
This document provides an overview of the cosine_distance function in PlaidCloud Lakehouse and demonstrates how to measure document similarity using this function.
The cosine_distance function in PlaidCloud Lakehouse is a built-in function that calculates the cosine distance between two vectors. It is commonly used in natural language processing tasks, such as document similarity and recommendation systems.
Cosine distance is a measure of similarity between two vectors, based on the cosine of the angle between them. The function takes two input vectors and returns a value between 0 and 1, with 0 indicating identical vectors and 1 indicating orthogonal (completely dissimilar) vectors.
Analyze Syntax
func.cosine_distance(<vector1>, <vector2>)
SQL Examples
Creating a Table and Inserting Sample Data
Let's create a table to store some sample text documents and their corresponding embeddings:
CREATE TABLE articles (
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT32)
);
Now, let's insert some sample documents into the table:
INSERT INTO articles (id, title, content, embedding)
VALUES
(1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
(2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
(3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
Querying for Similar Documents
Now, let's find the documents that are most similar to a given query using the cosine_distance function:
SELECT
id,
title,
content,
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
FROM
articles
ORDER BY
similarity ASC
LIMIT 3;
Result:
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
| id | title | content | similarity |
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
| 1 | Python for Data Science | Python is a versatile programming language widely used in data science... | 0.1142081 |
| 2 | Introduction to R | R is a popular programming language for statistical computing and graphics... | 0.18741018 |
| 3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+