How to Chat with Your BigQuery? Introducing semantic search to databases.
Large Language Models have shown to be very powerful in knowing and creating creating SQL queries against a variety of databases - including Google BigQuery.
This advancement in AI technology makes it possible to interact with your BigQuery database using natural language.
Data in BigQuery datasets tend to be quite complex and may contain one of the worlds most relevant business-related information.
Imagine a world, where you ask "How many users signed up last month?" and you get an answer in seconds, without needing to first search the database for the right data model and then writing an SQL query. Just using everyday, human language.
In this post we will show you how to build a conversational interface for Google BigQuery, using the powerful GPT-4o model. We'll outline the steps needed to set up your development environment, how to connect to BigQuery, how to use the OpenAI API to generate generate the SQL queries and how to execute them against your BigQuery database.
As a bonus, we are going to show how to use RAG - Retrieval Augmented Generation to improve the quality of the generated SQL queries.
Concept for a vector-search based chat interface for BigQuery?
First, let's have a look at how such a "chat with your BigQuery" solution might look like.
In general, we need the following elements:
- A way for the user to enter his query in natural language.
- A layer that translates the natural language query into SQL.
- A runtime to execute the SQL query against the BigQuery database.
While 1 and 3 are quite straightforward, the main logic and complexity lies in task number 2 - translating natural language into SQL.
To allow high quality SQL generation, we need the following sub-steps:
- Get the structure of the database - the database schema
- Get a description of the individual columns, so that the AI can understand the context of the data.
- Use retrieval augmented generation, to find the most relevant BigQuery tables and columns for the user query.
- Use a large language model to generate the SQL query.
- Automatically correct errors in the generated SQL query - as LLMs art not yet perfect and tend to produce erroneous queries from time to time.
Meaning, to have a high quality and robust conversational chat interface for BigQuery, we need the following high-level architecture:
High-level architecture of a conversational chat interface for BigQuery
If we look at the architecture in more detail, we can see that we not only need to "connect" our BigQuery database to an AI model, but we need an AI Agent
In short, AI agents are AI models that take a user input and than independently use their knowledge, external tools (like database access) and memory to solve the user query.
In our case, we need to build a small agent, specified as follows:
Task: Take the user query and translate it to a working SQL query, answering the user query. Tools: BigQuery database access, Python runtime Knowledge: SQL generation (as part of the AI model)
AI Agent Overview
Note: As the set of tasks we need the AI to independently solve is limited, our AI agent will be quite small. The concept of AI agents is phenomenally powerful for a variety of tasks, but in our case we only need a very limited implementation. Basically, all our agent needs to do is to generate SQL queries, execute them, decide whether to execute them again on error and return the result to the user.
Step by step: How to build a Text2SQL agent for BigQuery
Now that we established the basic architecture, lets go through the steps required to implement it. We are implementing the whole AI agnet loop from scratch, without using any of the existing LLM libraries.
Why not using LangChain or LlamaIndex?
Both LangChain and LlamaIndex provide SQL AI Agents as part of their feature set.
While they provide both a great way to get started with text2sql agents, their abstractions make it difficult to understand the underlying processes. Furthermore, and more important, their implementation is rather generic for many SQL databases. We found that after many hours and days of creating SQL AI agents, that each and any use case needs tweaking in how the SQL is generated and how errors are handled. This measure of flexibility is just not possible with the high-level interfaces provided by LangChain and LlamaIndex.
Preparation
First, make sure, that you have an active Google Cloud project as well as activated billing for said project.
Next, enable the following Google Cloud APIs:
Next, set up the Google Cloud BigQuery authentication as follows:
-
Create a service account for your project. Add the following BigQuery roles:
- BigQuery User
- BigQuery Data Viewer
- BigQuery Job User
-
Download the service account key of the newly created service account.
Finally, let's import some demo data into BigQuery. Alternatively, you can use your own data.
-
Download the demo dataset. It contains information from a well-known Kaggle challenge about churn of a Telecom company.
-
Import the file into BigQuery in a new dataset
aiagent_test
and tablechurn_table
.
Implementation of the AI Agent
-
Create a new venv or conda environment and activate it.
-
Install the required Python packages:
-
Import the modules and define the BigQuery table and dataset and set up the openai api key.
-
Create an sqlalchemy engine for the BigQuery database.
-
Let's test our database connection:
The results should be:
-
As we are able to connect to our database, let's retrieve the database schema.
This provides a list of all the tables and columns in the provided BigQuery dataset.
Preparation done, we can now start with the actual implementation of the AI agent. In short, we will use the OpenAI API to generate the SQL queries and execute them against the BigQuery database.
Implementing the AI Agent loop
At it's core, AI agents are nothing more than a while loop, that takes a task/query and invokes an AI model to solve it. The while loop (so, the agent) runs for as long as no stop condition is given.
In our case, we only have a simple agent. In pseudo-code it looks:
while:
- Use AI to generate SQL query based on user input and using the database schema
- Execute the generated SQL query on the BigQuery database
- Validate, if there were any errors, if yes, use AI to correct the query
- Optional: If no errors, use AI to validate, if the results answer the user query. If not, repeat
- Stop if the user query is answered
Let's implement this in python. First, we create a function to generate our query. We'll use the remarkable OpenAI function calling feature for that, as function calling provides robust JSON output which is easy to parse.
NOTE: Many experiments show, that AI models tend to be less intelligent when using function calling. This was not a problem for our SQL generation attempts so far - but keep that in mind. If you find your AI model is constantly producing bad queries, try to use the "normal" chat endpoint and parse the output yourself.
This function will be used to run the very first AI query generation attempt. We'll execute the resulting query against the BigQuery database and check for errors. If there are any, we'll use the AI to correct the query.
Let's create this error handling function first. It's basically just another LLM query generation, but we'll also send the error message to the model, so that it can correct the query.
As you can see, we are just updating the system prompt to include the error and ask the model to correct this error. Otherwise, it's the same request and function as before.
And finally, bring it all together, by creating the agent-loop:
All we have to do now is to run our agent and provide a question:
And - amazingly - the agent is able to Automatically generate the SQL query, execute the query, self-heal in case of errors and provide the correct result!
Let's try something more complicated: We want to know the percentage of males and females in the dataset.
Output:
Quite impressive, isn't it?
Improving our AI SQL Agent for BigQuery data analysis
While our little agent is already quite nice to play with, it creates quite a good amount of garbage queries. While it's surprisingly good in resolving these errors on it's own, there are some methods to reduce these erroneous attempts.
-
Adding a description to the database columns: In this example, we only provide the table name as well as the column names as context to the AI model. The model is than asked to try to "guess" which columns might be relevant for answering the users question. This is quite error-prone and can be improved by adding a description to the columns.
There are 2 ways to do this: a. Manually add a description to the columns. While this is in fact a little tedious, it is the most accurate and reliable way to provide context to the model. b. Send a head(10) of each column to GPT-4o and ask for a description. The model will most probably be able to provide a reasonable enough description.
As both methods are quite simple, we'll leave the implementation to the reader. For the sake of this post I used GPT-4o to refactor our schema as follows:
-
Using RAG to first find potentially relevant tables and columns to reduce the search space for the AI model.
This method is quite powerful and makes use of the fact, that most user queries are somehow related to the available columns. By using a RAG system to search for relevant columns and tables before asking the AI to create a query, we can potentially drastically improve the quality of the generated results.
Using RAG for text2sql BigQuery SQL agents
To demonstrate how such a RAG-assisted text2sql system might look like, let's extend the agent we just created.
The concept for RAG is always the same: we use a set of texts (in our case column and table names), create vector indexes for them and then use these indexes to find semantically similar texts (columns and tables) to the user query.
More specifically, we are creating an index on our column descriptions and use the other keys (table_name, column_name and column_type) as metadata.
We use ChromaDB as our vector store for this example, but any other vector store will work as well.
Note: ChromaDB provides a really handy feature
where you just add the "document" (the text you want to index) to the
collection.add()
method - without providing any embeddings. ChromaDB
will handle the embedding creation for you.
We chose to manually create embeddings in this example so that it's easier
to port to other vector stores. If you use ChromaDB anyhow, feel free to
make use of this feature though.
So, embeddings created, lets create a function to retrieve the most relevant columns.
This function finds the 10 most relevant columns with regards to the user query. Adjust this number as needed.
As ChromaDB does not provide a "threshold" parameter, we than manually filter the results to only include columns which are somewhat relevant for our purposes.
NOTE: If your vector store provides a threshold feature, you can skip the manual filtering step.
Now let's use this function in our agent:
The only change needed is to call the query_chromadb
function before
calling the query_database
function. Instead of sending over the whole
table schema, we only send the subset of columns which might be relevant
for the job.
Let's try it out once again:
And - thank god - our agent still works like a charm!
Conclusion
To conclude this blog post on creating a conversational interface for Google BigQuery using GPT-4, let's summarize the key points and discuss the implications of this technology:
We've demonstrated how to build a useful, simple AI agent that can translate natural language queries into SQL, execute them against a BigQuery database, and handle errors automatically.
The implementation combines several cutting-edge technologies:
-
OpenAI's GPT-4 for natural language understanding and SQL generation
-
Google BigQuery for data storage and
-
querying Retrieval Augmented Generation (RAG) for improving query relevance
This approach allows non-technical users to interact with complex databases using everyday language, significantly lowering the barrier to data analysis.
We've shown how to improve the accuracy of SQL generation by:
-
Adding descriptions to database columns
-
Using RAG to find relevant tables and columns before query generation
The flexibility of this custom implementation allows for fine-tuning and optimization specific to BigQuery and particular use cases.
We find, that these natural language interfaces for databases have the potential to change the way how organizations interact with their data. It democratizes access to complex databases, allowing more people within an organization to derive insights without needing to learn SQL or rely heavily on heavily on data teams. Or - even better - allow data professionals to focus focus more on the actual analysis and less on the data retrieval.
However, it's important to note that while this system is powerful, it's not infallible. Users should always verify critical results, especially for complex queries or when making important business decisions based on the output.
NOTE: If you need help with implementing such a system for your organization, feel free to reach out to us - we are happy to help!
Further reading
- Using AI directly from your PostgreSQL database
- How to test your RAG pipeline?
- How to host your own LLM - including HTTPS?
------------------
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use