This article was originally posted at RTDIP
Generative AI is changing how we think about data, particularly the knowledge it can unlock from unstructured data that simply wasn’t possible before. However, it’s also peaked our curiosity about structured data – Can Generative AI also query structured data? Could it query time series data to answer questions such as:
What was the average actual power generated by Turbine 1 at ACME Wind Farm on 6 May?"
Considering this, RTDIP Developers have been conducting experiments to see how Generative AI could help query structured data and found some very exciting and novel approaches that are available for you to try out in the RTDIP SDK v0.5.0!
Let’s explore how Chat GPT and langchain can execute SQL queries on tables in Databricks using questions like the above. Keep in mind that you can try this out on any data in Databricks, it is not limited to data ingested by RTDIP pipelines.
Note
This is experimental and you will likely experience variable responses to your questions depending on the complexity of the data you use in this setup. Start small, with only a 2 – 3 tables before scaling up.
Overview
Langchain is making it much simpler to leverage Large Language Models(LLMs) in various contexts, such as AI powered SQL Agents, through toolkits. RTDIP SDK v0.5.0 takes advantage of the Langchain SQL Database Agent which can build sql queries from text questions and execute them on Databricks. It is worthwhile spending time reading the langchain documentation, the available toolkits and the SQL Database Agent used by RTDIP SDK to understand in more detail how it technically works.
Let’s get straight to it.
Prerequisites
- Obtain an Open AI API key. Register on the Open AI platform to obtain an Open AI API key, follow these instructions to obtain your API key and familiarize yourself with the Open AI documentation.
- Access to Databricks SQL or a Databricks Cluster and data stored as tables in Unity Catalog or Hive Metastore.
Warning
Consider the implications of exposing your data with Open AI models and seek approval prior to registering with any of these services.
Setup the SQL AI Agent
With all the prerequisites in place, it’s time to setup the SQL AI Agent.
Firstly, import the required RTDIP SDK components
from rtdip_sdk.connectors import ChatOpenAIDatabricksConnection
Next, configure the component with all the relevant connection information:
agent = ChatOpenAIDatabricksConnection(
catalog="<databricks catalog>",
schema="<databricks schema>",
server_hostname="<databricks host name>",
http_path="<databricks http path>",
access_token="<Azure AD token or databricks PAT token>",
openai_api_key="<Open AI API key>",
openai_model = "gpt-4",
sample_rows_in_table_info = 5,
verbose_logging = True
)
Some notes on the above:
server_hostname
andhttp_path
can be obtained from your Databricks SQL Warehouse or Databricks Clusteraccess_token
can be either a Databricks PAT Token or Azure AD Token. To obtain an Azure AD token, please refer to this documentationopen_ai_model
defaults togpt-4
but is not easily available at the time of writing. Alternatively, thegpt-3.5-turbo-16k-0613
has worked well in our testssample_rows_in_table_info
limits the number of rows queried in a table when the SQL Database Agent is looking context in the data. Be careful to not increase this too much as its then possible to exceed token limits on the gpt models
Execute Queries
With the ChatOpenAIDatabricksConnection
class configured, its now possible to execute queries. Try out some questions with the following code:
response = agent.run("What was the average actual power generated by Turbine 1 at ACME Wind Farm on 6 May?")
print(response)
And the answer:
"The average power generated for turbine 1 at Acme on 6 May is 847.4923418083226."
How did it work out the answer?
It’s possible to trace how the SQL AI Agent determined the answer using the logs. To switch on the logs, set verbose_logging = True
in ChatOpenAIDatabricksConnection
.
Below are the logs generated by the SQL AI Agent to solve the above question:
Action: list_tables_sql_db
Action Input: ""
Observation: acme_events, descartes_events
Thought: I can query the table acme_events. I should check the schema of this table to see if it contains the information I need.
Action: schema_sql_db
Action Input: "acme_events"
Observation:
CREATE TABLE acme_events (
`TagName` STRING NOT NULL,
`Source` STRING NOT NULL,
`EventDate` DATE NOT NULL,
`EventTime` TIMESTAMP NOT NULL,
`Status` STRING,
`Value` FLOAT
) USING DELTA
/*
5 rows from acme_events table:
TagName Source EventDate EventTime Status Value
ZZ:ACM_WF_WTG02_analog_GriActPwrMaxMet 2023-04-03 2023-04-03 11:26:16.525000+00:00 Good 4100.0
ZZ:ACM_WF_WTG02_analog_GriActPwrMaxMan 2023-04-03 2023-04-03 11:26:16.525000+00:00 Good 4100.0
ZZ:ACM_WF_WTG02_analog_GriActPwrMaxWtgProt 2023-04-03 2023-04-03 11:26:16.525000+00:00 Good 3660.0
ZZ:ACM_WF_WTG02_analog_GriActPwrMaxGcmErr 2023-04-03 2023-04-03 11:26:16.525000+00:00 Good 4100.0
ZZ:ACM_WF_WTG02_analog_GriActPwrMaxGcmVer 2023-04-03 2023-04-03 11:26:16.525000+00:00 Good 4100.0
*/
Thought: I can query the table "acme_events" to find the average power generated for turbine 1 at Acme on 6 May. I will use the "query_sql_db" action to execute the query.
Action: query_sql_db
Action Input: "SELECT AVG(Value) FROM acme_events WHERE TagName = 'ZZ:ACM_WF_WTG01_analog_PwrAct' AND EventDate = '2023-05-06'"
Observation: [(847.4923418083226,)]
Thought: The average power generated for turbine 1 at Acme on 6 May is 847.4923418083226.
Final Answer: 847.4923418083226
What is the most interesting from the above is how the SQL AI Agent is able to work out the TagName
from the available data to then construct the right SQL statement and filters!
Now try some additional queries, such as:
What data is available?
How many rows are available for ACME?
How many tags are there for Descartes?
Limitations
The biggest limitation experienced to date is the token limit on the Chat GPT models. These limits can be quickly exceeded as the SQL AI Agent queries the data and uses the responses as tokens to find context and answers to questions. Some of this can be reduced by using the Chat GPT models that allow for higher token limits, like the gpt-3.5-turbo-16k-0613
model. However, if the Databricks schema has many tables and needs to do a number queries to determine the answer, 16k tokens is still easily exceeded.
We did not experience much cross querying of tables to solve answers. For example, the agent did not try to query a table of users to get their User IDs from their provided name to then solve a question for that user id in another table. This seems like an obvious next step and is likely be improved soon.
The answers are not always the same and asking the same question multiple times did not always provide the same answer(however, because its an Intelligence based solution, there could be good reasons for that).
Next Steps
First and foremost, we want to be able to run the langchain SQL agent on self hosted, open source LLMs. This could significantly reduce challenges around exposure of data to public models, allow for development of LLMs that are built specifically for these SQL based tasks and in turn can increase limits around tokens. We are exploring options in this domain, such as Mosaic’s MPT LLM running on self-hosted GPUs. Please reach out if you know LLMs that work well in the context of SQL queries. We will build additional connectors if suitable open source LLMs become or are available.
We would also like to see an increase in accuracy. The model and approach needs fine tuning to get more accurate answers:
- Langchain are actively developing the SQL Database Agent and this will no doubt improve the complexity and capability of the SQL Database Agent over time
- Microsoft will continue to improve their GPT Models to construct better and more complex sql queries
Conclusion
Watching the langchain SQL Database Agent solve problems with no input or context from us was a very exciting moment. To see it explore tables, construct SQL queries and attempt to decipher patterns in text columns to attempt to answer non trivial questions about specific tags or dimensions was fascinating and far more advanced than we imagined. It is fair to conclude that Generative AI is going to play a significant role in data, structured and unstructured, and it will interesting to see how data platforms adopt this capability into their base services and across the different layers of the data landscape. RTDIP will continue to research and provide capability in this area so stay in touch to get the latest updates about RTDIP and its use of Generative AI.