SQL-to-Text AI

Not to be confused with the widely popular Text-to-SQL AI

Jun 21, 2024

We’re truly in the peak AI hype era and I believe or at least hope it only goes down from here because the extent to which people are wanting to use AI is bonkers. I’ve always felt use-cases should drive data work and not the other way around and this applies to AI as well. I’m sure a big part of it is also because of a sense of panic with people feeling left out if they’re not jumping on this AI hype train.

Functional cost framework of applying ...

In hindsight, I wonder if LLMs were really the best thing the world could’ve started it’s AI journey with (at scale). Textual data is a very minor percentage in terms of how much it is leveraged and also how many companies actually have useful text data with which they can truly drive business value. Or maybe I’m undervaluing it given I belong to the herd that mostly works with tabular data. Nevertheless, I do find LLMs super useful. From being heavily dependent on StackOverflow for the majority of my career to replacing this completely with LLMs for any troubleshooting I need to do, it’s been such a blessing. The way it has accelerated the speed of development is truly commendable. I use AI as a personal tutor when I’m implementing anything new. It’s like pair programming with a fellow engineer albeit the responsibility has to be completely taken by me if something goes wrong.

My concern primarily is around the unrealistic expectations around the usage of LLMs and the irresistible urge to use them that it’s taking the focus away from other viable solutions or time better spent on working on more important use cases. I recently came across a use case where someone wanted to use LLMs to find the description of an ID from an FAQ document and my first reaction was like just use Ctrl-F bro. I’m sure even they understand at some level that using LLMs for something like this is kind of an overkill but again, I think it’s just the panic of feeling left out kicking in.

I recently came across this brilliant post by Bethany Lyons on LinkedIn where she talks about the much-hyped Text-to-SQL use-case using LLMs where everyone wants to generate SQL queries with a natural language prompt. Even I’ve always felt this to be counterintuitive. Forget about super complicated data models, writing a natural language prompt for a single complicated window function with partitions feels like a task. Why would we make it difficult for ourselves by struggling to write the logic in natural language when we can write it so simply in SQL? All this could also be coming from the hope to solve self-service analytics with AI but it feels like the CEO would just end up hiring a data prompt engineer to write prompts that can output the numbers through SQL. Who knows, contrary to the dystopian AI taking our jobs view, this approach could end up creating more jobs by intentionally making things difficult lol.

Bethany’s post was about countering the Text-to-SQL use case that gets a lot of attention with SQL-to-Text use case using which we can understand what a SQL query is trying to do. Her point was this when done at scale could uncover a treasure trove of business insights as to what problems different teams are trying to solve. Now, this is the kind of AI automation we need. A big part of working in data is getting people together to understand the context behind all the fancy tools and insane volumes of data stored in our data warehouses. If AI can be leveraged to at least partially automate this or bring it to a level where it can extract a decent amount of context, it could be a game-changer for the collaboration between data and business teams.

I got to thinking about the feasibility of this approach and the one commonality that struck me between the Text-to-SQL and SQL-to-Text use cases is the need for metadata that serves as the context for the LLMs. Be it generating SQL using a prompt or deciphering a SQL query, the availability of metadata becomes crucial. Metadata in this case are the descriptions of fields, tables, the information they contain, the level of granularity, the relationships with other fields and tables, etc. This is still very much, knowledge that resides in the minds of the data SMEs at most companies. Documenting this level of metadata even at a small to mid-sized company would be a herculean task that involves a lot of manual effort.

I’m not sure if tools exist today that document metadata in such a structured way at all these different levels and dimensions. Also, not every piece of data is equally important. I feel it’s still worth pursuing this problem where some form of metadata documentation is created, albeit manually, for a crucial part of the business data and exploring how the SQL-to-Text approach works out.

Aleks Tiupikov

Aug 5, 2024

That's a very interesting topic. My biggest concern is how to automate the generation of metadata. From my experience working on text-to-SQL, the most helpful metadata is usually generated by humans.

One potential solution might be to use existing query pairs that a company already has (like revenue - query X, retention - query Y). You could feed these to the AI and ask it to generate column metadata based on that. However, I'm not sure how accurate that would be if the same column is used for different meanings. Who's going to decide what's the right definition?

Curious to hear your thoughts!

Expand full comment

1 reply by Puneeth

1 more comment...

Random Muser

Discussion about this post

Ready for more?