SQL help from ChatGPT
March 23, 2025 at 9:14 AM by Dr. Drang
I had a couple of SQL database queries that I wanted to improve. Instead of going my usual route of searching for the answer in SQLite’s documentation, I decided to give ChatGPT a crack at it. I hoped that I would not only get better queries, but that I’d also learn something about SQL by studying the answers. It took a couple of tries, but I did end up with queries that work, and I think I know a little more SQL than I did before.
I should say here that I’m sure I could have used Claude or any other LLM for this. ChatGPT was just the one that came to mind first, which is probably the way OpenAI likes it.
Background
The database at issue is one I’ve slowly built over the past couple of years to organize all of my technical books. It uses SQLite because I didn’t want to run a database server. The database has three tables: one with a record for each book, one with a record for each author, and one that ties the books and authors together. This structure is what’s most commonly recommended for a library. It uses the third table to handle the many-to-many relationship between books and authors—books can have more than one author and authors can write more than one book.
Here are how the tables are defined. The book
table is
CREATE TABLE IF NOT EXISTS "book" (
"id" INTEGER UNIQUE PRIMARY KEY AUTOINCREMENT NOT NULL,
"title" TEXT NOT NULL,
"subtitle" TEXT,
"volume" TEXT,
"edition" TEXT,
"publisher" TEXT,
"published" TEXT,
"lccn" TEXT,
"loc" TEXT,
"added" TEXT NOT NULL
);
The author
table is
CREATE TABLE IF NOT EXISTS "author" (
"id" INTEGER UNIQUE NOT NULL PRIMARY KEY AUTOINCREMENT,
"name" TEXT
);
And the book_author
table is
CREATE TABLE IF NOT EXISTS "book_author" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"book_id" INTEGER,
"author_id" INTEGER
);
My goal is to have shell scripts that allow me to search the database by title or author so I can quickly find books on my shelves, which are organized by the Library of Congress Classification number. That’s the loc
field in the book
table.
Because the main portion of each shell script will be the corresponding SQL query, that’s what I’ll be focusing on in the following sections.
Search by title
I do know a little SQL, so I started out with elementary queries. My original query for searching by title looked like this:
SELECT b.title, a.name, b.loc
FROM book_author ba
JOIN author a ON a.id = ba.author_id
JOIN book b ON b.id = ba.book_id
WHERE b.title LIKE '%string%'
ORDER BY b.loc;
(The upper case is unnecessary, but that’s the way most people display SQL code.)
The '%string%'
in the last line is the placeholder for some portion of the title I’m looking for. The percent sign is SQL’s wildcard character. Using this query to search for titles with the word “planning” returns this:
Design and planning of engineering systems|Meredith, Dale Dean|TA168 .D48
Design and planning of engineering systems|Wong, Kam W.|TA168 .D48
Design and planning of engineering systems|Woodhead, Ronald W.|TA168 .D48
Design and planning of engineering systems|Wortman, Robert H.|TA168 .D48
Probability concepts in engineering planning and design|Ang, Alfredo Hua-Sing|TA340 .A5 (v. 1)
Probability concepts in engineering planning and design|Tang, Wilson H.|TA340 .A5 (v. 1)
Probability concepts in engineering planning and design|Ang, Alfredo Hua-Sing|TA340 .A5 (v. 2)
Probability concepts in engineering planning and design|Tang, Wilson H.|TA340 .A5 (v. 2)
This works, inasmuch as it returns all the books that have “planning” in the title. But it returns eight answers when there are only three books. The Design and planning… book is returned four times because it has four co-authors. And both volumes of Probability concepts…
On my own, I learned I could cut down on the repeats by adding a GROUP BY
clause, like this:
SELECT b.title, a.name, b.loc
FROM book_author ba
inner JOIN author a ON a.id = ba.author_id
inner JOIN book b ON b.id = ba.book_id
WHERE b.title LIKE '%string%'
GROUP BY b.id ORDER BY b.loc;
With this, a search for “planning” yields
Design and planning of engineering systems|Meredith, Dale Dean|TA168 .D48
Probability concepts in engineering planning and design|Ang, Alfredo Hua-Sing|TA340 .A5 (v. 1)
Probability concepts in engineering planning and design|Ang, Alfredo Hua-Sing|TA340 .A5 (v. 2)
which is an improvement, in that it returns only one entry per book. The downside is that grouping eliminates all but one author from the results.
I lived with this for several months, and it wasn’t really a problem. The authors who were cut out of the results were the secondary authors,
But it was unsatisfactory. I knew there had to be a way to combine all the authors while still maintaining just one result for each book. This is where I decided to hand the problem over to ChatGPT.
I gave ChatGPT the definitions of the three tables, just as I gave them above, and made this request:
I want a query that will search for books whose titles contain a given word or phrase and returns the book’s title, loc, publisher, published, as well as the names of all its authors.
ChatGPT’s response was
SELECT
b.title,
b.loc,
b.publisher,
b.published,
GROUP_CONCAT(a.name, ', ') AS authors
FROM book b
JOIN book_author ba ON b.id = ba.book_id
JOIN author a ON ba.author_id = a.id
WHERE b.title LIKE '%' || :search_term || '%'
GROUP BY b.id;
which I turned into
SELECT b.title, GROUP_CONCAT(a.name, '; '), b.loc
FROM book b
JOIN book_author ba ON b.id = ba.book_id
JOIN author a ON ba.author_id = a.id
WHERE b.title LIKE '%string%'
GROUP BY b.id ORDER by b.loc;
The edits were for the following reasons:
- I decided I didn’t really need the publication info.
- I wanted the output in a different order.
- The author names would be better separated by semicolons than by commas.
- I preferred my original order of the
JOIN
s. - The
AS authors
part didn’t do anything. - I had forgotten to tell ChatGPT that I wanted the output in LOC order.
That may seem like a lot of changes, but it took almost no time, and what I ended up with was basically what I had before but with the key addition of the GROUP_CONCAT
aggregation function, something I’d never seen before. But I’d used aggregation functions in Pandas,
With this new query, my search for books with “planning” in the title returned
Design and planning of engineering systems|Meredith, Dale Dean; Wong, Kam W.; Woodhead, Ronald W.; Wortman, Robert H.|TA168 .D48
Probability concepts in engineering planning and design|Ang, Alfredo Hua-Sing; Tang, Wilson H.|TA340 .A5 (v. 1)
Probability concepts in engineering planning and design|Ang, Alfredo Hua-Sing; Tang, Wilson H.|TA340 .A5 (v. 2)
which has some pretty long lines, but contains all the information I want.
Here’s the shell script, named bytitle
, that uses this query:
1 #!/usr/bin/env bash
2
3 n='%'$1'%'
4
5 c=".mode line
6 SELECT b.title, GROUP_CONCAT(a.name, '; '), b.loc
7 FROM book_author ba
8 JOIN author a ON a.id = ba.author_id
9 JOIN book b ON b.id = ba.book_id
10 WHERE b.title LIKE '$n' GROUP BY b.id ORDER BY b.loc;"
11
12 sqlite3 library.db <<<$c | sed 's/.* = //'
Line 3 puts the argument between percent signs, so that’s what gets used in the LIKE
clause on Line 10. Line 5 starts the commands we’re going to send to SQLite; it’s a dot command that changes the output format to put each field on it’s own line. Lines 6–10 is the query we’ve been talking about. Line 12 is the pipeline that runs the SQLite commands via a here-string and then uses sed
to edit out the prefix stuff that .mode line
creates.
Running
bytitle planning
returns
Design and planning of engineering systems
Meredith, Dale Dean; Wong, Kam W.; Woodhead, Ronald W.; Wortman, Robert H.
TA168 .D48
Probability concepts in engineering planning and design
Ang, Alfredo Hua-Sing; Tang, Wilson H.
TA340 .A5 (v. 1)
Probability concepts in engineering planning and design
Ang, Alfredo Hua-Sing; Tang, Wilson H.
TA340 .A5 (v. 2)
which I find pretty easy to read.
Search by author
I won’t go through as much detail for this one. This was my original query:
SELECT b.title, a.name, b.loc
FROM book_author ba
JOIN author a ON a.id = ba.author_id
JOIN book b on b.id = ba.book_id
WHERE a.name LIKE '%string%'
GROUP BY b.id ORDER BY b.loc;
Searching “gere” gave me
Theory of elastic stability|Gere, James M.|QA931 .T54 1961
Mechanics of materials|Gere, James M.|TA405 .G44 1984
Matrix analysis of framed structures|Gere, James M.|TA645 .W36 1980
which gives me all the books written by James Gere but, as with my original title searches, doesn’t give me any of the other authors of those books.
My request to ChatGPT was
Now I need a query for the same database that will search for books that have an author whose name contains a given string. I want the query to return the books’ loc, title, as well as the names of all its authors.
and the response was
SELECT
b.loc,
b.title,
GROUP_CONCAT(a.name, ', ') AS authors
FROM book b
JOIN book_author ba ON b.id = ba.book_id
JOIN author a ON ba.author_id = a.id
WHERE a.name LIKE '%' || :search_term || '%'
GROUP BY b.id;
which looked suspiciously like the query for title searching. I didn’t think it would work, and it didn’t. It returned exactly the same results as my original query above because the WHERE
clause is finding only entries like “gere.” I was kind of happy that it didn’t work—that meant I was learning.
I told ChatGPT that I was getting only one author for books that had multiple authors, and it came up with this alternative:
SELECT
b.loc,
b.title,
GROUP_CONCAT(a.name, ', ') AS authors
FROM book b
JOIN book_author ba ON b.id = ba.book_id
JOIN author a ON ba.author_id = a.id
WHERE b.id IN (
SELECT DISTINCT ba.book_id
FROM book_author ba
JOIN author a ON ba.author_id = a.id
WHERE a.name LIKE '%' || :search_term || '%'
)
GROUP BY b.id
ORDER BY b.title;
I edited this to
SELECT b.title, GROUP_CONCAT(a.name, '; '), b.loc
FROM book_author ba
JOIN author a ON a.id = ba.author_id
JOIN book b ON b.id = ba.book_id
WHERE b.id IN (
SELECT ba.book_id
FROM book_author ba
JOIN author a ON a.id = ba.author_id
WHERE a.name LIKE '%string%'
)
GROUP BY b.id ORDER BY b.loc;
and now a search for “gere” returns
Theory of elastic stability|Timoshenko, Stephen; Gere, James M.|QA931 .T54 1961
Mechanics of materials|Gere, James M.; Timoshenko, Stephen|TA405 .G44 1984
Matrix analysis of framed structures|Weaver, William; Gere, James M.|TA645 .W36 1980
Same three books, of course, but now we see the other authors, and in two cases the other authors are primary, so it’s helpful to have them in the output.
This version works because it first finds all the books that have “gere” in the author’s name and then concatenates the authors of each of those books. The initial query—or subquery—is the part in parentheses.
The shell script that came from this, named byauthor
, is this:
#!/usr/bin/env bash
n='%'$1'%'
c=".mode line
SELECT b.title, GROUP_CONCAT(a.name, '; '), b.loc
FROM book_author ba
JOIN author a ON a.id = ba.author_id
JOIN book b ON b.id = ba.book_id
WHERE b.id IN (
SELECT ba.book_id
FROM book_author ba
JOIN author a ON a.id = ba.author_id
WHERE a.name LIKE '$n'
)
GROUP BY b.id ORDER BY b.loc;"
sqlite3 library.db <<<$c | sed 's/.* = //'
The only difference between this and bytitle
is the query.
Searching by title and author
With my new SQL knowledge, I decided to tackle a new problem: a script that searches on both title and author, where the title is the first argument and the author is the second. No need for ChatGPT this time. The code for bytitleauthor
is pretty obvious:
1 #!/usr/bin/env bash
2
3 t='%'$1'%'
4 n='%'$2'%'
5
6 c=".mode line
7 SELECT b.title, GROUP_CONCAT(a.name, '; '), b.loc
8 FROM book_author ba
9 JOIN author a ON a.id = ba.author_id
10 JOIN book b ON b.id = ba.book_id
11 WHERE b.title LIKE '$t'
12 AND b.id IN (
13 SELECT ba.book_id
14 FROM book_author ba
15 JOIN author a ON a.id = ba.author_id
16 WHERE a.name LIKE '$n'
17 )
18 GROUP BY b.id ORDER BY b.loc;"
19
20 sqlite3 library.db <<<$c | sed 's/.* = //'
Finally
It’s not surprising that ChatGPT did better on these jobs than it did in my previous test of it. There is, presumably, an awful lot of SQL code in its training corpus, so it had a much better chance of getting a correct answer. Still, it had to try twice to get the author search right.
Another thing that makes LLMs relatively successful in generating code has to do with our relationship to the answers. We can check the LLM’s code just as we would check our own: does it process the data correctly? If not, we can often see immediately what part of the LLM-generated code needs to be tweaked. Otherwise, we just have it roll the dice again. Either way, there’s a decent likelihood that we’ll end up satisfied, even if the LLM stumbled around.
One thing I can’t judge about the SQL code in this post is how good it is. It works, but that doesn’t mean it’s the best or preferred way to make the query. If you’re an SQL expert and see something off in any of these queries, shoot me an email and set me right.
Update 23 Mar 2025 2:25 PM
I’ve been informed by someone on Mastodon whose posts cannot be linked that there’s value to the AS authors
extension to the GROUP_CONCAT
function, just not in my scripts as currently written.
Sometimes you want the output to include field names. In those cases, renaming the concatenated names field to authors
will make the output nicer. For example, if I issue a .mode json
command and then run
SELECT b.title, GROUP_CONCAT(a.name, '; '), b.loc
FROM book b
JOIN book_author ba ON b.id = ba.book_id
JOIN author a ON ba.author_id = a.id
WHERE b.title LIKE '%planning%'
GROUP BY b.id ORDER by b.loc;
I get
[
{
"title": "Design and planning of engineering systems",
"GROUP_CONCAT(a.name, '; ')": "Meredith, Dale Dean; Wong, Kam W.; Woodhead, Ronald W.; Wortman, Robert H.",
"loc": "TA168 .D48"
},
{
"title": "Probability concepts in engineering planning and design",
"GROUP_CONCAT(a.name, '; ')": "Ang, Alfredo Hua-Sing; Tang, Wilson H.",
"loc": "TA340 .A5 (v. 1)"
},
{
"title": "Probability concepts in engineering planning and design",
"GROUP_CONCAT(a.name, '; ')": "Ang, Alfredo Hua-Sing; Tang, Wilson H.",
"loc": "TA340 .A5 (v. 2)"
}
]
The GROUP_CONCAT(a.name, '; ')
field name is really clumsy. But if I run
SELECT b.title, GROUP_CONCAT(a.name, '; ') AS authors, b.loc
FROM book b
JOIN book_author ba ON b.id = ba.book_id
JOIN author a ON ba.author_id = a.id
WHERE b.title LIKE '%planning%'
GROUP BY b.id ORDER by b.loc;
I get
[
{
"title": "Design and planning of engineering systems",
"authors": "Meredith, Dale Dean; Wong, Kam W.; Woodhead, Ronald W.; Wortman, Robert H.",
"loc": "TA168 .D48"
},
{
"title": "Probability concepts in engineering planning and design",
"authors": "Ang, Alfredo Hua-Sing; Tang, Wilson H.",
"loc": "TA340 .A5 (v. 1)"
},
{
"title": "Probability concepts in engineering planning and design",
"authors": "Ang, Alfredo Hua-Sing; Tang, Wilson H.",
"loc": "TA340 .A5 (v. 2)"
}
]
which is much nicer. (For what it’s worth, I ran the JSON output through the jq
command to make it easier to read.)
I expect to be doing exactly this in the not-too-distant future. I plan to move this system online and put a web interface in front of it. I like the idea of being able to get at the database from my phone when I’m in a used bookstore.
Thanks, locked-down Mastodoner!
-
These are two separate books. They have different subtitles, and the second was published nearly a decade after the first. I have a strong relationship with these books because Al Ang was my thesis advisor and Wilson Tang was on my defense committee. I used the books as both a student and a teacher. The second volume was still in galley form when I took the class that used it; we got photocopies of several chapters with handwritten edits on the pages. ↩
-
In filling out the
book_author
table, I always entered records for a book in the order that the authors were given on the title page.GROUP BY
kept the first one. ↩ -
Yes, I know Pandas got aggregation functions from SQL, not the other way around. But I learned them in Pandas first. ↩