Learning (sort of) from ChatGPT

Simon Willison, the primary developer of the Datasette exploratory data analysis tool, has a strong interest in ChatGPT and similar AI toys.1 He recently linked on Mastodon to this dialog with ChatGPT to write some AppleScript. In that dialog, we see the good and bad of using ChatGPT to help you write programs. Although I generally don’t think much of programming that way, I did learn something from Willison’s exploration.

Willison asked ChatGPT to write him an AppleScript that output the contents of all of his Apple Notes. He then asked ChatGPT to turn that into a shell script. The result was

 1:  #!/bin/zsh
 3:  osascript -e 'tell application "Notes"
 4:    repeat with eachNote in every note
 5:      set noteTitle to the name of eachNote
 6:      set noteBody to the body of eachNote
 7:      set output to noteTitle & "\n" & noteBody & "\n"
 8:      display dialog output
 9:      log output
10:    end repeat
11:  end tell'

That Willison was willing to run this script is proof of his assertion that he’s steadfastly refused to learn AppleScript. Putting display dialog (Line 8) into a loop that could run dozens or hundreds of times, depending on how many notes you have, is insane. So although this was perfectly legal AppleScript (wrapped in a shell command) that solved the main problem of looping through all the notes, it has a defect (“it spammed my screen with dialog boxes”) that no AppleScript coder would include.

Two other aspects that I think most AppleScript programmers would avoid:

  1. Using "\n" in Line 7. That’s a perfectly normal way to add a linefeed in most languages, but AppleScript has linefeed for that, and idiomatic AppleScript would use it.
  2. Using log in Line 9. This is a more subtle mistake. log is meant to be used for debugging. When used inside an osascript call, it writes to standard error. But the contents of the output variable are the whole point of this script—it should be going to standard output, not standard error.

Willison and ChatGPT eventually got to a script that uses AppleScript’s write command to write his output to a file. He then wrote a short Python script that reads this file and saves the notes into an SQLite database:

 1:  import sqlite_utils
 2:  split = b"------------------------\n"
 3:  s = open("/tmp/notes.txt", "rb").read()
 4:  notes = [n.decode("mac_roman") for n in s.split(split) if n]
 6:  cleaned_notes = [{
 7:      "id": n.split("\n")[0],
 8:      "title": n.split("\n")[1],
 9:      "body": "\n".join(n.split("\n")[2:]).strip()
10:  } for n in notes]
12:  db = sqlite_utils.Database("/tmp/notes.db")
13:  db["notes"].insert_all(cleaned_notes)

What struck me about this script was Line 4. Why is he decoding the contents of the file using Mac OS Roman? I have plenty of code that mixes AppleScript and Python, and I’ve never needed to use decode("macroman"). Surely AppleScript doesn’t write text files in Mac OS Roman anymore?

Surely it does. Here’s Apple’s documentation on the write command and the formats it can generate through the as clause:

as class

Write the data as this class. The most common ones control the use of three different text encodings:

text or string
The primary text encoding, as determined by the user’s language preferences set in the International preference panel. (For example, Mac OS Roman for English, MacJapanese for Japanese, and so on.)

Unicode text

«class utf8»

Any other class is possible, for example date or list, but is typically only useful if the data will be read using a read statement specifying the same value for the as parameter.

Default Value:
The class of the supplied data. See Special Considerations.

Despite it being 2023, and despite Apple talking big about AppleScript being “entirely Unicode-based” over 15 years ago, this says AppleScript still writes text files, by default, in the Mac OS Roman encoding.

Maybe the documentation is wrong. I created an Apple Note specifically to see how non-ASCII characters are encoded under different circumstances.

Apple Note with non-ASCII characters

The contents of the note are

¿Thîs filé hås “Unicode” characters—doesn’t it?

I then wrote this AppleScript as a test:

 1:  set fRef to open for access "/Users/drang/Desktop/notes-test.txt" with write permission
 2:  set eof of fRef to 0
 4:  tell application "Notes"
 5:    repeat with thisNote in every note
 6:      tell thisNote
 7:        if name contains "UTF-8 Test" then
 8:          write plaintext & linefeed to fRef
 9:          close access fRef
10:          set the clipboard to plaintext & linefeed
11:          exit repeat
12:        end if
13:      end tell
14:    end repeat
15:  end tell

It gets the contents of the note shown above (plaintext is the content without any of the HTML) and does two things with it:

  1. Writes it to a file on my Desktop.
  2. Puts it on the clipboard.

Let’s see the difference between the two. Running xxd, the hex dump utility, on the saved file gives

00000000: 5554 462d 3820 5465 7374 0a0a c054 6894  UTF-8 Test...Th.
00000010: 7320 6669 6c8e 2068 8c73 20d2 556e 6963  s fil. h.s .Unic
00000020: 6f64 65d3 2063 6861 7261 6374 6572 73d1  ode. characters.
00000030: 646f 6573 6ed5 7420 6974 3f0a            doesn.t it?.

Running it on the clipboard (via pbpaste | xxd) gives

00000000: 5554 462d 3820 5465 7374 0a0a c2bf 5468  UTF-8 Test....Th
00000010: c3ae 7320 6669 6cc3 a920 68c3 a573 20e2  ..s fil.. h..s .
00000020: 809c 556e 6963 6f64 65e2 809d 2063 6861  ..Unicode... cha
00000030: 7261 6374 6572 73e2 8094 646f 6573 6ee2  racters...doesn.
00000040: 8099 7420 6974 3f0a                      ..t it?.

As you can see, there are more bytes in the clipboard than there are in the file, which shows the encodings are different. It’s easy enough to figure out that the file is encoded in Mac OS Roman and the clipboard is encoded in UTF-8.

So the encoding depends on how you handle the output, which is one of those—how should I put this?—unintuitive aspects of AppleScript, one that I don’t recall running into before now. I suppose it’s because I seldom (never?) write text files directly from AppleScript. Not files with non-ASCII characters, anyway.

As the documentation says, the way to get UTF-8 output written to the file is to change Line 8 to

write plaintext & linefeed to fRef as «class utf8»

Why the weird «» syntax? I suppose it’s because Apple had already used as Unicode text for UTF-16 and just couldn’t be bothered to extend the syntax to handle UTF-8 gracefully.

You know how I talked a couple of days ago about updating a process so I wouldn’t be embarrassed by my description of it? Apple may have music in its DNA, but it apparently lacks the gene for embarrassment. That would also explain a lot about Shortcuts.

I’m going to give ChatGPT some credit for introducing me to as «class utf8», even though it didn’t put it in its code. Most of the credit, though, goes to Simon Willison for documenting his ChatGPT dialog, writing clear Python code, and leading me to look up some AppleScript syntax I didn’t know about.

  1. Is “toys” too dismissive? Time will tell.