September 15, 2021 at 10:23 PM by Dr. Drang
Like many of you, I was surprised to see that the Series 3 Watch has survived yet another product cycle and is still holding down the low end of Apple’s watch lineup. Unlike many of you, I wear a Series 3 every day, and although I’ve complained bitterly both here and on Twitter about how much trouble it is to upgrade its OS, I still kind of like it.
The Series 3 is today’s version of the iPad 2, the 16 GB iPhone, or
the 5 GB iCloud free storage tier:1 The Thing That Wouldn’t Die. But like the iPad 22, it’s a perfectly good device if your needs stay the same as when you bought it. I bought my wife an iPad 2 when it was released, and when I got her an iPad Air some years later, she questioned the need for it (until she realized how dependent she’d become on connecting to power throughout the day). And that’s because her needs hadn’t changed much.
It’s the same with me and my watch. I use it more or less the same way I did a few years ago: displaying notifications, starting timers, setting reminders, controlling audio playback from my phone, tracking my walks, paying at stores. It’s still good at all of those things.
And it still has plenty of battery life. I put it on in the morning and take it off before going to bed without ever worrying about the battery. The only time the battery has ever run out was on my first business trip after buying it. I forgot to pack the charging puck—wasn’t in the habit yet—and the battery ran out sometime on the third day after its last charge.
So I’m on the fence about getting a new watch. The Series 3 is annoying only on days when I have to update the OS, and those days are few and far between. The update I did today to 7.6.2 was unusual in that I felt compelled to do it by the zero-click security problem it protects against. Normally, I let point releases go by until several have piled up. Still, it would be nice not to have to plan for 3– to 4-hour OS updates. And to have an always-on display. Whatever decision I make, it won’t have to be made until “later this fall.”
Sorry, my mistake. The 5 GB iCloud storage tier is today’s version of the 5 GB iCloud storage tier. ↩
And maybe the 16 GB iPhone? Because I’ve generally bought midrange iPhones, I’ve never had to deal with upgrading a phone with tight storage. But I think there were ways of updating 16 GB phones even as iOS grew in size. If so, I can imagine their owners being satisfied with their phones except on iOS update days. ↩
August 29, 2021 at 11:53 AM by Dr. Drang
Comparing lists is something I have to fairly often in my work. There are lots of ways to do it, and the techniques I’ve used have evolved considerably over the years. This post is about where I’ve been and where I am now.
First, let’s talk about the kinds of lists I deal with. They are typically sets of alphanumeric strings—serial numbers of products would be a good example. The reason I have two lists that need to be compared varies, but a common example would be one list of devices that were supposed to be collected from inventory and tested and another list of devices that actually were tested. This latter list is usually part of a larger data set that includes the test results. Before I get into analyzing the test results, I have to see whether all the devices from the first list were tested and whether some of the devices that were tested weren’t on the first list.
In the old days, the lists came on paper, and the comparison consisted of me sitting at my desk with a highlighter pen, marking off the items in each list as I found them. This is still a pretty good way of doing things when the lists are small. By having your eyes go over every item in each list, you get a good sense of the data you’ll be dealing with later.
But when the lists come to me in electronic form, the “by hand” method is less appealing, especially when the lists are dozens or hundreds of items long. This is where software tools come into play.
The first step is data cleaning, a topic I don’t want to get into in this post other than to say that the goal is to get two files, which we’ll call
listB.txt. Each file has one item per line. How you get the lists into this state depends on the form they took when you received them, but it typically involves copying, pasting, searching, and replacing. This is where you develop your strongest love/hate relationships with spreadsheets and regular expressions.
Let’s say these are the contents of your two files: three-digit serial numbers. I’m showing them in parallel to save space in the post, but they are two separate files
listA.txt listB.txt 115 114 119 115 105 106 101 119 113 105 116 125 102 114 106 120 114 101 108 117 120 113 103 111 107 112 109 123 118 116 112 107 110 118 104 114 102 105 110 121 122 109 104
You’ll note that neither of these lists are sorted. Or at least they’re not sorted in any obvious way. They may be sorted by the date on which the device was tested, and that might be important in your later analysis, so it’s a good idea, as you go through the data cleanup, to preserve the ordering of the data as you received it.
At this stage, I typically look at each list separately and see if there’s anything weird about it. One of the most common weirdnesses is duplication, which can be found through this simple pipeline:
sort listB.txt | uniq -c | sort -r | head
The result for our data is
3 114 2 105 1 125 1 123 1 122 1 121 1 120 1 119 1 118 1 117
What this tells us is that item 114 is in list B three times, item 105 is in list B twice, and all the others are there just once. At this point, some investigation is needed. Were there actually three tests run on item 114? Did someone mistype the serial number? Was the data I was given concatenated from several redundant sources and not cleaned up before sending it to me?
The workings of the pipeline are pretty simple. The first
sort alphabetizes the list, which puts all the duplicate lines adjacent to one another. The
uniq command prints the list without the repeated lines, and its
-c (“count”) option adds a prefix of the number times each line appears. The second
sort sorts this augmented list in reverse (
-r) order, so the lines that are repeated appear at the top. Finally, the
head prints just the top ten lines of this last sorted list. Although
head can be given an option that changes the number of lines that are printed, I seldom need to use that option, as my lists tend to have only a handful of duplicates at most.
Let’s say we’ve figured why list B had those duplicates. And we also learned that list A has no duplicates. Now it’s time to compare the two lists. I used to do this by making sorted versions of each list,
sort listA.txt > listA-sorted.txt sort -u listB.txt > listB-sorted.txt
and then compared the
-sorted files.1 This worked, but it led to cluttered folders with files that were only used once. More recently, I’ve avoided the clutter by using process substitution, a feature of the bash shell (and zsh) that’s kind of like a pipeline but can be used when you need files instead of a stream of data. We’ll see how this works later.
There are two main ways to compare files in Unix:
diff is the more programmery way of comparison. It tells you not only what the differences are between the files, but also how to edit one file to make it the same as the other. That’s useful, and we certainly could use
diff to do our comparison, but it’s both more verbose and more cryptic than we need.
comm command takes two arguments, the files we want to compare. By default, it prints three columns of output: the first column consists of lines in the first file that don’t appear in the second; the second column consists of lines that appear in the second file but not in the first; and the third column is the lines that appear in both files. Like this:
comm listA-sorted.txt listB-sorted.txt
gives output of
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 125
(The columns are separated by tab characters, but I’ve converted them to spaces here to make the output look more like what you’d see in the Terminal.)
This is typically more information than I want. In particular, the third column is redundant. If we know which items are only in list A and which items are only in list B, we know that all the others (usually the most numerous by far) are in both lists. We can suppress the printing of column 3 with the
comm -3 listA-sorted.txt listB-sorted.txt
gives the more compact output
103 108 111 117 121 122 123 125
You can suppress the printing of the other columns with
-2. So to get the items that are unique to list A, we do
comm -23 listA-sorted.txt listB-sorted.txt
There are a couple of things I hate about
- What’s up with that second m? This is a comparison program, not a communication program.
- The options are given in what my dad would call “yes, we have no bananas” form.2 You don’t tell
commthe columns you want, you tell it the columns you don’t want. Very intuitive.
I said earlier that I don’t like making
-sorted files. Now that we see how
comm works with files, we’ll switch to process substitution. Here’s how to get the items unique to each list without creating sorted files first:
comm -3 <(sort listA.txt) <(sort -u listB.txt)
What goes inside the
<() construct is the command we’d use to create the sorted file. I like this syntax, as the less-than symbol is reminiscent of the symbol used for redirecting input.
You may prefer using your text editor’s diffing system because it’s more visual. Here’s BBEdit’s difference window:
Here’s Kaleidoscope, which gets good reviews as a file comparison tool. I don’t own it but downloaded the free trial so I could get this screenshot.
These are both nice, but they don’t give you output you can paste directly into a report or email.
The following items were supposed to be tested, but there are no results for them:
Update 08/31/2021 6:05 PM
Reader Geoff Tench emailed to tell me that
comm is short for common, and he pointed to the first line of the man page:
comm — select or reject lines common to two files
This is what you see when you type
man comm in the Terminal. The online man page I linked to, at ss64.com, is slightly different:
Compare two sorted files line by line. Output the lines that are common, plus the lines that are unique.
I focused on the compare and missed the common completely.
comm is a good abbreviation, but it’s a good abbreviation to a poor description. By default,
comm outputs three columns of text, and only one of those columns has the common lines. So
comm is really more about unique lines than it is about common lines. Of course,
uniq is taken, as are
comp would probably be a bad name, given that
diff is such a good command name, I often use it by mistake when I really mean to use
comm. Now that I’ve written this exegesis of
comm, I’ll probably find it easier to remember. Thanks, Geoff!
sortmakes it act like
sort <file> | uniq. It strips out the duplicates, which we don’t want in the file when doing our comparisons. ↩
My father wasn’t a programmer, but it’s not unusual to be asked to provide negative information. Tax forms, for example: All income not included in Lines 5, 9, and 10e that weren’t in the years in which you weren’t a resident of the state. ↩
August 22, 2021 at 3:17 PM by Dr. Drang
A couple of applications have announced changes recently, and I have thoughts.
First, when the addition of Markdown to Things was announced, I was indifferent. I seldom add notes to tasks, and even when I do, it’s never more than a few words—not enough to take advantage of Markdown formatting. And after using the newer version of Things for about a week, I still don’t see any reason for me to add more extensive notes to tasks.
But the first time I created a new project after updating Things, I realized where Markdown could help me: to summarize the project in its notes field.
This is where I can put the project budget, the name of the client, and any other relevant information. It’s not as though I don’t have this project summary information elsewhere, but that’s the problem: it’s elsewhere—usually several elsewheres. Putting it all together in the place I use to track the project makes it much more convenient. I could’ve been doing this all along, of course, but because the formatting of notes was limited, it never occurred to me to do so.
There is one important Markdown feature missing. Cultured Code addresses it in the FAQ:
Can I hide links behind text? – Not at this time. Links will be displayed at full length.
That’s too bad, because I’d like to have links to certain documents in the project summary, but I’d rather not clutter my notes with long, messy links like
file:///Users/drdrang/Library/ Mobile%20Documents/com~apple~CloudDocs/ projects/timoshenko.roof/proposal/proposal.pdf
in which the entire URL is visible.
The second app is 1Password. An awful lot of digital ink has been spilled recently regarding its upcoming shift to Electron for handling its user interface on the Mac. I suppose I shouldn’t have any opinions about this, as I left 1Password for iCloud Keychain over two years ago, but I couldn’t help wondering about the fuss.
Look, I’ve been a Mac user since 1985, so I understand the desire to go to war to defend native user interfaces over lowest-common-denominator, weblike interfaces. But I wouldn’t choose 1Password as the hill to die on.
Maybe I was just a weird user, but I almost never had the full 1Password app open. My interaction with it was mostly through very simple dialog boxes, as it asked me if I wanted to save new login credentials or insert old ones. What I remember most about 1Password—and what was one of its best features—is that I didn’t interact with it much at all.
Now, if the new 1Password chews up memory and processor cycles as it runs in the background, that’s worth complaining about. The backend portions written in Rust shouldn’t create bottlenecks, but we’ll see how that goes as it moves from beta into production.
Update 08/22/2021 3:12 PM
This is an unusual update, as I’m writing it before posting. But I just listened to the second half of this week’s Connected and Myke made the same point about 1Password being an app that doesn’t get a lot of user interaction. So I could either dump a post that I’ve been writing off and on for nearly a week (I am a slow writer, but most of the delays in getting this finished had to do with a short vacation and moving a kid back to college), or go ahead and publish it while acknowledging Myke’s primacy.
August 15, 2021 at 12:07 PM by Dr. Drang
Isn’t it a safer habit to always quote/escape args containing wildcards?
To put this question in context, recall that the command I ran to get all the reports I’d written in the past 60 days was
find . -name *report*.pdf -mtime -60
The command was issued from within my
~/projects directory. Within it are subdirectories for every project and subsubdirectories within each of them for the various aspects of those projects. The idea behind the
find command is to search down through the current directory (
.) for files with names that match the glob
*report*.pdf that were last modified less than 60 days ago.
Leon’s question, which was really a suggestion politely formed as a question, was about my leaving the argument to the
-name expression unquoted. He thinks I should have used
find . -name '*report*.pdf' -mtime -60
He’s right. Quoting the argument to
-name is a good habit to get into. But it’s a habit I find hard to form.
The reason to quote the argument is to prevent the shell from expanding the glob before
find has a chance to get at it. If there were a file at the top level of the
~/projects directory with a name that matched the glob—and if it were less than 60 days old—that would be the only file that
find would have returned.
I got away with leaving out the quotes because there were no such files in
~/projects. Except for a couple of SQLite database files,
~/projects has nothing but subdirectories at its top level. I knew that, which is why the command worked without quoting. And although I know that quoting is a safer habit, I wrote the post using the command just as I used it—I didn’t add the quotes to model better behavior than I usually engage in.
It’s not that I never use quotes when working in the shell. But I do tend to forget them more often than I should. One thing I will say in my defense: I build up my shell commands and pipelines incrementally, making sure every step works the way I expect before adding another. I would never write
find . -name something | xargs rm
without first checking the output of
find . -name something
Also, on those rare occasions when I write a shell script, I am much more diligent about quoting than I am when working interactively at the command line. Commands given interactively are written for a particular set of circumstances in which shortcuts can be perfectly fine. Shell scripts get used more widely, where unforeseen conditions are more likely.
There is, by the way, a way to get at the same information without
find. If your shell is zsh (or if, like me, you use a recent version of bash installed via Homebrew), you can use the
**/ wildcard pattern:
ls -lt **/*report*.pdf | head
In this command, the shell looks down the whole subdirectory tree for files that match the globbing pattern,
ls -lt prints them in long form in reverse chronological order, and then
head extracts just the first ten, which will be the ten most recent. Because the long form of
ls includes the modification date, I could’ve looked through the output of this command and easily determined which reports were written in the past two months.
And because this is a situation where I want the shell to interpret the globbing pattern, quoting the pattern would be wrong. Which is a better fit to my sloppy habits.