Don’t fear the regex
January 9, 2019 at 8:17 AM by Dr. Drang
If you saw Jason Snell’s recent article on reviving an old podcast feed and skimmed past it because you don’t expect to ever need to revive an old podcast feed, you missed some excellent generally applicable advice within the specifics.
The real purpose of Jason’s article is to show you how to use simple software tools we all know—a text editor and a spreadsheet—to accomplish what would normally thought of as a “programming” task. Listeners of The Talk Show may remember an episode in which Jason and John Gruber discussed how both of them have done this many many times over the years.12
Although I do often write short programs for text munging, I typically resort to that only if the problem requires more than just large-scale text editing or if I expect to be repeating the process several times. And even then, I usually start out by playing around in BBEdit3 to see what searches, replacements, and rearrangements need to be done. It’s a convenient environment for getting immediate feedback on each transformation step.
(And if you expect to do a series of text transformations often and really don’t want to get into writing scripts in Perl or Python or Ruby or whatever, BBEdit’s Text Factories allow you to string together any number of individual munging steps.)
I will say, though, that one bit of “programming” you’ll find really helpful—and which Jason uses in his podcast feed example—is the building of regular expressions for searching and replacing. Simply put, regular expressions allow you to find patterns of text instead of specific text. For example, if you need to find all the US zip codes in a long chunk of text, you can’t go searching for a specific zip code, like “60606.” You have to look for a pattern: five digits, optionally followed by a hyphen and four more digits.
Regular expressions allow you to build these patterns by using placeholders for generic character types (like digits), repetitions, and options. Unfortunately, for historical reasons, these placeholders consist of normal ASCII character strings like \d
, *
, and ?
. This has the unfortunate effect of terrifying newcomers when they see something like
^\d{5}(-\d{4})?$
described as a simple regex, something I have probably done dozens of times.
Even worse, people who are thinking they should start using regular expressions often hear about this great book on the topic and have a natural reaction when they see it: A 500+ page book to learn how to search for text? No thanks.
This is too bad, because while Friedl’s book is great, it’s called Mastering Regular Expressions for a reason, and that reason is not because it’s a tutorial. My recommendation for a tutorial is the one I learned from over 20 years ago: the “Searching with Grep” chapter4 in the BBEdit User Manual. I believe it was largely written by a young guy named John Gruber.
The nice thing about regular expression syntax is that it can be learned a little bit at a time. You can be productive right away knowing only a few regex constructs. Even people who have been using regexes for ages tend to use just a dozen or so patterns, bringing out the more obscure ones only when they have a particularly tricky problem to solve. A good way to pick up new regex knowledge in small portions is to follow John D. Cook’s @RegexTip Twitter feed.
You may have heard that there are regular expression “flavors,” different regex syntaxes used by different programs, and been put off by the potential for confusion. Don’t be. There are different flavors, but that won’t have any practical effect on your learning until you’re an expert. Every program you run into nowadays uses the “Perl-Compatible Regular Expression,” or PCRE, syntax. Only very old programs use different flavors, and by the time you find yourself using them—if ever—you will be well-equipped to handle the variation.
And before anyone on Twitter can do it, I’ll bring out the obligatory Jamie Zawinski bon mot:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Don’t be led astray by this, either. Only a person who’s used regexes a lot, and successfully, would make this kind of joke.
-
I would link you to that episode, but the episode descriptions in the TTS archives aren’t detailed enough for me to find it. I know it isn’t the most recent episode, even though they talk about BBEdit in it, because I haven’t listened to that one yet. ↩
-
But in looking at the links for the most recent episode, I see that they do talk about the Studio Neat and Fintie carrying cases for the Magic Keyboard, so I need to move it up in my overflowing podcast queue. ↩
-
Whatever text editor you’re comfortable with will do. Ten years ago, I would have been using TextMate. ↩
-
“Grep” is the name of a Unix command that uses regular expressions for searching through files. I have always believed—but never confimed—that Rich Siegel used the term “grep” instead of “regular expression” because the shorter word fit better in BBEdit’s Find dialog box. ↩