Number regexes are hard

This question on the Keyboard Maestro forum—to which I gave an excessively complicated Perl-based answer because I don’t know nearly enough about Keyboard Maestro’s native features—got me thinking about the best regular expression for matching numbers in a block of text. It’s surprisingly tricky, but I think I have a decent answer.

First, let’s define what we’re looking for. The goal is to find both integers and numbers with a decimal point. They can be positive or negative, but we’re excluding numbers written in exponential notation. So all these should match

123  -123  +123  1.23  -1.23  +1.23
.123 -.123 +.123 123. -123.  +123.

but these shouldn’t

1.23e4  -1.23e4  1.23e-4  1.23e+4

(We might want to extend our regex to include exponential notation, but first things first.)

This very simple regex,

[-+.\d]+

will match numbers, but it’s too permissive. It will match things like 3-5, which clearly isn’t right. A more complicated regex

[-+]?\d*\.?\d*

looks better because it won’t match a hyphen or plus sign between digits, but because everything is optional, it will match empty strings. It would also match periods by themselves and other non-numbers.

This page suggests the following:

[-+]\d*\.?\d+

where the + at the end forces there to be at least one digit in the match. This is pretty good, but if you give it 123., it only matches the 123 part. This may or may not be a mistake, depending on the context.

For example in the Keyboard Maestro forum question, the numbers that were being hunted were attributes of XML tags and were inside double quotation marks. In that situation, a regex of

"[-+]\d*\.?\d+"

would not find the quoted number in

<tag attribute="123.">

when it clearly should.

After some experimenting and benchmarking, I’ve settled on

[-+]?(\d+\.?\d*|\.\d+)

The optional sign part is obvious. The rest consists of two alternatives. The first,

\d+\.?\d*

which will match integers, decimals that have one or more leading digits, and the 123. case we just talked about. The second alternative,

\.\d+

handles decimal numbers that don’t have a leading digit.

So

"([-+]?(\d+\.?\d*|\.\d+))"

would find the number in

<tag attribute="123.">

and put it in the first capture string, which is accessed via $1 or \1 in most languages and by Match.group(1) in Python.

This regex isn’t a panacea. Remember that 3-5 example from before? Suppose we had text that included the sentence

The material is found in Chapters 3-5.

Using

[-+]?(\d+\.?\d*|\.\d+)

would return two matches:

The first match is correct, but the second is terrible. You might argue that using a hyphen for a range is just wrong (it should be an n-dash), but we don’t always have the luxury of working with text from people who are as meticulous as we are. And even if the hyphen were replaced by an n-dash, we’d still have the period being treated as a decimal point.

While a universal regex may be impossible, that doesn’t mean we can’t come up with one that works in a particular situation. In most cases, we can use either delimiters (like quotation marks) or other clues to put together a regex that finds everything we want without finding the things we don’t want. What’s important is to have a toolbox of regex pieces that we can put together to solve the problem at hand.


Person, woman, man, camera, TV

While the obvious explanation is that we now know the words Putin used to activate Trump and send him off to kill Tony Stark’s parents, maybe there’s a more benign interpretation.

Nah, it’s definitely the Hydra thing.


Customer service

Just got off the phone with Kohler. Was installing a new faucet in the laundry room last night and learned (at nearly the last step, of course) that the sprayer hose assembly was missing a fitting. Called customer service this morning and learned that shipping a replacement part would take 7—10 days.

Me: Unacceptable.
Rep: Can send it FedEx for $25.
Me: No.
Rep: Let me talk to my manager.

Was on hold for quite a while. Felt like I was at a car dealership and half-expected to be told they couldn’t remove the rustproofing charge. But I got a “one-time waiver” on the shipping fee and should have the part tomorrow.


More discontinuous ranges in Python

I think I talked way too much in the latest episode of the Automators podcast, but David and Rosemary are indulgent hosts. One thing I’m glad I said—despite it being something I’ve said many times before—is that saving time isn’t necessarily the main reason build an automation. Consistency of results is at least as important as saving time. And then there’s the matter of keeping your skills sharp. I often create an automation just as a way to learn a new technique or to practice an old one that I haven’t used in a while. That’s what I did this past week.

I was writing up a plan to inspect a tall residential building. A certain architectural feature was of interest, and the inspections are to take place on those floors only. Here’s how the building is laid out:

The code I initially came up with to get a list of all the floors with the feature was this list comprehension:

python:
floors = [ i + 2 for i in range(52) if i % 8 < 5 ]

What I liked about this, apart from its compactness, was how all the pertinent figures appeared: 52 floors with pattern, starting on Floor 2, with the lower 5 floors in each group of 8 having the feature.

The result (after breaking it into lines where you can see the groups and gaps) was

[2, 3, 4, 5, 6,
 10, 11, 12, 13, 14,
 18, 19, 20, 21, 22,
 26, 27, 28, 29, 30,
 34, 35, 36, 37, 38,
 42, 43, 44, 45, 46,
 50, 51, 52, 53]

I was pretty pleased with myself until I remembered the superstition: there’s no “Floor 13.” Like many residential buildings, it skips from the 12th to the 14th, so if I wanted to give inspection directions that were easy to follow, I had to skip 13, too. I discussed this in a post last September.

One way to solve this—and the way I first solved it—is to loop through the floors list just created and bump up the floor numbers above 12:

python:
floors = [ i + 2 for i in range(52) if i % 8 < 5 ]
for i, f in enumerate(floors):
  floors[i] = f if f < 13 else f + 1

This worked, giving me

[2, 3, 4, 5, 6,
 10, 11, 12, 14, 15,
 19, 20, 21, 22, 23,
 27, 28, 29, 30, 31,
 35, 36, 37, 38, 39,
 43, 44, 45, 46, 47,
 51, 52, 53, 54]

but it wasn’t a particularly satisfying solution. If you’re using a comprehension to build the list in the first place, it feels like a defeat to then make adjustments using a plain old loop. So I thought some more and came up with this:

python:
floors = [ i + 2 if i < (13 - 2) else i + 3 \
           for i in range(52) if i % 8 < 5 ]

OK, I confess to being a little uneasy about making a comprehension that’s so long it needs to be broken into two lines to be understandable. But it is understandable, I think, and like the original, all the numbers in it have a meaning directly related to the layout of the building. Leaving the condition in the ternary operator as i < (13 - 2) instead of i < 11 helps a bit with the readability.1

So, did I save any time with this? No, I could have typed the whole list into an assignment statement in less time than I spent thinking about and writing either of the comprehensions. But being forced to think about the periodicity of the architectural features—which I hadn’t even considered when I was first looking through the drawings—gave me a better understanding of the building. And it’s always good to stretch out your language skills. Although I’ve used the ternary operator before, this was the first time I’d used it in a list comprehension.


  1. And if you’re wondering if leaving it that way makes the code run slower, I can tell you that running it though timeit shows that using 11 instead of (13 - 2) doesn’t speed up the code at all. Python’s compiler too smart for that.