Number regexes are hard
July 30, 2020 at 2:16 PM by Dr. Drang
This question on the Keyboard Maestro forum—to which I gave an excessively complicated Perl-based answer because I don’t know nearly enough about Keyboard Maestro’s native features—got me thinking about the best regular expression for matching numbers in a block of text. It’s surprisingly tricky, but I think I have a decent answer.
First, let’s define what we’re looking for. The goal is to find both integers and numbers with a decimal point. They can be positive or negative, but we’re excluding numbers written in exponential notation. So all these should match
123 -123 +123 1.23 -1.23 +1.23
.123 -.123 +.123 123. -123. +123.
but these shouldn’t
1.23e4 -1.23e4 1.23e-4 1.23e+4
(We might want to extend our regex to include exponential notation, but first things first.)
This very simple regex,
[-+.\d]+
will match numbers, but it’s too permissive. It will match things like 3-5
, which clearly isn’t right. A more complicated regex
[-+]?\d*\.?\d*
looks better because it won’t match a hyphen or plus sign between digits, but because everything is optional, it will match empty strings. It would also match periods by themselves and other non-numbers.
This page suggests the following:
[-+]\d*\.?\d+
where the + at the end forces there to be at least one digit in the match. This is pretty good, but if you give it 123.
, it only matches the 123
part. This may or may not be a mistake, depending on the context.
For example in the Keyboard Maestro forum question, the numbers that were being hunted were attributes of XML tags and were inside double quotation marks. In that situation, a regex of
"[-+]\d*\.?\d+"
would not find the quoted number in
<tag attribute="123.">
when it clearly should.
After some experimenting and benchmarking, I’ve settled on
[-+]?(\d+\.?\d*|\.\d+)
The optional sign part is obvious. The rest consists of two alternatives. The first,
\d+\.?\d*
which will match integers, decimals that have one or more leading digits, and the 123.
case we just talked about. The second alternative,
\.\d+
handles decimal numbers that don’t have a leading digit.
So
"([-+]?(\d+\.?\d*|\.\d+))"
would find the number in
<tag attribute="123.">
and put it in the first capture string, which is accessed via $1
or \1
in most languages and by Match.group(1)
in Python.
This regex isn’t a panacea. Remember that 3-5
example from before? Suppose we had text that included the sentence
The material is found in Chapters 3-5.
Using
[-+]?(\d+\.?\d*|\.\d+)
would return two matches:
3
-5.
The first match is correct, but the second is terrible. You might argue that using a hyphen for a range is just wrong (it should be an n-dash), but we don’t always have the luxury of working with text from people who are as meticulous as we are. And even if the hyphen were replaced by an n-dash, we’d still have the period being treated as a decimal point.
While a universal regex may be impossible, that doesn’t mean we can’t come up with one that works in a particular situation. In most cases, we can use either delimiters (like quotation marks) or other clues to put together a regex that finds everything we want without finding the things we don’t want. What’s important is to have a toolbox of regex pieces that we can put together to solve the problem at hand.