# PCalc binomial functions

Over the weekend, I read Federico Viticci’s excellent post on using iOS 12 Shortcuts to access PCalc. I haven’t installed iOS 12 yet (I usually wait a few days to let Apple’s servers cool down), but I expect to use Federico’s advice to make a few PCalc shortcuts when I do. In the meantime, his post inspired me to clean up and finish off a PCalc function that I’d half-written some time ago.

The function calculates a probability from the binomial distribution. Imagine there are $N$ independent trials of some event. There are only two possible outcomes of each event, which we’ll call success and failure, although you could give them any names you like. For each trial, the probability of success is $p$. The probability of $n$ successes in those $N$ trials is

The first term in the formula is the binomial coefficient, which is defined as

The binomial probability, then, is a function of three inputs, $n$, $N$, and $p$, that yields one output. The PCalc function I wrote to implement the function starts with three numbers on the stack,1 say $n = 5$, $N = 10$, and $p = .5$,

Any entries on the stack “above” the three inputs will still be there after the function runs.

The function is called “Binomial PMF.” The PMF stands for probability mass function, which is standard terminology for the function that returns the probability of a random variable equaling a discrete value. The function is defined this way,

which is a lot of steps to enter. You’d probably prefer to download it.

If you look through the function definition, you’ll notice that I’m using up a lot of registers—more than is truly necessary to run out the calculation. I do this for a couple of reasons:

1. The registers are there to be used, and by keeping all the intermediate calculations in separate registers, I’m able to make the logic of the function easier to follow.
2. I have further plans for this function. I want to use it as the starting point for a more complicated function, the cumulative distribution function, which will need to keep track of some of those intermediate calculations.

We’ll look at the Binomial CDF function in the next post.

1. Yes, it’s an RPN-only function. RPN is the natural way to handle this many inputs, and because I don’t use PCalc’s algebraic mode, I didn’t see any reason to write the function to work in that mode.

# Ebooks, Hazel, and xargs

Although I have a Kindle, I prefer reading ebooks on my iPad, and the ebook reader I like the most is Marvin.1 This means I have to convert my Kindle books to ePubs (the format Marvin understands) and put them in a Dropbox folder that Marvin can access. The tools I use to do this efficiently are Calibre, Hazel, and, occasionally, a one-liner shell script.

If you have ever felt the need to convert the format of an ebook, you’ve probably used Calibre, so I won’t spend more than a couple of sentences describing how I use it.2 When I buy I new book from Amazon, it gets sent to my Kindle. I then copy the book from the Kindle into the Calibre library and convert the format to ePub. This doesn’t do the conversion in place and overwrite the Kindle file; it makes a new file in ePub format.

Calibre organizes its library in subfolders according to author and book. At the top level is the calibre folder, which I keep in Dropbox. At the next level are folders for each author. Within these are folders for each book from a particular author. Finally, within each book folder are the ebook files themselves and some metadata.

This is great, but I find it easiest to configure Marvin to download ePubs from a single Dropbox folder, one that I’ve cleverly named epubs. So I want an automated way to copy the ePub files from the calibre folder structure into the epubs folder.

This is a job for Hazel. Following Noodlesoft’s guidance, I set up two rules for the calibre folder.

The first one, “Copy to epubs,” just looks for files with the epub extension and copies them to the epubs folder.

By itself, this rule does nothing, because the ePub files aren’t in the calibre folder itself, they’re two levels down, and Hazel doesn’t look in subfolders without a bit of prodding. That prodding comes from the “Run on subfolders” rule:

This action was copied directly from Noodlesoft’s documentation, which says

To solve this problem [running rules in subfolders], Hazel offers a special action: “Run rules on folder contents.” If a subfolder inside your monitored folder matches a rule containing this action, then the other rules in the list will also apply to that subfolder’s contents.

With these rules in place, I don’t have to remember to copy the newly made ePub file into the epubs folder—Hazel does it for me as soon as it recognizes that a new ePub is in the calibre folder structure.

At least it should do it for me. Sometimes—for reasons I haven’t been able to suss out—Hazel doesn’t make the copies it’s supposed to. I’ll open up Marvin expecting to see new books ready to be downloaded to my iPad, and they won’t be there. When that happens, I bring out the heavy artillery: a shell script that combines the find and xargs commands to copy any and all ePubs under the calibre directory into the epubs directory. The one-liner script, called epub-update, looks like this:

#!/bin/bash

find ~/Dropbox/calibre/ -iname "*.epub" -print0 | xargs -0 -I file cp -n file ~/Dropbox/epubs/


The find part of the pipeline collects all files with an epub extension (case-insensitive) under the calibre directory and prints them out separated by a null byte. This null separator is pretty much useless when printing out file names for people, but it’s great when those files are going to be piped to xargs, especially when the file names are likely to have spaces within them.

The xargs part of the pipeline takes the null-separated list of files from the find command and runs cp on them. In the cp command, file is used as a placeholder for the file name and the files are copied to the epubs directory. The -n option tells cp not to overwrite files that already exist.

The advantage of using epub-update when Hazel hiccups is that I don’t have go hunting through subfolders to find the file that didn’t get copied.

I suppose if I were smart, I’d set up my Amazon account to send new purchases to my Mac instead of my Kindle. Then I could automate the importing of new ebooks into the Calibre library and eliminate more of the manual work at the front end of the process. One of the advantages of doing posts like this is that the process of writing up my workflows forces me to confront my own inefficiencies.

1. iBooks doesn’t give me the control over line spacing and margins that Marvin does. I may change my mind after seeing the all-new, all-improved Books app in iOS 12, but even if I switch, the automations described here will still be useful.

2. Also, because it has the worst GUI of any app on my Mac, I can’t bear to post screenshots of it.

# Hypergeometric Obama

On Friday, Barack Obama will speak at the University of Illinois as part of a ceremony at which he will receive the Paul Douglas Award for Ethics in Government. The speech will be at the Auditorium at the south end of the Quad, which doesn’t seat nearly as many people as wanted to see him.

According to this report, 22,611 students signed up for a ticket lottery for just 1,300 openings. My two sons were among the horde, and they learned yesterday, along with 21,309 of their friends, that they didn’t win.

How likely was it that neither would get a seat? The chance of any individual student being picked was

which means that the chance of any individual student not being picked was $1 - 0.0575 = 0.9425$. So the probability that neither would be picked was

or just under 90%. So the fact that neither of them got in wasn’t a surprise. We can also calculate the probability that both were lucky,

which is extremely low, and the probability that one got in and one didn’t,

which isn’t too bad, but unfortunately it didn’t work out.

These calculations are simple and give us answers that are accurate enough for our purposes, given the large numbers (1,300 and 22,611) involved. But we’ve made some approximations that won’t be reasonable if the numbers are small.

For example, let’s assume the same general problem, but this time we’ll assume there are 10 applicants—two of whom are my sons—for 2 tickets. What are the probabilities of zero, one, and two of my sons getting tickets?

If we follow the procedure above, we get

for both sons getting a ticket,

for one son getting a ticket and one not, and

for neither getting a ticket. These calculations are internally consistent, in that they add up to unity, but they’re wrong.

The problem is that these calculations are based on an assumption that one son winning a ticket is independent of the other son winning, but they are not independent.

The equation for calculating the probability of an intersection of two events, call them $A$ and $B$, is

where $P(B\,|\,A)$ is the probability of event $B$ given that event $A$ occurs, and similarly, $P(A\,|\,B)$ is the probability of event $A$ given that event $B$ occurs. If the events are independent, then the conditions don’t matter,

and

But that’s not the situation here. If Son A gets a ticket, then Son B’s chance of getting a ticket isn’t $2/10 = 0.20$, it’s $1/9 = 0.1111$ because with Son A having a ticket, there’s only one ticket left for the nine remaining applicants. Which means the probability that both sons get a ticket is

which just over half of our earlier, mistaken, calculation of $0.040$.

Similarly, the probability that one son gets a ticket and the other doesn’t is

and the probability that neither get a ticket is

(And now we see why it was OK to play a little loose with the numbers back in our original calculations with 22,611 applicants for 1,300 tickets. There just isn’t enough difference between

and

to make it worth even the minimal effort.)

All of the denomimators in the previous results were 45. It will probably not surprise you that 45 is the number of combinations of ten things taken two at a time. It’s also the binomial coefficient, and is usually written like a fraction but without the horizontal dividing line and surrounded by parentheses. It’s calculated though factorials:

In our particular case of $n=10$ and $k=2$,

Imagine ten lottery balls with the numbers 0 through 9 printed on them. Mix them up and draw two. There are 45 different results you can get if you don’t care about the order of the two balls. Here they are:

0-1  0-2  0-3  0-4  0-5  0-6  0-7  0-8  0-9
1-2  1-3  1-4  1-5  1-6  1-7  1-8  1-9
2-3  2-4  2-5  2-6  2-7  2-8  2-9
3-4  3-5  3-6  3-7  3-8  3-9
4-5  4-6  4-7  4-8  4-9
5-6  5-7  5-8  5-9
6-7  6-8  6-9
7-8  7-9
8-9


If two of the balls—say 3 and 7, just as an example—represent winning a ticket, scanning the list shows that there’s one combination with both 3 and 7; 16 combinations with a 3 or a 7 but not both; and 28 combinations with neither a 3 nor a 7. These are the numerators in the answers above.

There’s another way to visualize this that might make more sense to you. Image a line of ten people. Two of them get handed tickets (X), the other eight don’t (O). Here are the 45 ways the two tickets can be distributed:

XXOOOOOOOO  XOXOOOOOOO  XOOXOOOOOO  XOOOXOOOOO  XOOOOXOOOO  XOOOOOXOOO  XOOOOOOXOO  XOOOOOOOXO  XOOOOOOOOX
OXXOOOOOOO  OXOXOOOOOO  OXOOXOOOOO  OXOOOXOOOO  OXOOOOXOOO  OXOOOOOXOO  OXOOOOOOXO  OXOOOOOOOX
OOXXOOOOOO  OOXOXOOOOO  OOXOOXOOOO  OOXOOOXOOO  OOXOOOOXOO  OOXOOOOOXO  OOXOOOOOOX
OOOXXOOOOO  OOOXOXOOOO  OOOXOOXOOO  OOOXOOOXOO  OOOXOOOOXO  OOOXOOOOOX
OOOOXXOOOO  OOOOXOXOOO  OOOOXOOXOO  OOOOXOOOXO  OOOOXOOOOX
OOOOOXXOOO  OOOOOXOXOO  OOOOOXOOXO  OOOOOXOOOX
OOOOOOXXOO  OOOOOOXOXO  OOOOOOXOOX
OOOOOOOXXO  OOOOOOOXOX
OOOOOOOOXX


Imagine now that my two sons are at the left end of the line (they could be anywhere—like positions 3 and 7—the results won’t change). There’s one arrangement where they get both tickets (the upper left corner), 16 arrangements where one of them gets a ticket and the other doesn’t (the top two rows except for the upper left corner), and 28 arrangements where neither of them get a ticket (everywhere else).

There is, as you might imagine, a generalization of this problem to account for any number of applications, tickets, and sons. It’s called the hypergeometric distribution. Using the nomenclature in the linked Wikipedia article, we’ll assume a population of $N$ things (applications), of which $K$ are successes (tickets). If we drawn $n$ samples (sons) from that population (without replacement), then the probability that $k$ of the samples will be successes is

The denominator should look familiar, it’s the number of combinations of $N$ things taken $n$ at a time.

The first term in the numerator is the number of ways the successes in the sample ($k$) can be taken from the successes in the population ($K$). The second term in the numerator is the number of ways the failures in the sample ($n - k$) can be taken from the failures in the population ($N - K$). This product is a formalization of the counting we did above.

Let’s use this formula to repeat the example with ten applications for tickets ($N = 10$), two tickets ($k = 2$), and two sons ($n = 2$). The probability that both sons will get a ticket is

The probability that exactly one son will get a ticket is

And the probability that neither one will get a ticket is

Just like before, but with less counting.

The SciPy library for Python has a sublibrary called stats with a set of functions for handling the hypergeometric distribution. Here’s how to do this last calculation in Python:

python:
>>> from scipy.stats import hypergeom
>>> hypergeom.pmf(0, 10, 2, 2)
0.62222222222222201


The pmf function stands for “probability mass function,” and it represents the formula defined above. The first argument is the number of successes in the sample, the second is the size of the population, the third is the number of successes in the population, and the fourth is the sample size.

I think this ordering of the arguments was an incredibly stupid choice by the designers of the library, as it puts the population numbers in between the sample numbers and the order of the population numbers is the reverse of the order of the sample numbers. It’s hard to imagine a less intuitive way to define the argument order. And don’t get me started on the symbols they use in the documentation. But at least the function works.

We can now go back to our original problem of 1,300 tickets, 22,611 applications, and 2 sons and do it right.

python:
>>> hypergeom.pmf(2, 22611, 1300, 2)
0.0033031794730568834
>>> hypergeom.pmf(1, 22611, 1300, 2)
0.10838192109526341
>>> hypergeom.pmf(0, 22611, 1300, 2)
0.88831489943503728


As expected, no practical difference between these answers and the approximations we started out with. But it’s the journey, not the destination, right?

# An image and PDF grab bag

In my job, I often refer to provisions of building codes or material and equipment standards in my reports. Usually, simply quoting the relevant provisions is sufficient, but sometimes I need to attach one or more pages from these documents as an addendum. In the old days, that meant photocopies; now it typically means pulling pages out of PDFs. Preview is pretty good tool for this, as it allows you to use the thumbnail sidebar to extract and rearrange pages. But I recently ran into a situation where Preview couldn’t do the job alone, and I had to use a series of command-line tools to get the job done.

The problem is with the American Institute of Steel Construction, which has decided to publish its essential Steel Construction Manual as a website instead of a PDF. Each “page” of the website looks like the corresponding page of the print edition of the manual.

Having this website instead of a PDF is moderately annoying when I’m trying to use the manual, it’s really annoying when I need to pull out excerpts, because I have to make screenshots of each page, edit them, convert them to PDFs, resize them to fit on letter-sized pages, and assemble them into a single coherent PDF document. Here’s how I do it.

First, I take the screenshots on my 9.7″ iPad Pro in portrait mode (see above) because that gives me good resolution of a single page. I could get slightly higher resolution by taking the screenshots on my 2017 27″ 5k iMac, but that machine isn’t available when I’m working at home, where the Mac on my desk is a non-Retina 2012 27″ iMac. After screenshotting, I have a bunch of JPEG files on my iPad, which I copy over to my Mac via the Files app and Dropbox.

Next, I move to the Mac (whichever one is handy) and crop each image down to just the page image, eliminating all the browser chrome and navigation controls. For this I use the mogrify command from the ImageMagick suite of tools to crop the images in place.1 After a bit of trial and error, I learned that

mogrify -crop 1214x1820+162+224 image.jpg


gives the crop size and offset that leaves just the page image.

Of course, I don’t want to enter this command for every screenshot, so I wrote a shell script, called aisc-crop, which loops through all of its arguments, running the mogrify command on each:

bash:
!/bin/bash

for f in "$@" do mogrify -crop 1214x1820+162+224 "$f"
done


With this, I can crop all the images in a directory with a single command:

aisc-crop *.JPG


Now that I have the page images I want, it’s time to turn them into PDFs. For this, I use the built-in sips command, but sips wants the extension to be .pdf before it does the conversion. So I use Larry Wall’s old rename Perl script:

rename 's/JPG/pdf/' *.JPG


Now the files are ready for conversion:

sips -s format pdf *.pdf


Time to put all the pages together into a single PDF document. For this, I like using PDFtk (which can also be installed via Homebrew):

pdftk *.pdf cat output aisc-pages.pdf


At this point, I have a PDF document with all the pages I want, but the pages aren’t letter-sized. If I open the document in Preview and Get Info on it, I see this:

The page size is so big because sips treated every pixel in the JPEG as a point in the converted PDF. Since there are 72 points per inch, the PDF pages are $\frac{1820}{72} = 27.28\; \mathrm{in}$ high. To get the PDF down to letter size, I use pdfjam, which got installed along with my TeX installation:

pdfjam --paper letterpaper --suffix letter aisc-pages.pdf


Now I have a document named aisc-pages-letter.pdf that’s the right physical size with higher dpi of the embedded images. I could have gotten the same result by “printing” aisc-pages.pdf to a new PDF with the Scale to Fit option selected in the Print sheet, but where’s the fun in that?

Now I can open the document in Preview and rearrange the pages if I didn’t take the screenshots in the right order. Otherwise, I’m done. As is often the case, it takes longer to explain than to do.

1. ImageMagick used to be kind of hard to install on a Mac, but not anymore. As Jason Snell showed us a couple of months ago, just use Homebrew and brew install imagemagick