Loops and arrays

I mentioned in yesterday’s post that Bruce Ediger solved the conditional probability problem using Monte Carlo simulation. This is when you use a random number generator to numerically simulate many repetitions of the problem and track the conditions of interest.

Ediger wrote his program in Go and posted it on GitHub. It’s pretty straightforward code, and you don’t need to know Go (I don’t) to understand how it works. I wrote a similar script in Python:

python:
 1:  #!/usr/bin/env python3
 2:  
 3:  import numpy as np
 4:  import sys
 5:  
 6:  # Initialize
 7:  N = int(sys.argv[1])
 8:  rng = np.random.default_rng()
 9:  pA = .75
10:  pB = .25
11:  
12:  # Generate N pairs of uniform random variates and count
13:  onlyOne = 0
14:  onlyA = 0
15:  for i in range(N):
16:    A = rng.random()
17:    B = rng.random()
18:    if A < pA:
19:      if B >= pB:
20:        onlyA += 1
21:        onlyOne += 1
22:    else:
23:      if B < pB:
24:        onlyOne += 1
25:  
26:  # Estimate probability that if only one hits, it's A
27:  p = onlyA/onlyOne
28:  
29:  print(f'Probability estimate: {p:.2%} ({onlyA:,d}/{onlyOne:,d})')

The variable N, which is passed as the first argument to the script (see Line 7), is the number of simulations; it’s the number of times we go through the loop that starts on Line 15. The variables onlyOne and onlyA keep track of, respectively, the number of simulations in which only one shooter hits the target and the number in which only Shooter A hits the target. Within each pass through the loop, random variables A and B are set via the random function, which generates a uniform random floating point number in the half-open interval [0, 1).1 The conditions of interest are tracked by Lines 18–24, where Shooter A hits the target if A is less than pA (0.75) and Shooter B hits the target if B is less than pB (0.25).

Finally, the probability we want is estimated by the division on Line 27 and returned as a percentage on Line 29. Overall, a pretty simple script if you’re used to doing this sort of thing.

This script is called monte-carlo-loop.py, and I ran it inside the time command to get both the probability estimate and the running time.

$ time python monte-carlo-loop.py 1_000_000
Probability estimate: 90.01% (562,294/624,714)

real    0m0.499s
user    0m0.402s
sys     0m0.033s

That’s a million simulations. Python disregards underscore characters in numbers, so I used 1_000_000 as the argument instead of 1000000 to make it easier to read. The script took about half a second to run on my M4 MacBook Pro and got, for all practical purposes, the theoretically correct answer of 90%.

Note that I’m using a loop to go through each simulation. You may recall from my post on the Electoral College that it’s often advantageous to work at a higher level of abstraction and deal with arrays of values as a single unit. This is a big part of what NumPy is for.

With that in mind, we can rewrite the Monte Carlo simulation this way:

python:
 1:  #!/usr/bin/env python3
 2:  
 3:  import numpy as np
 4:  import sys
 5:  
 6:  # Initialize
 7:  N = int(sys.argv[1])
 8:  rng = np.random.default_rng()
 9:  pA = .75
10:  pB = .25
11:  
12:  # Generate N pairs of uniform random variates and count
13:  A = rng.random(N)
14:  B = rng.random(N)
15:  onlyA = np.logical_and(A < pA, B >= pB).sum()
16:  onlyB = np.logical_and(B < pB, A >= pA).sum()
17:  
18:  # Estimate probability that if only one hits, it's A
19:  p = onlyA/(onlyA + onlyB)
20:  
21:  print(f'Probability estimate: {p:.2%} ({onlyA:,d}/{onlyA + onlyB:,d})')

In this code, A and B are arrays of length N, with each item being a random floating point number in the interval [0, 1). We then filter these arrays according to whether their values are less than the target values of pA and pB. For example, let’s say A is an array of 10 random numbers:

[0.67931298, 0.86084747, 0.66868163, 0.07347088, 0.37776693,
 0.86578819, 0.53811295, 0.84018319, 0.99293766, 0.39668558]

Then the result of A < .75 would be a 10-element array of Booleans:

 [True, False,  True,  True,  True, 
  False,  True, False, False, True]

The logical_and function takes two such arrays and outputs another Boolean array with elements that are True if both corresponding elements of the arguments are True and False otherwise.

Because Boolean arrays can be thought of as arrays of ones and zeros, the sum function acts to count the number of True values. This is how we make the (scalar) onlyA and onlyB variables in Lines 15 and 16.

The equivalent in this script of the onlyOne variable in the looping script is the sum of onlyA and onlyB, so that’s what shows up in the denominator of Line 19.

I called this script monte-carlo-array.py and ran it like this:

$ time python monte-carlo-array.py 1_000_000
Probability estimate: 89.99% (562,827/625,462)

real    0m0.091s
user    0m0.060s
sys     0m0.026s

Basically the same result but considerably faster. On the other hand, the array code uses a lot more memory because it has to make the various million-element arrays. The tradeoff between space and speed is common; what’s great about modern computers is that they have both.


  1. The opening square bracket means that 0 is included in the interval. The closing parenthesis means that 1 is not.