Pandas and HTML tables

The Talk Python To Me podcast is one that underscores the value of Castro’s two-stage management system. Unlike, say, 99% Invisible and Upgrade, Talk Python is a show I don’t listen to every week because many of its topics are just too far afield from my interests. So its episodes go into my Inbox for vetting instead of directly into my Queue. The latest episode, about PyArrow and its movement into the internals of Pandas, is one that I immediately promoted to the Queue.

But this post isn’t about Castro or PyArrow (inverted pyramid? what’s that?). It’s a feature of Pandas that was mentioned offhandedly by Michael Kennedy—Pandas can take the URL of a web page and turn the tables on that page into data frames. I didn’t know about this and wanted to try it out immediately.

After writing these two posts on bolts, I wondered about how the lead angle of bolt threads changes with bolt size. I didn’t do anything about it at the time, but now I figured it would make a good test case, as there are plenty of web pages out there with tables of bolt dimensions. I chose this one from bolt supplier Fastenere. There are two tables on the page, one for US dimensions and the other for metric. Here’s a screenshot of the US table:

US bolt table from Fastenere

I wanted to pull this table into a data frame, create two new data frames, one for coarse threads and the other for fine threads, and then add a column for lead angle to each data frame.

The lead angle is a bit of geometry that comes from unrolling the screw thread at the pitch diameter, d2:

Lead angle definition

The pitch diameter isn’t given in the table, but it’s easily calculated from

d2 = d-0.64951905 P

where d is the major diameter in the table and P is the pitch or lead,1 which is the inverse of the “threads per inch” value. This formula comes from ANSI Standard ASME B1.1, Unified Inch Screw Threads.

Here are the results for coarse threads,

Size Diameter TPI Angle
#2 0.0860 56 4.37
#4 0.1120 40 4.75
#5 0.1250 40 4.18
#6 0.1380 32 4.83
#8 0.1640 32 3.96
#10 0.1900 24 4.65
1/4″ 0.2500 20 4.18
5/16″ 0.3125 18 3.66
3/8″ 0.3750 16 3.40
7/16″ 0.4375 14 3.33
1/2″ 0.5000 13 3.11
9/16″ 0.5625 12 2.99
5/8″ 0.6250 11 2.93
3/4″ 0.7500 10 2.66
7/8″ 0.8750 9 2.52
1″ 1.0000 8 2.48
1-1/8″ 1.1250 7 2.52
1-1/4″ 1.2500 7 2.25
1-3/8″ 1.3750 6 2.40
1-1/2″ 1.5000 6 2.18
1-3/4″ 1.7500 5 2.25
2″ 2.0000 4.5 2.18

The diameter is given in inches and the angle is in degrees. Similarly, here are the results for fine threads:

Size Diameter TPI Angle
#0 0.0600 80 4.39
#2 0.0860 64 3.75
#4 0.1120 48 3.85
#5 0.1250 44 3.75
#6 0.1380 40 3.74
#8 0.1640 36 3.47
#10 0.1900 32 3.35
1/4″ 0.2500 28 2.87
5/16″ 0.3125 24 2.66
3/8″ 0.3750 24 2.18
7/16″ 0.4375 20 2.25
1/2″ 0.5000 20 1.95
9/16″ 0.5625 18 1.92
5/8″ 0.6250 18 1.72
3/4″ 0.7500 16 1.61
7/8″ 0.8750 14 1.57
1″ 1.0000 12 1.61
1-1/8″ 1.1250 12 1.42
1-1/4″ 1.2500 12 1.27
1-3/8″ 1.3750 12 1.15
1-1/2″ 1.5000 12 1.05

The lead angles are all pretty small, even for tiny screws that don’t have much room for threads. And we haven’t even considered extra fine threads.

Here’s the code that produced the tables:

python:
 1:  #!/usr/bin/env python3
 2:  
 3:  import pandas as pd
 4:  import numpy as np
 5:  import sys
 6:  
 7:  # Get the first table from https://www.fastenere.com/blog/bolt-size-chart
 8:  # Don't include the two header rows
 9:  dfBoth = pd.read_html('https://www.fastenere.com/blog/bolt-size-chart', skiprows=2, na_values='---')[0]
10:  
11:  # We don't need any other columns
12:  colnames = 'Size Diameter TPI'.split()
13:  
14:  # Make a table for the coarse threads
15:  dfCoarse = dfBoth.drop(columns=[3, 4, 5, 6, 7]).drop(index=0)
16:  dfCoarse.columns = colnames
17:  dfCoarse['Angle'] = np.round(np.rad2deg(np.atan2(1/dfCoarse.TPI, np.pi*(dfCoarse.Diameter- 0.64951905/dfCoarse.TPI))), decimals=2)
18:  print("US Coarse Threads")
19:  print(dfCoarse)
20:  print()
21:  
22:  # Make a table for the fine threads
23:  dfFine = dfBoth.drop(columns=[2, 3, 4, 6, 7]).drop(index=[21, 22])
24:  dfFine.columns = colnames
25:  dfFine['Angle'] = np.round(np.rad2deg(np.atan2(1/dfFine.TPI, np.pi*(dfFine.Diameter- 0.64951905/dfFine.TPI))), decimals=2)
26:  print("US Fine Threads")
27:  print(dfFine)
28:  print()
29:  

Line 8 is what was new to me. The read_html function reads all the tables on the referenced page (you can also provide a local HTML file or an HTML string wrapped in io.StringIO) and returns a list of data frames. Since the US bolt table is the first one on the page, the list is indexed by [0]. The first two rows in the HTML table are headers, so I used skiprows=2 to keep them out of the data frame; I add my own column names later via Lines 12, 16, and 24. The na_values='---' parameter handles the missing values, which are indicated in the HTML table by three hyphens.

The rest of the code is pretty straightforward. I make the coarse data frame by dropping the columns associated with fine threads, and vice versa. I also drop the columns for area because they don’t matter for what I’m doing. Rows with missing values are dropped, too. The calculation of the lead angle (Lines 17 and 25) is kind of long, but that’s mainly because I wanted the results in degrees instead of radians and rounded to the nearest hundredth of a degree.

The output tables are in Pandas’ native format, which looks like this:

      Size  Diameter   TPI  Angle
1       #2    0.0860  56.0   4.37
2       #4    0.1120  40.0   4.75
3       #5    0.1250  40.0   4.18
4       #6    0.1380  32.0   4.83
5       #8    0.1640  32.0   3.96
6      #10    0.1900  24.0   4.65
7     1/4"    0.2500  20.0   4.18
[etc]

I did some simple rectangular editing in BBEdit to turn this into a Markdown table for posting here:

|   Size | Diameter |  TPI | Angle |
|-------:|---------:|-----:|------:|
|     #2 |   0.0860 |   56 |  4.37 |
|     #4 |   0.1120 |   40 |  4.75 |
|     #5 |   0.1250 |   40 |  4.18 |
|     #6 |   0.1380 |   32 |  4.83 |
|     #8 |   0.1640 |   32 |  3.96 |
|    #10 |   0.1900 |   24 |  4.65 |
|   1/4″ |   0.2500 |   20 |  4.18 |
[etc]

Pandas has a to_markdown function, which is sometimes the best way to go, but in this case that didn’t give the same number of decimal places for all the items in the Diameter column, which ruined the alignment. It was faster to add the pipe characters to the output than change the code to make to_markdown print the way I wanted it to.

Between read_html and Tabula, which extracts tabular data from PDFs,2 I can pull in almost any data I can find on the internet that isn’t already in a convenient format like CSV.


  1. For double-threaded (and other multiply-threaded) bolts, the pitch and lead are not the same, but we’re considering only single-threaded bolts here. 

  2. There’s also tabula-py, a Python wrapper around Tabula. This is a more direct way of getting a PDF table into a data frame, but I’ve always used Tabula itself to make a CSV file, which I then read into a data frame. It’s slightly longer, but it’s always felt safer because it lets me see what I’m doing as I do it.