Pandas and HTML tables
May 1, 2025 at 10:43 AM by Dr. Drang
The Talk Python To Me podcast is one that underscores the value of Castro’s two-stage management system. Unlike, say, 99% Invisible and Upgrade, Talk Python is a show I don’t listen to every week because many of its topics are just too far afield from my interests. So its episodes go into my Inbox for vetting instead of directly into my Queue. The latest episode, about PyArrow and its movement into the internals of Pandas, is one that I immediately promoted to the Queue.
But this post isn’t about Castro or PyArrow (inverted pyramid? what’s that?). It’s a feature of Pandas that was mentioned offhandedly by Michael Kennedy—Pandas can take the URL of a web page and turn the tables on that page into data frames. I didn’t know about this and wanted to try it out immediately.
After writing these two posts on bolts, I wondered about how the lead angle of bolt threads changes with bolt size. I didn’t do anything about it at the time, but now I figured it would make a good test case, as there are plenty of web pages out there with tables of bolt dimensions. I chose this one from bolt supplier Fastenere. There are two tables on the page, one for US dimensions and the other for metric. Here’s a screenshot of the US table:
I wanted to pull this table into a data frame, create two new data frames, one for coarse threads and the other for fine threads, and then add a column for lead angle to each data frame.
The lead angle is a bit of geometry that comes from unrolling the screw thread at the pitch diameter, :
The pitch diameter isn’t given in the table, but it’s easily calculated from
where is the major diameter in the table and is the pitch or lead,1 which is the inverse of the “threads per inch” value. This formula comes from ANSI Standard ASME B1.1, Unified Inch Screw Threads.
Here are the results for coarse threads,
Size | Diameter | TPI | Angle |
---|---|---|---|
#2 | 0.0860 | 56 | 4.37 |
#4 | 0.1120 | 40 | 4.75 |
#5 | 0.1250 | 40 | 4.18 |
#6 | 0.1380 | 32 | 4.83 |
#8 | 0.1640 | 32 | 3.96 |
#10 | 0.1900 | 24 | 4.65 |
1/4″ | 0.2500 | 20 | 4.18 |
5/16″ | 0.3125 | 18 | 3.66 |
3/8″ | 0.3750 | 16 | 3.40 |
7/16″ | 0.4375 | 14 | 3.33 |
1/2″ | 0.5000 | 13 | 3.11 |
9/16″ | 0.5625 | 12 | 2.99 |
5/8″ | 0.6250 | 11 | 2.93 |
3/4″ | 0.7500 | 10 | 2.66 |
7/8″ | 0.8750 | 9 | 2.52 |
1″ | 1.0000 | 8 | 2.48 |
1-1/8″ | 1.1250 | 7 | 2.52 |
1-1/4″ | 1.2500 | 7 | 2.25 |
1-3/8″ | 1.3750 | 6 | 2.40 |
1-1/2″ | 1.5000 | 6 | 2.18 |
1-3/4″ | 1.7500 | 5 | 2.25 |
2″ | 2.0000 | 4.5 | 2.18 |
The diameter is given in inches and the angle is in degrees. Similarly, here are the results for fine threads:
Size | Diameter | TPI | Angle |
---|---|---|---|
#0 | 0.0600 | 80 | 4.39 |
#2 | 0.0860 | 64 | 3.75 |
#4 | 0.1120 | 48 | 3.85 |
#5 | 0.1250 | 44 | 3.75 |
#6 | 0.1380 | 40 | 3.74 |
#8 | 0.1640 | 36 | 3.47 |
#10 | 0.1900 | 32 | 3.35 |
1/4″ | 0.2500 | 28 | 2.87 |
5/16″ | 0.3125 | 24 | 2.66 |
3/8″ | 0.3750 | 24 | 2.18 |
7/16″ | 0.4375 | 20 | 2.25 |
1/2″ | 0.5000 | 20 | 1.95 |
9/16″ | 0.5625 | 18 | 1.92 |
5/8″ | 0.6250 | 18 | 1.72 |
3/4″ | 0.7500 | 16 | 1.61 |
7/8″ | 0.8750 | 14 | 1.57 |
1″ | 1.0000 | 12 | 1.61 |
1-1/8″ | 1.1250 | 12 | 1.42 |
1-1/4″ | 1.2500 | 12 | 1.27 |
1-3/8″ | 1.3750 | 12 | 1.15 |
1-1/2″ | 1.5000 | 12 | 1.05 |
The lead angles are all pretty small, even for tiny screws that don’t have much room for threads. And we haven’t even considered extra fine threads.
Here’s the code that produced the tables:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: import numpy as np
5: import sys
6:
7: # Get the first table from https://www.fastenere.com/blog/bolt-size-chart
8: # Don't include the two header rows
9: dfBoth = pd.read_html('https://www.fastenere.com/blog/bolt-size-chart', skiprows=2, na_values='---')[0]
10:
11: # We don't need any other columns
12: colnames = 'Size Diameter TPI'.split()
13:
14: # Make a table for the coarse threads
15: dfCoarse = dfBoth.drop(columns=[3, 4, 5, 6, 7]).drop(index=0)
16: dfCoarse.columns = colnames
17: dfCoarse['Angle'] = np.round(np.rad2deg(np.atan2(1/dfCoarse.TPI, np.pi*(dfCoarse.Diameter- 0.64951905/dfCoarse.TPI))), decimals=2)
18: print("US Coarse Threads")
19: print(dfCoarse)
20: print()
21:
22: # Make a table for the fine threads
23: dfFine = dfBoth.drop(columns=[2, 3, 4, 6, 7]).drop(index=[21, 22])
24: dfFine.columns = colnames
25: dfFine['Angle'] = np.round(np.rad2deg(np.atan2(1/dfFine.TPI, np.pi*(dfFine.Diameter- 0.64951905/dfFine.TPI))), decimals=2)
26: print("US Fine Threads")
27: print(dfFine)
28: print()
29:
Line 8 is what was new to me. The read_html
function reads all the tables on the referenced page (you can also provide a local HTML file or an HTML string wrapped in io.StringIO
) and returns a list of data frames. Since the US bolt table is the first one on the page, the list is indexed by [0]
. The first two rows in the HTML table are headers, so I used skiprows=2
to keep them out of the data frame; I add my own column names later via Lines 12, 16, and 24. The na_values='---'
parameter handles the missing values, which are indicated in the HTML table by three hyphens.
The rest of the code is pretty straightforward. I make the coarse data frame by dropping the columns associated with fine threads, and vice versa. I also drop the columns for area because they don’t matter for what I’m doing. Rows with missing values are dropped, too. The calculation of the lead angle (Lines 17 and 25) is kind of long, but that’s mainly because I wanted the results in degrees instead of radians and rounded to the nearest hundredth of a degree.
The output tables are in Pandas’ native format, which looks like this:
Size Diameter TPI Angle
1 #2 0.0860 56.0 4.37
2 #4 0.1120 40.0 4.75
3 #5 0.1250 40.0 4.18
4 #6 0.1380 32.0 4.83
5 #8 0.1640 32.0 3.96
6 #10 0.1900 24.0 4.65
7 1/4" 0.2500 20.0 4.18
[etc]
I did some simple rectangular editing in BBEdit to turn this into a Markdown table for posting here:
| Size | Diameter | TPI | Angle |
|-------:|---------:|-----:|------:|
| #2 | 0.0860 | 56 | 4.37 |
| #4 | 0.1120 | 40 | 4.75 |
| #5 | 0.1250 | 40 | 4.18 |
| #6 | 0.1380 | 32 | 4.83 |
| #8 | 0.1640 | 32 | 3.96 |
| #10 | 0.1900 | 24 | 4.65 |
| 1/4″ | 0.2500 | 20 | 4.18 |
[etc]
Pandas has a to_markdown
function, which is sometimes the best way to go, but in this case that didn’t give the same number of decimal places for all the items in the Diameter column, which ruined the alignment. It was faster to add the pipe characters to the output than change the code to make to_markdown
print the way I wanted it to.
Between read_html
and Tabula, which extracts tabular data from PDFs,2 I can pull in almost any data I can find on the internet that isn’t already in a convenient format like CSV.
-
For double-threaded (and other multiply-threaded) bolts, the pitch and lead are not the same, but we’re considering only single-threaded bolts here. ↩
-
There’s also
tabula-py
, a Python wrapper around Tabula. This is a more direct way of getting a PDF table into a data frame, but I’ve always used Tabula itself to make a CSV file, which I then read into a data frame. It’s slightly longer, but it’s always felt safer because it lets me see what I’m doing as I do it. ↩