Section properties and SymPy

I felt a little guilty about this footnote in yesterday’s post:

Yes, the product of inertia integral is definitely more complicated if you’re going to do the derivation by hand. So don’t do it by hand. Learn SymPy and you’ll be able to zip through it.

This is entirely too much like those “it can be easily shown” tricks that math textbook writers use to avoid complicated and unintuitive manipulations. If I’m going to claim you can zip through the product of inertia, I should be able to prove it. So let’s do it.

SymPy comes with the Anaconda Python distribution, and that’s how I installed it. I believe you can get it working with Apple’s system-supplied Python, but Anaconda is so helpful in getting and maintaining a numerical/scientific Python installation, I don’t see why you’d try anything else.

If you’ve ever used a symbolic math program, like Mathematica or Maple, SymPy will seem reasonably familiar to you. My main hangup is the need in SymPy to declare certain variables as symbols before doing any other work. I understand the reason for it—SymPy needs to protect symbols from being evaluated the way regular Python variables are—but I tend to forget to declare all the symbols I need and don’t realize it until an error message appears.

That one personal quirk aside, I find SymPy easy to use for the elementary math I tend to do. The functions I use most often, like diff, integrate, expand, and factor, are easy to remember, so I don’t have to continually look things up in the documentation. And the docs are well-organized when I do have to use them.

The problem we’re going to look at is the solution of this integral for a polygonal area:

[\iint\limits_A xy\; dx dy]

We’ll use Green’s theorem to turn this area integral into a path integral around the polygon’s perimeter:

[\iint\limits_A xy\; dx dy = \oint\limits_C \frac{1}{2} x^2 y\; dy]

For each side of the polygon, from point [(x_i, y_i)] to point [(x_{i+1}, y_{i+1})], the line segment defining the perimeter can be expressed in parametric form,

[x = x_i + (x_{i+1} - x_i)\;t] [y = y_i + (y_{i+1} - y_i)\;t]

which means

[dy = (y_{i+1} - y_i)\; dt]

Now we’re ready to use SymPy to evaluate and simplify the integral for a single line segment. To make the typing go faster as I used SymPy, which I ran interactively in Jupyter console session, I decided to use 0 for subscript [i] and 1 for subscript [i+1]. Here’s a transcript of the session, where I’ve broken up long lines to make it easier to read:

In [1]: from sympy import *

In [2]: x, y, x_0, x_1, y_0, y_1, t = symbols('x y x_0 x_1 y_0 y_1 t')

In [3]: x = x_0 + (x_1 - x_0)*t

In [4]: y = y_0 + (y_1 - y_0)*t

In [5]: full = integrate(x**2*y/2*diff(y, t), (t, 0, 1))

In [6]: full
Out[6]: -x_0**2*y_0**2/8 + x_0**2*y_0*y_1/12 + x_0**2*y_1**2/24
 - x_0*x_1*y_0**2/12 + x_0*x_1*y_1**2/12 - x_1**2*y_0**2/24
 - x_1**2*y_0*y_1/12 + x_1**2*y_1**2/8

In [7]: part = x_0**2*y_0*y_1/12 + x_0**2*y_1**2/24 - x_0*x_1*y_0**2/12
 + x_0*x_1*y_1**2/12 - x_1**2*y_0**2/24 - x_1**2*y_0*y_1/12

In [8]: factor(part)
Out[8]: (x_0*y_1 - x_1*y_0)*(2*x_0*y_0 + x_0*y_1 + x_1*y_0 + 2*x_1*y_1)/24

In [9]: print(latex(_))
\frac{1}{24} \left(x_{0} y_{1} - x_{1} y_{0}\right) \left(2 x_{0} y_{0}
 + x_{0} y_{1} + x_{1} y_{0} + 2 x_{1} y_{1}\right)

We start by importing everything from SymPy and defining all the symbols needed. Then we define the parametric equations of the line segment in In[3] and In[4].

In[5] does a lot of work. We define the integrand inside the integrate function and tell it to integrate that expression over [t] from 0 to 1 (i.e., from [(x_0, y_0)] to [(x_1, y_1)]). Note that we didn’t need to explicitly enter the expressions for [x], [y], or [dy]; SymPy did all the substitution for us, including the differentiation.

I called the result of the integration full because it contains every term of the integration. But we learned in the last post that the leading and trailing terms get cancelled out when we sum over all the segments of the polygon. So I copied just the inner terms from full and pasted them into In[7] to define a new expression, called part.

In[8] then factors part to get a more compact expression, and In[9] converts it to a LaTeX expression, so I can render it nicely here:

[\frac{1}{24} \left(x_{0} y_{1} - x_{1} y_{0}\right) \left(2 x_{0} y_{0} + x_{0} y_{1} + x_{1} y_{0} + 2 x_{1} y_{1}\right)]

With a quick search-and-replace to convert the subscripts to their more general forms, we get the expression presented in the last post (albeit with the terms in a different order):

[\iint\limits_A xy\; dx dy = \frac{1}{24} \sum_{i=0}^{n-1} \left(x_{i} y_{i+1} - x_{i+1} y_{i}\right) \left(2 x_{i} y_{i} + x_{i} y_{i+1} + x_{i+1} y_{i} + 2 x_{i+1} y_{i+1}\right)]

SymPy didn’t do everything for us. We had to figure out the Green’s function transformation and recognize the cancellation of the leading and trailing terms of full. But it did all the boring stuff, which is its real value.

Green’s theorem and section properties

In the last post, I presented a simple Python module with functions for calculating section properties of polygons. Now we’ll go through the derivations of the formulas used in those functions.

The basis for all the formulas is Green’s theorem, which is usually presented something like this:

[\oint\limits_C P\; dx + Q\; dy = \iint\limits_A \left( \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right)\; dx dy]

where [P] and [Q] are functions of [x] and [y], [A] is the region over which the right integral is being evaluated, and [C] is the boundary of that region. The integral on the right is evaluated in accordance with the right-hand rule, i.e., counterclockwise for the usual orientation of the [x] and [y] axes.

Region and nomenclature

The section properties of interest are all area integrals. We’ll use Green’s theorem to turn them into boundary integrals and then evaluate those integrals using the coordinates of the polygon’s vertices.


This is the easiest one, but instead of going through the full derivation here, I’ll refer you to this excellent StackExchange page by apnorton and just hit the highlights.

  1. The area is defined

    [A = \iint\limits_Adx dy]

    and we’ll choose [P = 0] and [Q = 1] as our Greens’ theorem functions. This gives us

    [A = \iint\limits_Adx dy = \oint\limits_C x\; dy]
  2. We break the polygonal boundary into a series of straight-line segments, each of which can be parameterized this way:

    [x = x_i + (x_{i+1} - x_i)\;t] [y = y_i + (y_{i+1} - y_i)\;t] [dy = (y_{i+1} - y_1)\; dt]

    where the [(x_i, y_i)] are the coordinates of the vertices.

  3. Plugging these equations into the integral, we get

    [A = \frac{1}{2} \sum_{i=0}^{n-1}\; (x_{i+1} + x_i)(y_{i+1} - y_i)]

A note on the indexing: The polygon has [n] vertices, which we’ll number from 0 to [n-1]. The last segment of the boundary goes from [(x_{n-1}, y_{n-1})] to [(x_0, y_0)]. To make this work with the equation, we’ll define [(x_n, y_n) = (x_0, y_0)].

Let’s compare this with the area function in the module:

def area(pts):
  'Area of cross-section.'

  if pts[0] != pts[-1]:
    pts = pts + pts[:1]
  x = [ c[0] for c in pts ]
  y = [ c[1] for c in pts ]
  s = 0
  for i in range(len(pts) - 1):
    s += x[i]*y[i+1] - x[i+1]*y[i]
  return s/2

We start by checking the pts list to see if the starting and ending items match. If they don’t, we copy the starting item to the end to fit the indexing convention discussed above. We then initialize some variables and execute a loop, summing terms along the way. Rewriting the loop in mathematical terms, we get

[A = \frac{1}{2} \sum_{i=0}^{n-1}\; x_i y_{i+1} - x_{i+1} y_i]

This doesn’t look like the equation derived from Green’s theorem, does it? But it’s not too hard to see that they are equivalent. Expanding out the binomial product in the earlier equation gives

[x_{i+1} y_{i+1} - x_{i+1} y_i + x_i y_{i+1} - x_i y_i]

As we loop through all the values of [i] from 0 to [n-1], the leading term of one trip through the loop will cancel the trailing term of the next trip through the loop. Here’s an example for a triangle:

[\quad (x_1 y_1 - x_1 y_0 + x_0 y_1 - x_0 y_0 )] [+\; (x_2 y_2 - x_2 y_1 + x_1 y_2 - x_1 y_1 )] [+\; (x_0 y_0 - x_0 y_2 + x_2 y_0 - x_2 y_2 )]

After the cancellations, all that’s left are the inner terms, and that’s the formula used in the area function.

The cancellation doesn’t do much for us here, changing from two additions and one multiplication per loop to two multiplications and one addition per loop. But we’ll see this same sort of cancellation in the other section properties, and it will provide greater simplification in those.


The centroid is essentially the average position of the area. If a sheet of material of uniform thickness and density were cut into a shape, the centroid would be the center of gravity, the balance point, of that shape. The coordinates of the centroid are defined this way:

[x_c = \frac{1}{A} \iint\limits_A x\; dx dy] [y_c = \frac{1}{A} \iint\limits_A y\; dx dy]

Let’s derive the formula for [x_c] for a polygon; the derivation of the formula for [y_c] will be similar.

In applying Green’s theorem, we’ll take [P = 0] and [Q = \frac{1}{2} x]. Therefore,

[x_c = \frac{1}{A} \iint\limits_A x\; dx dy = \oint\limits_C \frac{1}{2} x^2\; dy]

Breaking the polygonal boundary into straight-line segments and using the same parametric equations as before, we get an integral that looks like this

[\int_0^1 \frac{1}{2} \left[ (x_i^2 + 2 x_i (x_{i+1} - x_i)\; t + (x_{i+1} - x_i)^2\; t^2\right] (y_{i+1} - y_i)\; dt]

for each segment. This integral evaluates to

[\frac{1}{6} (x_{i+1}^2 + x_{i+1} x_i + x_i^2)\; ( y_{i+1} - y_i)]

so our formula for the centroid is

[x_c = \frac{1}{6A} \sum_{i=0}^{n-1} (x_{i+1}^2 + x_{i+1} x_i + x_i^2)\; ( y_{i+1} - y_i)]

As we found in the formula for area, the leading and trailing terms in the expansion of this product cancel out as we loop through the sum, leaving us with

[x_c = \frac{1}{6A} \sum_{i=0}^{n-1} -y_i x_{i+1}^2 + y_{i+1}x_{i+1} x_i - y_i x_{i+1} x_i + y_{i+1} x_i^2]

This looks like a mess, but it can be factored into a more compact form:

[x_c = \frac{1}{6A} \sum_{i=0}^{n-1} (x_{i+1} + x_i)\;(x_i y_{i+1} - x_{i+1} y_i)]

The expression for the other centroidal coordinate is as you’d expect:

[y_c = \frac{1}{6A} \sum_{i=0}^{n-1} (y_{i+1} + y_i)\;(x_i y_{i+1} - x_{i+1} y_i)]

These are the formulas used in the centroid function.

def centroid(pts):
  'Location of centroid.'

  if pts[0] != pts[-1]:
    pts = pts + pts[:1]
  x = [ c[0] for c in pts ]
  y = [ c[1] for c in pts ]
  sx = sy = 0
  a = area(pts)
  for i in range(len(pts) - 1):
    sx += (x[i] + x[i+1])*(x[i]*y[i+1] - x[i+1]*y[i])
    sy += (y[i] + y[i+1])*(x[i]*y[i+1] - x[i+1]*y[i])
  return sx/(6*a), sy/(6*a)

Moments and product of inertia

You may be familiar with moments and products of inertia from dynamics, where the terms are related to the distribution of mass in a body. The moments and product of inertia we’ll be talking about here—more properly called the second moments of area—are mathematically similar and refer to the distribution of area across a planar shape.

The moments and product of inertia that matter in beam bending are taken about the centroidal axis (i.e., a set of [x] and [y] axes with the origin at the centroid of the shape). Since we don’t know where the centroid is when we set up our coordinate system, our list of vertex points don’t work off that basis. But we can still calculate the centroidal moments and product of inertia by using these formulas:

[I_{xx} = \iint\limits_A (y - y_c)^2\; dx dy] [I_{yy} = \iint\limits _A (x - x_c)^2 \; dx dy] [I_{xy} = \iint\limits_A (y - y_c)\; (x - x_c)\; dx dy]

We’ll concentrate on [I_{yy}]; the other two will be similarly derived.

First, let’s expand the square inside the integral and see what we get:

[I_{yy} = \iint\limits_A x^2\; dx dy - 2x_c \iint\limits_A x\; dx dy + x_c^2 \iint\limits_A dx dy]

The integral in the second term is [A x_c] and the integral in the third term is just [A]. Putting this together, we get1

[I_{yy} = \iint\limits_A x^2 dx dy - A\;x_c^2]

Since we already have formulas for [x] and [x_c], we can concentrate on the integral in the first term on the right.

Returning to Green’s theorem, we’ll use [P = 0] and [Q = \frac{1}{3}x^3], giving us

[\iint\limits_A x^2 dx dy = \oint\limits_C \frac{1}{3}x^3 \; dy]

Once again, we break the polygonal boundary into straight-line segments and use parametric equations to define the segments. For each segment, we’ll get the following integral:

[\int_0^1 \frac{1}{3} \left[ x_i^3 + 3 x_i^2 (x_{i+1} - x_1)\; t + 3 x_i (x_{i+1} - x_1)^2\; t^2 + (x_{i+1} - x_1)^3\; t^3 \right] (y_{i+1} - y_i) \; dt]

This integral evaluates to

[\frac{1}{12} \left[ x_{i+1}^3 + x_{i+1}^2 x_i + x_{i+1} x_i^2 + x_i^3 \right] (y_{i+1} - y_i)]

giving us

[\iint\limits_A x^2 dx dy = \frac{1}{12} \sum_{i=0}^{n-1} \left[ x_{i+1}^3 + x_{i+1}^2 x_i + x_{i+1} x_i^2 + x_i^3 \right] (y_{i+1} - y_i)]

Once again, if we expand out the product inside the sum, we’ll find that the leading and trailing terms cancel as we work through the loop. That gives us

[\frac{1}{12} \sum_{i=0}^{n-1} -y_i x_{i+1}^3 + y_{i+1} x_{i+1}^2 x_i - y_i x_{i+1}^2 x_i + y_{i+1} x_{i+1} x_i^2 - y_i x_{i+1} x_i^2 + y_{i+1} x_i^3]

And that long expression can be factored, leaving

[\iint\limits_A x^2 dx dy = \frac{1}{12} \sum_{i=0}^{n-1} (x_{i+1}^2 + x_{i+1} x_i + x_i^2)\; (x_i y_{i+1} - x_{i+1} y_i)]

Similar2 derivations give us

[\iint\limits_A y^2 dx dy = \frac{1}{12} \sum_{i=0}^{n-1} (y_{i+1}^2 + y_{i+1} y_i + y_i^2)\; (x_i y_{i+1} - x_{i+1} y_i)] [\iint\limits_A xy\; dx dy = \frac{1}{24} \sum_{i=0}^{n-1} (2 x_{i+1} y_{i+1} + x_{i+1} y_i + x_i y_{i+1} + 2 x_i y_i)\; (x_i y_{i+1} - x_{i+1} y_i)]

These formulas, and the terms accounting for the location of the centroid, are in the function inertia.

def inertia(pts):
  'Moments and product of inertia about centroid.'

  if pts[0] != pts[-1]:
    pts = pts + pts[:1]
  x = [ c[0] for c in pts ]
  y = [ c[1] for c in pts ]
  sxx = syy = sxy = 0
  a = area(pts)
  cx, cy = centroid(pts)
  for i in range(len(pts) - 1):
    sxx += (y[i]**2 + y[i]*y[i+1] + y[i+1]**2)*(x[i]*y[i+1] - x[i+1]*y[i])
    syy += (x[i]**2 + x[i]*x[i+1] + x[i+1]**2)*(x[i]*y[i+1] - x[i+1]*y[i])
    sxy += (x[i]*y[i+1] + 2*x[i]*y[i] + 2*x[i+1]*y[i+1] + x[i+1]*y[i])*(x[i]*y[i+1] - x[i+1]*y[i])
  return sxx/12 - a*cy**2, syy/12 - a*cx**2, sxy/24 - a*cx*cy

This older post explains the use of the moment of inertia in beam bending, but I avoided the trickier bits associated with the product of inertia and principal axes. We’ll cover them in the next post.

  1. Yes, this is the parallel axis theorem

  2. Yes, the product of inertia integral is definitely more complicated if you’re going to do the derivation by hand. So don’t do it by hand. Learn SymPy and you’ll be able to zip through it. 

Python module for section properties

A lot of what I do at work involves analyzing the bending of beams, and that means using properties of the beams’ cross sections. The properties of greatest importance are the area, the location of the centroid, and the moments of inertia. Most of the time, I can just look these properties up in a handbook, as I did in this post, or combine the properties of a few well-known shapes. Recently, though, I needed the section properties of an oddball shape, and my handbooks failed me.

In the past, I would open a commercial program that had a section properties module, draw in the shape, and copy out the results. But my partners and I stopped paying the license for that program several years ago, so that wasn’t an option anymore. I decided to write a Python module to do the calculations and draw the cross-section.

If the cross section is a polygon, there are formulas for calculating the section properties from the coordinates of the vertices. Most of the formulas are on the aforelinked Wikipedia pages and on this very nice page from Paul Bourke of the University of Western Australia. I’ll explain how and why the formulas work in a later post; for now, we’ll just accept them. For cross sections that aren’t polygons, we can create a close approximation by fitting a series of short straight lines to any boundary curve.

Here’s the source code of the module, which I call

  1:  import matplotlib.pyplot as plt
  2:  from math import atan2, sin, cos, sqrt, pi, degrees
  4:  def area(pts):
  5:    'Area of cross-section.'
  7:    if pts[0] != pts[-1]:
  8:      pts = pts + pts[:1]
  9:    x = [ c[0] for c in pts ]
 10:    y = [ c[1] for c in pts ]
 11:    s = 0
 12:    for i in range(len(pts) - 1):
 13:      s += x[i]*y[i+1] - x[i+1]*y[i]
 14:    return s/2
 17:  def centroid(pts):
 18:    'Location of centroid.'
 20:    if pts[0] != pts[-1]:
 21:      pts = pts + pts[:1]
 22:    x = [ c[0] for c in pts ]
 23:    y = [ c[1] for c in pts ]
 24:    sx = sy = 0
 25:    a = area(pts)
 26:    for i in range(len(pts) - 1):
 27:      sx += (x[i] + x[i+1])*(x[i]*y[i+1] - x[i+1]*y[i])
 28:      sy += (y[i] + y[i+1])*(x[i]*y[i+1] - x[i+1]*y[i])
 29:    return sx/(6*a), sy/(6*a)
 32:  def inertia(pts):
 33:    'Moments and product of inertia about centroid.'
 35:    if pts[0] != pts[-1]:
 36:      pts = pts + pts[:1]
 37:    x = [ c[0] for c in pts ]
 38:    y = [ c[1] for c in pts ]
 39:    sxx = syy = sxy = 0
 40:    a = area(pts)
 41:    cx, cy = centroid(pts)
 42:    for i in range(len(pts) - 1):
 43:      sxx += (y[i]**2 + y[i]*y[i+1] + y[i+1]**2)*(x[i]*y[i+1] - x[i+1]*y[i])
 44:      syy += (x[i]**2 + x[i]*x[i+1] + x[i+1]**2)*(x[i]*y[i+1] - x[i+1]*y[i])
 45:      sxy += (x[i]*y[i+1] + 2*x[i]*y[i] + 2*x[i+1]*y[i+1] + x[i+1]*y[i])*(x[i]*y[i+1] - x[i+1]*y[i])
 46:    return sxx/12 - a*cy**2, syy/12 - a*cx**2, sxy/24 - a*cx*cy
 48:  def principal(Ixx, Iyy, Ixy):
 49:    'Principal moments of inertia and orientation.'
 51:    avg = (Ixx + Iyy)/2
 52:    diff = (Ixx - Iyy)/2      # signed
 53:    I1 = avg + sqrt(diff**2 + Ixy**2)
 54:    I2 = avg - sqrt(diff**2 + Ixy**2)
 55:    theta = atan2(Ixy, diff)/2
 56:    if Ixy <= 0:
 57:      theta = abs(theta)
 58:    else:
 59:      theta = -abs(theta)
 60:    return I1, I2, theta
 63:  def summary(pts):
 64:    'Text summary of cross-sectional properties.'
 66:    a = area(pts)
 67:    cx, cy = centroid(pts)
 68:    Ixx, Iyy, Ixy = inertia(pts)
 69:    I1, I2, theta = principal(Ixx, Iyy, Ixy)
 70:    summ = """Area
 71:    A = {}
 72:  Centroid
 73:    cx = {}
 74:    cy = {}
 75:  Moments and product of inertia
 76:    Ixx = {}
 77:    Iyy = {}
 78:    Ixy = {}
 79:  Principal moments of inertia and direction
 80:    I1 = {}
 81:    I2 = {}
 82:    θ︎ = {}°""".format(a, cx, cy, Ixx, Iyy, Ixy, I1, I2, degrees(theta))
 83:    return summ
 86:  def outline(pts, basename='section', format='pdf', size=(8, 8), dpi=100):
 87:    'Draw an outline of the cross-section with centroid and principal axes.'
 89:    if pts[0] != pts[-1]:
 90:      pts = pts + pts[:1]
 91:    x = [ c[0] for c in pts ]
 92:    y = [ c[1] for c in pts ]
 94:    # Get the bounds of the cross-section
 95:    minx = min(x)
 96:    maxx = max(x)
 97:    miny = min(y)
 98:    maxy = max(y)
100:    # Whitespace border is 5% of the larger dimension
101:    b = .05*max(maxx - minx, maxy - miny)
103:    # Get the properties needed for the centroid and principal axes
104:    cx, cy = centroid(pts)
105:    i = inertia(pts)
106:    p = principal(*i)
108:    # Principal axes extend 10% of the minimum dimension from the centroid
109:    length = min(maxx-minx, maxy-miny)/10
110:    a1x = [cx - length*cos(p[2]), cx + length*cos(p[2])]
111:    a1y = [cy - length*sin(p[2]), cy + length*sin(p[2])]
112:    a2x = [cx - length*cos(p[2] + pi/2), cx + length*cos(p[2] + pi/2)]
113:    a2y = [cy - length*sin(p[2] + pi/2), cy + length*sin(p[2] + pi/2)]
115:    # Plot and save
116:    # Axis colors chosen from
117:    fig, ax = plt.subplots(figsize=size)
118:    ax.plot(x, y, 'k*-', lw=2)
119:    ax.plot(a1x, a1y, '-', color='#0072B2', lw=2)     # blue
120:    ax.plot(a2x, a2y, '-', color='#D55E00')           # vermillion
121:    ax.plot(cx, cy, 'ko', mec='k')
122:    ax.set_aspect('equal')
123:    plt.xlim(xmin=minx-b, xmax=maxx+b)
124:    plt.ylim(ymin=miny-b, ymax=maxy+b)
125:    filename = basename + '.' + format
126:    plt.savefig(filename, format=format, dpi=dpi)
127:    plt.close()

The key data structure is a list of tuples,1 which represent all of the vertices of the polygon. Each tuple is a pair of (x, y) coordinates for a vertex, and the list must be arranged so the vertices are in consecutive clockwise order. This ordering is the result of Green’s theorem, which is the source of the formulas.2

Here’s a brief example of using the module:

1:  #!/usr/bin/env python
3:  from section import summary, outline
5:  shape = [(0, 0), (5, 0), (5, 1), (3.125, 1), (2.125, 3), (0.875, 3), (1.875, 1), (0, 1)]
6:  print(summary(shape))
7:  outline(shape, 'skewed', format='png', size=(8, 6))

Line 5 defines the vertices of the shape. The printed output from Line 6 is

  A = 7.5
  cx = 2.3333333333333335
  cy = 1.0
Moments and product of inertia
  Ixx = 5.0
  Iyy = 11.367187499999993
  Ixy = -1.666666666666666
Principal moments of inertia and direction
  I1 = 11.77706657483349
  I2 = 4.590120925166502
  θ︎ = 76.18358042418826°

and the PNG file created from Line 7, named skewed.png, looks like this


As you might expect, the x-axis is horizontal and the y-axis is vertical. In addition to the shape itself, the outline function also plots the centroid as a black dot and the orientation of the principal axes. The major axis is the thicker bluish line and the minor axis is the thinner reddish line.

The outline function is the most interesting, in that it isn’t just the transliteration of a formula into Python. Lines 95–98 extract the extreme x and y values, and Line 101 calculates the size of a whitespace border (5% of the larger dimension) to keep the frame of the plot a reasonable distance away from the shape. This also makes it easy to crop the drawing to omit the frame. The ends of the principal axes are calculated in Lines 108–113 to make their lengths 20% of the smaller dimension; the idea is to make them long enough to see but not so long as to be distracting.

As noted in the comments, I chose the axis colors from a colorblind-safe palette given by Martin Krzywinski on this page. He got the palette from a paper by Bang Wong that I didn’t feel like paying $59 for (my scholarship has its limits). To better emphasize which is the major principal axis, I made it thicker.

Mechanical and civil engineers learn how to calculate section properties early on in their undergraduate curriculum, so it’s not a particularly difficult topic, but there is a surprising depth to it. Enough depth that I plan to milk it for three more posts, which I’ll link to here when they’re done.

  1. Strictly speaking, any data structure that indexes like a list of tuples—a list of lists, for example—would work just as well, but because coordinates are usually given as parenthesized pairs, a list of tuples seemed the most natural. 

  2. As promised, we’ll get to the derivation of the formulas in a later post, but if you want a taste, here’s a good derivation of the area formula by apnorton. 

A small hint of big data

Shortly before Christmas, I got a few gigabytes of test data from a client and had to make sense of it. The first step was being able to read it.

The data came from a series of sensors installed in the some equipment manufactured by the client but owned by one of its customers. It was the customer who had collected the data, and precise information about it was limited at best. Basically, all I knew going in was that I had a handful of very large files, most of them about half a gigabyte, and that they were almost certainly text files of some sort.

One of the files was much smaller than the other, only about 50 MB. I decided to start there and opened it in BBEdit, which took a little time to suck it all in but handled it flawlessly. Scrolling through it, I learned that the first several dozen lines described the data that was being collected and the units of that data. At the end of the header section was a line with just the string


and after that came line after line of numbers. Each line was about 250 characters long and used DOS-style CRLF line endings. All the fields were numeric and were separated by single spaces. The timestamp field for each data record looked like a floating point number, but after some review, I came to understand that it was an encoding of the clock time in hhmmss.ssss format. This also explained why the files were so big: the records were 0.002 seconds apart, meaning the data had been collected at 500 Hz, much faster than was necessary for the type of information being gathered.

Anyway, despite its excessive volume, the data seemed pretty straightforward, a simple format that I could do a little editing of to get it into shape for importing into Pandas. So I confidently right-clicked one of the larger files to open it in BBEdit, figuring I’d see the same thing. But BBEdit wouldn’t open it.

BBEdit error message

As the computer I was using has 32 GB of RAM, physical memory didn’t seem like the cause of this error. I had never before run into a text file that BBEdit couldn’t handle, but then I’d never tried to open a 500+ MB file before. I don’t blame BBEdit for the failure—data files like this aren’t what it was designed to edit—but it was surprising. I had to come up with Plan B.

Plan B started with running head -100 on the files to make sure they were all formatted the same way. I learned that although the lengths of the header sections were different, they were collecting same type of data and using the same space-separated format for the data itself. Also, in each file the header and data were separated by a [data] line.

The next step was stripping out the header lines and transforming the data into CSV format. Pandas can certainly read space-separated data, but I figured that as long as I had to do some editing of the files, I might as well put them into a form that lots of software can read. I considered using a pipeline of standard Unix utilities and maybe Perl to do the transformation, but settled on a writing a Python script. Even though such a script was likely to be longer than the equivalent pipeline, my familiarity with Python would make it easier to write.

Here’s the script:

 1: #!/usr/bin/env python
 3: import sys
 5: f = open(sys.argv[1], 'r')
 6: for line in f:
 7:   if line.rstrip() == '[data]':
 8:     break
10: print 'sats,time,lat,long,velocity,heading,height,vert_vel,eng_outP,valv_inP,valv_outP,supplyP,event1,T1,flow1,flow2,eng_outT,valv_inT,supplyT,valv_outT,eng_rpm'
12: for line in f:
13:   print line.rstrip().replace(' ', ',')

(You can see from the print commands that this was done back before I switched to Python 3.)

The script, data2csv, was run from the command line like this for each data file in turn:

data2csv file01.dat > file01.csv

The script takes advantage of the way Python iterates through an open file line-by-line, keeping track of where it left off. The first loop, Lines 6–8, runs through the header lines, doing nothing and then breaking out of the loop when the [data] line is encountered.

Line 10 prints a CSV header line of my own devising. This information was in the original file, but its field names weren’t useful, so it made more sense for me to create my own.

Finally, the loop in Lines 12–13 picks up the file iteration where the previous loop left off and runs through to the end of the file, stripping off the DOS-style line endings and replacing the spaces with commas before printing each line in turn.

Even on my old 2012 iMac, this script took less than five seconds to process the large files, generating CSV files with over two million lines.

I realize my paltry half-gigabyte files don’t really qualify as big data, but they were big to me. I’m usually not foolish enough to run high frequency data collection processes on low frequency equipment for long periods of time. Since the usual definition of big data is something like “too voluminous for traditional software to handle,” and my traditional software is BBEdit, this data set fit the definition for me.