A delightfully simple chart

As I was putting together last night’s post, I realized I had forgotten to write about this MacStories post from last week. It’s Graham Spencer’s writeup of what he found while digging into Apple’s Top Grossing charts in the App Store. The post contains several examples of a simple and elegant type of graph I seldom see.

And I do mean simple. Here’s one example:

MacStories Game Position

Image from MacStories.

It’s nothing more than a strip of alternating colors, but at a glance it gives you a sense of both the proportion and the distribution of games among the top 200 apps.

This isn’t the kind of chart that’ll draw attention from the web’s dataviz whiz kids—it doesn’t even use this week’s hot new JavaScript library! All it does is communicate directly and effectively. Thank you, Graham.


Obligatory Apple sales post

Apple’s 2016-Q3 numbers came out a couple of days ago, and there was the usual orgy of chart making and hot takes. MacStories and Six Colors are in a tight competition for the hottest takes and the most orgiastic charts. Because it’s two days after the fact, my takes are lukewarm at best, but nobody’s charts are as compact or as handsomely branded as mine.

Apple sales

The dots are the raw quarterly data and the lines are the now de rigueur trailing four-quarter moving averages. The abscissa is the regular calendar, not Apple’s off-by-three-months fiscal calendar: Q2 ends (and is plotted) on the last Saturday in March, Q3 ends on the last Saturday in June, Q4 is on the last Saturday in September, and Q1 ends on the last Saturday in December of the previous year. I plot it this way because I think the real calendar is easier to understand—and I like to be annoying.

Let’s start with the Mac, where the lackluster sales are a perfect reflection of the effort Apple’s been putting into it recently. Among notebooks, only the weird and underpowered Retina Macbook has gotten any love recently. In the desktop world, the iMac had a leap in quality when it went Retina, but that was almost two years ago, and the less said about the Pro and the Mini the better.

As for the iPad, its unit sales are still decreasing, but its revenue is increasing thanks to the more expensive iPad Pro. Jason Snell has a nice chart of average selling price that shows a distinct jump up for the iPad line this past quarter. The ASP is 18% up year-over-year, which is how a 9% decline in unit sales becomes a 7% increase in revenue.

(I’m compelled to say that I was disappointed to read this tweet from Federico Viticci and this post from John Gruber. Both touted the revenue increase without mentioning the continued unit sales decrease. I know they both linked to the full story, but the election year has made me weary of obvious half-truths.)

Unlike with the Mac, Apple has truly improved the iPad hardware over the past few iterations. It must be terribly frustrating for that hard work to have such small returns. But if you squint, you can see that sales are maybe/possibly/conceivably starting to level off from their two-year decline. When people talk about sales “leveling off” it’s usually a bad sign, but not in this case.

Finally, the iPhone. Down substantially from Q3 of 2015 (8.5% in units, 7.7% in revenue), but as I said back in January, 2015 was a tremendous year for iPhone sales because of the pent-up demand for a larger phone. A better comparison (and Jason made this same point) is to look at the trend from the years before and skip over the iPhone 6.

Here’s the same chart, but with the raw figures for Q1, Q2, and Q3 highlighted with diamonds, squares, and circles, respectively. I don’t go back further than the iPhone 4S because that’s when the current fall release schedule began.

Apple sales highlighted

Looking only at the highlighted sales, we see that 2016 looks like a reasonable continuation of the years before the iPhone 6. Q1 (diamond) is higher than we might expect, Q2 and Q3 are perhaps a bit lower, but nothing is significantly out of whack. You’d get the same impression if you extended the moving average line from the trend before the kink at the end of 2014.

Is this a Panglossian view? Maybe. But simply looking at year-over-year data is too pessimistic. We’ll have plenty of opportunity to declare Apple dead if the iPhone 7 tanks.

Update 07/29/2016 5:35 AM
The colors in the plot were originally the standard green, blue, and red from Matplotlib’s standard palette. I had intended to change the script that makes the chart to use a palette that’s more friendly to color blind readers, but I forgot. Now it’s fixed, and any further iterations of the chart will use the new colors.

The section of the script that plots the raw data and moving averages now looks like this:

python:
ax.plot(macDates, macMA, '-', color='#1b9e77', linewidth=3, label='Mac')
ax.plot(macDates, macRaw, '.', color='#1b9e77')
ax.plot(phoneDates, phoneMA, '-', color='#7570b3', linewidth=3, label='iPhone')
ax.plot(phoneDates, phoneRaw, '.', color='#7570b3')
ax.plot(padDates, padMA, '-', color='#d95f02', linewidth=3, label='iPad')
ax.plot(padDates, padRaw, '.', color='#d95f02')

The new colors came from Cynthia Brewer’s online color picker.


Table munging followup

The best thing about writing posts like last night’s is what I learn from the people who read it. Let’s do a little followup.

First, I can always count on Aristotle Pagaltzis to improve my Perl. His suggestion for the one-liner that turns the data copied from the PDF table into a TSV,

perl -pe 's/\n/\t/ if $. % 21'

is much cleaner than my clunky solution.

Similarly, Nathan Grigg improved my Python by pointing out that enumerate can take an optional start parameter, so the index term doesn’t have to start at 0. In my original Python script, I could’ve written

python:
for i, b in enumerate(a, start=1):
  if i % 21 > 0:

instead of

python:
for i, b in enumerate(a):
  if (i + 1) % 21 > 0:

making the script just a little easier to understand. The start parameter was added in Python 2.6, which is probably why I didn’t know about it. I’d been using Python so long by the time 2.6 came along, I thought I knew everything there was to know about enumerate. A cautionary tale.

The best tip I got was this one:

.@drdrang Perhaps Tabula would help for extracting tables from PDFs. I’ve used it with great success before: tabula.technology
Wired-up Wrong (@BrokenWiring) Jul 24 2016 11:16 PM

Tabula really is a wonderful tool for extracting data from tables in PDFs. It’s a locally hosted web app1 that allows you to

  1. Select one or more PDFs with the data you want.
  2. Identify the area of the page from which to extract the data.
  3. Save the data in CSV, TSV, or JSON format.

I gave Tabula a try on the same PDF tables I wrote about last night, and it worked perfectly. You may recall that I didn’t like the column headings in the original table. Well, Tabula let me drag a rectangle to select just the data portion of the table, leaving the stuff I didn’t want out of the extracted CSV file.

Tabula table selection

I can imagine using this same technique to grab just a few rows or a few columns of data from a large table.

The Tabula people don’t oversell its capabilities. They recognize that some table layouts give its parser fits, and they offer suggestions to work around its limitations. I found it to work very quickly and accurately and look forward to using it more. Thanks to both Wired-up Wrong and Bill Eccles for suggesting it.

David Emmons suggested using PDFpen Pro’s Export to Excel feature, which is similar to Tabula in that it scans the PDF and tries to parse the table layout. Although I own PDFpen Pro, I hadn’t tried that, mainly because my few experiences with its Export to Word had been disappointing. But after David’s tweet, I decided to give it a try.

At first, things didn’t go well. The exported files raised error messages when opened in Excel and were littered with extra columns. But this was when I let PDFpen Pro try to parse the entire table. I then used the rectangle selection tool to crop the table down to just the data portion. The files created after that were perfect.

Still, I don’t actually want an Excel file—it’s only a means to an end. And I especially don’t want to see stupid warning messages like this when I quit an application after copying no more than a few hundred cells of a spreadsheet.

Excel clipboard warning

What I really want is a CSV or TSV, and because Tabula can give me those directly, it’s the tool I’ll be using.


  1. What this means is that Tabula starts up a web server on your computer and you interact with it through a browser. While you might think this leads to an awkward user interface, it actually works quite smoothly and allows Tabula to have a consistent user interface on all platforms. ↩︎


A little data munging

For the past couple of weeks, I’ve been doing analyses of other people’s data. Usually, the data comes in the form of a spreadsheet, which works out fine. I don’t like using spreadsheets myself, but it’s easy to extract the data and put it in a CSV or TSV file, which I can then manipulate in Python.

Sometimes, though, what I get is a PDF of someone else’s report, with the data in one or more tables. This is more challenging. If the PDF was generated directly from the writer’s word processor, the text in the tables is selectable, and I can copy and paste it into a plain text document. Unfortunately, the columns often aren’t delimited the way I’d like.

Here’s part of a multipage table I had to analyze. It has 21 columns and, after combining all the pages, about 100 rows. I opened the PDF in Preview, selected the text, and copied it.

Selection from PDF table

I crossed my fingers and pasted it into a new BBEdit document. The columns were lost entirely; every table cell appeared on its own line. But luckily there was order to it, as the data came in row by row. In other words, something like this

Col 1 Col 2 Col 3 Col 4 Col 5
a b c d e
f g h i j
k l m n o

was turned into this (I didn’t select the column headings because I didn’t like the names given in the table).

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o

At least it would have turned into that if it weren’t for the missing data. If you look back at the image of the table, you’ll see that lots of cells are empty. What that meant was the table looked more like this

Col 1 Col 2 Col 3 Col 4 Col 5
a b e
g h
k n o

and turned into this after copying and pasting into BBEdit.

a
b
e
g
h
k
n
o

I couldn’t think of an automated way to add the missing lines, so I did it by hand, adding empty lines to the BBEdit document where needed.

a
b


e

g
h


k


n
o

While I’m pretty sure pandas has a way of reading a file like this and putting it into a data frame, I’m not fluent enough in pandas to know how to do it without poring over the documentation. And because I’d just spent a good chunk of time adding blank lines to the file, I wanted to get the analysis going right away. So I wrote this little script to read in the text file and spit out a 21-column TSV.

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from sys import stdout
 4:  
 5:  with open('table.txt') as f:
 6:    a = f.readlines()
 7:  
 8:  for i, b in enumerate(a):
 9:    if (i + 1) % 21 > 0:
10:      stdout.write(b.replace('\n', '\t'))
11:    else:
12:      stdout.write(b)

The rows are 21 items long, so we want to turn the line feeds into tab characters except every 21st one. The only tricky part is recognizing that the array of lines, a, is 0-indexed, and the modulo operator used to determine where a row stops must be applied to line numbers that start with 1, not 0. Hence the (i + 1) % 21 in Line 9. After running the script, I added a row of column names to the top of the file and had a nice TSV file for importing.

This script is, I realize, both longer than necessary and uses more memory than necessary. Maybe I should’ve used a Perl one-liner:

perl -ne 'chomp; if ($.%21) {print "$_\t"} else {print "$_\n"}' table.txt

This doesn’t slurp in the entire file, and because the line number variable, $., starts counting at 1, there’s no off-by-one trickiness to worry about. An awk one-liner would be similar.

But I wasn’t in a Perl or awk frame of mind when I did this. I’d been working in Python, so that’s what I used. And there are better ways of writing this in Python, too. As I sit here drinking tea and not trying to finish a paying job, I imagine the fileinput module would be useful.

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from fileinput import input, lineno
 4:  from sys import stdout, argv
 5:  
 6:  columns = int(argv[1])
 7:  del argv[1]
 8:  for line in input():
 9:    if lineno() % columns:
10:      stdout.write(line.replace('\n', '\t'))
11:    else:
12:      stdout.write(line)

That gets rid of both the slurping and the off-by-one correction. And it doesn’t have the name of the file or the number of columns in the table hard-wired into the code. It can be called this way:

shape-table 21 table.txt

Now I have something I can use again the next time I need to extract a table of data from a PDF. Which pretty much guarantees I’ll never run into this kind of problem again.