DXF data extraction

This is going to be one of those posts that are more valuable to me than they are to you. Maybe you’ll find something in it you can translate to your own needs.

This week I needed to pull the coordinates of a bunch of points out of a 3D laser scan of a structure. The guy I was working with knew what I wanted, but the software that came with the scanner was fighting him and wouldn’t just cough up a nice comma-separated or tab-separated values file. I would’ve been satisfied with an Excel or HDF file, too. Anything Pandas knows how to read. But it wasn’t to be.

So we worked out another plan. Although the laser scan returned a huge cloud of points, all I needed were the coordinates of points along a few hundred lines. My colleague knew how to collect those points and output them in a DXF file. I suspected I could write a script that would parse the DXF and get the coordinates. He went to work getting me the DXF file while I gave myself a crash course on the DXF format.

The DXF format was created by Autodesk for exchanging AutoCAD files. What I knew going in, and what made me confident I could write the script I needed, was that it’s a plain text format, something I deal with all the time. Autodesk has published many reference manuals on the format and has a nice webified manual for one of the older versions.

In a nutshell, the DXF format is a series of sections, each of which contains several key/value pairs on successive lines. The key is always a numeric code, and the key that starts a section is 0. For the DXF file made for me, the sections that had the data I wanted were VERTEX sections, which looked like this:


The type of section is the value associated with the 0 key, and the x, y, and z coordinates are the values associated with the 10, 20, and 30 keys, respectively.

In words, what my script needed to do to create a CSV file of the coordinates was:

  1. Collect the lines from a 0 up to the next 0.
  2. Set x, y, and z to the lines after the 10, 20, and 30 lines.
  3. Print the x, y, and z values to a file, all on the same line and separated by commas.

This seemed simple, and luckily for me, it was.

The approach you take to solve this problem depends, as you might expect, on the language you use. I knew that Perl has a built-in method of reading files in chunks separated by a regex-defined delimiter. That’s ideal for this problem, but my Perl skills have become so atrophied I felt I’d spend too much time relearning its syntax. I decided to stick with the language I’m most comfortable with now, Python, and implement the chunk-reading part myself.

Here’s the script, called getpoints:

 1:  #!/usr/bin/env python
 3:  from fileinput import input
 5:  def printpoint(b):
 6:    obj = dict(zip(b[0::2], b[1::2]))
 7:    if obj['0'] == 'VERTEX':
 8:      print '{},{},{}'.format(obj['10'], obj['20'], obj['30'])
10:  print 'x,y,z'             # header line
11:  buffer = ['0', 'fake']    # give first pass through loop something to process
12:  for line in input():
13:    line = line.rstrip()
14:    if line == '0':         # we've started a new section, so
15:      printpoint(buffer)      # handle the captured section
16:      buffer = []             # and start a new one
17:    buffer.append(line)
19:  printpoint(buffer)        # buffer left over from last pass through loop

Although Python doesn’t have a built-in regex-delimited chunk reading method, it does have several ways to read a file a line at a time, which gets us halfway there. I chose to use the input method of the fileinput module.

The main loop of the program, Lines 12–17, reads each line of the input and put it into a list called buffer. When a 0 line gets read, that means we’ve read a section, so it’s time to process the buffer, empty it, and then start filling it again with the 0 we just read.

The processing of the buffer is done by the printpoint function in Lines 5–8. Because the DXF data for a section comes in key/value pairs, the natural Python data structure is a dictionary, and Line 6 makes that dictionary out of the items in the buffer using a technique I stole from Stack Overflow. It uses the zip function to create a list of tuple pairs, where the first item of each pair comes from the even-indexed items, and the second item comes from the odd-indexed items. The lists of evens and odds are pulled from the buffer by using array slices that start on 0 and 1, respectively, and use step sizes of 2 to get every other item. Then dict turns that list of tuple pairs into a dictionary.

With the dictionary made, it’s checked to make sure it’s a VERTEX and the values associated with keys 10, 20, and 30 are printed out, separated by commas. Sections that aren’t VERTEXes are simply discarded silently.

When you first write a loop, it often works for every pass through except the first one. Or maybe it works for every pass except the last one. In this case, the loop didn’t work for either the first or the last pass. Line 11 seeds the buffer with fake data so the loop works correctly when it hits the 0 at the top of the file. And the final call to printpoint on Line 19 cleans up the unprocessed buffer left over after the last pass.

The DXF file I was given was over 250 MB and made up of about 30 million lines. After running getpoints, I had a 120 MB file with over 2.5 million lines that looked like this:


It took about 2 minutes to run on my old 2011 MacBook Air.

That wasn’t the end, of course. I filtered the 2.5 million points down to about 40,000 and then did calculations on those points to get the answers I needed. Those were more interesting parts of the work than the DXF extraction, but I can’t talk about them.

Although you probably won’t need to pull points out of a DXF file, I’ve found that other older text file data formats use a similar system of keys and values on successive lines. If you run into one of them, you may get some hints from this script.