Dissociated Darwin

Have you been following Brett Terpstra’s series of “lipsum” posts? He’s developed a set of TextExpander snippets for generating random placeholder text, the kind of nonsense text people use when figuring out a web or publication layout. The “lipsum” name is derived from “Lorem ipsum,” the first two words of this nonsense Latin passage, used for ages by typesetters:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

I’ve had a ;lorem TextExpander snippet in library for quite a while, but I’ve never really liked it. It works fine, but because it always produces the same output—the quote above—it creates an unnatural regularity in the placeholder text.

Brett’s lipsum snippets avoid that regularity by generating random text. And in addition to plain text, the set also includes HTML-specific snippets that generate series of paragraphs and ordered and unordered lists, all wrapped in the appropriate tags. It’s an impressive and useful set, but after playing with it for a while, I decided I wanted something different.

There are two reasons I felt I couldn’t use Brett’s work. First, the corpus from which he generates the random text contains an awful lot of non-English words,1 which litter my text editor with red squiggly lines because I like to have spellcheck turned on (this is also a problem with lorem ipsum). Second, the text is accessed over the internet from Kwisatz Haderach, Lorem Ipsum, and Lorem Ipscream, which means you have to be connected to use the snippets. I wanted nonsense English generated locally.

My first thought was to use the Dissociated Press feature of Emacs, which takes the text in a buffer and scrambles it in a way that resembles a Markov process. I don’t edit in Emacs, but there are ways to use Emacs Lisp as a scripting language. After a little exploration, I decided against this because I’d have to spend too much time learning how to access and edit Emacs buffers via Lisp. But I figured someone in the Python, Perl, or Ruby communities would have reimplemented Dissociated Press as a filter in one of those languages. And I was right; in fact, there are Dissociated Press libraries for each of those languages.

I chose Avi Finkel’s Perl implementation, called Games::Dissociated, because it had the best documentation and would be the easiest to figure out. After downloading and expanding the gzipped tar file, I ran

perl Makefile.PL

I was told that two Perl libraries were missing—Test::Pod::Coverage and Pod::Coverage—and asked if I wanted to rectify that. Since the libraries in question were just for testing and documentation, I declined. I’ve seen Perl go off for a half-hour or more, updating libraries that need updated libraries that need updated libraries… I knew from the documentation that Games::Dissociated had no special dependencies and would work fine despite the dire warnings.

I then ran

make
make install  

to build and install the library. I skipped the usual make test phase because I figured it would complain about the missing modules.

The Games::Dissociate module needs a corpus of text to work with. I chose Darwin’s Origin of Species. I downloaded a copy from Project Gutenberg; stripped out the Gutenberg boilerplate, the table of contents, the index, and the glossary; and saved the remainder in a file called species.txt in a directory called text in my home directory.

The TextExpander snippet I built has this Shell Script content:

perl:
 1:  #!/usr/bin/perl
 2:  
 3:  use Games::Dissociate;
 4:  
 5:  # Slurp in the given corpus as a single string.
 6:  open(my $fh, "$ENV{HOME}/text/species.txt") or die "Can't open";
 7:  {local $/; $corpus = <$fh>;}
 8:  
 9:  # Dissociate the corpus, using word pairs, and return 15-50 pairs.
10:  $length = int(15 + rand(35));
11:  $dis = dissociate($corpus, -2, $length);
12:  
13:  # Capitalize the first word and end it with a period.
14:  $dis =~ s/^(.)/\u\1/;
15:  $dis =~ s/[.);:?'", -]+$/./;
16:  
17:  print $dis;

The dissociate function in Line 11 does the hard work. From the Games::Dissociate documentation, this is how the function works:

Basically, the way Dissociated Press algorithms (at least mine — I can’t speak for the exact details of all others) work is:

  1. Start at a random point in the text, and read a group of tokens (characters or words from there — where group size is a parameter you change) from there. Call this the last-matched group.

  2. Output the last-matched group.

  3. Look for the other times the last-matched group occurs in the text, and randomly select one of them. (Or: select the next time that group occurs — a shortcut I’ve made in the code, which seems to still produce random-looking results). Look at the group of tokens that occurs right after that. Make that the last-matched group. Loop back to Step 2 until we think we’ve outputted enough.

  4. But if the last-matched group from 2 occurred just that once in the text, go back to step 1.

The three arguments are as follows:

  1. The base text it’s going to generate from. This is the Origin text grabbed in Line 7.
  2. The number of tokens in the group. A positive value uses characters as tokens; a negative value uses words. I use 2 words because that gives a nice, almost plausible feel to the nonsense. It’s the default value used in Emacs, too.
  3. The (approximate) number of groups in the output. Here I use a random number between 15 and 50—generated in Line 10—to make the output roughly 30 to 100 words long.

Lines 14 and 15 capitalize the first word and put a period at the end to give the output the look of a real paragraph. More could be done here to clean up the text—deleting mismatched quotation marks and parentheses comes to mind—but the occasionally exotic punctuation doesn’t bother me.

I have the snippet bound to the abbreviation ;darwin and I find its output both authoritative and soothing.

Was formed, nor would the deposit formed during subsidence. Since publishing my views on the origin of man in accumulating contradictions in order to complete its hammock, seemed forced to trace community of descent that these should produce from the several islands This difference as in that of the width of an inch in length, and highly developed, and perfect fertility surprising, when it has once disappeared, it does not differ much and the arrangement did not prove discommunity of descent, for in one sense be.

Flower in another closely allied orchid, namely, the Polyzoa, are provided with avicularia and vibracula of the Polyzoa are provided with avicularia alone and that by this fundamental subject of instinct, or structure which are correlated with this special sensibility. In searching for the chance of having and of rearing offspring.” But these two distinct classes range over 7.7 provinces; whereas.

Because this is by no means strict. A multitude of cases we will now give a few illustrations of the eleven marine birds only two are new, having appeared there was less and less.

To an unparalleled degree. Why this has so often shown. We continually meet with the most important of all; and had stuck to the most different climates, but whether or not displayed in any one species, Consequently, whatever part of India the Kattywar breed of horses, or other neuter insect had been slowly acquired through the shell. Of the chickens which were.

Bathed by warm blood to be put out of order and injured. Looking to the inhabitants of the American cave-animals, as all our large birds have been acted on by natural selection. This theory is the most accordant with the known amount of denudation, may.

If I were in the market for a personal mission statement, “Bathed by warm blood to be put out of order and injured” would be a strong contender.

I chose Darwin, by the way, to give a scientific feel to the nonsense without using the technical terms common to my usual writing. My need for placeholder text is limited to work I do for myself and my consulting firm; all of the placeholders eventually get replaced by engineering descriptions.

Because it uses the entire Origin of Species, this snippet may have unacceptable delays when run on an older system. If you want to speed it up, you could cut down on the size of the input text by using just a few chapters. You could also change the flavor of the output by using a different base text entirely. Here’s a sample generated from Edgar Rice Burroughs’ A Princess of Mars:

Which were the four stood, with my back toward the door of my chamber behind him and just before she died. “As we neared the city The sky Without effort at concealment I hastened to the navy of Helium. He said, “rightly boasts the most beautiful women of his family, and laid his strong arms about his neck I pressed her dear precious little body; so.

And one from Jane Austen’s Pride and Prejudice:

And in silence withdrew; determined, if possible, to get Pemberley by purchase than by netting a purse or covering a screen. But I am sure,” said Mrs. Bennet, “shaking her head, “then she is she handsome?” “She is unfortunately of a sickly constitution, which has prevented her from any return of Mr. Collins’s character, connection, and situation in life.

If I were going to use fiction as my corpus, I’d definitely fix the unmatched quotation marks problem.

Although I ended up not using any of his code, Brett Terpstra still deserves a lot of the credit (and none of the blame) for this. If you’re not already reading his blog, you really should.

Update 2/6/11
Young Mr. Terpstra has pointed out an excellent Dissociated Press-style Ruby library, called Raingrams. Raingrams, written by Hal Brodigan, has many of the stylistic niceties missing from Games::Dissociated and has methods to produce sentences, paragraphs, and longer stretches of text. Brett coded up a quick example that he seeded with Orwell’s 1984:

An example ALL MANS ARE REDHAIRED is meant almost foolproof instrument and incautiously held it seemed to certain sexual misdeeds whatever. Courage and this the poor quarters of objectivity and inflected in every year. For a vague wordless form and in Ingsoc could turn to. So far away.

The appeal of this kind of nonsense—quite apart from any practical use it has as placeholder text—is how the flavor of the original comes through. Eric Beavers (@ELBeavers) took my TextExpander snippet and fed it Alice in Wonderland instead of Origin of Species. He emailed me the result:

The table, she opened it, and found that, it led into a tidy little room with a table in the middle of one! There ought to eat or drink under the circumstances. There was silence for some minutes. it puffed away without.

Yes.

Update 2/7/11
One last thing. Brett’s now written a more elaborate script using Raingrams, one that parallels his earlier script in that it can generate plain text (in Markdown format) or HTML and can produce sentences, paragraphs, ordered lists, and unordered lists.

As I said in the earlier update, I’m very impressed with what Raingrams can do, and now that Brett’s put together such a complete script, I’ll probably abandon my Games::Dissociated script and use his as a template for a few TextExpander snippets. My only concern is whether Raingrams is fast enough to handle a large input corpus in a reasonable amount of time. I noticed that Brett used only an excerpt from 1984, which leads me to wonder about its performance. A few quick tests should answer that question.


  1. In addition to fake Latin, Brett’s snippets can also generate nonsense text taken from fantasy and science fiction series like Dune, Foundation, Ringworld, Harry Potter, and Dr. Who