Text Processing and the Writer's Workbench

David Silverman

The Writer’s Workbench exemplifies the collaborative nature of the UNIX group. The people of the Unix project had always done text processing. They had been writing and editing code to create Unix and its tools. The project, moreover, had received funding from the patent department in exchange for a document preparation package. Lorinda Cherry, an experienced programmer who had earned a computer science degree from Stevens Institute in 1969, and Brian Kernighan built an open-ended system of programs to deal with text. Their work on formatting, text analysis and style helped to create troff, ntroff and Writer’s Workbench, programs still used today.

Three factors contributed to the interest in text processing: in-house use, parts-of-speech programs, and statistical analysis of text. As various groups investigated or required new ways to process text, the number of tools grew. The team used text processing to work on programs and prepare reports. Some of the Unix team’s tinkering, moreover, led to improvements in the new tools. Cherry’s self-described goal was to "see what kind of neat new things I can make the computer do." Although Unix had used the text processor ed since its inception, Kernighan and Cherry improved not only the way ed performed its old functions, but created new functions for it.

The first improvements were troff and ntroff. These commands facilitated "a wide variety of formatting tasks by providing flexible fundamental tools rather than specific features," according to the Bell Labs Technical Journal (Kernighan, Lesk, and Ossanna 2119). Combined with the little languages described above, notably eqn, these features allowed the text processing and formatting both on the computer and in printed documents. This was particularly important for a company such as Bell Labs where so many reports were on technical matters.

The second project to assist with text processing was Brent Aker’s work on the Votrex machine, a peripheral that spoke for the computer. The Votrex did not intonate or emphasize properly. Cherry worked on a parts-of-speech program that would allow the computer to pronounce words properly. The computer needed "parts of speech …for syllabic stress."

The third project was Bob Morris and Lee McMahon’s work on the authorship of the Federalist papers. Working with the ideas of statistician Fredrick Mosteller, Morris and McMahon were trying to determine who wrote which paper using statistical analysis. "Taking turns typing," they entered the papers into the machine to run them through various filters and counters. They "developed lots of tools for processing text in the process." Typo, for example, was "one of the early spell-checkers." It worked based on trigram statistics, a Mosteller technique that analyzed chunks of repeated letters. Cherry’s familiarity with trigram statistics had come from a compression project she worked on in 1976. She describes the process:

You take the whole string, if your ten-letter work had maybe a trigram that was six letters long that had a high enough count to be worthwhile, you pick that entire six-letter string off and store it in a dictionary and replace it with a byte and then with an index into the dictionary.

This counting procedure was applied to the other forms of analysis, for example, the Federalist papers authorship research.

Unix’s special capabilities made much of the text processing work possible. Because ed was general purpose, "programs originally written for some other purpose" could be used in document preparation. Rudimentary spell checkers utilized the sort command, for example. "Case recognition," which "changed with Unix," also enhanced the programmer’s ability to analyze text. New methods of accounting "for punctuation and blank space and upper-lower case" also contributed.

With the background of formatting, part of speech analysis and statistical filtering, Cherry embarked on the project Writer’s Workbench. As the "grandmother" of this new aid, Cherry created a word processor with the capacity to analyze style, determine readability and facilitate good writing.

Cherry heard through a colleague that Bill Bestry, an instructor in Princeton University’s English department "had his students … count parts of speech." The students were then able to use the objective statistics to improve their writing. Drawing on Cherry’s previous part of speech work, Writer’s Workbench did the count automatically. As Cherry put it:

There are various things you would count and look at using parts of speech to decide whether you’ve got a compound or compound sentencing sentences types, so the part of speech program turned into the style program.

This "layer on top of style and diction" features filled the program with a wider range of capabilities for both students and colleagues at Bell Labs. "There was a human factors group in Piscataway," for example, that wanted to "look at [computer] documentation and decide whether it was reasonable from a human factors standpoint." The readability indices of Workbench helped to edit the manuals of Unix itself.

During beta testing with Colorado State, the Workbench saw active faculty and student use. This program succeeded for three main reasons – its reliability, structure, and the programmers’ understanding of the writing process. The competing IBM product, Epistle, was based on a parser, making it slow and incapable of coping with incorrect student grammar. Workbench "never really did check grammar" but did illuminate the style of sentences employed. The "press-on-regardless attitude" of workbench lead to accuracy across the entire paper. The most important factor, however, was the programmers understanding of the writing process itself. They knew to present readability scales as estimates, not to squeeze papers into pure numbers. They knew that the ultimate lesson was to teach students that writing is a series of choices, not a matter of pretty formatting on a laser printer. Cherry expressed her vision of the Workbench’s use:

My feeling about a lot of those tools is their value in education um is as much pointing out to people who are learning to write that they have choices and make choices when they do it. They don’t think of a writing task as making choices per se. Once they get it on paper they think it’s cast in stone. So it makes them edit.

This step beyond formatting is what makes Unix truly able to process text and improve the writing skills of its user.