A Blog by

How Forensic Linguistics Outed J.K. Rowling (Not to Mention James Madison, Barack Obama, and the Rest of Us)

Earlier this week, the UK’s Sunday Times rocked the publishing world by revealing that Robert Galbraith, the first-time author of a new crime novel called The Cuckoo’s Calling, is none other than J.K. Rowling, the superstar author of the Harry Potter series. Then the New York Times told the story of how the Sunday Times’s arts editor, Richard Brooks, had figured it out.

One of Brooks’s colleagues got an anonymous tip on Twitter claiming that Galbraith was Rowling. The tipster’s Twitter account was then swiftly deleted. Before confronting the publisher with the question, Brooks’s team did some web sleuthing. They found that the two authors shared the same publisher and agent. And, after consulting with two computer scientists, they discovered that The Cuckoo’s Calling and Rowling’s other books show striking linguistic similarities. Satisfied that the Twitter tipster was right, Brooks reached out to Rowling. Finally, on Saturday morning, as the New York Times reports, “he received a response from a Rowling spokeswoman, who said that she had ‘decided to fess up’.”

While the literary world was buzzing about whether that anonymous tipster was actually Rowling’s publisher, Little, Brown and Company (it wasn’t), I wanted to know how those computer scientists did their mysterious linguistic analyses. I called both of them yesterday and learned not only how the Rowling investigation worked, but about the fascinating world of forensic linguistics.

With computers and sophisticated statistical analyses, researchers are mining all sorts of famous texts for clues about their authors. Perhaps more surprising: They’re are also mining not-so-famous texts, like blogs, tweets, Facebook updates and even Amazon reviews for clues about people’s lifestyles and buying habits. The whole idea is so amusingly ironic, isn’t it? Writers choose words deliberately, to convey specific messages. But those same words, it turns out, carry personal information that we don’t realize we’re giving out.

“There’s a kind of fascination with the thought that a computer sleuth can discover things that are hidden there in the text. Things about the style of the writing that the reader can’t detect and the author can’t do anything about, a kind of signature or DNA or fingerprint of the way they write,” says Peter Millican of Oxford University, one of the experts consulted by the Sunday Times.

Cal Flyn, a reporter with the Sunday Times, sent email requests to Millican and to Patrick Juola, a computer scientist at Duquesne University in Pittsburgh. Flyn told them the hypothesis — that Galbraith was Rowling — and gave them the text of five books to test that hypothesis. Those books included Cuckoo, obviously, as well as a novel by Rowling called The Casual Vacancy. The other three were all, like Cuckoo, British crime novels: The St. Zita Society by Ruth Rendell, The Private Patient by P.D. James, and The Wire in the Blood by Val McDermid.

Juola ran each book (or, more precisely, the sequence of tens of thousands of words that make up a book) through a computer program that he and his students have been working on for more than 10 years, dubbed JGAAP. He compared Cuckoo to the other books using four different analyses, each focused on a different aspect of writing.

One of those tests, for example, compared all of the word pairings, or sets of adjacent words, in each book. “That’s better than individual words in a lot of ways because it captures not just what you’re talking about but also how you’re talking about it,” Juola says. This test could show, for example, the types of things an author describes as expensive: an expensive car, expensive clothes, expensive food, and so on. “It might be that this is a word that everyone uses, like expensive, but depending on what you’re focusing on, it [conveys] a different idea.”

Juola also ran a test that searched for “character n-grams”, or sequences of adjacent characters. He focused on 4-grams, or four-letter sequences. For example, a search for the sequence “jump” would bring up not only jump, but jumps, jumped, and jumping. “That lets us look at concepts and related words without worrying about tense and conjugation,” he says.

Those two tests turn up relatively rare words. But even a book’s most common words — words like a, and, of, the — leave a hidden signature. So Juola’s program also tallied the 100 most common words in each book and compared the small differences in frequency. One book might have used the word “the” six percent of the time, while another uses it only 4 percent.

Juola’s final test completely separates a word from its meaning, by sorting words simply by their length. What fraction of a book is made of three-letter words, or eight-letter words? These distributions are fairly similar from book to book, but statistical analyses can dig into the subtle differences. And this particular test “was very characteristically Rowling,” Juola says. “Word lengths was one of the strongest pieces of evidence that [Cuckoo] was Rowling.”

It took Juola about an hour and a half to do all of these word-crunchings, and all four tests suggested that Cuckoo was more similar to Rowling’s Casual Vacancy than the other books. And that’s what he relayed back to Flyn. Still, though, he wasn’t totally confident in the result. After all, he had no way of knowing whether the real author was somebody who wasn’t in the comparison set of books who happened to write like Rowling does. “It could have been somebody who looked like her. That’s the risk with any police line-up, too,” he says.

Meanwhile, across the pond, Peter Millican was running a parallel Rowling investigation. After getting Flyn’s email, Millican told her he needed more comparison data, so he ended up with an additional book from each of the four known authors (using Harry Potter and the Deathly Hallows as the second known Rowling book). He ran those eight books, plus Cuckoo, into his own linguistics software program, called Signature.

Signature includes a fancy statistical method called principal component analysis to compare all of the books on six features: word length, sentence length, paragraph length, letter frequency, punctuation frequency, and word usage.

Word frequency tests can be done in different ways. Juola, as I described, looked at word pairings and at the most common words. Another approach that can be quite definitive, Millican says, is a comparison of rare words. The classical example concerns the Federalist Papers, a series of essays written by Alexander Hamilton, James Madison, and John Jay during the creation of the U.S. Constitution. In 1963, researchers used word counts to determine the authorship of 12 of these essays that were written by either Madison or Hamilton. They found that Madison’s essays tended to use “whilst” and never “while”, and “on” rather than “upon”. Hamilton, in contrast, tended to use “while”, not “whilst”, and used “on” and “upon” at the same frequency. The 12 anonymous papers never used “while” and rarely used “upon”, pointing strongly to Madison as the author.

Millican found a few potentially distinctive words in his Rowling investigation. The other authors tended to use the words “course” (as in, of course), “someone” and “realized” a bit more than Rowling did. But the difference wasn’t statistically significant enough for Millican to run with it. So, like Juola, he turned to the most common words. Millican pulled out the 500 most common words in each book, and then went through and manually removed the words that were subject-specific, such as “Harry”, “wand”, and “police”.

Of all of the tests he can run with his program, Millican finds these word usage comparisons most compelling. “You end up with a graph, and on the graph it’s absolutely clear that Cuckoo’s Calling is lining up with Harry Potter. And it’s also clear that the Ruth Rendell books are close together, the Val McDermid books are close together, and so on,” he says. “It is identifying something objective that’s there. You can’t easily describe in English what it’s detecting, but it’s clearly detecting a similarity.”

On all of Millican’s tests, Cuckoo turned out to be most similar to a known Rowling book, and on four of them, both Rowling books were closer than any of the others. Millican got the files around 8pm on Friday night. Five hours later, he emailed the Sunday Times. “I said, ‘I’m pretty certain that if it’s one of these four authors, it’s Rowling.'”

This isn’t the first time that Millican has found himself in the middle of a high-profile authorship dispute. In the fall of 2008, just a couple of weeks before the U.S. presidential election, he got an email from the brother-in-law of a Republican congressman from Utah. He told him that they had used his Signature software (which is downloadable from his website) to show that Barack Obama’s book, Dreams from my Father, could have been written by Bill Ayers, a domestic terrorist. “They were planning to have a press conference in Washington to expose Obama one week before the election and got in touch with me,” Millican recalls, chuckling. “It was quite a strange situation to be in.”

Millican re-ran the analysis and definitively showed that Dreams was not, in fact, written by Ayers (you can read more about what he did here).

Juola told me some crazy stories, too. He once worked on a legal case in which a man had written a set of anonymous newspaper articles that were critical of a foreign government. He was facing deportation proceedings in the United States, and knew that if he was deported then the secret police in said foreign government would be waiting for him at the airport. Juola’s analyses confirmed that the anonymous articles were, in fact, written by the man. And because of that, he was permitted to stay in the U.S. “We were able to establish his identity to the satisfaction of the judge,” Juola says.

That story, he adds, shows how powerful this kind of science can be. “There are a lot of real controversies with real consequences for the people involved that are a lot more important than just, did this obscure novel get written by this particular famous author?”

The words of many of us, in fact, are probably being mined at this very moment. Some researchers, Juola told me, are working on analyzing product reviews left on websites like Amazon.com. These investigations could root out phony glowing reviews left by company representatives, for example, or reveal valuable demographic patterns.

“They might say, hmmm, that’s funny, it looks like all of the women from the American West are rating our product a star and a half lower than men from the northeast, so obviously we need ot do some adjustment of our advertisements,” he says. “Not many companies are going to admit to doing this kind of thing, but anytime you’ve got some sort of investigation going on, whether police or security clarance or a job application, one of the things you’re going to look at is somebody’s public profile on the web. Anything is fair game.”

In fact, it was a good thing the original tipster of the Rowling news deleted his or her Twitter account, Juola says. “If we still had the account, we could have looked at the phrasings to see if it corresponded to anyone who works at the publishing house.”

24 thoughts on “How Forensic Linguistics Outed J.K. Rowling (Not to Mention James Madison, Barack Obama, and the Rest of Us)

  1. “all four tests suggested that Cuckoo was more similar to Rowling’s Casual Vacancy than the other books” suggested? Wow. Not very confident, even when they know the answer!

    Forensic linguistics my left foot. It was the same trick used for thousands of years – the inability to keep a secret. Science doesn’t advance by claiming credit after the event. It really doesn’t work that way.

  2. It is good to see that there are multiple computer programs for this type of detective work to provide rigorous testing and less room for error.
    Though, I wonder if someone trying to deceive these programs could be successful at hiding their individual style. And don’t book editors mess around with a writer’s style sometimes.
    I know my writing is pretty idiosyncratic and would probably be fairly easy to spot, especially since I make lots of mistakes. 😉

  3. Hi Kathy,

    The linguists I spoke to mentioned this possibility. They said certain tests — especially the word length one — are fairly easy to manipulate consciously. In fact, an author might deliberately use shorter words when writing, say, a children’s book than a dense white paper. But they say it’s very difficult to fake the frequency of common words. Also, you may use words in a distinctive way without ever knowing that it’s distinctive! These are patterns that aren’t apparent to human readers, only computer ones. 🙂

  4. Thanks Virginia for your response.
    I’m definitely sure that a computer could pick up patterns that I would miss! 🙂

  5. Has anyone suggested using this analysis on Paul’s gospel letters to determine which are authentic and which were likely to have been written by others? If so, I imagine that the analysis would require an ability to read ancient Greek.

  6. Hi Kathy,

    I am a researcher who works with Dr. Juola. There is another entire field dedicated to learning how to “fool” the types of analyses we do. It is much harder than it sounds to change the kinds of things we look for in your writing (word lengths, which was noted as fairly easy to fake, is not a very common method for our work).

    There are actually computer programs out there that will use our techniques but backward to help with that. But luckily, as they get better so do we, and they generally aren’t able to fool us for long. Also, since we use a multi-faceted approach to the analysis, if you try to adjust ALL those parameters the writing tends to come out so bad you can tell what happened just by looking at it. Imagine trying to change your word length, use of active/passive voice, preferred verb tense, content words (things like saying “sofa” vs “couch”), function words (things like “ON the left” vs “TO the left”), etc.

  7. did I miss the mention of stopwords/function-words and the work by Mosteller and Wallace (could’ve, cos this is a long article for the mobile screen). if someone Google searches for “stopwords and stylometry”, who knows even my contribution that built on M&W’s work in Federalist Papers may pop out.

  8. rather than science I think it would be more instructive to understand this as a classic public relations (i.e. ADVERTISING) ploy. Linking up a new unknown ‘brand’ to an established successful one. This is not news. It is advertising.

  9. The best way to keep a secret is to not let anyone know you’ve got one. He told his wife, and she told her friend?! What a surprise. Not that anyone involved on the profit end would keep that particular secret for long.

    I think remember Security Clarance — didn’t he work at the Quiky-Mart?

  10. Joseph O, something like this was done with the epistles decades ago (see http://en.wikipedia.org/wiki/Authorship_of_the_Pauline_epistles#Language_and_style); a rerun using current technology is obviously overdue. Another fertile area might be unravelling Old Testament texts, to answer such knotty questions as, are J and E really all that different?

    I don’t think you’d even need to know the language; just go by frequency of symbol combinations. The biggest problem might be choosing good out-group texts for comparison.

  11. The headline is misleading and would certainly not be used in a reputable newspaper. Forensic linguistics did not “out” JK Rowling, that was done on Twitter. The role of forensic linguistics was to test if the outing had any likelihood of being true.

  12. Why not respect the lady’s desire to keep her secret ?… she must have had her reasons. Why can’t we just leave well enough alone …

  13. Speaking of language mining… Who do I talk to about my fear of people stealing my personal writing/speaking style through whatever it is that makes predictive texting applications work cuz that freaks me out…

  14. Rowlings identity was coming out one way or another, this was purely a business decision by the publisher, who needed to move this book. Rowlings was allowed to play cat and mouse for a spell, but it’s called the book business, and her acknowledgement of authorship was coming science or no science.

  15. principal components analysis

    this as i recall was an option in factor analysis (a way to take a pairwise correlation matrix and construct a new variable that aligned with the existing ones, kind of a core or central variable, a factor, or principal

    principle components analysis, in my work, was uninformative, b/c in social science data, where everything is correlated with everything else, so you get one big glob of a factor, or component

    we used an option called varimax, (ie maximize variation or variance) which was probably a proprietary name, whose creation was to reduce some of the first big glob of variance accounted for and create few more that were equally predictive of the component variables but nor correlated to each other – for example, the big glob might be social desirability (undefined here) but the varimax rotation could filter out – helpfulness, leadership, diligence., dutifulness – all depending on what your input varietals were, we had looked at personality descriptors or more accurately behavioral descriptors, imputing to personality attributes;

    varimax rotation identified some imputed personality core variables, also called traits, in our data, which turned out to be pre-empirical traits in trait theory data, although this is very subjective and slippery

    this choice of which statistic to use, might be domain specific

    the empirical (numerical) test, maybe for when to use which, is how highly the extracted principal components correlate with each other, if highly, try varimax; for the data domain of linguistic style — who knows? there is a probably an ‘erudite’ style and a ‘low-brow’ style – there was even a term eigenvalue, or ‘1’ or apportioning ‘1’ how closely; too closely was bad

    this is my first read of style-detection software; i would have found or expected: sentence length, included; number of clauses; sequence of dominant and subordinate clauses; commas per sentence, words per paragraph; for noels, percentage of writing that is language – people talking, vs narrative, ie description, the omniscient author, vs say interior monologue – these obviously are harder to computer evaluate than the basics used

    btw i don’t read books with adverbs, or where conversation is more than say 10% or 15%, but then i am quant and therefore anti social

    the extraction of imputed traits, had a utility, in remedial preschool, we were able ti identify areas of behavior, that were or not responsive to intervention, and showed these area of responsiveness to align with interventional intention

    nothing common sense wouldn’t tell us, but (1) common sense is not so common; (2) counting things particularly those not so obvious or with , has virtue

  16. big woop. what if she is the author? jk rowling is an AMAZING writer and if she wants to go under identity so she doesnt get all the fame and people in her face then let her be. so many people are nosy when it comes to famous people. their just regular people who have talents that dont go unnoticed. jk has been through alot and if she just wants to hang back for a while and not get noticed through another book then let her be.

  17. Hi Virginia ~~~ This is such a fascinating newer field and I am interested in finding out more about it as a career option. I have a master’s in Linguistics although my work experience has been in Applied Linguistics and information development. Do you have any suggestions for finding out more about making a career transition to forensic linguistics?

  18. This is as fascinating and reliable as lie detectors and handwriting analysis. It may be a useful investigative tool – but it is should have no value as a legal proof in the US legal system. As to other countries I am certain it might be enough to hang somebody.
    Hint for feeble minded – i.e. most of us including those seating on the juries – this is dangerous.
    Just because observations are made using computers does not make it a scientific (Daubert) proof. Computers are told what to look for based on a subjective choice.
    Another dangerous argument is in doing gazillion tests and claiming that the something sticks. Something always sticks. The more mud you sling the more will stick – and it only proves that some mud sticks better than the other. Big deal.
    The next question than becomes: After someone determines that you are the most likely author under their selection of tests criteria – how do you prove that you are not the “pink elephant”. Or should you ever be put in that position. The classical “When did you stop beating your wife” comes to mind – it sticks even if you do not have a wife.
    Fortunately trained judges most of the time remember that.
    But “outing” the great author like that makes for a wonderful read anyway.

  19. I wonder if these analyses could be turned to Shakespearean texts to deduce whether they were written by the same author. I’m not sure whether it would sate Shakespeare conspiracy theorists (http://en.wikipedia.org/wiki/Shakespeare_authorship_question) but it would at least be an interesting exercise to see if a single author penned them.

    Thanks for the recap of this year’s articles; I’ve only just discovered the blogs recently.

  20. This statistical analysis for this investigation is amazing. The technology is so modern and advanced it is hard to believe that we can identify word patterns and call out phonies. I recently read Freakonomics, which goes into analyzing test score patterns to sort out the cheaters, that process is very similar to finding the most common words or phrases.

Leave a Reply

Your email address will not be published. Required fields are marked *