Does this prove that Rowling wrote cuckoo? All it really "proves" — suggests, rather — is that out of the four authors studied, the most likely candidate author is probably rowling. But it could easily also be by someone who, by accident or design, wrote like rowling. (Certainly one could do worse than imitate the style of one of the most successful writers of this generation.) It was fair to say that there was a lot of evidence pointing Rowling as the author and nothing specifically suggesting that she wasn't. It was certainly enough for the sunday times to use as part of their package when they approached Rowling's agent and asked, directly, "Did. Rowling write The cuckoo's Calling?" Less than a day later, rowling confirmed through a spokesman that she had indeed written the novel, and the story launched. Discussion it's always nice when a mystery story closes with a confession by the person responsible, and this is no different.

These could be words, parts of words (like four letters "nsid" that would be inside the word "inside or even parts of two words (like the four letters "n th" as part of the phrase "in the. This particular unit of analysis has been shown to be very accurate at determining authorship, and there's a very good article by Efstathios Stamatatos that just came point out in the journal of Law and Policy describing why. I also ran on word bigrams, pairs of adjacent words, again a feature with a good track record. The character 4-grams showed a preference for McDermid, with 8 sections close to her. Three need were rowling-like, and no one else was mentioned. The word pairs, on the other hand, were clearly rowling-like (9 sections, against 2 by McDermid, no one else mentioned). Results so, the final score? The results look "mixed but pointing strongly to rowlng. There were certainly a couple of likely losers: nothing at all pointed to rendell as a possible author, and only one test, and an unreliable one at that, suggested James. McDermid could be a reasonable candidate author, but the word length distribution seemed almost entirely uncharacteristic of her. The only person consistently suggested by every analysis was Rowling, who showed up as the winner or the runner-up in each instance.

No one else got a mention. Another feature i used were the 100 most common words. What percentage of the document were "the what were "of and. Again, a rich data set that is easy to extract by computer. Using an otherwise similar analysis (including cosine distance again four of the sections were rowling-like, four were McDermid-like, and the other three book split between James and Rendell. I ran two tests based on authorial vocabulary. The first was on the distribution of character 4-grams, groups of four adjacent characters.

I broke cuckoo into chunks of 1000 lines (the last chunk was incomplete) and compared each chunk individually against the baseline model built from each of the four candidate novels. The heart of this analysis, of course, is in the details of the word "compared." Compared what, specifically, and how, specifically. I actually ran four separate types of analyses focusing on four different linguistic variables. While anything can in theory be an informative variable, my work focuses on variables that are easy to compute and that generate a lot of data from a given passage of language. One variable that i used, for example, is the distribution of word lengths. Each novel has a lot of words, each word has a length, and so one can get a robust vector of x of the words in this document have exactly y letters. Using a distance formula (for the mathematically minded, i used the normalized cosine distance formula instead of the more traditional Euclidean distance you remember from high school i was able to get a measurement of similarity, with.0 being identity and progressively higher numbers being. Of the 11 sections of Cuckoo, six were closest (in distribution of word lengths) writings to rowling, five to james.

So it works, but not necessarily well. A better approach is not to use average word length, but to look at the overall distribution of word lengths. Still better is to use other measures, such as the frequency of specific words or word stems (e.g., how often did Madison use "by"? and better yet is to use a combination of features and analyses, essentially analyzing the same data with different methods and seeing what the most consistent findings are. That's the approach I took. Materials, methods, maths, i was given e-text copies of, cuckoo to compare against Rowling's own The casual Vacancy, ruth Rendell's The. James' The Private patient and Val McDermid's The wire in the Blood. Fortunately, these were relatively clean copies and required little attention; deleting front and back matter, plus a little bit of issue regarding some non-standard punctuation, mostly"tions marks. The jgaap program handles issues like normalizing whitespace and stripping punctuation in a straightforward manner.

For example, galbraith apparently was surprisingly good at describing women's clothing, possibly suggesting a female author.) would I be willing to look into this? I said yes, of course, but with work a couple of conditions. First, i needed clean (machine readable) copies. Cuckoo, and clean samples of something comparable undisputedly by rowling herself. Secondly, i needed other comparable samples from other writers (distractor authors, to use the common term) to assess the degree of variation. For the past ten years or so, i've been working on a software project to assess stylistic similarity automatically, and at the same time, test different stylistic features to see how well they distinguish authors. De morgan's idea of average word lengths, bridge for example, works — sort.

If you actually get a group of documents together and compare how different they are in average word length, you quickly learn two things. First, most people are average in word length, just as most people are average in height. Very few people actually write using loads of very long words, and few write with very small words, either. Second, you learn that average word length isn't necessarily stable for a given author. Writing a letter to your cousin will have a different vocabulary than a professional article to be published.

Some choices come from dialect (the reason an Englishman drives a lorry but an American a truck some from social pressure (if I need to impress someone with my vocabulary, i can utilize a polysyllabic lexicon instead of just using big words and some just. An example of the latter category is in the use of many function words. If you ask yourself where the salad fork is relative to the plate, you quickly realize that it's usually to the left of the plate. It's just as likely to be "on" the left of the plate, "at" the left of the plate, or perhaps "to" the left side of the plate. Same fork, same position, and at least four different choices for how to describe it, none of which correspond to any sociolinguistic or cognitive variable with which I'm familiar.

But what we do know is that much of this apparently free variation is actually rather static at least at an individual level. So by studying examples of documents a person has written, we can build a model of the kind of choices that person makes. The idea that we can use quantifiable models of this kind of linguistic choice is hardly new. It dates back at least to the logician Augustus de morgan (yes, de morgan's rule who proposed in the mid-19th century that average word length could be used to settle questions of disputed authorship. Mosteller and Wallace studied the writing styles. The federalist Papers in the mid-60s and showed, for example, that Alexander Hamilton never used the word "whilst" but that James Madison never used the word "while." More interestingly, they both used the word "by but Madison consistently used it twice as often. Problem statement, i was approached by a reporter, cal Flyn, from the, sunday times, to assess this kind of variation in the writings of "Robert Galbraith a first-time novelist and author. (I learned later from the papers that the paper had received an anonymous tip via twitter that Galbraith was the pen name. And in retrospect there were a lot of other clues as well.

guest post by patrick juola. With the supermarket recent announcement by london's. Sunday times that. Rowling had written the recently published novel. The cuckoo's Calling, several people have asked about the process that led up to this. I'm grateful to ben Zimmer for giving me a chance to write a bit about. Background, i don't know how first much background most linguists have in "forensic stylometry." The basic theory is pretty simple: language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices.

But this time, i won't encourage you to buy a single copy. Although it looks pretty cool. For a book that will sap your children's brains and everything). « previous post next post the, homework sunday (uk times recently revealed that. Rowling wrote the detective novel. The cuckoo's Calling under the pen name robert Galbraith. The newspaper explained that, as part of their investigation, they sought the assistance of two scholars who have developed software to help with authorship attribution: Peter Millican of Oxford University and, patrick juola of Duquesne University. Given the public interest in the rowling revelation, i asked Patrick to write a guest post describing the authorial analysis that he conducted. (For more on the story, see my post on the, wall Street journal 's Speakeasy blog.

regarding my personal copy. By all rights, one of those handmade copies should have come. If it hadn't been for me, with all the blogging I did about the books, encouraging all eleven of you to buy them (and talk them up to your own few dozen readers it's doubtful jk rowling would have amassed the fortune she has today. Not that I'm bitter! I'm just saying, it's important to remember where you came from. Anyway, as I predicted, the almighty pound wins again, and. Beedle the bard is coming to bookstores, just in time for the holidays.

Prof Dawkins will write a book aimed at youngsters where he will discuss whether stories like the successful jk rowling series have a "pernicious" effect on children. The 67-year-old, who recently resigned from his position at Oxford University, says he intends to look at the effects of business "bringing children up to believe in spells and wizards". 'looking back to my own childhood, the fact that so many of the stories. I read allowed the possibility of frogs turning into princes, whether that has a sort of insidious affect on rationality, i'm not sure. Perhaps it's something for research. o my collisions of subatomic particles occurring in random fashion, the man may be onto something! Because history shows there's no way that children can differentiate between scientific reasoning and fictionalized accounts of magic. .

Please check the url for proper spelling and capitalization. If you're having trouble locating a destination on Yahoo!, try visiting literature the. Home page or look through a list. Also, you may find what you're looking for if you try searching below. Help Central if you need more assistance. And pundits gotta pontificate. London daily mail : Outspoken atheist Professor Richard Dawkins is to warn children of the dangers in believing "anti-scientific" fairytales such as Harry.

