Book Scraper
A note on our methodology
Welcome to Book Scraper, a tool the Times has created to let you analyse some of the world's most famous books.
This guide attempts to explain how Book Scraper was developed, sets out some of the challenges we faced along the way, and also outlines the decisions we took about our methodology.
We welcome any feedback you have, so please get in touch and let us know what you think!
How did you choose your books?
There was no great scientific - or literary - rigour in deciding which books we analysed.
The main criteria were that they would be well-known - some might think of them as 'classic texts' - and that they were out of copyright. (We had hoped to add a comprehensive subsection of books that were on the 2009 GCSE syllabus, but so many were in copyright that it wasn't worth creating such a category.)
We spoke at length to the Times Books editors, who gave us very helpful feedback. We then downloaded the 126 texts from Gutenberg, and went to work.
How come there's no older books?
Initially, we had several classic era texts - such as the Ilyiad and the Odyssey, as well some well known medieval ones.
The idea was that when you were using Book Scraper to analyse how a word had been used across time, you would get a nice sense of its appearance in those early texts and gradual adoption over the centuries.
A problem emerged, though, which was that if you wanted to display such trends graphically - for instance using a timeline, the vast amount of white space in between, say, Plato's Republic, which was written in 360BC, and the next book - Dante's Divine Comedy, was very inelegant. (The great concentration of books in the 19th century on such a timeline were then too bunched up.)
We didn't find a way around this, so we're launching with a more modest spread of texts, starting with The Prince (which Machiavelli, Niccolò began in 1532), and ending with Right Ho, Jeeves by Wodehouse, Pelham Grenville (written in 1922).
How come there's so many nonsense words?
One of the most common ways to explore Book Scraper is to search for words by length, and then click on the longest - which has 69 letters - to see what it is. (You get: nationalgymnasiummuseumsanatoriumandsuspensoriumsordinaryprivatdocent, in case you haven't yet.)
It is of course, not a word in the traditional sense - it's one of the many that Joyce coined and which give him by far the largest 'vocabulary' of any of the authors in our collection. (He has a vocabulary of 33,213 words)
There's a couple of things to say here.
First, many of the long 'words' may seem to you to be collections of shorter words, which begs the question: why didn't you spot the transcription error and break them up?
The reality is that most - at least as far as we can tell, anyway - were intended to be written that way. Joyce, for instance, wrote handsomemarriedwomanrubbedagainstwide, just like that. As did Dickens, write you'retheguidingstarofmyexistence. To split them up into separate words would have meant tampering with the manuscript, which we didn't want to do.
Second, from a database point of view, it's very difficult to single out 'nonsense words'. As far as the computer is concerned, any collection of letters without a space between them is a word. It doesn't know what a collection of letters means.)
One solution would have been to trawl through every one of the 900,000 words in our database manually and excise those that weren't in the dictionary, but that wasn't satisfactory either: it's fun to explore the way Joyce and others contrived long words - especially when with a single click you can go to the word in context. (One curiosity thrown up by this is that the word honorificabilitudinitatibus - which appears just twice in the database - appeared to be a private joke Joyce Joyce had with Shakespeare Have a look and see for yourself.)
Having said that, there are several typos in the database, which typically are errors introduced during the scanning process.
(Gutenberg, which we used as the source material for our database, uses a scanning method known as OCR - or Optical Character Recognition. OCR is the most common way to digitise texts. You scan a page, and then a computer makes a guess a to which bits of the image correspond to particular words. It then reproduces those words in digital form - making the text searchable. If a word is obscured, the computer will often guess incorrectly. OCR errors are a common feature of digitised texts.)
In any case, if you're after the longest 'real words', then you tend to find that they kick in at about 18 characters. (See the 'explore words by number of characters' page.) The longest words in our database, that is. There are, of course, others. Antidisestablishmentarianism is one of the best-known 'long' words. But none of the 53 authors in our database used it.
How do you calculate the 'most important words'?
An author's - and a publication's - 'most important' words are calculated using a corpus linguistics measure known as TF-IDF, or term frequency-inverse document frequency.
TF-IDF attempts to measure the significance of a word to a particular text by measuring its frequency in that text against its frequency in other texts in a 'corpus'.
It produces a more accurate measure of which words are important to a text than would a simple calculation of the words are most frequent in that text, because of it makes reference to other texts.
Take Romeo and Juliet, for instance. The word thou appears more frequently in the play than either of the words Romeo or Juliet. By a more simple measure - based on pure frequency - that might make it qualify as the most important word in the text. But because 'thou' also appears frequently in other texts in the corpus - whereas Romeo and Juliet do not - its TF-IDF rating in Romeo and Juliet is reduced.
For those interested in the maths, TF-IDF is calculated as follows: term frequency (TF) is the number of times the word appears in a document divided by the total number of words in that document.
Inverse document frequency (IDF) is the log of the ratio of the total number of documents in the corpus to the number of documents in which the word appears.
The TF-IDF is the product of these numbers.
Take for instance a 1,000 word document in which the word rabbit appears 11 times. The term frequency is .0011. Now, let's assume your corpus contains 100 texts, and that the word rabbit appears in 30 of them. The inverse document frequency is log (100/30) = 1.2
The TF-IDF of rabbit in the 1,000 word document is 1.2 x .0011 = 0.001
(It's worth noting that if a word appears in every document in a corpus, then it's TF-IDF will be zero, because the IDF will be the log of 1, which is 0.)
Sentence length
One measure we toyed with initially was calculating average sentence length. We figured it would be interesting.
Many authors, such as the German Thomas Mann, are associated with writing long sentences, at least in the popular imagination. But to what extent did the database bear this out? It didn't appear too tricky a calculation, either - at least initially. Take the total word count, divide by the number of full stops and other relevant punctuation marks, and you should have it.
Pretty soon, though, we ran into difficulties.
What about when an author uses three full stops in a row - ... -a common way of indicating a pause in speech? Or authors who, for other stylistic reasons, used long strings of full stops? Including these types of punctuation would skew the calculation significantly.
We considered stripping out instances where more than one full stop appeared in a row before doing the calculation. But what about where three full stops, for instance, legitimately ended a sentence?
In the end, the potential for inaccuracy was too great so we decided against analysing sentence count.
What about chapter headings?
A tough one, this.
It was very common for the publications in our collection to use roman numerals to mark out chapters - I, II, III etc. - even though these don't technically form part of the word count of the book. (The word counter will, by default, consider them words.)
So what to do about them?
You could tell the program to disregard any of the following combinations - I, II, III, IV and so on - but then you would mean excising 'I' from where it appeared in other parts of the text.
In the end, we decided to excise the words Scene and Act where they was followed by a Roman numeral or a number. If not, we left them in. That means that any book which denotes its chapters I, II, III etc. without a preceding Chapter has a slightly inflated word count. It's something we had to live with.
What about stage directions?
Another challenge.
Shakespeare - who contributed most of the plays in our database - typically indicates which character is speaking by three or so 'capped' letters at the start of a line followed by a full stop. For instance, Benvolio is BENV. Hamlet is HAM.
We decided that, as these were very frequently appearing words that weren't actually spoken in the play, they should be removed for the purposes of wordcount, vocab, and most important word calculations.
How to do this, though, from a proramming point of view?
Well, Gutenberg makes it a bit easier by always writing the stage directions with an indent. In that way, we were able to write a script which excised any word which took the form of two spaces/followed by/two or more letters in caps/followed by/a fullstop.
The danger, of course, is that you could strip out a real word using the same method.
(The fact that stage directions are indented means that even if a 'legitimate' word ended a sentence in caps, it should survive
(assuming there was only one space between it and the preceding word. Oscar Wilde's plays were trickier because in them, character names - when they appeared as stage directions - are not capitalised.)
We decided, however, that the risk of removing the odd word that shouldn't have come out was outweighed by the value of removing stage directions.
Ditto directions in square brackets. They came out too - at the risk of removing any other expressions that an author decided to put in square brackets. (A lengthy, though unscientific, analysis decided this was incredibly rare.)
What about dashes/hyphens/apostrophes?
Dashes we treated as 'word separators'. In other words, the program removes the hyphen and separates the two words it joined.
That means that 'no-one' is not recognised as a word - a downside, but on the other hand, scores of joined-up words that, were the dash not removed, would have been treated as one word, are instead treated as two.
Initially we approached apostrophes in the same way. We got into trouble, though, when we started running our tests and finding that came up against the problem that 'words' such as 'd' and 't' were cropping up as the most important words in earlier books. This was because, several centuries ago, the past participle of many verbs - which we now associated with the ending '-ed', as in 'missed, interpreted, helped etc' was often written as an 't or an 'd. ('Miss't etc. )
In the end, we left apostrophes in words, so that Darcy and Darcy's appear as two separate words in Pride and Prejudice.
How do you tell which publications are similar to other ones?
This was one of the more ambitious aspects of Book Scraper and, at the time of launch, is not fully finished. However, we had some very productive sessions with Flash programming, and we think its worth presenting the early results.
The first thing to say about this calculation is that it is, in many ways, absurd.
The idea that one could say two publications are 'very similar' or otherwise based on a mathematical calculation will, to many people - not least those who love books, be offensive.
On the other hand, the discipline of corpus linguistics has for many years applied rigorous mathematical analysis to texts, and so long as one takes the field's inherent limitations on board, the insights it can offers relation to word usage etc. can be extremely valuable.
Our 'most similar' book calculation starts by measuring the vocabulary - or total unique words - of each book. For any two books, it is then possible to measure the number of unique words they have in common.
That number of unique words can then be expressed as a fraction of each book's total vocabulary. For instance, if A has a vocabulary of 50 words, and B has a vocabulary of 100 words, and they have 25 words in common, then you can then say: A contains 25 per cent of the words in B, but B contains 50 per cent of the words in A.
These two ratios are then averaged to give a measure of the similarity of books A and B, in vocab terms. If this average is higher for books A and C, than A and B, then we can say that A and C are more similar, in a vocabulary sense, than A and B.
A team in the Department of Information Science at City University helped us produce some demos which show how these calculations could be turned into a web application. (Thanks to Jo, Jason, and Aiden in the Department of Information Science at City University for their time and expertise.)
The apps the City team created which show the publications in our collection in clusters -with lines running between them -and in a 'tree diagram' are an extension of this calculation.
(At the time of launch the applets are still not complete, nor integrated with the rest of the site - you can try them out here, if you have Java installed - but they give an insight into the way our database could be manipulated with sufficient resources! One interesting thing to emerge from this calculation is how groups of books which are very similar to one another in vocab terms - such as the Shakespeare or Austen texts - often break away in little clusters.)