Book Word Analysis – a literary text mining project
- A fun project combining literature, statistics and programming
- Provide interesting insights into literature
- Create a reference for interesting literary texts by displaying hundreds of opening lines
- The project is divided into two parts:
- Batch processes running under Windows, displaying output in web pages or on the command lines
- An Android app which uses data created by the batch processing in order to enable interactive querying and displaying
research results on Android devices.
- FIles containing novels and other literary works in digitized form
- For the opening lines feature, various physical or digital books
- Various control files, such as: which words should be skipped
- Batch processing
- For opening lines, input is parsed, encoding converted to UTF-8, and written to an Android project file in a format expected by the app
- For the word search feature, each author's books are read and the text cleaned up. Usage of each word is counted, as well as total number of words and total number of different words
- The data is also stored in Android project files, in a manner that will speed up searching
- For the author similarity feature, similarity between all pairs of authors is calculated using an original algorithm
- In addition, the 3 most similar authors for each author are noted and inserted in the proper place in Android project files
- Windows file system files are created by the word search feature, to be used by the author similarity feature
- Web pages are created for the similarity table and for opening lines, to be used for quality control and for creating PDF files
- Android files are created. These files are later read by the app for searching and displaying the data on Android devices
- Android app
- Uses files created in the batch process to perform fast dynamic word usage calculation and sorting
- Displays static information, some of it created automatically by the batch process, such as the full author similarity table
- Does not require internet access or any permissions
- Available on Google Play Store
- Who are the authors whose word usage is most similar to that of a specific author ?
- Conversely, which authors are the most different from a specific author ?
- Find groups of authors showing high similarity to each other
- Style: Which author uses a certain word more often ('presently') ?
- How rare or common is the usage of a specific word ?
- Which authors use French / Spanish / Latin etc. ?
- Propensity of authors for using a specific color
- Historically, when was a word first used ? Does it appear in Shakespeare / Old Testament (KJV) ?
- Conversely, which authors mention new technology or new words ?
- Which authors mention a specific geographical location or people ?
- Which authors use certain animal names more frequently ?
- Usage of words denoting a specific mood
- Style: Usage of common words (the, an)
- Usage of numbers, when spelled as words
- Usage of specific first names
- Words used by only one author
- Conversely, words used by all authors
- Usage of food related terms
- Usage of marine terms
- Usage of specific work activities
- Usage of legal terms
- Which authors mention a pecific flower or other botanical terms
- Usage of words starting with a specific letter sequence ('qu')
- Which authors mention a specific historical figure ?
- Usage of religious terms
- Is it possible to detect if the writer is male or female, based on word usage ?
Batch processing currently not available in the Android app ('offline technology')
- Given an anonymous text, determine which author or book has the most similar word usage
- Compare word usage between any two texts
- Increase the number of works for each author
- Identify in which percentile of the text a word is used the most
- Make 'offline technology' available in the app
- A limited number of books is used for each author. So the data does not represent the complete corpus of words used by the author
- Cleanup of the input text may be imperfect, so some extraneous text may have been included. Currently there are no such known problems
- The total number of words is not the same for all authors. This may skew both search and similarity results
- Most works are fictional novels, but some non fiction and poems were included
- Works by Tolstoy and Kafka were translated from the orginal Russian/German. All other works were written in English
- Numerals and special symbols are removed. For example, terms containing multiple words separated by hyphen are split into single words
- Words containing diacritics are converted to non-diacritic characters