Book Word Analysis – a literary text mining project

Purpose

Technical Overview

  1. The project is divided into two parts:
    1. Batch processes running under Windows, displaying output in web pages or on the command lines
    2. An Android app which uses data created by the batch processing in order to enable interactive querying and displaying research results on Android devices.
  2. Input
    1. FIles containing novels and other literary works in digitized form
    2. For the opening lines feature, various physical or digital books
    3. Various control files, such as: which words should be skipped
  3. Batch processing
    1. For opening lines, input is parsed, encoding converted to UTF-8, and written to an Android project file in a format expected by the app
    2. For the word search feature, each author's books are read and the text cleaned up. Usage of each word is counted, as well as total number of words and total number of different words
    3. The data is also stored in Android project files, in a manner that will speed up searching
    4. For the author similarity feature, similarity between all pairs of authors is calculated using an original algorithm
    5. In addition, the 3 most similar authors for each author are noted and inserted in the proper place in Android project files
  4. Output
    1. Windows file system files are created by the word search feature, to be used by the author similarity feature
    2. Web pages are created for the similarity table and for opening lines, to be used for quality control and for creating PDF files
    3. Android files are created. These files are later read by the app for searching and displaying the data on Android devices
  5. Android app
    1. Uses files created in the batch process to perform fast dynamic word usage calculation and sorting
    2. Displays static information, some of it created automatically by the batch process, such as the full author similarity table
    3. Does not require internet access or any permissions
    4. Available on Google Play Store

Potential uses

  1. Who are the authors whose word usage is most similar to that of a specific author ?
  2. Conversely, which authors are the most different from a specific author ?
  3. Find groups of authors showing high similarity to each other
  4. Style: Which author uses a certain word more often ('presently') ?
  5. How rare or common is the usage of a specific word ?
  6. Which authors use French / Spanish / Latin etc. ?
  7. Propensity of authors for using a specific color
  8. Historically, when was a word first used ? Does it appear in Shakespeare / Old Testament (KJV) ?
  9. Conversely, which authors mention new technology or new words ?
  10. Which authors mention a specific geographical location or people ?
  11. Which authors use certain animal names more frequently ?
  12. Usage of words denoting a specific mood
  13. Style: Usage of common words (the, an)
  14. Usage of numbers, when spelled as words
  15. Usage of specific first names
  16. Words used by only one author
  17. Conversely, words used by all authors
  18. Usage of food related terms
  19. Usage of marine terms
  20. Usage of specific work activities
  21. Usage of legal terms
  22. Which authors mention a pecific flower or other botanical terms
  23. Usage of words starting with a specific letter sequence ('qu')
  24. Which authors mention a specific historical figure ?
  25. Usage of religious terms
  26. Is it possible to detect if the writer is male or female, based on word usage ?

Batch processing currently not available in the Android app ('offline technology')

  1. Given an anonymous text, determine which author or book has the most similar word usage
  2. Compare word usage between any two texts

Future objectives

  1. Increase the number of works for each author
  2. Identify in which percentile of the text a word is used the most
  3. Make 'offline technology' available in the app

Limitations

  1. A limited number of books is used for each author. So the data does not represent the complete corpus of words used by the author
  2. Cleanup of the input text may be imperfect, so some extraneous text may have been included. Currently there are no such known problems
  3. The total number of words is not the same for all authors. This may skew both search and similarity results
  4. Most works are fictional novels, but some non fiction and poems were included
  5. Works by Tolstoy and Kafka were translated from the orginal Russian/German. All other works were written in English
  6. Numerals and special symbols are removed. For example, terms containing multiple words separated by hyphen are split into single words
  7. Words containing diacritics are converted to non-diacritic characters

Copyright 2019 starry-side.com   starry.side@gmail.com     Home