Library Guides: Text mining & text analysis: Preparing text for analysis

Cleaning and parsing text

Most text is created and stored so that humans can understand it, and it is not always easy for a computer to process that text. Before you begin a text analysis project, you often need to clean and parse the text to ensure it is in a format that a computer can use (machine readable).

Computers work well when there is structure to a data source or, at least, some regular patterns that it can identify. Most cleaning and parsing for text analysis involves increasing the regularity (for example, fixing typos) or adding structure (tagging certain words as important, or even splitting documents up into different sections that have special meaning - title, authors, chapters, etc.).

You will need to know a bit about your analysis methods and the tools you'll be using before you know what type of cleaning you need to do. For example, some techniques and tools will be very precise when counting the individual words, and they may count a lower-case and an upper-case version of the same word separately.

Common text cleaning and parsing techniques

Removing stop words
Stop words are words which are filtered out before or after processing of natural language text, e.g. common words like "the", "and", "at" etc.
Stemming
The process of reducing inflected (or sometimes derived) words to their word stem, base or root form, e.g. "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu".
Lemmatisation
The process of grouping together the different forms of a word so they can be analysed as a single item, e.g. "walk" is the lemma for "walked", "walks", "walking".
Parts of speech tagging
Identifying words as nouns, verbs, adjectives, adverbs, etc. based on both its definition and its context.

Optical Character Recognition (OCR)

When acquiring textual data for text mining, it is possible that your digitized copy of the text data may not be available in machine readable formats that are optimal for text mining work. You may be able to utilize Optical Character Recognition (OCR) software to convert paper documents and other not-readable digital formats into machine-readable digital files.

Adobe Acrobat provides a built in OCR tool, or free OCR software is available from the internet. Please check carefully to ensure the software is free from viruses and 'freeware' that may cause issues.

Adobe Acrobat - OCR tool
Scan any paper document to PDF or open a scanned image. Acrobat will automatically perform optical character recognition tasks and can be converted to a Word file (and then to .txt or other common text file types).
Adobe Acrobat 8 Pro is installed on all computers in the Library, can be installed on any UQ computer and is available for staff to install on their home devices.
For more information see the UQ ITS Adobe Software page.
UQ Adobe Software page

Digitisation

Digitisation services at the Library

Tools for cleaning text data

OpenRefine
Explore, clean, transform, reconcile and match data.
Vard2
For cleaning historical texts
TextFixer
For changing case, removing whitespace and line breaks, sorting and converting text.
Porter stemmer online
For stemming text
Lexos
Cleans, lemmatizes, removes stop words

File conversion tools

Batch conversion of PDFs into spreadsheets
Tabula (converting data tables locked inside PDF files)
Transformer
Rescue texts from old file formats
Pandoc
Is a open source, many-to-many format converter.
Trafilatura
For Python. Use in Python or from the command line. This solution allows the user to extract text in a minimal number of commands. For a more advanced process that is fully customisable use a scraping or parsing tool such as Beautiful soup.
htm2txt
For R users. This solution allows the user to extract text in a minimal number of commands. For a more advanced process that is fully customisable use a scraping or parsing tool such as rvest.