Skip to Main Content

Text mining & text analysis

This guide contains resources for researchers about text mining and text analysis (sometimes known as distant reading).

Cleaning and parsing text

Most text is created and stored so that humans can understand it, and it is not always easy for a computer to process that text. Before you begin a text analysis project, you often need to clean and parse the text to ensure it is in a format that a computer can use (machine readable).

Computers work well when there is structure to a data source or, at least, some regular patterns that it can identify.  Most cleaning and parsing for text analysis involves increasing the regularity (for example, fixing typos) or adding structure (tagging certain words as important, or even splitting documents up into different sections that have special meaning - title, authors, chapters, etc.).

You will need to know a bit about your analysis methods and the tools you'll be using before you know what type of cleaning you need to do.  For example,  some techniques and tools will be very precise when counting the individual words, and they may count a lower-case and an upper-case version of the same word separately.

Common text cleaning and parsing techniques

Optical Character Recognition (OCR)

When acquiring textual data for text mining, it is possible that your digitized copy of the text data may not be available in machine readable formats that are optimal for text mining work.  You may be able to utilize Optical Character Recognition (OCR) software to convert paper documents and other not-readable digital formats into machine-readable digital files.

Adobe Acrobat provides a built in OCR tool, or free OCR software is available from the internet. Please check carefully to ensure the software is free from viruses and 'freeware' that may cause issues.

Tools for cleaning text data