Skip to Main Content

Text mining & text analysis

This guide contains resources for researchers about text mining and text analysis (sometimes known as distant reading).

Research using text mining and analysis

See these examples of researchers using text mining and analysis. The first example was research undertaken at the University of Queensland.

Text analysis in sport history research

Cassius Clay (Muhammad Ali), as a young heavyweight contender from Louisville, Ky., May 17, 1962. Stanley Weston Getty ImagesA Bird's-eye view of the past: Digital history, distant reading and sport history - This research investigates the utility of distant reading as a research tool via three newspaper case studies concerning Muhammad Ali, women’s surfing in Australia, and homophobic language and Australian sport. Distant reading is defined as an umbrella term that embraces many practices, including data mining, aggregation, text analysis, and the visual representations of these practices.

Text analysis in historical research

Sculptor Edmondia Lewis (1844-1907) was the first woman of African- and Native-American descent to achieve notoriety in the fine arts world. She spent most of her career in Rome.  Credit: Henry Rocher – National Portrait Gallery, Smithsonian Institution, Public DomainRescued history: Massive text data analysis helps uncover black women's experiences - Researchers used high performance computers to analyze 20,000 documents from the HathiTrust and JSTOR databases that were known to contain information about black women. This analysis was used to create a computational model based on this corpus of documents which they then used to study the entire 800,000 documents in both databases. To make sense of the huge datasets, the investigators used computational techniques of topic modeling and data visualization.

Identifying social networks

Six degrees of Francis BaconSix degrees of Francis Bacon - Text mining the Oxford Dictionary of National Biography for relationships between early modern persons, documents, and institutions to create a digital reconstruction of the early modern social network of England. Researchers used Named-Entity Recognition (NER) to process the unstructured text into structured data – specifically a matrix of documents and named entities – that was amenable to statistical analysis. Researchers also applied statistical graph-learning methods to the structured data and topic modeling.

Text analysis in the BioSciences

Three-tier analysis of text-mined vs. curated gene-disease-variant triplets.Text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine - Researchers developed a highly accurate machine-learning-based text mining approach for mining complete genotype-phenotype relationships from biomedical literature. Disease-gene-variant triplets were extracted from all abstracts in PubMed related to a set of ten important diseases. Mutations associated with the queried disease were identified using a machine-learning(ML)-based classification algorithm trained to detect disease-related mutations.

Text analysis in Business

Big data use cases within businessThree real-world applications of text mining to solve specific business problems - Text mining is being applied to answer business questions and to optimize day-to-day operational efficiencies as well as improve long-term strategic decisions. This article describes practical real-world instances where text mining has been successfully applied in three industries.e.g. text mining (keyword and thematic analysis) warranty repair comments by technicians to identify component defect insights leading to informed interventions for preventing them in future.

Using the Google N-Gram corpus to measure cultural complexity

Using Google's N-Gram CorpusUsing the Google N-Gram corpus to measure cultural complexity - Using the Google Books American 2Gram corpus, this study shows that (as predicted from the cumulative nature of culture), US culture has been steadily increasing in complexity, even when (for economic reasons) the amount of actual discourse as measured by publication volume decreases. 


TED Talk: What we learned from 5 million books (YouTube, 14m:08s). This video looks at the surprising things learnt from Google Labs' NGram Viewer.