Skip to main content

Text mining & text analysis

This guide contains resources for researchers about text mining and text analysis (sometimes known as distant reading).

Considerations - Ethics, Copyright, Licencing, Etiquette

Ethics

When accessing research data made available by other organisations it is important that mining activities do not inadvertently disclose confidential information or breach the privacy of research subjects. Although the primary responsibility for the ethical collection, storage and access to research data sits with the research owner, it may be possible to filter data in ways that can reveal confidential or identifying details. This is why some data owners require researchers to make application to use their data or may license its use via a formal agreement or Creative Commons license. Researchers need to ensure that they abide by the terms of use of any data they access.

Copyright

Depending on how the process of mining is conducted e.g. whether the material is copied, reformatted or digitised without permission, it could be considered a copyright infringement. The ability to data mine relies heavily on technologies that are considered 'copy-reliant' where copies must be made of the data in order for it to be analysed. Currently the Copyright Act 1968 makes no specific exemption for text or data mining. 

Limited text mining might be covered by the fair dealing exceptions however if an entire dataset needed to be copied this would clearly exceed a 'reasonable portion' of the work.

While copyright does not apply to raw data or factual information it does cover the arrangement of data within a database or the 'expression' of data eg. presentation in a table.

Licence conditions

Data providers will each have their own specific standards and procedures that you must follow in order to legally use the data they provide. It’s essential that you ensure from the outset of your project that the activities you intend to perform during the course of your data mining and the subsequent publication of your research results comply with any licensing terms and conditions.

For example, many data providers license their data to be mined for research purposes only and either prohibit or require special negotiation for data mining with potential commercial applications. If you have any questions about licensing conditions or negotiating permission for potential commercial applications of data mining with data providers please contact your Librarian.

Online mining etiquette

Even if the licence permits it, some approaches to text and data mining are considered poor etiquette due to the inconvenience they can cause to data providers. For example, bulk scraping or non-rate-limited programmatic querying via APIs can place a significant burden on data providers’ servers, causing slow response times or even down time for other users. Best practice is to check the requirements of the data provider and comply with their preferences regarding data mining activities.