Library Guides: Text mining & text analysis: Web scraping

Web scraping as a source of text data

Web crawling

A process of collating a collection of webpages by starting with an initial list of URLs (or links) and systematically processing each page to extract content and additional links. Writing a Web crawler requires basic programming knowledge.

Web scraping

Used to extract text from webpages. Web scraping software is designed to recognise different types of content within a website and to acquire and store only the types of content specified by the user, e.g. article titles or authors from a news website, or prices and product descriptions from a commercial website. Commercial software or programming languages can be used.

Web scraping and web crawling resources

The Ultimate Guide to Web Scraping for Non-Programmers
Nvivo (with NCapture add-in)
Use NCapture, a web browser extension, to quickly and easily capture content like web pages, online PDFs and social media for analysis in NVivo.
10 Best Open Source Web Scrapers in 2023
9 Free Web Scrapers that you cannot miss
Downloading Web Pages with Python
Web Scraping with Python for Beginners

Automated Data Collection with R by Simon Munzert; Christian Rubba; Peter Meißner; Dominic Nyhuis
Publication Date: 2014
A hands on guide to web scraping and text mining for both beginners and experienced users of R Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL. Provides basic techniques to query web documents and data sets (XPath and regular expressions). Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.

What You Can Scrape and What Is Right to Scrape: A Proposal for a Tool to Collect Public Facebook Data
In reaction to the Cambridge Analytica scandal, Facebook has restricted the access to its Application Programming Interface (API). This new policy has damaged the possibility for independent researchers to study relevant topics in political and social behavior. Yet, much of the public information that the researchers may be interested in is still available on Facebook, and can be still systematically collected through web scraping techniques. The goal of this article is twofold. First, we discuss some ethical and legal issues that researchers should consider as they plan their collection and possible publication of Facebook data. In particular, we discuss what kind of information can be ethically gathered about the users (public information), how published data should look like to comply with privacy regulations (like the GDPR), and what consequences violating Facebook’s terms of service may entail for the researcher. Second, we present a scraping routine for public Facebook posts, and discuss some technical adjustments that can be performed for the data to be ethically and legally acceptable.