COMP20008 · Elements Of Data Processing
Web Crawling & Scraping
When the data you need lives on the web, you either crawl (systematically discover and fetch pages by following links to build coverage) or scrape (extract specific structured fields from a page's content). This chapter contrasts the two, walks the seed-URL → fetch → parse (DOM) → BeautifulSoup find_all → get_text → clean-matrix pipeline, and dissects the classic flawed crawler that never tracks visited URLs. It is examined as short-answer + critique: the 2024 exam's Q3 asks for differences between crawling and scraping, and the 2025 exam's Q2 asks why a crawler with no visited-set fails.
What this chapter covers
- 011. Crawling: link-following frontier, breadth-first traversal, goal = coverage/indexing
- 022. Scraping: targeted extraction of structured fields, goal = data
- 033. robots.txt: the site owner declares which paths crawlers may/may not visit
- 044. The scraping pipeline: seed URL → fetch raw HTML → parse (html.parser → DOM)
- 055. Extraction: BeautifulSoup find_all() to select target nodes → get_text() to strip tags
- 066. Cleaning: tokenise & standardise the extracted text → clean matrix for analysis
- 077. The flawed crawler: no visited set / deduplication / termination → infinite loops and re-fetching
Why a crawler with no visited-set fails (mirrors 2025 Q2)
- 1 markBecause there is no visited set, the same URL is added to the queue every time it is linked from another page — and most pages link back to home, navigation and category pages — so the crawler re-fetches the same pages over and over.
- 1 markRe-fetching wastes the budget on duplicates: the count of pages fetched climbs, but the count of unique pages stalls, so it may hit 1000 fetches while having seen far fewer than 1000 distinct pages. With cyclic links it can also loop indefinitely and never terminate cleanly.
- 1 markFix: maintain a visited set of already-fetched URLs and, before enqueuing or fetching a link, check it is not already in that set (deduplication). This guarantees each page is fetched once and gives a clean termination condition (stop at 1000 unique URLs).
Key terms
- Web crawling
- Systematically discovering and fetching web pages by following links from a seed URL outward, building a frontier of URLs to visit. The goal is coverage/indexing, and it behaves like a breadth-first traversal of the web's link graph.
- Web scraping
- Extracting specific structured fields (prices, titles, dates) from the content of a page. The goal is data, not coverage; it is targeted parsing of one page's DOM rather than link-following across many.
- robots.txt
- A file placed at a website's root by the site owner that tells crawlers which paths are allowed or disallowed. Well-behaved crawlers read and respect it; it is the standard signal of what may be crawled.
- DOM (Document Object Model)
- The tree structure a parser builds from raw HTML, with elements as nested nodes. Scraping navigates the DOM to locate the nodes that hold the target fields.
- BeautifulSoup find_all() / get_text()
- The standard Python scraping moves: find_all() selects all DOM nodes matching a tag or attribute (the target fields), and get_text() strips the HTML tags from a node to return its plain text for cleaning.
- Visited set
- A record of URLs a crawler has already fetched. Checking each candidate link against it (deduplication) prevents re-fetching the same page, avoids infinite loops on cyclic links, and provides a clean termination condition.
Web Crawling & Scraping FAQ
What is the core difference between crawling and scraping?
Crawling is about breadth — discovering and fetching many pages by following links, with coverage as the goal — while scraping is about depth on one page — extracting specific fields from its content, with data as the goal. Crawling is a breadth-first traversal of the link graph; scraping is targeted parsing of a single page's DOM. In practice you often crawl to find pages, then scrape each one.
Why does a crawler need a visited set?
Because pages link to each other (and back to home/navigation), the same URL gets discovered repeatedly. Without a visited set the crawler re-fetches duplicates, wastes its fetch budget, can loop forever on cycles, and never reliably reaches its target number of unique pages. Checking each link against the visited set before fetching guarantees one fetch per page and gives a termination condition.
What does the scraping-to-analysis pipeline look like?
Seed URL → fetch the raw HTML → parse it (html.parser) into a DOM tree → select the target nodes with BeautifulSoup find_all() → strip the tags with get_text() to get plain text → tokenise and standardise the text → assemble a clean matrix ready for analysis. Each stage is a step the exam can ask you to name or place in order.
How is crawling/scraping examined in COMP20008?
As short-answer comparison and critique, mirroring 2024 Q3 ("give differences between crawling and scraping") and 2025 Q2 ("why does this crawler fail / how would you fix it"). The marks reward precise vocabulary — frontier, coverage vs targeted extraction, robots.txt, DOM, find_all/get_text, visited set — and the ability to diagnose a flawed design rather than write code.
Exam move
Lock in the one-sentence contrast first: crawling = breadth/coverage by following links; scraping = depth/extraction of fields from one page. Then memorise the scraping pipeline as an ordered chain (seed → fetch → parse/DOM → find_all → get_text → clean matrix) so you can reproduce or reorder it on demand. The highest-frequency critique is the flawed crawler — rehearse explaining that without a visited set the crawler re-fetches, loops and never reaches unique-page targets, and that deduplication is the fix. Keep robots.txt and find_all/get_text in your active vocabulary, because naming the exact term is where the short-answer marks are.