University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters4-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 3 of 12 · COMP20008

Web Crawling & Scraping

When the data you need lives on the web, you either crawl (systematically discover and fetch pages by following links to build coverage) or scrape (extract specific structured fields from a page's content). This chapter contrasts the two, walks the seed-URL → fetch → parse (DOM) → BeautifulSoup find_all → get_text → clean-matrix pipeline, and dissects the classic flawed crawler that never tracks visited URLs. It is examined as short-answer + critique: the 2024 exam's Q3 asks for differences between crawling and scraping, and the 2025 exam's Q2 asks why a crawler with no visited-set fails.

In this chapter

What this chapter covers

  • 011. Crawling: link-following frontier, breadth-first traversal, goal = coverage/indexing
  • 022. Scraping: targeted extraction of structured fields, goal = data
  • 033. robots.txt: the site owner declares which paths crawlers may/may not visit
  • 044. The scraping pipeline: seed URL → fetch raw HTML → parse (html.parser → DOM)
  • 055. Extraction: BeautifulSoup find_all() to select target nodes → get_text() to strip tags
  • 066. Cleaning: tokenise & standardise the extracted text → clean matrix for analysis
  • 077. The flawed crawler: no visited set / deduplication / termination → infinite loops and re-fetching
Worked example · free

Why a crawler with no visited-set fails (mirrors 2025 Q2)

Q [3 marks]. A student writes a crawler to collect 1000 unique pages from a website. It starts at a seed URL, fetches the page, extracts every link, and adds all of them to a queue to fetch next — but it never records which URLs it has already fetched. Explain why this design fails to reliably reach 1000 unique pages, and state the fix.
  • 1 markBecause there is no visited set, the same URL is added to the queue every time it is linked from another page — and most pages link back to home, navigation and category pages — so the crawler re-fetches the same pages over and over.
  • 1 markRe-fetching wastes the budget on duplicates: the count of pages fetched climbs, but the count of unique pages stalls, so it may hit 1000 fetches while having seen far fewer than 1000 distinct pages. With cyclic links it can also loop indefinitely and never terminate cleanly.
  • 1 markFix: maintain a visited set of already-fetched URLs and, before enqueuing or fetching a link, check it is not already in that set (deduplication). This guarantees each page is fetched once and gives a clean termination condition (stop at 1000 unique URLs).
Without a visited set the crawler re-fetches and loops on already-seen URLs, so unique coverage stalls and it may never terminate; the fix is to deduplicate against a visited set before fetching.
Sia tip — The single word the marker is listening for is "visited set" (or deduplication) — name it explicitly and explain that it both prevents re-fetching and provides the termination condition.
Glossary

Key terms

Web crawling
Systematically discovering and fetching web pages by following links from a seed URL outward, building a frontier of URLs to visit. The goal is coverage/indexing, and it behaves like a breadth-first traversal of the web's link graph.
Web scraping
Extracting specific structured fields (prices, titles, dates) from the content of a page. The goal is data, not coverage; it is targeted parsing of one page's DOM rather than link-following across many.
robots.txt
A file placed at a website's root by the site owner that tells crawlers which paths are allowed or disallowed. Well-behaved crawlers read and respect it; it is the standard signal of what may be crawled.
DOM (Document Object Model)
The tree structure a parser builds from raw HTML, with elements as nested nodes. Scraping navigates the DOM to locate the nodes that hold the target fields.
BeautifulSoup find_all() / get_text()
The standard Python scraping moves: find_all() selects all DOM nodes matching a tag or attribute (the target fields), and get_text() strips the HTML tags from a node to return its plain text for cleaning.
Visited set
A record of URLs a crawler has already fetched. Checking each candidate link against it (deduplication) prevents re-fetching the same page, avoids infinite loops on cyclic links, and provides a clean termination condition.
FAQ

Web Crawling & Scraping FAQ

What is the core difference between crawling and scraping?

Crawling is about breadth — discovering and fetching many pages by following links, with coverage as the goal — while scraping is about depth on one page — extracting specific fields from its content, with data as the goal. Crawling is a breadth-first traversal of the link graph; scraping is targeted parsing of a single page's DOM. In practice you often crawl to find pages, then scrape each one.

Why does a crawler need a visited set?

Because pages link to each other (and back to home/navigation), the same URL gets discovered repeatedly. Without a visited set the crawler re-fetches duplicates, wastes its fetch budget, can loop forever on cycles, and never reliably reaches its target number of unique pages. Checking each link against the visited set before fetching guarantees one fetch per page and gives a termination condition.

What does the scraping-to-analysis pipeline look like?

Seed URL → fetch the raw HTML → parse it (html.parser) into a DOM tree → select the target nodes with BeautifulSoup find_all() → strip the tags with get_text() to get plain text → tokenise and standardise the text → assemble a clean matrix ready for analysis. Each stage is a step the exam can ask you to name or place in order.

How is crawling/scraping examined in COMP20008?

As short-answer comparison and critique, mirroring 2024 Q3 ("give differences between crawling and scraping") and 2025 Q2 ("why does this crawler fail / how would you fix it"). The marks reward precise vocabulary — frontier, coverage vs targeted extraction, robots.txt, DOM, find_all/get_text, visited set — and the ability to diagnose a flawed design rather than write code.

Study strategy

Exam move

Lock in the one-sentence contrast first: crawling = breadth/coverage by following links; scraping = depth/extraction of fields from one page. Then memorise the scraping pipeline as an ordered chain (seed → fetch → parse/DOM → find_all → get_text → clean matrix) so you can reproduce or reorder it on demand. The highest-frequency critique is the flawed crawler — rehearse explaining that without a visited set the crawler re-fetches, loops and never reaches unique-page targets, and that deduplication is the fix. Keep robots.txt and find_all/get_text in your active vocabulary, because naming the exact term is where the short-answer marks are.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 4-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo