University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters10-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 12 of 12 · COMP20008

LLMs, Prompting, RAG & Verifiable Data

The subject closes on modern generative AI and on how to trust data. This chapter covers what makes an LLM generative (web-scale text → embeddings → neural recombination), hallucination, prompts vs fine-tuning, the AUTOMAT prompt schema, RAG (retrieving external documents to ground answers, and the content-partitioning problem), and the week-12 guest material on blockchain tamper-evidence and digital-signature credential verification. It is examined as short-answer: the 2024 exam's Q9 asks what makes an LLM generative and why it hallucinates, Q8 examines blockchain/signature verification, and the 2025 exam's Q12 covers AUTOMAT and RAG.

In this chapter

What this chapter covers

  • 011. What makes an LLM generative: web-scale text → embeddings (vectors) → neural network recombines to produce novel output
  • 022. Hallucination: predicting plausible-sounding tokens with no grounding in fact → confident but wrong
  • 033. Prompts vs fine-tuning: steer a frozen model in natural language (cheap) vs continued training (costly)
  • 044. The AUTOMAT prompt schema: Act-as, User, Targeted-action, Output, Mode, Atypical cases, Topic
  • 055. RAG: retrieve relevant external chunks → add to the prompt → generate a grounded answer
  • 066. The content-partitioning problem: chunking and permissioning so the right (and only permitted) content is retrieved
  • 077. Blockchain tamper-evidence: each block hashes the previous header → a change breaks the chain
  • 088. Verifiable credentials: public key + recompute hash + verify signature → authenticity and integrity
Worked example · free

Verifying a digitally signed credential (mirrors 2024 Q8b)

Q [4 marks]. The statement "Aisha graduated BSc 2024, University of Melbourne" is published with a digital signature. Explain how an employer can verify that UoM genuinely issued it and that it has not been altered, and how blockchain adds tamper-evidence.
  • 1 markObtain UoM's public key (the counterpart to the private key UoM signed with).
  • 1 markRecompute the hash of the statement text exactly as published.
  • 1 markVerify the signature with the public key: if the signature decrypts to the same hash you recomputed, then UoM signed it (authenticity) and the statement is unaltered (integrity); if the hashes differ, it was tampered with or not signed by UoM.
  • 1 markBlockchain tamper-evidence: if the credential is recorded on a chain, each block's header contains a hash of the previous block's header, so changing any earlier data changes its hash and breaks the link in every subsequent block — making tampering detectable.
Get UoM's public key, recompute the statement's hash, and check the signature decrypts to that hash — a match proves authenticity and integrity; on a blockchain, each block hashing the previous header means any change breaks the chain and is detectable.
Sia tip — The verification skeleton is always the same three moves — public key → recompute hash → compare — and "each block hashes the previous header" is the one line that earns the blockchain mark.
Glossary

Key terms

Generative LLM
A model trained on web-scale text, which is decomposed into fragments and turned into embeddings (vectors of numbers); a neural network then recombines them to generate novel text. "Generative" means it produces new output rather than retrieving a stored answer.
Embeddings
Numeric vector representations of words/fragments that place similar meanings near each other in vector space. They are the form in which an LLM internally represents text so a neural network can operate on it.
Hallucination
When an LLM produces confident, plausible-sounding output that is factually wrong, because it predicts likely tokens from statistical patterns with no grounding in truth. It is worsened by gaps or biases in the training data.
AUTOMAT prompt schema
A mnemonic for the components of a good prompt: Act-as (role), User (persona/audience), Targeted-action, Output (definition/format), Mode (tone), Atypical cases (edge cases), Topic (extra context). It structures a prompt for clearer, more reliable results.
Retrieval-Augmented Generation (RAG)
Supplies the LLM with external, user-specific documents at query time: retrieve relevant chunks from a RAG store, add them to the prompt, then generate a grounded answer. It reduces hallucination by anchoring the model in real source material.
Content-partitioning problem
In RAG, the challenge of splitting documents into chunks and isolating/permissioning them so the right — and only the permitted — content is retrieved. Done poorly it hurts relevance or leaks one user's/tenant's data to another.
FAQ

LLMs, Prompting, RAG & Verifiable Data FAQ

What makes a large language model "generative"?

It is trained on web-scale text that is broken into fragments and converted into embeddings — vectors of numbers — and a neural network learns to recombine them to produce new text. "Generative" means it creates novel output token by token rather than looking up and returning a stored answer. The same idea extends to images and other modalities. The 2024 Q9a wants this chain: web text → embeddings → neural recombination → novel output.

Why do LLMs hallucinate?

Because they generate text by predicting the most statistically plausible next tokens, not by checking facts. With no grounding in a source of truth, the model can produce fluent, confident output that is simply wrong — inventing citations, dates or details. Gaps and biases in the training data make it worse. This is exactly why RAG exists: feeding the model relevant retrieved documents grounds its answers and reduces hallucination.

What is RAG and what is the content-partitioning problem?

Retrieval-Augmented Generation gives the model external documents at query time: it retrieves the most relevant chunks from a RAG store, adds them to the prompt, and generates an answer grounded in that source material instead of relying on memorised patterns. The content-partitioning problem is how to split documents into chunks and isolate or permission them so the right content — and only the content the user is allowed to see — is retrieved. Bad partitioning hurts relevance or, worse, leaks one user's or tenant's data into another's answers.

How does blockchain make data tamper-evident?

Each block's header contains a cryptographic hash of the previous block's header, forming a chain. If anyone alters the data in an earlier block, that block's hash changes, which no longer matches the hash stored in the next block, breaking the link — and every subsequent header would have to be recomputed to hide it. So any tampering is detectable. Combined with digital signatures (public key → recompute hash → verify), this is how verifiable credentials prove both authenticity and integrity, which the 2024 Q8 examines directly.

Study strategy

Exam move

Treat this chapter as definition-and-explanation territory rather than calculation. Lock in the generative chain (web text → embeddings → neural recombination → novel output) and the hallucination explanation (predicts plausible tokens, no grounding in truth) as paired answers, because 2024 Q9 asks them together. Memorise AUTOMAT as a labelled list and be ready to describe three components with examples (2025 Q12a). For RAG, learn both halves: the retrieve-augment-generate pipeline and the content-partitioning problem (chunking/permissioning for relevance and to avoid leaking data). For the guest-lecture thread, drill the two verification skeletons until automatic — blockchain (each block hashes the previous header, so a change breaks the chain) and digital signatures (public key → recompute hash → compare). These are short, high-yield short-answer marks that close the exam.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 10-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo