COMP20008 · Elements Of Data Processing
LLMs, Prompting, RAG & Verifiable Data
The subject closes on modern generative AI and on how to trust data. This chapter covers what makes an LLM generative (web-scale text → embeddings → neural recombination), hallucination, prompts vs fine-tuning, the AUTOMAT prompt schema, RAG (retrieving external documents to ground answers, and the content-partitioning problem), and the week-12 guest material on blockchain tamper-evidence and digital-signature credential verification. It is examined as short-answer: the 2024 exam's Q9 asks what makes an LLM generative and why it hallucinates, Q8 examines blockchain/signature verification, and the 2025 exam's Q12 covers AUTOMAT and RAG.
What this chapter covers
- 011. What makes an LLM generative: web-scale text → embeddings (vectors) → neural network recombines to produce novel output
- 022. Hallucination: predicting plausible-sounding tokens with no grounding in fact → confident but wrong
- 033. Prompts vs fine-tuning: steer a frozen model in natural language (cheap) vs continued training (costly)
- 044. The AUTOMAT prompt schema: Act-as, User, Targeted-action, Output, Mode, Atypical cases, Topic
- 055. RAG: retrieve relevant external chunks → add to the prompt → generate a grounded answer
- 066. The content-partitioning problem: chunking and permissioning so the right (and only permitted) content is retrieved
- 077. Blockchain tamper-evidence: each block hashes the previous header → a change breaks the chain
- 088. Verifiable credentials: public key + recompute hash + verify signature → authenticity and integrity
Verifying a digitally signed credential (mirrors 2024 Q8b)
- 1 markObtain UoM's public key (the counterpart to the private key UoM signed with).
- 1 markRecompute the hash of the statement text exactly as published.
- 1 markVerify the signature with the public key: if the signature decrypts to the same hash you recomputed, then UoM signed it (authenticity) and the statement is unaltered (integrity); if the hashes differ, it was tampered with or not signed by UoM.
- 1 markBlockchain tamper-evidence: if the credential is recorded on a chain, each block's header contains a hash of the previous block's header, so changing any earlier data changes its hash and breaks the link in every subsequent block — making tampering detectable.
Key terms
- Generative LLM
- A model trained on web-scale text, which is decomposed into fragments and turned into embeddings (vectors of numbers); a neural network then recombines them to generate novel text. "Generative" means it produces new output rather than retrieving a stored answer.
- Embeddings
- Numeric vector representations of words/fragments that place similar meanings near each other in vector space. They are the form in which an LLM internally represents text so a neural network can operate on it.
- Hallucination
- When an LLM produces confident, plausible-sounding output that is factually wrong, because it predicts likely tokens from statistical patterns with no grounding in truth. It is worsened by gaps or biases in the training data.
- AUTOMAT prompt schema
- A mnemonic for the components of a good prompt: Act-as (role), User (persona/audience), Targeted-action, Output (definition/format), Mode (tone), Atypical cases (edge cases), Topic (extra context). It structures a prompt for clearer, more reliable results.
- Retrieval-Augmented Generation (RAG)
- Supplies the LLM with external, user-specific documents at query time: retrieve relevant chunks from a RAG store, add them to the prompt, then generate a grounded answer. It reduces hallucination by anchoring the model in real source material.
- Content-partitioning problem
- In RAG, the challenge of splitting documents into chunks and isolating/permissioning them so the right — and only the permitted — content is retrieved. Done poorly it hurts relevance or leaks one user's/tenant's data to another.
LLMs, Prompting, RAG & Verifiable Data FAQ
What makes a large language model "generative"?
It is trained on web-scale text that is broken into fragments and converted into embeddings — vectors of numbers — and a neural network learns to recombine them to produce new text. "Generative" means it creates novel output token by token rather than looking up and returning a stored answer. The same idea extends to images and other modalities. The 2024 Q9a wants this chain: web text → embeddings → neural recombination → novel output.
Why do LLMs hallucinate?
Because they generate text by predicting the most statistically plausible next tokens, not by checking facts. With no grounding in a source of truth, the model can produce fluent, confident output that is simply wrong — inventing citations, dates or details. Gaps and biases in the training data make it worse. This is exactly why RAG exists: feeding the model relevant retrieved documents grounds its answers and reduces hallucination.
What is RAG and what is the content-partitioning problem?
Retrieval-Augmented Generation gives the model external documents at query time: it retrieves the most relevant chunks from a RAG store, adds them to the prompt, and generates an answer grounded in that source material instead of relying on memorised patterns. The content-partitioning problem is how to split documents into chunks and isolate or permission them so the right content — and only the content the user is allowed to see — is retrieved. Bad partitioning hurts relevance or, worse, leaks one user's or tenant's data into another's answers.
How does blockchain make data tamper-evident?
Each block's header contains a cryptographic hash of the previous block's header, forming a chain. If anyone alters the data in an earlier block, that block's hash changes, which no longer matches the hash stored in the next block, breaking the link — and every subsequent header would have to be recomputed to hide it. So any tampering is detectable. Combined with digital signatures (public key → recompute hash → verify), this is how verifiable credentials prove both authenticity and integrity, which the 2024 Q8 examines directly.
Exam move
Treat this chapter as definition-and-explanation territory rather than calculation. Lock in the generative chain (web text → embeddings → neural recombination → novel output) and the hallucination explanation (predicts plausible tokens, no grounding in truth) as paired answers, because 2024 Q9 asks them together. Memorise AUTOMAT as a labelled list and be ready to describe three components with examples (2025 Q12a). For RAG, learn both halves: the retrieve-augment-generate pipeline and the content-partitioning problem (chunking/permissioning for relevance and to avoid leaking data). For the guest-lecture thread, drill the two verification skeletons until automatic — blockchain (each block hashes the previous header, so a change breaks the chain) and digital signatures (public key → recompute hash → compare). These are short, high-yield short-answer marks that close the exam.