University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP5318 · Machine Learning and Data Mining

- one subject, every graph, every model, every mark

Computer Science14 Chapters8-page Bible

Our own words - no uploaded lecturer files

Updated for this semester

Chapter 8 of 11 · COMP5318

Deep Learning: CNN, RNN & Transformer

This Weeks 8-9 topic of COMP5318 Machine Learning and Data Mining at the University of Sydney takes the neural-network foundations deep, across the three architectures the exam keeps returning to. A Convolutional Neural Network (CNN) shares small filters across an image and pools to downsample; a Recurrent Neural Network (RNN), and its gated cousin the LSTM, carry a hidden state through a sequence; and a Transformer drops recurrence for self-attention plus positional encoding. Most marks come from precise two-sentence reasons and one small convolution or attention calculation in the closed-book final.

In this chapter

What this chapter covers

01Explain convolution as a small shared filter slid over an image, and compute a feature map by hand
02Use weight sharing and translation invariance to say why a CNN beats a dense net on images like MNIST
03Contrast max pooling and average pooling, and state the drawback of each
04Describe how an RNN carries a hidden state and why plain RNNs suffer vanishing gradients on long sequences
05Name the LSTM's input, forget and output gates and what each controls, and place a GRU as the lighter alternative
06Compare RNN and LSTM on computation and on their ability to capture long-range dependencies
07Compute scaled dot-product self-attention from Query, Key and Value vectors (scores, scale by √d_k, softmax, weighted sum)
08Explain positional encoding and multi-head attention, and why the input is embedded once, not per head
09Say how the final Linear + softmax picks the next token during Transformer inference

Worked example · free

Compute scaled dot-product self-attention by hand

Q [4 marks]. A query token has q₁ = (2, 0, 2, 0) and attends over two tokens with keys k₁ = (2, 1, 2, 1), k₂ = (1, 1, 1, 1) and value vectors v₁ = (1, 0), v₂ = (0, 1). With key dimension d_k = 4, compute the self-attention output z₁.

+1Scores are the query dotted with each key: q₁·k₁ = 2·2 + 0·1 + 2·2 + 0·1 = 8, and q₁·k₂ = 2·1 + 0·1 + 2·1 + 0·1 = 4.
+1Scale by √d_k = √4 = 2 to keep the softmax stable: scaled scores are 8/2 = 4 and 4/2 = 2.
+1Softmax the scaled scores: e⁴ = 54.60, e² = 7.39, sum = 61.99, so the weights are 54.60/61.99 = 0.88 and 7.39/61.99 = 0.12 (they sum to 1).
+1Take the weighted sum of the value vectors: z₁ = 0.88·(1, 0) + 0.12·(0, 1) = (0.88, 0.12).

The attention output is z₁ = (0.88, 0.12): the query draws about 88% of its representation from token 1 and 12% from token 2, because token 1's key matched the query more strongly (score 8 vs 4). The weights always sum to 1, and dividing the scores by √d_k before the softmax is what keeps the gradients stable when d_k is large.

Sia tip — Lay the two tokens out in columns and go down the same four rows every time: score, divide by √d_k, softmax, then weighted sum of values. Writing the scaling step explicitly earns its own mark and prevents the classic slip of softmax-ing the raw dot products. The output has the dimension of the value vectors, not of the keys.

Glossary

Key terms

Convolution: A layer that slides a small learned filter over the input, multiplying and summing at each position to build a feature map; the same filter is reused everywhere (weight sharing).
Pooling (max / average): A downsampling layer that summarises each window into one number. Max pooling keeps the largest activation (discarding the rest); average pooling takes the mean (which can wash out a strong edge).
Translation invariance: The property, from applying one shared filter everywhere, that a CNN detects a feature wherever it appears in the image rather than only at a trained location.
RNN (Recurrent Neural Network): A network for sequences that feeds its hidden state forward from step to step, so each step sees a summary of what came before; plain RNNs suffer vanishing/exploding gradients over long sequences.
LSTM gates: An LSTM adds a cell state guarded by three gates — forget (erase stale memory), input (write new information) and output (expose the hidden state) — letting it capture long-range dependencies. A GRU does similar work with two gates and fewer parameters.
Self-attention (Q, K, V): Attention(Q, K, V) = softmax(QKᵀ/√d_k)·V: each token's Query is compared with every Key to score relevance, scaled and softmaxed into weights, then used to average the Value vectors. No recurrence is needed.
Multi-head attention: Several attention computations run in parallel with different learned Q/K/V projections of the same single embedding, then concatenated and projected — so different heads can attend to different relationships at once.
Positional encoding: A position signal added to each token embedding, because self-attention is otherwise order-agnostic; it lets the Transformer know the sequence order of the tokens.

FAQ

Deep Learning: CNN, RNN & Transformer FAQ

Why is a CNN better than an ordinary (dense) neural network for images?

A fully-connected net gives every pixel its own weight to every neuron — an enormous parameter count with no sense of locality and no reuse. A CNN builds in image structure: local receptive fields (a pixel matters most to its neighbours), weight sharing (one small filter reused everywhere, so far fewer parameters and less overfitting), and translation invariance (a feature is detected wherever it appears). That is why the standard answer to 'classify MNIST handwritten digits' is a CNN rather than an MLP or k-NN.

How do an RNN and an LSTM differ, and what problem does the LSTM solve?

A plain RNN carries a single hidden state across time steps; because training multiplies many derivatives back through the sequence, gradients tend to vanish (or explode), so long-range context is lost. An LSTM adds a separate cell state guarded by input, forget and output gates that control what is written, erased and exposed, letting it hold long-range dependencies. The trade-off is more parameters and more computation per step; a GRU is the middle ground with two gates and fewer parameters.

Can AI help me with deep learning topics in COMP5318?

Yes, for understanding. Sia can explain each idea step by step — how a convolution builds a feature map, why gates fix the vanishing-gradient problem, or how scaled dot-product attention weights the value vectors — and walk you through a practice problem with your own numbers so you learn the method. Use it to check your reasoning and rehearse the derivations, not to obtain answers to submitted assignments or the closed-book exam; the unit requires you to acknowledge any AI tools used in assessable work, so keep AI to learning and revision. Sia does not promise a grade or a pass.

Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.

Study strategy

Exam move

Treat this chapter as three architectures plus two small calculations. First, lock the one-line 'what and why' for each model: CNN = shared local filters + pooling for images (weight sharing, translation invariance, few parameters); RNN/LSTM = a hidden state through time, with LSTM gates fixing the vanishing gradient; Transformer = self-attention over the whole input, with positional encoding for order and multi-head for multiple relationships. Second, rehearse the two hand calculations until they are automatic — convolve a tiny input then max/average pool, and run scaled dot-product attention (score, divide by √d_k, softmax, weighted sum) — because these are the only numeric parts and they are quick marks. Third, drill the precise contrasts the exam repeats: max vs average pooling drawbacks, RNN vs LSTM on computation and long-range memory, and why the input is embedded once, not per head. The final is 2 hours, closed book, with a non-programmable calculator only, so budget about one minute per mark (a 6-mark Transformer question is roughly 6 minutes) and answer every 'explain' with a mechanism and its consequence, not just a name. Keep AI tutoring to rehearsing the method, and confirm the exact exam date on the Canvas exam timetable.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your COMP5318 tutor, unlimited, worked the way the exam marks it

The full 8-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works