COMP5318 · Machine Learning and Data Mining
Deep Learning: CNN, RNN & Transformer
This Weeks 8-9 topic of COMP5318 Machine Learning and Data Mining at the University of Sydney takes the neural-network foundations deep, across the three architectures the exam keeps returning to. A Convolutional Neural Network (CNN) shares small filters across an image and pools to downsample; a Recurrent Neural Network (RNN), and its gated cousin the LSTM, carry a hidden state through a sequence; and a Transformer drops recurrence for self-attention plus positional encoding. Most marks come from precise two-sentence reasons and one small convolution or attention calculation in the closed-book final.
What this chapter covers
- 01Explain convolution as a small shared filter slid over an image, and compute a feature map by hand
- 02Use weight sharing and translation invariance to say why a CNN beats a dense net on images like MNIST
- 03Contrast max pooling and average pooling, and state the drawback of each
- 04Describe how an RNN carries a hidden state and why plain RNNs suffer vanishing gradients on long sequences
- 05Name the LSTM's input, forget and output gates and what each controls, and place a GRU as the lighter alternative
- 06Compare RNN and LSTM on computation and on their ability to capture long-range dependencies
- 07Compute scaled dot-product self-attention from Query, Key and Value vectors (scores, scale by √d_k, softmax, weighted sum)
- 08Explain positional encoding and multi-head attention, and why the input is embedded once, not per head
- 09Say how the final Linear + softmax picks the next token during Transformer inference
Compute scaled dot-product self-attention by hand
- +1Scores are the query dotted with each key: q₁·k₁ = 2·2 + 0·1 + 2·2 + 0·1 = 8, and q₁·k₂ = 2·1 + 0·1 + 2·1 + 0·1 = 4.
- +1Scale by √d_k = √4 = 2 to keep the softmax stable: scaled scores are 8/2 = 4 and 4/2 = 2.
- +1Softmax the scaled scores: e⁴ = 54.60, e² = 7.39, sum = 61.99, so the weights are 54.60/61.99 = 0.88 and 7.39/61.99 = 0.12 (they sum to 1).
- +1Take the weighted sum of the value vectors: z₁ = 0.88·(1, 0) + 0.12·(0, 1) = (0.88, 0.12).
Key terms
- Convolution
- A layer that slides a small learned filter over the input, multiplying and summing at each position to build a feature map; the same filter is reused everywhere (weight sharing).
- Pooling (max / average)
- A downsampling layer that summarises each window into one number. Max pooling keeps the largest activation (discarding the rest); average pooling takes the mean (which can wash out a strong edge).
- Translation invariance
- The property, from applying one shared filter everywhere, that a CNN detects a feature wherever it appears in the image rather than only at a trained location.
- RNN (Recurrent Neural Network)
- A network for sequences that feeds its hidden state forward from step to step, so each step sees a summary of what came before; plain RNNs suffer vanishing/exploding gradients over long sequences.
- LSTM gates
- An LSTM adds a cell state guarded by three gates — forget (erase stale memory), input (write new information) and output (expose the hidden state) — letting it capture long-range dependencies. A GRU does similar work with two gates and fewer parameters.
- Self-attention (Q, K, V)
- Attention(Q, K, V) = softmax(QKᵀ/√d_k)·V: each token's Query is compared with every Key to score relevance, scaled and softmaxed into weights, then used to average the Value vectors. No recurrence is needed.
- Multi-head attention
- Several attention computations run in parallel with different learned Q/K/V projections of the same single embedding, then concatenated and projected — so different heads can attend to different relationships at once.
- Positional encoding
- A position signal added to each token embedding, because self-attention is otherwise order-agnostic; it lets the Transformer know the sequence order of the tokens.
Deep Learning: CNN, RNN & Transformer FAQ
Why is a CNN better than an ordinary (dense) neural network for images?
A fully-connected net gives every pixel its own weight to every neuron — an enormous parameter count with no sense of locality and no reuse. A CNN builds in image structure: local receptive fields (a pixel matters most to its neighbours), weight sharing (one small filter reused everywhere, so far fewer parameters and less overfitting), and translation invariance (a feature is detected wherever it appears). That is why the standard answer to 'classify MNIST handwritten digits' is a CNN rather than an MLP or k-NN.
How do an RNN and an LSTM differ, and what problem does the LSTM solve?
A plain RNN carries a single hidden state across time steps; because training multiplies many derivatives back through the sequence, gradients tend to vanish (or explode), so long-range context is lost. An LSTM adds a separate cell state guarded by input, forget and output gates that control what is written, erased and exposed, letting it hold long-range dependencies. The trade-off is more parameters and more computation per step; a GRU is the middle ground with two gates and fewer parameters.
Can AI help me with deep learning topics in COMP5318?
Yes, for understanding. Sia can explain each idea step by step — how a convolution builds a feature map, why gates fix the vanishing-gradient problem, or how scaled dot-product attention weights the value vectors — and walk you through a practice problem with your own numbers so you learn the method. Use it to check your reasoning and rehearse the derivations, not to obtain answers to submitted assignments or the closed-book exam; the unit requires you to acknowledge any AI tools used in assessable work, so keep AI to learning and revision. Sia does not promise a grade or a pass.
Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.
Exam move
Treat this chapter as three architectures plus two small calculations. First, lock the one-line 'what and why' for each model: CNN = shared local filters + pooling for images (weight sharing, translation invariance, few parameters); RNN/LSTM = a hidden state through time, with LSTM gates fixing the vanishing gradient; Transformer = self-attention over the whole input, with positional encoding for order and multi-head for multiple relationships. Second, rehearse the two hand calculations until they are automatic — convolve a tiny input then max/average pool, and run scaled dot-product attention (score, divide by √d_k, softmax, weighted sum) — because these are the only numeric parts and they are quick marks. Third, drill the precise contrasts the exam repeats: max vs average pooling drawbacks, RNN vs LSTM on computation and long-range memory, and why the input is embedded once, not per head. The final is 2 hours, closed book, with a non-programmable calculator only, so budget about one minute per mark (a 6-mark Transformer question is roughly 6 minutes) and answer every 'explain' with a mechanism and its consequence, not just a name. Keep AI tutoring to rehearsing the method, and confirm the exact exam date on the Canvas exam timetable.