University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP5318 · Machine Learning and Data Mining

- one subject, every graph, every model, every mark
Computer Science14 Chapters9-page Bible
Our own words - no uploaded lecturer files
Updated for this semester
Chapter 11 of 11 · COMP5318

Reinforcement Learning

This chapter covers reinforcement learning (RL) from the Week 12 block of COMP5318 Machine Learning and Data Mining at the University of Sydney. You will learn how an agent learns to act from a stream of rewards rather than labelled examples, how the discount factor trades immediate against future reward, and how Q-learning and Deep Q-learning with experience replay estimate the value of actions. It is examined as short-answer plus a small policy / value calculation in the final exam.

In this chapter

What this chapter covers

  • 01Why RL differs from supervised learning: a scalar reward signal, not labelled targets, and actions that change the data seen next
  • 02The Markov Decision Process tuple (states, actions, reward, transition probability, discount) and the agent-environment loop
  • 03Rewards, return and the discount factor gamma: myopic near gamma=0, far-sighted near gamma=1
  • 04The state-value V and action-value Q functions, and the optimal policy that maximises expected return
  • 05The Bellman optimality equation, and solving a small deterministic MDP backward from the terminal state
  • 06The Q-learning update rule: TD target, TD error, learning rate, and the max over next actions
  • 07Exploration vs exploitation with an epsilon-greedy action rule
  • 08Deep Q-learning: approximating Q with a neural network, and why experience replay breaks sample correlation for stable learning
Worked example · free

Apply one Q-learning update

Q [5 marks]. An agent uses learning rate alpha = 0.5 and discount gamma = 0.9. The current estimate is Q(s,a) = 2. The agent takes action a, receives reward r = 5, and moves to state s', where the current action-values are Q(s',a1) = 4, Q(s',a2) = 10 and Q(s',a3) = 6. Compute the updated Q(s,a).
  • +1Write the rule: Q(s,a) <- Q(s,a) + alpha[ r + gamma * max_a' Q(s',a') - Q(s,a) ].
  • +1Best next value: max_a' Q(s',a') = max(4, 10, 6) = 10.
  • +1TD target: r + gamma * (best next Q) = 5 + 0.9 * 10 = 5 + 9 = 14.
  • +1TD error: target - current Q = 14 - 2 = 12.
  • +1New estimate: Q(s,a) = 2 + 0.5 * 12 = 2 + 6 = 8.
Q(s,a) = 8. The estimate moves up from 2 toward the target 14, but only halfway, because alpha = 0.5 scales the +12 TD error to a +6 step. A larger alpha would jump closer to 14; a smaller one would inch up more slowly. Note: if s' were a terminal state the max term would be 0, so the target would be just r = 5.
Sia tip — Multiply gamma by the max of the next-state Q-values (here 0.9 x 10), not by their sum or average, and remember the step is alpha times the whole bracket (the TD error), not alpha times the target. That is where the calculation marks are won or lost.
Glossary

Key terms

Reinforcement learning (RL)
Learning to act by interacting with an environment and receiving scalar rewards, with the goal of maximising expected cumulative discounted reward. There are no labelled examples, which is the key contrast with supervised learning.
Markov Decision Process (MDP)
The formal model of an RL problem, written as the tuple (S, A, R, P, gamma): states, actions, reward, transition probability, and discount factor. 'Markov' means the next state depends only on the current state and action.
Discount factor (gamma)
A number in [0,1] that weights future rewards: a reward t steps away is multiplied by gamma^t. Near 0 the agent is myopic (values immediate reward); near 1 it is far-sighted (future reward counts almost as much as present).
Policy (pi) and optimal policy (pi*)
A policy maps each state to an action; the optimal policy pi* maximises the expected cumulative reward from every state. It corresponds to taking the best action, arg max over a of Q*(s,a), in each state.
Value function V and Q-value function
V(s) is the expected return from following the policy starting in state s; Q(s,a) is the expected return from taking action a in state s and then following the policy. Q-learning estimates Q.
Bellman optimality equation
The condition the optimal action-values satisfy: Q*(s,a) equals the expected immediate reward plus gamma times the value of acting optimally in the next state, r + gamma * max over a' of Q*(s',a').
Q-learning update
Q(s,a) <- Q(s,a) + alpha[ r + gamma * max_a' Q(s',a') - Q(s,a) ]. The bracket is the temporal-difference (TD) error, alpha is the learning rate, and the update nudges Q toward the Bellman target.
Deep Q-learning with experience replay
Approximating Q(s,a) with a neural network Q_w(s,a) trained on the same TD target. Experience replay stores past transitions (s,a,r,s') and samples a random minibatch, breaking the correlation between consecutive samples for more stable, data-efficient learning.
FAQ

Reinforcement Learning FAQ

What is the main difference between reinforcement learning and supervised learning?

In supervised learning the model is trained on labelled examples and tries to match a given target for each input. In reinforcement learning there are no labels: the agent receives only a scalar reward signal from the environment, and its objective is to maximise expected cumulative (discounted) future reward. There are two further contrasts the exam rewards naming: the agent's own actions change the data (states) it sees next, and rewards can be delayed rather than immediate. This is a common 2-mark short-answer, so list two or three of these differences explicitly.

Why does Deep Q-learning use experience replay?

Training a Q-network on consecutive frames is unstable because successive transitions are highly correlated, which breaks the near-independence that gradient descent relies on and can cause divergence. Experience replay stores the agent's past transitions (s,a,r,s') in a memory buffer and trains on a random minibatch drawn from it. Random sampling decorrelates the training batch and lets each experience be reused, giving more stable and more data-efficient learning. State both the problem (correlation) and the fix (random sampling from a buffer) to earn full marks.

Can AI help me with reinforcement learning in COMP5318?

Yes, as a study aid. Sia, the AskSia tutor, can explain RL step by step, for example how to solve a small MDP for its optimal policy and V* values by working backward from the terminal state, or how to apply one Q-learning update and identify the TD target and TD error. It can also generate fresh practice problems so you build the method yourself. It will not complete your graded assignments or exam for you, and it cannot promise a mark or a pass; the University also requires you to acknowledge any AI tools used in assessable work. Use it to understand the method, then practise unaided under exam conditions.

Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.

Study strategy

Exam move

Make two calculations mechanical, because they are the reliable marks in this topic. First, solving a small deterministic MDP: set terminal states to value 0, then work backward with V*(s) = max over a of [ R(s,a,s') + gamma * V*(s') ], and read off the optimal policy as the action achieving each max. Second, the Q-learning update: write the rule, take the max of the next-state Q-values, form the TD target r + gamma * max Q, subtract the old estimate to get the TD error, then scale by the learning rate alpha. Rehearse both on small numbers until they are automatic. Layer the short-answer facts on top: how RL differs from supervised learning (reward signal, maximise future return, actions change the data), what the discount factor does, and why experience replay helps (random sampling breaks sample correlation). Budget roughly one minute per mark, so a 6-mark policy and value question takes about six minutes, comfortably inside the two-hour paper. The final exam is worth the majority of the unit and carries a minimum-mark hurdle, so bank this quick-scoring topic cleanly; confirm the exact exam date on the Canvas exam timetable.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP5318 tutor, unlimited, worked the way the exam marks it
The full 9-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP5318 Bible + 25 University of Sydney subjects解锁完整 COMP5318 Bible + University of Sydney 25 门科目
$25/mo