DATA3404

May 29, 2026

All files

我拿到了一份我在上的课的考前宝典！你帮我拆解一下这节课，告诉我复习重点，我快final了越详细越好，公式，定义，核心考点我都需要，救救孩子

好的，我按你这两份资料（asksia-bible-data3404-bilingual.pdf + asksia-cheatsheet-data3404.pdf）把 DATA3404 这门课的期末复习“拆成一张能照着背、照着算、照着拿分”的地图：考什么 → 怎么拿分 → 必背公式带 → 必背定义一句话 → 高频题型模板 → 最常见陷阱清单 → 考前 48 小时冲刺法。
- 先说一句最重要的结论：这门课考试不考代码、不考 SQL 记忆；考的是你能不能“手算引擎内部成本”：给你页数（pages）、缓冲帧数（$B$ 或 $M$）、索引信息 → 你能选出最便宜的算法并算出 I/O。[1]Source: asksia-bible-data3404-bilingual.pdfC 3 . EXAM 3 · 考试(EXAM) It's STUVAC. The TL;DR strips, the cost-formula belt and the recap boxes are your map. The blueprint overleaf shows the double hurdle, what 'semi-open book' means, and the calculation types that recur. 现在是 STUVAC (考前复习周)。TL;DR 条、代价公式带和回顾框就是你的地图。背面的蓝图展示了双及格门槛、“半开卷” 意味着什么,以及反复出现的计算类型。 AskSia Library · DATA3404 · 双语 Bilingual ! The single most important thing to understand about DATA3404 关于 DATA3404 最需要理解的一件事 The exam does not test your code. It tests whether you understand the engine well enough to predict its cost: given relation sizes in pages, a buffer of B frames, and an index, which join is cheapest and what is its I/O? How many passes does external sort take? How tall is the B+-tree? What does the optimiser estimate? Master the cost models and the mechanisms behind them and you clear the 40% exam hurdle comfortably. Memorising SQL will not save you. 考试不考你的代码。它考的是你是否足够理解引擎,从而能预测其代价:给定以页为单位的关系大小、B帧的缓冲区和一个索引,哪种连接最便宜、其I/O是多少?外部排序需要多少趟?B+树有多高?优化器估计出什么?掌握代价模型及其背后的机制,你就能轻松越过 40% 的考试及格门槛(hurdle)。死记 SQL 救不了你。 i How this book was built - and the two-layer rule 本书是如何编写的 -- 以及两层规则 Standard database-systems theory, algorithms and cost models are stated plainly (they are universal computer science). The unit's own assignment datasets and the lecturer's specific problem framings are paraphrased and re- numbered, never copied. There is no single prescribed textbook; the canonical references are Ramakrishnan & Gehrke and Garcia-Molina/Ullman/Widom. Every cost calculation here has been checked - but verify weights, dates and the exact 'semi-open book' rules against your own Canvas (canvas. sydney. edu. au). 标准的数据库系统理论、算法和代价模型都直白陈述(它们是通用的计算机科学)。本单元自有的作业数据集和讲师特定的题目设定都经过改写和重新编号,绝不照搬。本课没有单一指定教材;权威参考是 Ramakrishnan & Gehrke 和 Garcia-Molina/Ullman/Widom。这里每一处代价计算都已核对过 -- 但请对照你自己的 Canvas (canvas. sydney. edu. au) 核实权重、日期和确切的“半开卷”规则。 AskSia Library · DATA3404 · 双语 Bilingual THE BLUEPRINT - THE EXAM BLUEPRINT FINAL 60% . DOUBLE HURDLE 60% final, double-hurdled 60% 期末考试,双及格门槛 Continuous 40% . Final 60% . need > 40% exam AND ≥50% overall 平时成绩 40% · 期末 60% · 需期末 ≥40% 且总成绩 ≥50% Your mark splits 40% continuous / 60% final - but two hurdles gate the pass. Miss either and the result is capped at a fail, regardless of the rest. 你的成绩按 40% 平时 / 60% 期末划分 -- 但有两道及格门槛(hurdle)把守通过线。任意一道没过,无论其余部分如何,结果都被封顶为不及格。 60% FINAL EXAM 期末考试 40% MIN ON THE EXAM 考试及格分 50% MIN OVERALL 总评及格分 40% CONTINUOUS (6 TASKS) 平时连续考核(6项任务)[2]Source: asksia-bible-data3404-bilingual.pdfFINAL 60% . DOUBLE HURDLE The walk-in page - read this on the bus 考前速览页 -- 上考场前在车上读这一页 Logistics, the "if you see X, do Y" triggers, every cost formula, the traps, and a clock plan 考务安排、“看到X就做Y”的触发器、所有代价公式、各类陷阱,以及一份时间分配方案 This is the only page you re-read the morning of the exam. DATA3404 is a 2-hour, paper-based, semi- open-book final worth 60%, mostly short answer (possibly some MCQ). It does not assess your code - it asks you to compute engine internals by hand: join I/O, external-sort passes, B+-tree height, index choice, selectivity, distributed-join strategy. Marks live in the method: write the formula, sub in the page counts, show every ceiling. Below: what to confirm, the per-topic triggers, the formula belt, the traps that bleed marks, and a two-hour clock. 这是考试当天早晨你唯一要重读的一页。DATA3404 的期末考是2 小时、纸笔形式、半开卷,占60%,主要为简答题(可能有少量选择题)。它不考你的代码 -- 而是要求你手算引擎内部机制:连接 I/O、外部排序的趟数、B+树高度、索引选择、选择率、分布式连接策略。分数藏在方法里:写出公式、代入页数、展示每一次取顶。下文给出:需确认的事项、各主题的触发条件、公式带、会丢分的陷阱,以及一个两小时的时钟规划。 60% FINAL EXAM WEIGHT 期末考试权重 2h PAPER-BASED, ON-PAPER 纸笔、卷面作答 ≥40% EXAM HURDLE 考试及格门槛 ≥50% OVERALL HURDLE 总评及格门槛 ★ The double hurdle - the number that ends people 双重及格门槛 -- 让人栽跟头的那个数字 To pass you need ≥40% on the final exam AND ≥50% overall. Fail either and your final mark is capped at 45 regardless of your average - the exam is flagged a [hurdle task]. Translation: a strong semester (assignments, quizzes, presentation = 40% of the unit) cannot rescue a weak exam, and a great exam can't rescue a thin semester. Treat 40% on the exam as the floor you must clear before you can think about a grade. 要通过你需要期末考 ≥40% 且总评≥50%。任一不达标,你的最终分数无论平均如何都被封顶在 45 -- 考试被标记为 [及格门槛任务]。换言之:一个强劲的学期(作业、测验、展示=单元的40%)救不了一场弱的考试,而一场出色的考试也救不了一个单薄的学期。把考试的40% 当作你必须越过、才能去想分数的地板。 AskSia Library · DATA3404 · 双语 Bilingual 10. 1 Logistics - confirm before you walk in 10. 1 考务安排 -- 进场前先确认 - Format: paper-based, written, 2 hours, in the formal exam period. Mostly short answer, possibly a few MCQ. 形式:纸笔、书面、2小时,在正式考试期。多为简答, 可能有少量选择题(MCQ)。 - Scope: lectures, labs, homework quizzes and the assignments - but it does NOT assess coding. It is hand-worked theory and cost models. 范围:讲课、实验、作业测验与各次作业 -- 但它不考核编码。它是手算的理论与代价模型。 - - Bring: pen, a spare pen, your student card, and a calculator if permitted (confirm). You do arithmetic on page counts - not nice round numbers. - 携带:笔、一支备用笔、学生证,以及若允许的计算器 (请确认)。你要对页数做算术 -- 不是好看的整数。 - Hurdle again: ≥40% exam & ≥50% overall, else capped at 45. 再说及格门槛:考试≥40% 且总评≥50%,否则被封顶在 45。 ! "Semi-open book" - CONFIRM what you may bring “半开卷” -- 确认你可以带什么 [4]Source: asksia-bible-data3404-bilingual.pdfAskSia Library EXAM BIBLE . ASKSIA SCHOOL OF COMPUTER SCIENCE SEMESTER 1 . 2026 node A node B - - THE COMPLETE EXAM BIBLE Scalable Data Management 可扩展数据管理 THE EXAM ISN'T SQL SYNTAX-IT'S THE ENGINE INTERNALS YOU COST BY HAND: INDEXES, JOINS, SORTING, QUERY PLANS. 考的不是 SQL 语法 · 而是你手算的引擎内部成本 DATA3404 . UNIVERSITY OF SYDNEY 中英双语版 · BILINGUAL EDITION 英文主讲,中文随行一考试要点与术语保留英文原词 The final is 60% and double-hurdled: you need ≥40% on the exam AND ≥50% overall to pass. It is a 2-hour, semi-open-book, paper-based exam - and the marks live in the engine internals you compute by hand: B+-tree & hash indexing, external-sort I/O, the join cost models, and query-optimisation cost estimation. This book drills those calculations, not your SQL. Independent study companion. Not affiliated with or endorsed by the University of Sydney. Corrections: takedowns@asksia. ai PREFACE - HOW TO USE THIS BOOK Cost it by hand 手算代价 The 60% exam rewards engine internals computed on paper - not SQL recall 这门 60% 的考试奖励的是在纸上算出的引擎内部机制 -- 而不是 SQL 记忆 This is not a transcript of the lecture decks. It is a self-contained course in how a scalable database engine actually works - storage and buffering, indexing, sorting, the join algorithms, query optimisation, and the parallel/distributed stack (MapReduce, Spark). Each idea is stated plainly, drawn as an original schematic, then turned into the cost calculation the exam demands. The practical assignments train your SQL and Spark; this book trains the 60% exam. 本书不是讲义幻灯片的逐字稿。它是一门自成体系的课程,讲解一个可扩展数据库引擎到底是如何工作的 -- 存储与缓冲、索引、排序、连接算法、查询优化(query optimisation),以及并行/分布式技术栈(MapReduce、Spark)。每个概念都被平实地讲清,画成原创示意图,再转化为考试所要求的代价计算。实践作业训练你的 SQL 和 Spark;本书训练占60% 的考试 (exam). A 1 . LEARN 1 · 学习(LEARN) You haven't seen the lecture yet. Read a chapter top to bottom. Every topic opens with a one-line TL;DR, then concept - diagram - cost formula - worked example - trap. The schematics are original drawings of standard DB internals - learn the mechanism cold. 你还没上过这节课。从头到尾读一章。每个主题以一行 TL;DR 开头,然后是概念→ 图示→代价公式→ 例题 (worked example) →陷阱。这些示意图是对标准数据库内部机制的原创绘图 -- 把机制学到滚瓜烂熟。 B 2 . DRILL 2 · 训练(DRILL) You've done the lecture and the lab. Cover the worked steps and re-compute each I/O cost yourself - given page counts and buffers, derive the join cost, the sort passes, the B+-tree height. That is the exam's single most repeated task. 你已上过课、做过实验。盖住例题步骤,自己重新算每个 I/O代价 -- 给定页数和缓冲,推导连接代价、排序趟数、B+树高度。这是考试中最常重复出现的单一任务。[16]Source: asksia-cheatsheet-data3404.pdfDATA3404 Scalable Data Management UNIVERSITY OF SYDNEY . SCHOOL OF COMPUTER SCIENCE EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Whole-unit revision . all topics SIDE 1/2 JUIN cost models 0 . Exam Blueprint READ FIRST * DATA3404 is about database-engine internals + scaling out - "learn the principles, not the software. " The graded weight: Final 60% (paper-based, 2h, semi- open book [confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation. Double hurdle: you must score ≥40% on the exam AND ≥50% overall - fail either and the final mark is capped at 45. The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external- sort passes, the join cost formulas, and query- optimisation costing. SIA > Every cost is in page transfers (I/Os), output ignored, CPU & seq-vs-random ignored. Write the formula, sub the numbers, box the answer - method marks survive an arithmetic slip. 1 . Storage & Files LEC 02 DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency >> memory = minimise page I/O. FILE ORGANISATIONS Heap - record placed anywhere with free space; best when access is a full scan. Index - tree/hash access path; faster updates than a sorted file. ACCESS COST (B = #PAGES) OP HEAP SORTED LRU least-recently-used (locality) MRU

7）你现在只要回答我 3 个问题，我就能把你的复习再“个性化压缩”成一份冲刺清单
你回我这三件事（越短越好）：
- 你考试具体是哪一天（日期）？还有几天？（我好给你排每天刷哪类题）
- 你最虚的两块是哪两块：
  - A Join 代价
  - B External sort
  - C B+-tree / Hash / clustered vs unclustered
  - D Optimiser + selectivity 估计
  - E Buffer/CLOCK trace
  - F Spark/NoSQL 概念题
- 你现在更需要：“公式模板+例题带你算一遍”，还是 “给你出题你来算我来改”？