DATA3404
May 29, 2026
All files
- 好的,我按你这两份资料(asksia-bible-data3404-bilingual.pdf + asksia-cheatsheet-data3404.pdf)把 DATA3404 这门课的期末复习“拆成一张能照着背、照着算、照着拿分”的地图:考什么 → 怎么拿分 → 必背公式带 → 必背定义一句话 → 高频题型模板 → 最常见陷阱清单 → 考前 48 小时冲刺法。
- 先说一句最重要的结论:这门课考试不考代码、不考 SQL 记忆;考的是你能不能“手算引擎内部成本”:给你页数(pages)、缓冲帧数($B$ 或 $M$)、索引信息 → 你能选出最便宜的算法并算出 I/O。[1]Source: asksia-bible-data3404-bilingual.pdfC 3 . EXAM 3 · 考试(EXAM) It's STUVAC. The TL;DR strips, the cost-formula belt and the recap boxes are your map. The blueprint overleaf shows the double hurdle, what 'semi-open book' means, and the calculation types that recur. 现在是 STUVAC (考前复习 周)。TL;DR 条、代价公式带和 回顾框就是你的地图。背面的蓝 图展示了双及格门槛、“半开卷” 意味着什么,以及反复出现的计 算类型。 AskSia Library · DATA3404 · 双语 Bilingual ! The single most important thing to understand about DATA3404 关于 DATA3404 最需要理解的一件事 The exam does not test your code. It tests whether you understand the engine well enough to predict its cost: given relation sizes in pages, a buffer of B frames, and an index, which join is cheapest and what is its I/O? How many passes does external sort take? How tall is the B+-tree? What does the optimiser estimate? Master the cost models and the mechanisms behind them and you clear the 40% exam hurdle comfortably. Memorising SQL will not save you. 考试不考你的代码。它考的是你是否足够理解引擎,从而能预测其代价:给定以页为单位的关系大小、B帧的缓冲区和 一个索引,哪种连接最便宜、其I/O是多少?外部排序需要多少趟?B+树有多高?优化器估计出什么?掌握代价模型及 其背后的机制,你就能轻松越过 40% 的考试及格门槛(hurdle)。死记 SQL 救不了你。 i How this book was built - and the two-layer rule 本书是如何编写的 -- 以及两层规则 Standard database-systems theory, algorithms and cost models are stated plainly (they are universal computer science). The unit's own assignment datasets and the lecturer's specific problem framings are paraphrased and re- numbered, never copied. There is no single prescribed textbook; the canonical references are Ramakrishnan & Gehrke and Garcia-Molina/Ullman/Widom. Every cost calculation here has been checked - but verify weights, dates and the exact 'semi-open book' rules against your own Canvas (canvas. sydney. edu. au). 标准的数据库系统理论、算法和代价模型都直白陈述(它们是通用的计算机科学)。本单元自有的作业数据集和讲师特定 的题目设定都经过改写和重新编号,绝不照搬。本课没有单一指定教材;权威参考是 Ramakrishnan & Gehrke 和 Garcia-Molina/Ullman/Widom。这里每一处代价计算都已核对过 -- 但请对照你自己的 Canvas (canvas. sydney. edu. au) 核实权重、日期和确切的“半开卷”规则。 AskSia Library · DATA3404 · 双语 Bilingual THE BLUEPRINT - THE EXAM BLUEPRINT FINAL 60% . DOUBLE HURDLE 60% final, double-hurdled 60% 期末考试,双及格门槛 Continuous 40% . Final 60% . need > 40% exam AND ≥50% overall 平时成绩 40% · 期末 60% · 需期末 ≥40% 且总成绩 ≥50% Your mark splits 40% continuous / 60% final - but two hurdles gate the pass. Miss either and the result is capped at a fail, regardless of the rest. 你的成绩按 40% 平时 / 60% 期末划分 -- 但有两道及格门槛(hurdle)把守通过线。任意一道没过,无论其余部分如何,结果 都被封顶为不及格。 60% FINAL EXAM 期末考试 40% MIN ON THE EXAM 考试及格分 50% MIN OVERALL 总评及格分 40% CONTINUOUS (6 TASKS) 平时连续考核(6项任务)[2]Source: asksia-bible-data3404-bilingual.pdfFINAL 60% . DOUBLE HURDLE The walk-in page - read this on the bus 考前速览页 -- 上考场前在车上读这一页 Logistics, the "if you see X, do Y" triggers, every cost formula, the traps, and a clock plan 考务安排、“看到X就做Y”的触发器、所有代价公式、各类陷阱,以及一份时间分配方案 This is the only page you re-read the morning of the exam. DATA3404 is a 2-hour, paper-based, semi- open-book final worth 60%, mostly short answer (possibly some MCQ). It does not assess your code - it asks you to compute engine internals by hand: join I/O, external-sort passes, B+-tree height, index choice, selectivity, distributed-join strategy. Marks live in the method: write the formula, sub in the page counts, show every ceiling. Below: what to confirm, the per-topic triggers, the formula belt, the traps that bleed marks, and a two-hour clock. 这是考试当天早晨你唯一要重读的一页。DATA3404 的期末考是2 小时、纸笔形式、半开卷,占60%,主要为简答题(可能 有少量选择题)。它不考你的代码 -- 而是要求你手算引擎内部机制:连接 I/O、外部排序的趟数、B+树高度、索引选择、选 择率、分布式连接策略。分数藏在方法里:写出公式、代入页数、展示每一次取顶。下文给出:需确认的事项、各主题的触发 条件、公式带、会丢分的陷阱,以及一个两小时的时钟规划。 60% FINAL EXAM WEIGHT 期末考试权重 2h PAPER-BASED, ON-PAPER 纸笔、卷面作答 ≥40% EXAM HURDLE 考试及格门槛 ≥50% OVERALL HURDLE 总评及格门槛 ★ The double hurdle - the number that ends people 双重及格门槛 -- 让人栽跟头的那个数字 To pass you need ≥40% on the final exam AND ≥50% overall. Fail either and your final mark is capped at 45 regardless of your average - the exam is flagged a [hurdle task]. Translation: a strong semester (assignments, quizzes, presentation = 40% of the unit) cannot rescue a weak exam, and a great exam can't rescue a thin semester. Treat 40% on the exam as the floor you must clear before you can think about a grade. 要通过你需要期末考 ≥40% 且总评≥50%。任一不达标,你的最终分数无论平均如何都被封顶在 45 -- 考试被标记为 [及格门槛任务]。换言之:一个强劲的学期(作业、测验、展示=单元的40%)救不了一场弱的考试,而一场出色的考 试也救不了一个单薄的学期。把考试的40% 当作你必须越过、才能去想分数的地板。 AskSia Library · DATA3404 · 双语 Bilingual 10. 1 Logistics - confirm before you walk in 10. 1 考务安排 -- 进场前先确认 - Format: paper-based, written, 2 hours, in the formal exam period. Mostly short answer, possibly a few MCQ. 形式:纸笔、书面、2小时,在正式考试期。多为简答, 可能有少量选择题(MCQ)。 - Scope: lectures, labs, homework quizzes and the assignments - but it does NOT assess coding. It is hand-worked theory and cost models. 范围:讲课、实验、作业测验与各次作业 -- 但它不考核 编码。它是手算的理论与代价模型。 - - Bring: pen, a spare pen, your student card, and a calculator if permitted (confirm). You do arithmetic on page counts - not nice round numbers. - 携带:笔、一支备用笔、学生证,以及若允许的计算器 (请确认)。你要对页数做算术 -- 不是好看的整数。 - Hurdle again: ≥40% exam & ≥50% overall, else capped at 45. 再说及格门槛:考试≥40% 且 总评≥50%,否则被封顶 在 45。 ! "Semi-open book" - CONFIRM what you may bring “半开卷” -- 确认你可以带什么[4]Source: asksia-bible-data3404-bilingual.pdfAskSia Library EXAM BIBLE . ASKSIA SCHOOL OF COMPUTER SCIENCE SEMESTER 1 . 2026 node A node B - - THE COMPLETE EXAM BIBLE Scalable Data Management 可扩展数据管理 THE EXAM ISN'T SQL SYNTAX-IT'S THE ENGINE INTERNALS YOU COST BY HAND: INDEXES, JOINS, SORTING, QUERY PLANS. 考的不是 SQL 语法 · 而是你手算的引擎内部成本 DATA3404 . UNIVERSITY OF SYDNEY 中英双语版 · BILINGUAL EDITION 英文主讲,中文随行 一 考试要点与术语保留英文原词 The final is 60% and double-hurdled: you need ≥40% on the exam AND ≥50% overall to pass. It is a 2-hour, semi-open-book, paper-based exam - and the marks live in the engine internals you compute by hand: B+-tree & hash indexing, external-sort I/O, the join cost models, and query-optimisation cost estimation. This book drills those calculations, not your SQL. Independent study companion. Not affiliated with or endorsed by the University of Sydney. Corrections: takedowns@asksia. ai PREFACE - HOW TO USE THIS BOOK Cost it by hand 手算代价 The 60% exam rewards engine internals computed on paper - not SQL recall 这门 60% 的考试奖励的是在纸上算出的引擎内部机制 -- 而不是 SQL 记忆 This is not a transcript of the lecture decks. It is a self-contained course in how a scalable database engine actually works - storage and buffering, indexing, sorting, the join algorithms, query optimisation, and the parallel/distributed stack (MapReduce, Spark). Each idea is stated plainly, drawn as an original schematic, then turned into the cost calculation the exam demands. The practical assignments train your SQL and Spark; this book trains the 60% exam. 本书不是讲义幻灯片的逐字稿。它是一门自成体系的课程,讲解一个可扩展数据库引擎到底是如何工作的 -- 存储与缓冲、索 引、排序、连接算法、查询优化(query optimisation),以及并行/分布式技术栈(MapReduce、Spark)。每个概念都被平实 地讲清,画成原创示意图,再转化为考试所要求的代价计算。实践作业训练你的 SQL 和 Spark;本书训练占60% 的考试 (exam). A 1 . LEARN 1 · 学习(LEARN) You haven't seen the lecture yet. Read a chapter top to bottom. Every topic opens with a one-line TL;DR, then concept - diagram - cost formula - worked example - trap. The schematics are original drawings of standard DB internals - learn the mechanism cold. 你还没上过这节课。从头到尾读 一章。每个主题以一行 TL;DR 开头,然后是概念→ 图示→代 价公式→ 例题 (worked example) →陷阱。这些示意 图是对标准数据库内部机制的原 创绘图 -- 把机制学到滚瓜烂 熟。 B 2 . DRILL 2 · 训练(DRILL) You've done the lecture and the lab. Cover the worked steps and re-compute each I/O cost yourself - given page counts and buffers, derive the join cost, the sort passes, the B+-tree height. That is the exam's single most repeated task. 你已上过课、做过实验。盖住例 题步骤,自己重新算每个 I/O代 价 -- 给定页数和缓冲,推导连 接代价、排序趟数、B+树高度。 这是考试中最常重复出现的单一 任务。[16]Source: asksia-cheatsheet-data3404.pdfDATA3404 Scalable Data Management UNIVERSITY OF SYDNEY . SCHOOL OF COMPUTER SCIENCE EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Whole-unit revision . all topics SIDE 1/2 JUIN cost models 0 . Exam Blueprint READ FIRST * DATA3404 is about database-engine internals + scaling out - "learn the principles, not the software. " The graded weight: Final 60% (paper-based, 2h, semi- open book [confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation. Double hurdle: you must score ≥40% on the exam AND ≥50% overall - fail either and the final mark is capped at 45. The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external- sort passes, the join cost formulas, and query- optimisation costing. SIA > Every cost is in page transfers (I/Os), output ignored, CPU & seq-vs-random ignored. Write the formula, sub the numbers, box the answer - method marks survive an arithmetic slip. 1 . Storage & Files LEC 02 DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency >> memory = minimise page I/O. FILE ORGANISATIONS Heap - record placed anywhere with free space; best when access is a full scan. Index - tree/hash access path; faster updates than a sorted file. ACCESS COST (B = #PAGES) OP HEAP SORTED LRU least-recently-used (locality) MRU
-
0)先把“游戏规则”搞清楚:不然复习会跑偏
- 期末考试画像(你要按这个训练)
- 期末 60%,2 小时,纸笔作答,半开卷(但资料明确提醒:允许带什么要你自己去 Canvas/官方通知确认,不能想当然)。[2]Source: asksia-bible-data3404-bilingual.pdfFINAL 60% . DOUBLE HURDLE The walk-in page - read this on the bus 考前速览页 -- 上考场前在车上读这一页 Logistics, the "if you see X, do Y" triggers, every cost formula, the traps, and a clock plan 考务安排、“看到X就做Y”的触发器、所有代价公式、各类陷阱,以及一份时间分配方案 This is the only page you re-read the morning of the exam. DATA3404 is a 2-hour, paper-based, semi- open-book final worth 60%, mostly short answer (possibly some MCQ). It does not assess your code - it asks you to compute engine internals by hand: join I/O, external-sort passes, B+-tree height, index choice, selectivity, distributed-join strategy. Marks live in the method: write the formula, sub in the page counts, show every ceiling. Below: what to confirm, the per-topic triggers, the formula belt, the traps that bleed marks, and a two-hour clock. 这是考试当天早晨你唯一要重读的一页。DATA3404 的期末考是2 小时、纸笔形式、半开卷,占60%,主要为简答题(可能 有少量选择题)。它不考你的代码 -- 而是要求你手算引擎内部机制:连接 I/O、外部排序的趟数、B+树高度、索引选择、选 择率、分布式连接策略。分数藏在方法里:写出公式、代入页数、展示每一次取顶。下文给出:需确认的事项、各主题的触发 条件、公式带、会丢分的陷阱,以及一个两小时的时钟规划。 60% FINAL EXAM WEIGHT 期末考试权重 2h PAPER-BASED, ON-PAPER 纸笔、卷面作答 ≥40% EXAM HURDLE 考试及格门槛 ≥50% OVERALL HURDLE 总评及格门槛 ★ The double hurdle - the number that ends people 双重及格门槛 -- 让人栽跟头的那个数字 To pass you need ≥40% on the final exam AND ≥50% overall. Fail either and your final mark is capped at 45 regardless of your average - the exam is flagged a [hurdle task]. Translation: a strong semester (assignments, quizzes, presentation = 40% of the unit) cannot rescue a weak exam, and a great exam can't rescue a thin semester. Treat 40% on the exam as the floor you must clear before you can think about a grade. 要通过你需要期末考 ≥40% 且总评≥50%。任一不达标,你的最终分数无论平均如何都被封顶在 45 -- 考试被标记为 [及格门槛任务]。换言之:一个强劲的学期(作业、测验、展示=单元的40%)救不了一场弱的考试,而一场出色的考 试也救不了一个单薄的学期。把考试的40% 当作你必须越过、才能去想分数的地板。 AskSia Library · DATA3404 · 双语 Bilingual 10. 1 Logistics - confirm before you walk in 10. 1 考务安排 -- 进场前先确认 - Format: paper-based, written, 2 hours, in the formal exam period. Mostly short answer, possibly a few MCQ. 形式:纸笔、书面、2小时,在正式考试期。多为简答, 可能有少量选择题(MCQ)。 - Scope: lectures, labs, homework quizzes and the assignments - but it does NOT assess coding. It is hand-worked theory and cost models. 范围:讲课、实验、作业测验与各次作业 -- 但它不考核 编码。它是手算的理论与代价模型。 - - Bring: pen, a spare pen, your student card, and a calculator if permitted (confirm). You do arithmetic on page counts - not nice round numbers. - 携带:笔、一支备用笔、学生证,以及若允许的计算器 (请确认)。你要对页数做算术 -- 不是好看的整数。 - Hurdle again: ≥40% exam & ≥50% overall, else capped at 45. 再说及格门槛:考试≥40% 且 总评≥50%,否则被封顶 在 45。 ! "Semi-open book" - CONFIRM what you may bring “半开卷” -- 确认你可以带什么[13]Source: asksia-bible-data3404-bilingual.pdfThe official Assessment Overview literally says "paper-based exam, 2-hours, semi-open book" - but the mined Canvas pages and slides do not specify what materials are permitted. There is no allowed-materials list (no "one A4 sheet", no "lecture notes only") anywhere in the source. Do not assume. Email cs. data3404@sydney. edu. au / A/Prof Roehm and check the official exam notice before the day so you know exactly what you can carry in. Prepare to work from memory either way - treat any permitted notes as a bonus, not a crutch. 官方评估概览原话说“纸笔考试,2小时,半开卷” -- 但所挖掘的 Canvas 页面和讲义并未指明允许哪 些材料。源材料中任何地方都没有允许材料清单(没 有“一张A4纸”,没有“仅讲义”)。不要假设。发邮件 {A cs. data3404@sydney. edu. au / A/Prof Roehm, 并在考试当天之前查阅官方考试通知,以确切知道你 能带什么进去。无论如何都准备好凭记忆作答 -- 把 任何获准的笔记当作额外加分,而非拐杖。 10. 2 If you see X, do Y - the trigger table 10. 2 看到 X 就做 Y -- 触发器表 Pattern-match the question stem to the method. The instant you read the trigger on the left, write the response on the right - before you even finish reading the numbers. 把题干模式匹配到方法上。一读到左侧的触发条件,就写出右侧的应答 -- 甚至在你读完数字之前。 If the question says . . . Storage, indexing & sorting "Sort N pages, B buffers" Passes = 1+ [logB_1[N/B]]; I/O = 2N . passes. Pass O makes [ N/B] runs. B+-tree height / lookup cost Height = [log[ N]; equality lookup = [ log[ N] + 1 (= height + 1, one match). "Which index for this query?" Equality only - hash. Range / prefix / sort / group-by - tree (B+). Hash cannot do ranges. Composite / covering query Tree matches a prefix of the key; put equality attrs before range attrs; answer from the index alone = index-only (no clustering needed). Range-scan cost, index given Clustered: (height+1) + few packed pages. Unclustered: (height+1) + one I/O per matching row. State which. Joins (the gold - §10. 3 belt) "Join cost, M buffers given" Write all five formulas, sub in bR/bs/IRI, state which relation is outer, then pick the min. Inequality or outer join Hash & sort-merge are OUT (equi/natural only). Use nested-loop - BNLJ usually best. AskSia Library . DATA3404 . XXia Bilingual If the question says . . . . . . do this (the method) Plenty of memory, equi-join Grace hash = 3(bR+bs), usually cheapest; symmetric, outer/inner doesn't matter. Output must be sorted (ORDER BY) Prefer sort-merge; its sorted output is a free "interesting order" for the next operator. Optimisation & distributed "Estimate result size"[14]Source: asksia-bible-data3404-bilingual.pdfAskSia Library . DATA3404 . XXia Bilingual 2h, semi-open book ! 'Semi-open book' - confirm what you may bring “半开卷” -- 确认你可以带什么进场 The official unit overview states the exam is semi- open book but the mined materials do not specify exactly which materials are permitted. Do not assume you can bring this book or unlimited notes - check the current unit outline / exam instructions on Canvas before the exam. Study as if closed-book; treat any permitted sheet as a bonus. 官方单元概览说明考试为半开卷,但所挖掘的材料并 未明确指出允许携带哪些材料。不要假设你可以带这 本书或无限制的笔记 -- 考前请到 Canvas 查看当前 的单元大纲/考试须知。按闭卷来备考;把任何获准的 资料当作额外加分。 ★ The strategy this dictates 由此决定的应试策略 Every high-value question is a cost computation or a mechanism trace. The recurring chains are - relation sizes + B buffers - pick the join ++ compute I/O; N pages + B -+ sort passes ++ 2N. passes; data + fan-out -+ B+-tree height -+ search cost; query - RA tree - push selections - estimate cost. Show the formula and every page count - method marks are real. Drill the chains and the numbers can't surprise you. 每一道高分值题目都是一次代价计算或机制追踪。反 复出现的链条是 -- 关系大小+B 缓冲→ 选择连接 →计算 I/O; N页+B→ 排序趟数→ 2N· 趟数;数 据+ 扇出→B+树高度→搜索代价;查询→ 关系代 数树→下推选择→估计代价。写出公式和每个页数 -- 方法分是实打实的。把这些链条练熟,数字就吓 不到你了。 AskSia Library · DATA3404 · 双语 Bilingual CONTENTS - CONTENTS From one node to a cluster 从单节点到集群 The single-node engine first (where the cost models live), then scaling out 先讲单节点引擎(代价模型所在之处),再讲横向扩展 Ch Topic Core content Part 1 . The single-node engine 1 Storage & buffer management disk . pages . records . heap/sorted . row/column . CLOCK → 2 Indexing B+-trees . ISAM . extendible hashing · clustered/dense → 3 Sorting & query processing external merge-sort . operators . access paths → 4 Join algorithms & cost NLJ . BNLJ . INLJ . sort-merge . hash - the I/O formulas → 5 Query optimisation RA equivalences . selectivity . System-R . plan trees → Part 2 . Scaling out 6 Parallel & distributed data[16]Source: asksia-cheatsheet-data3404.pdfDATA3404 Scalable Data Management UNIVERSITY OF SYDNEY . SCHOOL OF COMPUTER SCIENCE EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Whole-unit revision . all topics SIDE 1/2 JUIN cost models 0 . Exam Blueprint READ FIRST * DATA3404 is about database-engine internals + scaling out - "learn the principles, not the software. " The graded weight: Final 60% (paper-based, 2h, semi- open book [confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation. Double hurdle: you must score ≥40% on the exam AND ≥50% overall - fail either and the final mark is capped at 45. The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external- sort passes, the join cost formulas, and query- optimisation costing. SIA > Every cost is in page transfers (I/Os), output ignored, CPU & seq-vs-random ignored. Write the formula, sub the numbers, box the answer - method marks survive an arithmetic slip. 1 . Storage & Files LEC 02 DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency >> memory = minimise page I/O. FILE ORGANISATIONS Heap - record placed anywhere with free space; best when access is a full scan. Index - tree/hash access path; faster updates than a sorted file. ACCESS COST (B = #PAGES) OP HEAP SORTED LRU least-recently-used (locality) MRU
- 双及格门槛(double hurdle):
- 你必须 期末 ≥ 40% 且 总评 ≥ 50%;少一个都算挂,最终分会被封顶(材料里提到 capped)。[1]Source: asksia-bible-data3404-bilingual.pdfC 3 . EXAM 3 · 考试(EXAM) It's STUVAC. The TL;DR strips, the cost-formula belt and the recap boxes are your map. The blueprint overleaf shows the double hurdle, what 'semi-open book' means, and the calculation types that recur. 现在是 STUVAC (考前复习 周)。TL;DR 条、代价公式带和 回顾框就是你的地图。背面的蓝 图展示了双及格门槛、“半开卷” 意味着什么,以及反复出现的计 算类型。 AskSia Library · DATA3404 · 双语 Bilingual ! The single most important thing to understand about DATA3404 关于 DATA3404 最需要理解的一件事 The exam does not test your code. It tests whether you understand the engine well enough to predict its cost: given relation sizes in pages, a buffer of B frames, and an index, which join is cheapest and what is its I/O? How many passes does external sort take? How tall is the B+-tree? What does the optimiser estimate? Master the cost models and the mechanisms behind them and you clear the 40% exam hurdle comfortably. Memorising SQL will not save you. 考试不考你的代码。它考的是你是否足够理解引擎,从而能预测其代价:给定以页为单位的关系大小、B帧的缓冲区和 一个索引,哪种连接最便宜、其I/O是多少?外部排序需要多少趟?B+树有多高?优化器估计出什么?掌握代价模型及 其背后的机制,你就能轻松越过 40% 的考试及格门槛(hurdle)。死记 SQL 救不了你。 i How this book was built - and the two-layer rule 本书是如何编写的 -- 以及两层规则 Standard database-systems theory, algorithms and cost models are stated plainly (they are universal computer science). The unit's own assignment datasets and the lecturer's specific problem framings are paraphrased and re- numbered, never copied. There is no single prescribed textbook; the canonical references are Ramakrishnan & Gehrke and Garcia-Molina/Ullman/Widom. Every cost calculation here has been checked - but verify weights, dates and the exact 'semi-open book' rules against your own Canvas (canvas. sydney. edu. au). 标准的数据库系统理论、算法和代价模型都直白陈述(它们是通用的计算机科学)。本单元自有的作业数据集和讲师特定 的题目设定都经过改写和重新编号,绝不照搬。本课没有单一指定教材;权威参考是 Ramakrishnan & Gehrke 和 Garcia-Molina/Ullman/Widom。这里每一处代价计算都已核对过 -- 但请对照你自己的 Canvas (canvas. sydney. edu. au) 核实权重、日期和确切的“半开卷”规则。 AskSia Library · DATA3404 · 双语 Bilingual THE BLUEPRINT - THE EXAM BLUEPRINT FINAL 60% . DOUBLE HURDLE 60% final, double-hurdled 60% 期末考试,双及格门槛 Continuous 40% . Final 60% . need > 40% exam AND ≥50% overall 平时成绩 40% · 期末 60% · 需期末 ≥40% 且总成绩 ≥50% Your mark splits 40% continuous / 60% final - but two hurdles gate the pass. Miss either and the result is capped at a fail, regardless of the rest. 你的成绩按 40% 平时 / 60% 期末划分 -- 但有两道及格门槛(hurdle)把守通过线。任意一道没过,无论其余部分如何,结果 都被封顶为不及格。 60% FINAL EXAM 期末考试 40% MIN ON THE EXAM 考试及格分 50% MIN OVERALL 总评及格分 40% CONTINUOUS (6 TASKS) 平时连续考核(6项任务)[2]Source: asksia-bible-data3404-bilingual.pdfFINAL 60% . DOUBLE HURDLE The walk-in page - read this on the bus 考前速览页 -- 上考场前在车上读这一页 Logistics, the "if you see X, do Y" triggers, every cost formula, the traps, and a clock plan 考务安排、“看到X就做Y”的触发器、所有代价公式、各类陷阱,以及一份时间分配方案 This is the only page you re-read the morning of the exam. DATA3404 is a 2-hour, paper-based, semi- open-book final worth 60%, mostly short answer (possibly some MCQ). It does not assess your code - it asks you to compute engine internals by hand: join I/O, external-sort passes, B+-tree height, index choice, selectivity, distributed-join strategy. Marks live in the method: write the formula, sub in the page counts, show every ceiling. Below: what to confirm, the per-topic triggers, the formula belt, the traps that bleed marks, and a two-hour clock. 这是考试当天早晨你唯一要重读的一页。DATA3404 的期末考是2 小时、纸笔形式、半开卷,占60%,主要为简答题(可能 有少量选择题)。它不考你的代码 -- 而是要求你手算引擎内部机制:连接 I/O、外部排序的趟数、B+树高度、索引选择、选 择率、分布式连接策略。分数藏在方法里:写出公式、代入页数、展示每一次取顶。下文给出:需确认的事项、各主题的触发 条件、公式带、会丢分的陷阱,以及一个两小时的时钟规划。 60% FINAL EXAM WEIGHT 期末考试权重 2h PAPER-BASED, ON-PAPER 纸笔、卷面作答 ≥40% EXAM HURDLE 考试及格门槛 ≥50% OVERALL HURDLE 总评及格门槛 ★ The double hurdle - the number that ends people 双重及格门槛 -- 让人栽跟头的那个数字 To pass you need ≥40% on the final exam AND ≥50% overall. Fail either and your final mark is capped at 45 regardless of your average - the exam is flagged a [hurdle task]. Translation: a strong semester (assignments, quizzes, presentation = 40% of the unit) cannot rescue a weak exam, and a great exam can't rescue a thin semester. Treat 40% on the exam as the floor you must clear before you can think about a grade. 要通过你需要期末考 ≥40% 且总评≥50%。任一不达标,你的最终分数无论平均如何都被封顶在 45 -- 考试被标记为 [及格门槛任务]。换言之:一个强劲的学期(作业、测验、展示=单元的40%)救不了一场弱的考试,而一场出色的考 试也救不了一个单薄的学期。把考试的40% 当作你必须越过、才能去想分数的地板。 AskSia Library · DATA3404 · 双语 Bilingual 10. 1 Logistics - confirm before you walk in 10. 1 考务安排 -- 进场前先确认 - Format: paper-based, written, 2 hours, in the formal exam period. Mostly short answer, possibly a few MCQ. 形式:纸笔、书面、2小时,在正式考试期。多为简答, 可能有少量选择题(MCQ)。 - Scope: lectures, labs, homework quizzes and the assignments - but it does NOT assess coding. It is hand-worked theory and cost models. 范围:讲课、实验、作业测验与各次作业 -- 但它不考核 编码。它是手算的理论与代价模型。 - - Bring: pen, a spare pen, your student card, and a calculator if permitted (confirm). You do arithmetic on page counts - not nice round numbers. - 携带:笔、一支备用笔、学生证,以及若允许的计算器 (请确认)。你要对页数做算术 -- 不是好看的整数。 - Hurdle again: ≥40% exam & ≥50% overall, else capped at 45. 再说及格门槛:考试≥40% 且 总评≥50%,否则被封顶 在 45。 ! "Semi-open book" - CONFIRM what you may bring “半开卷” -- 确认你可以带什么[16]Source: asksia-cheatsheet-data3404.pdfDATA3404 Scalable Data Management UNIVERSITY OF SYDNEY . SCHOOL OF COMPUTER SCIENCE EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Whole-unit revision . all topics SIDE 1/2 JUIN cost models 0 . Exam Blueprint READ FIRST * DATA3404 is about database-engine internals + scaling out - "learn the principles, not the software. " The graded weight: Final 60% (paper-based, 2h, semi- open book [confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation. Double hurdle: you must score ≥40% on the exam AND ≥50% overall - fail either and the final mark is capped at 45. The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external- sort passes, the join cost formulas, and query- optimisation costing. SIA > Every cost is in page transfers (I/Os), output ignored, CPU & seq-vs-random ignored. Write the formula, sub the numbers, box the answer - method marks survive an arithmetic slip. 1 . Storage & Files LEC 02 DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency >> memory = minimise page I/O. FILE ORGANISATIONS Heap - record placed anywhere with free space; best when access is a full scan. Index - tree/hash access path; faster updates than a sorted file. ACCESS COST (B = #PAGES) OP HEAP SORTED LRU least-recently-used (locality) MRU
- 评分规律(最值钱的习惯)
- 分数藏在“方法”里:每道代价题都要写:公式 → 代入页数 → 每次取整/上取整(ceiling) → 算术 → 框答案。就算乘法算错,方法对也能拿分;只有裸数字错了基本没分。[2]Source: asksia-bible-data3404-bilingual.pdfFINAL 60% . DOUBLE HURDLE The walk-in page - read this on the bus 考前速览页 -- 上考场前在车上读这一页 Logistics, the "if you see X, do Y" triggers, every cost formula, the traps, and a clock plan 考务安排、“看到X就做Y”的触发器、所有代价公式、各类陷阱,以及一份时间分配方案 This is the only page you re-read the morning of the exam. DATA3404 is a 2-hour, paper-based, semi- open-book final worth 60%, mostly short answer (possibly some MCQ). It does not assess your code - it asks you to compute engine internals by hand: join I/O, external-sort passes, B+-tree height, index choice, selectivity, distributed-join strategy. Marks live in the method: write the formula, sub in the page counts, show every ceiling. Below: what to confirm, the per-topic triggers, the formula belt, the traps that bleed marks, and a two-hour clock. 这是考试当天早晨你唯一要重读的一页。DATA3404 的期末考是2 小时、纸笔形式、半开卷,占60%,主要为简答题(可能 有少量选择题)。它不考你的代码 -- 而是要求你手算引擎内部机制:连接 I/O、外部排序的趟数、B+树高度、索引选择、选 择率、分布式连接策略。分数藏在方法里:写出公式、代入页数、展示每一次取顶。下文给出:需确认的事项、各主题的触发 条件、公式带、会丢分的陷阱,以及一个两小时的时钟规划。 60% FINAL EXAM WEIGHT 期末考试权重 2h PAPER-BASED, ON-PAPER 纸笔、卷面作答 ≥40% EXAM HURDLE 考试及格门槛 ≥50% OVERALL HURDLE 总评及格门槛 ★ The double hurdle - the number that ends people 双重及格门槛 -- 让人栽跟头的那个数字 To pass you need ≥40% on the final exam AND ≥50% overall. Fail either and your final mark is capped at 45 regardless of your average - the exam is flagged a [hurdle task]. Translation: a strong semester (assignments, quizzes, presentation = 40% of the unit) cannot rescue a weak exam, and a great exam can't rescue a thin semester. Treat 40% on the exam as the floor you must clear before you can think about a grade. 要通过你需要期末考 ≥40% 且总评≥50%。任一不达标,你的最终分数无论平均如何都被封顶在 45 -- 考试被标记为 [及格门槛任务]。换言之:一个强劲的学期(作业、测验、展示=单元的40%)救不了一场弱的考试,而一场出色的考 试也救不了一个单薄的学期。把考试的40% 当作你必须越过、才能去想分数的地板。 AskSia Library · DATA3404 · 双语 Bilingual 10. 1 Logistics - confirm before you walk in 10. 1 考务安排 -- 进场前先确认 - Format: paper-based, written, 2 hours, in the formal exam period. Mostly short answer, possibly a few MCQ. 形式:纸笔、书面、2小时,在正式考试期。多为简答, 可能有少量选择题(MCQ)。 - Scope: lectures, labs, homework quizzes and the assignments - but it does NOT assess coding. It is hand-worked theory and cost models. 范围:讲课、实验、作业测验与各次作业 -- 但它不考核 编码。它是手算的理论与代价模型。 - - Bring: pen, a spare pen, your student card, and a calculator if permitted (confirm). You do arithmetic on page counts - not nice round numbers. - 携带:笔、一支备用笔、学生证,以及若允许的计算器 (请确认)。你要对页数做算术 -- 不是好看的整数。 - Hurdle again: ≥40% exam & ≥50% overall, else capped at 45. 再说及格门槛:考试≥40% 且 总评≥50%,否则被封顶 在 45。 ! "Semi-open book" - CONFIRM what you may bring “半开卷” -- 确认你可以带什么[9]Source: asksia-bible-data3404-bilingual.pdf- PRACTICE BANK (CONT. ) Spark & conceptual - DAGs, engines, NoSQL Spark 与概念题–––DAG、引擎、NoSQL Narrow vs wide dependencies, Spark-vs-MapReduce, and a NoSQL choice 窄依赖 vs 宽依赖、Spark vs MapReduce,以及一次 NoSQL 选型 Q18 SPARK / CONCEPTS 10 marks . DAG / engines / NoSQL . Areas 8-10 Consider this Spark pipeline: 考虑这段 Spark 流水线: rdd. map(f). filter(g). groupByKey(). mapValues(h) . collect () rdd. map(f). filter(g). groupByKey(). mapValues(h). collect() (a) Label each transformation narrow or wide, and count the stages. [4] (b) Which call triggers execution, and why is the rest "lazy"? [2] (c) Name one reason Spark beats MapReduce for an iterative job. [2] (d) Choose a NoSQL family for a workload of simple Put /Get by key with no joins, and justify. [2] (a) 将每个转换(transformation) 标注为窄依赖(narrow)或宽依赖(wide),并统计阶段(stage)数。[4](b)哪个 调用触发执行,其余为何是“惰性的(lazy)”?[2](c)说出 Spark 在迭代作业上胜过 MapReduce 的一个原因。[2](d) 为一个仅按键做简单 Put/Get、无连接的工作负载选择一个 NoSQL 家族,并说明理由。[2] WORKED SOLUTION 1 (a) Dependencies. map = narrow (each output partition from one input partition); filter = narrow; groupByKey = wide (a shuffle - same key must meet across partitions); mapValues = narrow. A new stage starts at each wide dependency, so there are 2 stages: {map, filter} before the shuffle, {groupByKey's reduce side, mapValues} after it. - (a) 依赖关系。map =窄依赖(每个输出分区来自一个输入分区);filter =窄依赖;groupByKey = 宽依赖(一次洗牌 -- 相同的键必须跨分区相遇);mapValues = 窄依赖。每个宽依赖处开始一个新阶段,所以有 2个阶段:洗牌之前的 {map, filter},之后的 {groupByKey 的归约侧,mapValues}。 - 2 (b) collect () is the action that triggers execution. Transformations are lazy - recorded as a DAG (lineage), not run - so Spark can fuse narrow ops into one stage, plan shuffles, and rebuild lost partitions from lineage. Nothing computes until an action pulls a result to the driver. (b) collect(〕 是触发执行的动作。转换是惰性的 -- 被记录为一个 DAG(血统),并不运行 -- 这样 Spark 就能把窄算 子融合进一个阶段、规划洗牌,并从血统重建丢失的分区。在动作把结果拉到驱动器之前,什么都不会计算。 3 (c) Iterative speed-up: in-memory caching. Spark keeps reused RDDs/DataFrames in memory (cache ()), so each iteration avoids MapReduce's materialise-to-HDFS-then-reload between every map-reduce step. (c)迭代加速:内存缓存。Spark 把被复用的 RDD/DataFrame 留在内存中(cache()),所以每次迭代都避免了 MapReduce 在每个 map-reduce 步骤之间物化到 HDFS 再重新加载。 4 (d) Key-Value store (e. g. Redis, RocksDB, Dynamo). The access pattern is exactly Put /Get /Delete by key with an opaque payload and no joins or range scans - the simplest, fastest-scaling family for that shape. A document store fits if the value is queryable JSON; a wide-column store fits sparse, column-family data; a graph store fits relationship traversal - none of which this workload needs. AskSia Library . DATA3404 . XXia Bilingual (d) 键值存储(如 Redis、RocksDB、Dynamo)。访问模式恰好是按键的 Put/Get/Delete,负载不透明,且没有连接或范 围扫描 -- 对这种形态而言,这是最简单、扩展最快的一类。如果值是可查询的 JSON,文档存储更合适;如果是稀疏的列族 数据,宽列存储更合适;如果是关系遍历,图存储更合适 -- 而这个工作负载都不需要。 ✓ Stage-counting shortcut 数阶段的捷径 #stages = #wide dependencies + 1. Each shuffle (groupByKey, reduceByKey, join on non-co-partitioned data, distinct, repartition) ends a stage. Chains of map/filter/mapValues are fused into the surrounding stage for free. 阶段数 = 宽依赖数 +1。每次洗牌(groupByKey、reduceByKey、在非共分区数据上的 join、distinct、repartition) 结束一个阶段。map/filter/mapValues 的链条被融合进周围的阶段,不增加开销。 On every cost question: write the formula, substitute the page counts, then compute - in that order. The marker can award method marks on a correct 3 . (b_R+b_S) even if the multiply slips, but a bare number with no working earns nothing if it is wrong. Choose the smaller relation as the outer, keep ceilings until the end, and remember output cost is free. 对每道代价题:写出公式,代入页数,再计算 -- 按这个顺序。即使乘法算错,阅卷者也能在一个正确的 3·(b_R+b_S) 上给方法分,但一个没有过程的裸数字如果错了就什么都得不到。选较小的关系作外表,保留取整直到 最后,并记住输出代价是免费的。 MARKER'S NOTE . DATA3404 FINAL AskSia Library · DATA3404 · 双语 Bilingual CH 10 . EXAM MORNING - CHAPTER 10 . EXAM MORNING[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M[16]Source: asksia-cheatsheet-data3404.pdfDATA3404 Scalable Data Management UNIVERSITY OF SYDNEY . SCHOOL OF COMPUTER SCIENCE EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Whole-unit revision . all topics SIDE 1/2 JUIN cost models 0 . Exam Blueprint READ FIRST * DATA3404 is about database-engine internals + scaling out - "learn the principles, not the software. " The graded weight: Final 60% (paper-based, 2h, semi- open book [confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation. Double hurdle: you must score ≥40% on the exam AND ≥50% overall - fail either and the final mark is capped at 45. The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external- sort passes, the join cost formulas, and query- optimisation costing. SIA > Every cost is in page transfers (I/Os), output ignored, CPU & seq-vs-random ignored. Write the formula, sub the numbers, box the answer - method marks survive an arithmetic slip. 1 . Storage & Files LEC 02 DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency >> memory = minimise page I/O. FILE ORGANISATIONS Heap - record placed anywhere with free space; best when access is a full scan. Index - tree/hash access path; faster updates than a sorted file. ACCESS COST (B = #PAGES) OP HEAP SORTED LRU least-recently-used (locality) MRU
- 统一计量单位:页 I/O(page I/Os);并且材料强调:输出代价忽略,CPU 代价忽略,顺序 vs 随机 I/O 也忽略(你别在这些上浪费时间)。[5]Source: asksia-bible-data3404-bilingual.pdfpartitioning . parallel joins . HDFS . MapReduce → 7 Spark & the big-data stack RDDs . DAG . lazy eval . NoSQL . streaming → Part 3 . Walk in ready 8 Glossary & cost-formula map every term EN++ · every formula → 9 Practice bank & worked solutions exam-style cost problems, re-numbered → 10 Exam morning the by-hand checklist . timing . traps → i Why this order 为何采用这一顺序 DATA3404 builds the single-node engine first - storage, indexing, sorting, joins, optimisation - because that is where the exam's cost calculations live and where most marks sit. It then scales out to parallel/distributed processing, MapReduce and Spark. Part 3 turns it all into marks: a bilingual glossary, a cost-formula map, the re- numbered practice template, and a one-page exam-morning drill. DATA3404 先构建单节点引擎 -- 存储、索引、排序、连接、优化 -- 因为考试的代价计算就在这里,大部分分数也坐 落于此。然后它横向扩展到并行/分布式处理、MapReduce 和 Spark。第 3 部分把这一切变成分数:一份双语术语表、 一张代价公式图、重新编号的练习模板,以及一份单页的考前早晨速练。 AskSia Library . DATA3404 . XXia Bilingual CH 1 . STORAGE EXAM CORE - CHAPTER 1 . STORAGE & BUFFER MANAGEMENT Pages, files and the cost of an I/O 页、文件与一次 I/O 的代价 The memory-disk gap . records & slotted pages . heap vs sorted . row vs column . the buffer pool & replacement policies 内存一磁盘差距 · 记录与槽式页 · 堆 vs 有序 · 行 vs 列 · 缓冲池与置换策略 Takeaway: everything DATA3404 ever charges you for is measured in page I/Os - the number of fixed-size pages transferred between disk and memory. This chapter builds that cost vocabulary: how data is laid out on pages, which file organisation makes a scan or a search cheap, and how the buffer manager decides which page to throw out when memory is full. Get the counting right here and the join/sort/optimisation chapters are just the same arithmetic at scale. 要点:DATA3404 向你收取的所有代价,都是以页 I/O来度量的 -- 即在磁盘与内存之间传输的固定大小页(page)的数量。 本章建立这套代价词汇:数据如何在页上布局、哪种文件组织能让扫描或查找变便宜,以及当内存写满时缓冲管理器如何决定 逐出哪个页。把这里的计数弄对,连接/排序/优化各章就只是同一套算术在更大规模上的重复。 ★ What the exam asks here 考试在这里会问什么 Four moves, all done by hand. (1) Trace a buffer-replacement policy (LRU / CLOCK / GCLOCK) over a page- reference string and count the page misses for each. (2) Reason about heap vs sorted access cost - scan, equality search, insert, delete - in O(B) / O(log_B) terms. (3) Explain the slotted-page layout and the record formats (fixed vs variable length, the NULL bitmap, the TID). (4) Compare row-store vs column-store for a given workload and say which wins. Output cost is always ignored; sequential-vs-random and CPU cost are ignored too. 四个动作,全部手算完成。(1)在一段页引用串上追踪一种缓冲置换策略(LRU / CLOCK / GCLOCK),并为每种统计 缺页次数。(2)用 O(B)/ O(log2B)的形式推理堆 vs 有序的访问代价 -- 扫描、等值搜索、插入、删除。(3)解释槽式 页(slotted page)布局和记录格式(定长vs变长、NULL 位图、TID)。(4)针对给定工作负载比较行存储(row store) vs 列存储(column store)并说明哪种胜出。输出代价始终忽略不计;顺序 vs 随机以及 CPU 代价也忽略。[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M[16]Source: asksia-cheatsheet-data3404.pdfDATA3404 Scalable Data Management UNIVERSITY OF SYDNEY . SCHOOL OF COMPUTER SCIENCE EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Whole-unit revision . all topics SIDE 1/2 JUIN cost models 0 . Exam Blueprint READ FIRST * DATA3404 is about database-engine internals + scaling out - "learn the principles, not the software. " The graded weight: Final 60% (paper-based, 2h, semi- open book [confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation. Double hurdle: you must score ≥40% on the exam AND ≥50% overall - fail either and the final mark is capped at 45. The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external- sort passes, the join cost formulas, and query- optimisation costing. SIA > Every cost is in page transfers (I/Os), output ignored, CPU & seq-vs-random ignored. Write the formula, sub the numbers, box the answer - method marks survive an arithmetic slip. 1 . Storage & Files LEC 02 DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency >> memory = minimise page I/O. FILE ORGANISATIONS Heap - record placed anywhere with free space; best when access is a full scan. Index - tree/hash access path; faster updates than a sorted file. ACCESS COST (B = #PAGES) OP HEAP SORTED LRU least-recently-used (locality) MRU
-
1)考试“最高频黄金内容”是什么?(你优先把这些练到闭眼能写)
- 资料反复强调:期末最重复的任务就是这些“链条题”:[1]Source: asksia-bible-data3404-bilingual.pdfC 3 . EXAM 3 · 考试(EXAM)
It's STUVAC. The TL;DR strips, the cost-formula belt and the recap boxes are your map. The blueprint overleaf shows the double hurdle, what 'semi-open book' means, and the calculation types that recur.
现在是 STUVAC (考前复习
周)。TL;DR 条、代价公式带和 回顾框就是你的地图。背面的蓝 图展示了双及格门槛、“半开卷” 意味着什么,以及反复出现的计 算类型。
AskSia Library · DATA3404 · 双语 Bilingual
! The single most important thing to understand about DATA3404
关于 DATA3404 最需要理解的一件事
The exam does not test your code. It tests whether you understand the engine well enough to predict its cost: given relation sizes in pages, a buffer of B frames, and an index, which join is cheapest and what is its I/O? How many passes does external sort take? How tall is the B+-tree? What does the optimiser estimate? Master the cost models and the mechanisms behind them and you clear the 40% exam hurdle comfortably. Memorising SQL will not save you.
考试不考你的代码。它考的是你是否足够理解引擎,从而能预测其代价:给定以页为单位的关系大小、B帧的缓冲区和 一个索引,哪种连接最便宜、其I/O是多少?外部排序需要多少趟?B+树有多高?优化器估计出什么?掌握代价模型及 其背后的机制,你就能轻松越过 40% 的考试及格门槛(hurdle)。死记 SQL 救不了你。
i How this book was built - and the two-layer rule
本书是如何编写的 -- 以及两层规则
Standard database-systems theory, algorithms and cost models are stated plainly (they are universal computer science). The unit's own assignment datasets and the lecturer's specific problem framings are paraphrased and re-
numbered, never copied. There is no single prescribed textbook; the canonical references are Ramakrishnan & Gehrke and Garcia-Molina/Ullman/Widom. Every cost calculation here has been checked - but verify weights, dates and the exact 'semi-open book' rules against your own Canvas (canvas. sydney. edu. au).
标准的数据库系统理论、算法和代价模型都直白陈述(它们是通用的计算机科学)。本单元自有的作业数据集和讲师特定 的题目设定都经过改写和重新编号,绝不照搬。本课没有单一指定教材;权威参考是 Ramakrishnan & Gehrke 和 Garcia-Molina/Ullman/Widom。这里每一处代价计算都已核对过 -- 但请对照你自己的 Canvas (canvas. sydney. edu. au) 核实权重、日期和确切的“半开卷”规则。
AskSia Library · DATA3404 · 双语 Bilingual
THE BLUEPRINT
- THE EXAM BLUEPRINT
FINAL 60% . DOUBLE HURDLE
60% final, double-hurdled 60% 期末考试,双及格门槛
Continuous 40% . Final 60% . need > 40% exam AND ≥50% overall
平时成绩 40% · 期末 60% · 需期末 ≥40% 且总成绩 ≥50%
Your mark splits 40% continuous / 60% final - but two hurdles gate the pass. Miss either and the result is capped at a fail, regardless of the rest.
你的成绩按 40% 平时 / 60% 期末划分 -- 但有两道及格门槛(hurdle)把守通过线。任意一道没过,无论其余部分如何,结果 都被封顶为不及格。
60%
FINAL EXAM 期末考试
40%
MIN ON THE EXAM 考试及格分
50%
MIN OVERALL 总评及格分
40% CONTINUOUS (6 TASKS) 平时连续考核(6项任务)[4]Source: asksia-bible-data3404-bilingual.pdfAskSia Library
EXAM BIBLE . ASKSIA SCHOOL OF COMPUTER SCIENCE SEMESTER 1 . 2026
node A
node B
-
- THE COMPLETE EXAM BIBLE
Scalable Data Management 可扩展数据管理
THE EXAM ISN'T SQL SYNTAX-IT'S THE ENGINE INTERNALS YOU COST BY HAND: INDEXES, JOINS, SORTING, QUERY PLANS.
考的不是 SQL 语法 · 而是你手算的引擎内部成本
DATA3404 . UNIVERSITY OF SYDNEY
中英双语版 · BILINGUAL EDITION
英文主讲,中文随行 一 考试要点与术语保留英文原词
The final is 60% and double-hurdled: you need ≥40% on the exam AND ≥50% overall to pass. It is a 2-hour, semi-open-book, paper-based exam - and the marks live in the engine internals you compute by hand: B+-tree & hash indexing, external-sort I/O, the join cost models, and query-optimisation cost estimation. This book drills those calculations, not your SQL.
Independent study companion. Not affiliated with or endorsed by the
University of Sydney.
Corrections: takedowns@asksia. ai
PREFACE
- HOW TO USE THIS BOOK
Cost it by hand 手算代价
The 60% exam rewards engine internals computed on paper - not SQL recall 这门 60% 的考试奖励的是在纸上算出的引擎内部机制 -- 而不是 SQL 记忆
This is not a transcript of the lecture decks. It is a self-contained course in how a scalable database engine actually works - storage and buffering, indexing, sorting, the join algorithms, query optimisation, and the parallel/distributed stack (MapReduce, Spark). Each idea is stated plainly, drawn as an original schematic, then turned into the cost calculation the exam demands. The practical assignments train your SQL and Spark; this book trains the 60% exam.
本书不是讲义幻灯片的逐字稿。它是一门自成体系的课程,讲解一个可扩展数据库引擎到底是如何工作的 -- 存储与缓冲、索 引、排序、连接算法、查询优化(query optimisation),以及并行/分布式技术栈(MapReduce、Spark)。每个概念都被平实 地讲清,画成原创示意图,再转化为考试所要求的代价计算。实践作业训练你的 SQL 和 Spark;本书训练占60% 的考试 (exam).
A 1 . LEARN
1 · 学习(LEARN)
You haven't seen the lecture yet. Read a chapter top to bottom. Every topic opens with a one-line TL;DR, then concept - diagram - cost formula - worked example - trap. The schematics are original drawings of standard DB internals - learn the mechanism cold.
你还没上过这节课。从头到尾读 一章。每个主题以一行 TL;DR 开头,然后是概念→ 图示→代 价公式→ 例题 (worked example) →陷阱。这些示意 图是对标准数据库内部机制的原 创绘图 -- 把机制学到滚瓜烂 熟。
B 2 . DRILL
2 · 训练(DRILL)
You've done the lecture and the lab. Cover the worked steps and re-compute each I/O cost yourself - given page counts and buffers, derive the join cost, the sort passes, the B+-tree height. That is the exam's single most repeated task.
你已上过课、做过实验。盖住例 题步骤,自己重新算每个 I/O代 价 -- 给定页数和缓冲,推导连 接代价、排序趟数、B+树高度。 这是考试中最常重复出现的单一 任务。[14]Source: asksia-bible-data3404-bilingual.pdfAskSia Library . DATA3404 . XXia Bilingual
2h, semi-open book
! 'Semi-open book' - confirm what you may bring “半开卷” -- 确认你可以带什么进场
The official unit overview states the exam is semi- open book but the mined materials do not specify exactly which materials are permitted. Do not assume you can bring this book or unlimited notes - check the current unit outline / exam instructions on Canvas before the exam. Study as if closed-book; treat any permitted sheet as a bonus.
官方单元概览说明考试为半开卷,但所挖掘的材料并 未明确指出允许携带哪些材料。不要假设你可以带这 本书或无限制的笔记 -- 考前请到 Canvas 查看当前 的单元大纲/考试须知。按闭卷来备考;把任何获准的 资料当作额外加分。
★ The strategy this dictates 由此决定的应试策略
Every high-value question is a cost computation or a mechanism trace. The recurring chains are - relation sizes + B buffers - pick the join ++ compute I/O; N pages + B -+ sort passes ++ 2N. passes; data + fan-out -+ B+-tree height -+ search cost; query - RA tree - push selections - estimate cost. Show the formula and every page count - method marks are real. Drill the chains and the numbers can't surprise you.
每一道高分值题目都是一次代价计算或机制追踪。反 复出现的链条是 -- 关系大小+B 缓冲→ 选择连接 →计算 I/O; N页+B→ 排序趟数→ 2N· 趟数;数 据+ 扇出→B+树高度→搜索代价;查询→ 关系代 数树→下推选择→估计代价。写出公式和每个页数 -- 方法分是实打实的。把这些链条练熟,数字就吓 不到你了。
AskSia Library · DATA3404 · 双语 Bilingual
CONTENTS
- CONTENTS
From one node to a cluster
从单节点到集群
The single-node engine first (where the cost models live), then scaling out
先讲单节点引擎(代价模型所在之处),再讲横向扩展
Ch Topic
Core content
Part 1 . The single-node engine
1 Storage & buffer management
disk . pages . records . heap/sorted . row/column . CLOCK →
2 Indexing
B+-trees . ISAM . extendible hashing · clustered/dense →
3 Sorting & query processing
external merge-sort . operators . access paths →
4 Join algorithms & cost
NLJ . BNLJ . INLJ . sort-merge . hash - the I/O formulas →
5 Query optimisation
RA equivalences . selectivity . System-R . plan trees →
Part 2 . Scaling out
6 Parallel & distributed data[16]Source: asksia-cheatsheet-data3404.pdfDATA3404
Scalable Data Management UNIVERSITY OF SYDNEY . SCHOOL OF COMPUTER SCIENCE
EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Whole-unit revision . all topics
SIDE 1/2 JUIN cost models
0 . Exam Blueprint READ FIRST * DATA3404 is about database-engine internals + scaling out - "learn the principles, not the software. " The graded weight: Final 60% (paper-based, 2h, semi- open book [confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation.
Double hurdle: you must score ≥40% on the exam AND ≥50% overall - fail either and the final mark is capped at 45.
The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external- sort passes, the join cost formulas, and query- optimisation costing.
SIA > Every cost is in page transfers (I/Os), output ignored, CPU & seq-vs-random ignored. Write the formula, sub the numbers, box the answer - method marks survive an arithmetic slip.
1 . Storage & Files LEC 02
DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency >> memory = minimise page I/O.
FILE ORGANISATIONS
Heap - record placed anywhere with free space; best when access is a full scan.
Index - tree/hash access path; faster updates than a sorted file.
ACCESS COST (B = #PAGES)
OP
HEAP
SORTED
LRU
least-recently-used (locality)
MRU
- Join I/O 代价(五大公式:NLJ/BNLJ/INLJ/SMJ/Hash)
- External sort:趟数 + 总 I/O
- B+-tree:高度 + 等值查找 I/O;Hash:等值查找
- Query optimisation:选择率(selectivity / reduction factor)估计 + 推导计划代价
- Buffer/CLOCK(或 GCLOCK):追踪引用串、数缺页
- (权重中到偏低但会考概念):Spark:narrow vs wide、stage 数、lazy DAG、为什么比 MapReduce 适合迭代;NoSQL 选型[6]Source: asksia-bible-data3404-bilingual.pdf对比:翻滚(tumbling)的1分钟窗口会把每个事件恰好放入一个窗口,给出互不重叠的逐分钟计数。 i Recap - Chapter 7 in seven exam moves 回顾 -- 七个考试动作讲完第7章 (1) MapReduce writes every intermediate to replicated HDFS - slow for iterative work; Spark keeps it in memory. (2) RDD = immutable, partitioned, resilient-via-lineage. (3) Transformations are lazy (return an RDD); actions trigger the run. (4) Lazy = Spark holds the whole DAG and optimises before running. (5) Narrow = no shuffle, pipelined; wide (reduceByKey / join / groupBy) = shuffle = stage boundary; #stages = #wide + 1. (6) Recover a lost partition by replaying its lineage - no replication. (7) DataFrames - Catalyst + AQE; pick a NoSQL family by access pattern; BASE/CAP trades C for A; streams = micro-batches over windows. (1) MapReduce 把每个中间结果写到复制的 HDFS→ 迭代工作慢;Spark 把它保留在内存。(2) RDD= 不可变、分 区、经由血统恢复弹性。(3)转换是惰性的(返回一个RDD);动作触发运行。(4)惰性⇒ Spark 持有整个 DAG 并在运 行前优化。(5)窄=无洗牌、流水线;宽(reduceByKey / join /groupBy)= 洗牌= 阶段边界;阶段数 = 宽依赖数 + 1. (6)通过重放血统恢复丢失的分区 -- 无需复制。(7) DataFrame → Catalyst + AQE;按访问模式挑一个 NoSQL 家 族;BASE/CAP 用 C换A;流=窗口上的微批。 AskSia Library · DATA3404 · 双语 Bilingual CH 8 . GLOSSARY CHAPTER 8 . GLOSSARY & COST - FORMULA MAP EN + 中文 Bilingual glossary & cost-formula map 双语术语表与代价公式速查 Every examinable term . English . X . one-line meaning - grouped by area 每一个可考术语 · 英文 · 中文 · 一行释义 -- 按领域分组 A fast bilingual reference for the vocabulary DATA3404 actually examines 〔DATA3404 实际考查的术语双语速 A) , grouped by the taught areas, followed by a single page that collects every exam I/O-cost formula. Lock down the precise wording examiners reward and the pairs students confuse (clustered vs dense, narrow vs wide dependency, speed-up vs scale-up). 一份针对 DATA3404 实际考查的术语的快速双语参考〔DATA3404 实际考查的术语双语速查〕,按所教领域分组,后面跟着 单独一页,汇集了每一个考试I/O代价 公式。锁定阅卷人奖励的精确措辞,以及学生混淆的成对概念(聚簇 vs 稠密、窄依赖 vs 宽依赖、加速比 vs 扩展比)。 Term (EN) 中文 One-line meaning Storage & file organisation 存储与文件组织 Storage hierarchy 存储层次 Registers - cache - RAM - disk - tertiary; latency grows by orders of magnitude down the levels. Access gap 访问鸿沟 The huge latency jump between main memory and secondary (disk) storage that I/O costing targets. Page / block 页/块 Fixed-size unit of transfer between disk and memory; a DB file is a sequence of pages. Heap file 堆文件 Records placed in any free space (unordered); good when the typical access is a full scan. Sorted file 有序文件[9]Source: asksia-bible-data3404-bilingual.pdf- PRACTICE BANK (CONT. ) Spark & conceptual - DAGs, engines, NoSQL Spark 与概念题–––DAG、引擎、NoSQL Narrow vs wide dependencies, Spark-vs-MapReduce, and a NoSQL choice 窄依赖 vs 宽依赖、Spark vs MapReduce,以及一次 NoSQL 选型 Q18 SPARK / CONCEPTS 10 marks . DAG / engines / NoSQL . Areas 8-10 Consider this Spark pipeline: 考虑这段 Spark 流水线: rdd. map(f). filter(g). groupByKey(). mapValues(h) . collect () rdd. map(f). filter(g). groupByKey(). mapValues(h). collect() (a) Label each transformation narrow or wide, and count the stages. [4] (b) Which call triggers execution, and why is the rest "lazy"? [2] (c) Name one reason Spark beats MapReduce for an iterative job. [2] (d) Choose a NoSQL family for a workload of simple Put /Get by key with no joins, and justify. [2] (a) 将每个转换(transformation) 标注为窄依赖(narrow)或宽依赖(wide),并统计阶段(stage)数。[4](b)哪个 调用触发执行,其余为何是“惰性的(lazy)”?[2](c)说出 Spark 在迭代作业上胜过 MapReduce 的一个原因。[2](d) 为一个仅按键做简单 Put/Get、无连接的工作负载选择一个 NoSQL 家族,并说明理由。[2] WORKED SOLUTION 1 (a) Dependencies. map = narrow (each output partition from one input partition); filter = narrow; groupByKey = wide (a shuffle - same key must meet across partitions); mapValues = narrow. A new stage starts at each wide dependency, so there are 2 stages: {map, filter} before the shuffle, {groupByKey's reduce side, mapValues} after it. - (a) 依赖关系。map =窄依赖(每个输出分区来自一个输入分区);filter =窄依赖;groupByKey = 宽依赖(一次洗牌 -- 相同的键必须跨分区相遇);mapValues = 窄依赖。每个宽依赖处开始一个新阶段,所以有 2个阶段:洗牌之前的 {map, filter},之后的 {groupByKey 的归约侧,mapValues}。 - 2 (b) collect () is the action that triggers execution. Transformations are lazy - recorded as a DAG (lineage), not run - so Spark can fuse narrow ops into one stage, plan shuffles, and rebuild lost partitions from lineage. Nothing computes until an action pulls a result to the driver. (b) collect(〕 是触发执行的动作。转换是惰性的 -- 被记录为一个 DAG(血统),并不运行 -- 这样 Spark 就能把窄算 子融合进一个阶段、规划洗牌,并从血统重建丢失的分区。在动作把结果拉到驱动器之前,什么都不会计算。 3 (c) Iterative speed-up: in-memory caching. Spark keeps reused RDDs/DataFrames in memory (cache ()), so each iteration avoids MapReduce's materialise-to-HDFS-then-reload between every map-reduce step. (c)迭代加速:内存缓存。Spark 把被复用的 RDD/DataFrame 留在内存中(cache()),所以每次迭代都避免了 MapReduce 在每个 map-reduce 步骤之间物化到 HDFS 再重新加载。 4 (d) Key-Value store (e. g. Redis, RocksDB, Dynamo). The access pattern is exactly Put /Get /Delete by key with an opaque payload and no joins or range scans - the simplest, fastest-scaling family for that shape. A document store fits if the value is queryable JSON; a wide-column store fits sparse, column-family data; a graph store fits relationship traversal - none of which this workload needs. AskSia Library . DATA3404 . XXia Bilingual (d) 键值存储(如 Redis、RocksDB、Dynamo)。访问模式恰好是按键的 Put/Get/Delete,负载不透明,且没有连接或范 围扫描 -- 对这种形态而言,这是最简单、扩展最快的一类。如果值是可查询的 JSON,文档存储更合适;如果是稀疏的列族 数据,宽列存储更合适;如果是关系遍历,图存储更合适 -- 而这个工作负载都不需要。 ✓ Stage-counting shortcut 数阶段的捷径 #stages = #wide dependencies + 1. Each shuffle (groupByKey, reduceByKey, join on non-co-partitioned data, distinct, repartition) ends a stage. Chains of map/filter/mapValues are fused into the surrounding stage for free. 阶段数 = 宽依赖数 +1。每次洗牌(groupByKey、reduceByKey、在非共分区数据上的 join、distinct、repartition) 结束一个阶段。map/filter/mapValues 的链条被融合进周围的阶段,不增加开销。 On every cost question: write the formula, substitute the page counts, then compute - in that order. The marker can award method marks on a correct 3 . (b_R+b_S) even if the multiply slips, but a bare number with no working earns nothing if it is wrong. Choose the smaller relation as the outer, keep ceilings until the end, and remember output cost is free. 对每道代价题:写出公式,代入页数,再计算 -- 按这个顺序。即使乘法算错,阅卷者也能在一个正确的 3·(b_R+b_S) 上给方法分,但一个没有过程的裸数字如果错了就什么都得不到。选较小的关系作外表,保留取整直到 最后,并记住输出代价是免费的。 MARKER'S NOTE . DATA3404 FINAL AskSia Library · DATA3404 · 双语 Bilingual CH 10 . EXAM MORNING - CHAPTER 10 . EXAM MORNING[15]Source: asksia-bible-data3404-bilingual.pdfMapReduce always scans the whole input (no indexing); joins are not built in (you code them); it is low-level for non-programmers; and chaining multi-step pipelines is awkward (each step round-trips through HDFS). These pains drove declarative layers: Hive (HiveQL - MR jobs, SQL-like over HDFS), Pig Latin, and the in-memory successors Spark and Flink (next chapter). MapReduce 总是扫描整个输入(无索引);连接不是内建的(你得自己写);它对非程序员而言层次太低;且串联多步 流水线很笨拙(每一步都经由 HDFS 往返)。这些痛点催生了声明式层:Hive (HiveQL → MR 作业,HDFS 之上类 SQL)、Pig Latin,以及内存式后继者 Spark 和 Flink(下一章)。 6. 12 Exam-move recap 6. 12 考点动作回顾 AskSia Library · DATA3404 · 双语 Bilingual i Chapter 6 in eight moves 用八步讲完第6 章 (1) Scale-out (commodity nodes) beats scale-up for big data; shared-nothing is the scalable architecture (network is the only shared resource). (2) Speed-up = fixed data + more resources (ideal linear); scale-up = grow both (ideal flat). Skew, startup and interference limit both. (3) Partition: round-robin balances but can't prune; range prunes ranges; hash prunes equality and is the default. Co-partition joined tables to keep joins local. (4) CAP: pick two of C / A / P. (5) At scale failures are guaranteed (MTBF = MTBF,/k => 1,000 nodes = 750 h, not the slide's 7,500) = replicate. (6) HDFS: 64 MB blocks, 3x replication, NameNode (metadata) + DataNodes (data); NameNode is off the data path. (7) Distributed join - read the condition first. selective filter - broadcast (5,005/node); equi - shuffle + local hash (5,050/node); non-equi - fragment-and-replicate (196,500 total). (8) MapReduce = map - shuffle/sort - reduce ~ parallel GROUP BY; reduce-side join = shuffle, map-side join = broadcast. (1)横向扩展(商用节点)在大数据上胜过纵向扩展;无共享是可扩展架构(网络是唯一的共享资源)。(2)加速比=固 定数据+更多资源(理想线性);扩展比=同时增长两者(理想平坦)。倾斜、启动和干扰限制两者。(3)分区:轮询均 衡但无法剪枝;范围为范围剪枝;散列为等值剪枝且为默认。共分区被连接的表以保持连接本地化。(4) CAP:在 C /A / P 中选两个。(5)大规模下失败是必然的(MTBF = MTBF1/k ⇒ 1,000 节点= 750小时,而非讲义的 7,500) ⇒复 制。(6) HDFS: 64 MB 块、3x 复制、NameNode (元数据)+ DataNode (数据); NameNode 在数据路径之外。(7) 分布式连接 -- 先读条件:选择性过滤→广播(5,005/节点);等值→洗牌+本地散列(5,050/节点);非等值→分 片复制(总计 196,500)。(8) MapReduce = map → 洗牌/排序 → reduce ~ 并行 GROUP BY; reduce 侧连接 = 洗 牌,map 侧连接=广播。 Shared- nothing THE SCALABLE ARCHITECTURE 可扩展的架构 Hash THE DEFAULT PARTITIONING 默认分区方式 3x / 64 MB HDFS REPLICATION / BLOCK HDFS 每块复制数 750 h 1,000-NODE MTBF (CORRECTED) 1,000 节点的 MTBF(已修正) AskSia Library · DATA3404 · 双语 Bilingual CH 7 . SPARK & BIG DATA - CHAPTER 7 . SPARK & THE BIG-DATA STACK SCALING OUT RDDs, the lazy DAG and the distributed- dataflow engine RDD、惰性 DAG 与分布式数据流引擎 From MapReduce's disk writes to in-memory, lineage-recovered, Catalyst-optimised dataflow 从 MapReduce 的磁盘写入,到内存中、靠血统恢复、由 Catalyst 优化的数据流 Spark is the modern engine of the big-data stack. It keeps everything the earlier chapters taught - partitioning, cost-based optimisation, the join algorithms - but wraps them in a fault-tolerant, in- memory, lazily-optimised dataflow. The mental model the exam rewards is small: a job is a DAG of transformations on immutable RDDs; nothing runs until an action fires; wide (shuffle) dependencies cut the DAG into stages; and a lost partition is rebuilt from its lineage, not from a replica. Spark 是大数据技术栈的现代引擎。它保留了前几章教的一切 -- 分区、基于代价的优化、连接算法 -- 但把它们包裹进一个 容错的、内存内的、惰性优化的数据流中。考试奖励的心智模型很小:一个作业是一个对不可变 RDD的转换 (transformation) DAG;在动作(action)触发之前什么都不运行;宽(洗牌)依赖把 DAG 切成阶段(stage);丢失的分区从它 的血统(lineage)重建,而不是从副本。 ★ What the exam asks here 考试在这里会问什么 Explain RDD lazy evaluation and how lineage gives fault tolerance without replication; given a small Spark job, identify narrow vs wide dependencies, draw the stage boundaries and count the shuffles; compare Spark to MapReduce (why Spark wins on iterative / interactive work); name Spark's four join strategies and what Catalyst / AQE do; choose a NoSQL family for a described use-case and state the CAP / BASE-vs-ACID trade-off; and reason about a streaming window. This area is MEDIUM-to-LOWER weight - conceptual, not cost-formula heavy. 解释 RDD 惰性求值(lazy evaluation)以及血统(lineage)如何在不复制的情况下提供容错;给定一个小型 Spark 作 业,识别窄依赖 vs 宽依赖、画出阶段边界并统计洗牌次数;比较 Spark 与 MapReduce (Spark 为何在迭代/交互式工 作上胜出);说出 Spark 的四种连接策略以及 Catalyst / AQE 做什么;为所描述的用例选择一个 NoSQL 家族并说明 CAP / BASE-vs-ACID 的取舍;并推理一个流式窗口。这一领域权重为中到偏低 -- 偏概念,不重代价公式。
-
2)“公式带”:你必须背到能默写(整张卷子都靠它)
-
2.1 Join I/O(五大公式,考试最常出现)
-
记号(材料给的标准写法):
- $b_R, b_S$ = 关系 $R,S$ 的页数(pages)
- $|R|$ = $R$ 的行数(rows)
- $M$ 或 $B$ = 缓冲帧数(buffer frames)
- R = outer(外表),S = inner(内表)(一定要写明,否则答案“翻转”)[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M
-
- Tuple NLJ:$$\text{Cost} = b_R + |R|\cdot b_S$$
- Block NLJ(BNLJ):$$\text{Cost} = b_R + \left\lceil \frac{b_R}{(M-2)} \right\rceil \cdot b_S$$
- 关键点:这个版本用的是 $M-2$(留 1 个 inner 输入帧 + 1 个输出帧)。[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M
- Index NLJ(INLJ):$$\text{Cost} = b_R + |R|\cdot c,$$ 其中 $$c = 1 + h + #matches$$[21]Source: asksia-cheatsheet-data3404.pdfDocument MongoDB, CouchDB Graph Neo4j "NoSQL"= Not-Only-SQL: scalability, schema flexibility, no standard model/API. Decoupled (lakehouse): separate compute/storage for elasticity. Apache Flink = pipelined dataflow, strong for streaming; cost-based optimiser but no join re-ordering ; manages raw byte[] in 32 KB pages to dodge GC. YARN = the cluster resource manager letting MR/Spark/Flink coexist (per- app Application Master). Data streams = unbounded, process with windows; Spark Streaming / Flink DataStream. RDBMS column on the NoSQL comparison: tuples, schema-first, SQL, ACID, in-place updates. 17 . If You See X, Compute Y METHOD TRIGGERS Page counts + buffer M, "join" => plug all 5 join formulas; smaller = outer. "sort N pages, B buffer" => 1+[ log_(B-1) [ N/B]] passes, 2N·passes. "B+-tree search/height" => [ log_FN]+1. · · CLOCK/GCLOCK trace => set ref bit on hits too. · "reduction / selectivity" => RF=1/V; multiply RFs (independence); matches = |R|-IT RF. "INLJ cost c" => c = 1 + h + #matches per outer row. SIA > Exam discipline: state the formula, sub numbers, box the I/O count, name which relation is outer. The exam tests by-hand cost, not your code. asksia. ai/cheatsheet/ data3404 · side 2/2 AskSia CHEATSHEET SERIES Independent revision aid . semi-open book - confirm permitted materials on canvas. sydney. edu. au . not affiliated with[24]Source: asksia-cheatsheet-data3404.pdfBlock NLJ b_R+ [b_R/(M-2)1 b_S Index NLJ b_R+|R|-C Sort-merge sort(R)+sort(S)+b_R+b_S Grace hash 3(b_R + b_S) INLJ: c = cost of one selection on S = 1 + h + #matches (root + index height + matching rows). SMJ sorted- only = b_R + b_S; result is sorted. Hash & SMJ work only for equi/natural joins (not outer/inequality). SIA > BNLJ uses M-2 (1 input + 1 output page reserved) in the worksheet form - the M-1 variant is for pipelined output. Read which the question wants; the block size drives the whole answer. 9b . Worked . NLJ vs BNLJ RE- NUMBERED R= 100 pg / 1000 rows; S = 400 pg / 10,000 rows; M = 10 (block M-2 = 8). SIMPLE NLJ R outer: 100 + 1000. 400 = 400,100 S outer: 400 + 10000-100 = 1,000,400 BLOCK NLJ R outer: 100 + [100/81. 400 = 100+13. 400 = 5,300 / S outer: 400 + [400/81. 100 = 5,400 BNLJ is ~75x cheaper than simple NLJ here - and smaller-as-outer still wins. 9d . Join Types & Stats SETUP THE COST 0-join (any cond), equi-join (=only), natural (all same- named), outer (NULL on no match). Hash & sort-merge work only for equi/natural. For R(1000 rows)MS(10,000) on a FK: cross-product = 10,000,000 rows; join result = 10,000; avg matches/R- row = 10. The o_0(RxS) view would discard 9,990,000 - why physical join algorithms beat materialising the cross-product. Joins matter: ~ 25% of CPU on a TPC-H workload (Impala) - hash join + sequential scan dominate analytics. That is why the exam drills the cost models. 9e . Which Join When DECISION Spark · No index, any condition => block NLJ (smaller as outer block).
- Sort-Merge Join(SMJ):$$\text{Cost}=\text{sort}(R)+\text{sort}(S)+b_R+b_S$$
- Grace Hash Join:$$\text{Cost} = 3\cdot (b_R+b_S)$$[10]Source: asksia-bible-data3404-bilingual.pdf4. 7 并列对比:同一个连接,五种算法 Here is the entire Week-6 worksheet on one table - Student X Enrolled (b = 100 / 400, |. | = 1000 / 10,000, M = 10) costed every way, best to worst. Re-derive each row from its formula and you have rehearsed the exam. 这里是整张第 6 周习题在一张表里 -- Student √ Enrolled (b = 100/ 400, || = 1000 / 10,000, M = 10)从好到坏地用 各种方式算了一遍代价。从公式重新推导每一行,你就预演了考试。 Algorithm Formula & substitution I/O cost Grace hash join 3 · (100 + 400) 1,500 Sort-merge (Student pre-sorted) 2,400 + 500 2,900 Sort-merge (both unsorted) 600 + 2,400 + 500 3,500 BNLJ (Student outer, M=10) 100 + [100/81. 400 5,300 BNLJ (Enrolled outer, M=10) 400 + [400/81. 100 5,400 INLJ (idx Enrolled. sid h=2, Student outer) 100 + 1000-13 13,100 INLJ (idx Student. sid h=1, Enrolled outer) 400 + 10,000. 3 30,400 Simple NLJ (Student outer) 100 + 1000. 400 400,100[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M
-
适用条件(考官爱卡你)
- Hash join 和 sort-merge join 只适用于 equi/natural join(等值/自然连接);遇到 不等式条件 或 outer join,它们就“不能用”,通常回到 nested loop(常用 BNLJ)。[13]Source: asksia-bible-data3404-bilingual.pdfThe official Assessment Overview literally says "paper-based exam, 2-hours, semi-open book" - but the mined Canvas pages and slides do not specify what materials are permitted. There is no allowed-materials list (no "one A4 sheet", no "lecture notes only") anywhere in the source. Do not assume. Email cs. data3404@sydney. edu. au / A/Prof Roehm and check the official exam notice before the day so you know exactly what you can carry in. Prepare to work from memory either way - treat any permitted notes as a bonus, not a crutch. 官方评估概览原话说“纸笔考试,2小时,半开卷” -- 但所挖掘的 Canvas 页面和讲义并未指明允许哪 些材料。源材料中任何地方都没有允许材料清单(没 有“一张A4纸”,没有“仅讲义”)。不要假设。发邮件 {A cs. data3404@sydney. edu. au / A/Prof Roehm, 并在考试当天之前查阅官方考试通知,以确切知道你 能带什么进去。无论如何都准备好凭记忆作答 -- 把 任何获准的笔记当作额外加分,而非拐杖。 10. 2 If you see X, do Y - the trigger table 10. 2 看到 X 就做 Y -- 触发器表 Pattern-match the question stem to the method. The instant you read the trigger on the left, write the response on the right - before you even finish reading the numbers. 把题干模式匹配到方法上。一读到左侧的触发条件,就写出右侧的应答 -- 甚至在你读完数字之前。 If the question says . . . Storage, indexing & sorting "Sort N pages, B buffers" Passes = 1+ [logB_1[N/B]]; I/O = 2N . passes. Pass O makes [ N/B] runs. B+-tree height / lookup cost Height = [log[ N]; equality lookup = [ log[ N] + 1 (= height + 1, one match). "Which index for this query?" Equality only - hash. Range / prefix / sort / group-by - tree (B+). Hash cannot do ranges. Composite / covering query Tree matches a prefix of the key; put equality attrs before range attrs; answer from the index alone = index-only (no clustering needed). Range-scan cost, index given Clustered: (height+1) + few packed pages. Unclustered: (height+1) + one I/O per matching row. State which. Joins (the gold - §10. 3 belt) "Join cost, M buffers given" Write all five formulas, sub in bR/bs/IRI, state which relation is outer, then pick the min. Inequality or outer join Hash & sort-merge are OUT (equi/natural only). Use nested-loop - BNLJ usually best. AskSia Library . DATA3404 . XXia Bilingual If the question says . . . . . . do this (the method) Plenty of memory, equi-join Grace hash = 3(bR+bs), usually cheapest; symmetric, outer/inner doesn't matter. Output must be sorted (ORDER BY) Prefer sort-merge; its sorted output is a free "interesting order" for the next operator. Optimisation & distributed "Estimate result size"[26]Source: asksia-cheatsheet-data3404.pdfINLJ: c = cost of one selection on S = 1 + h + #matches (root + index height + matching rows). SMJ sorted- only = b_R + b_S; result is sorted. Hash & SMJ work only for equi/natural joins (not outer/inequality). SIA > BNLJ uses M-2 (1 input + 1 output page reserved) in the worksheet form - the M-1 variant is for pipelined output. Read which the question wants; the block size drives the whole answer. 9b . Worked . NLJ vs BNLJ RE- NUMBERED R= 100 pg / 1000 rows; S = 400 pg / 10,000 rows; M = 10 (block M-2 = 8). SIMPLE NLJ R outer: 100 + 1000. 400 = 400,100 S outer: 400 + 10000-100 = 1,000,400 BLOCK NLJ R outer: 100 + [100/81. 400 = 100+13. 400 = 5,300 / S outer: 400 + [400/81. 100 = 5,400 BNLJ is ~75x cheaper than simple NLJ here - and smaller-as-outer still wins. 9d . Join Types & Stats SETUP THE COST 0-join (any cond), equi-join (=only), natural (all same- named), outer (NULL on no match). Hash & sort-merge work only for equi/natural. For R(1000 rows)MS(10,000) on a FK: cross-product = 10,000,000 rows; join result = 10,000; avg matches/R- row = 10. The o_0(RxS) view would discard 9,990,000 - why physical join algorithms beat materialising the cross-product. Joins matter: ~ 25% of CPU on a TPC-H workload (Impala) - hash join + sequential scan dominate analytics. That is why the exam drills the cost models. 9e . Which Join When DECISION Spark · No index, any condition => block NLJ (smaller as outer block). Index on inner join attr, equi => index NLJ if few matches. Output is ignored in cost, but a sort-merge join leaves the result sorted - free for a later ORDER BY or another merge join (an "interesting order"). · Inequality (R. a<S. b) or outer => hash/SMJ not applicable = block NL (or INLJ on a clustered B+- tree). 9c · Worked · INLJ, SMJ, Hash SAME R, S R = 100 pg/1000 rows, S = 400 pg/10,000 rows, M = 10. INDEX NLJ - C = 1 + H + MATCHES (a) idx on R, h=1, unique = c=3; S outer, R inner: 400 + 10000. 3 = 30,400
-
2.2 External Merge Sort(外部归并排序)
- 趟数(passes):
$$#passes = 1 + \left\lceil \log_{B-1}\left(\left\lceil \frac{N}{B}\right\rceil\right)\right\rceil$$- 材料用口径:$$#passes = 1+\left\lceil \log_{B-1}\left(\frac{N}{B}\right)\right\rceil$$(其中 “前导的 1”是 Pass 0)。[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M
- 总 I/O:
$$\text{I/O} = 2N \cdot #passes$$[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M - 最常见致命坑(off-by-one)
- 别忘了那个 $+1$(Pass 0 也算一趟);还有 归并扇入是 $B-1$,不是 $B$。[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[18]Source: asksia-cheatsheet-data3404.pdfSIA > Don't forget the +1 sort pass: #passes = 1 + (merge passes). Forgetting it is the classic off- by-one that drops the I/O total. 8 . Query Pipeline LEC 05 SQL (declarative) > Parser > Optimizer > Executor. SQL -> relational-algebra logical tree -> physical plan (operators + algorithm). Many plans per SQL; optimiser picks one. Materialization (set-at-a-time): write temp relation, next op reads it - always works, costly I/O. Pipelining (tuple- at-a-time): pass each row on - cheap, but the inner join input, sorts, hash joins & aggregations can't pipeline their input. Reduction factor = #out / #in (= selectivity). Cost = #block transfers, output ignored. 8b · Relational Algebra THE OPERATORS 6 basic: u union, - difference, o select (rows), Tt project (cols), x cross-product, p rename. Derived: n, M join, + division. Set ops need union-compatible schemas (same arity, names, domains). Join RM_0 S = o_0(RxS); equi-join = only equalities; natural join = equi-join on all same-named attrs, keep one copy. Complex conditions -> CNF to match access paths. Access paths: table scan, scan+filter, index scan, index- only (covering) scan. o (selection) = subset of rows; Tt (projection) = subset of columns (strict RA removes duplicates = SQL DISTINCT). o and It are usually fused into the access- path loop. An expression tree shows logical RA operators; a query execution plan shows the physical operators + algorithms chosen. Formula Belt SIDE 1 heap eq = B/2 . sorted eq = log2B B+tree search = [log_F N]+1 = h+1 ext-sort #passes = 1+[log_(B-1) [ N/B]] total I/0 = 2N . #passes ext-hash: double iff local=global depth hit ratio = (req-I/0)/req ≥ 80% asksia. ai/cheatsheet/ data3404 · side 1/2 AskSia CHEATSHEET SERIES Independent revision aid . DATA3404 exam is semi-open book - confirm permitted materials on your unit outline flip - for side 2 . joins, optimisation & scaling out Scan all O(B) O(B)[22]Source: asksia-cheatsheet-data3404.pdfHEAP VS SORTED Person: 1,000,000 rows, 10/page = 100,000 pages; age uniform 0-99. · age=30 on heap (1% match) => still scan all 100,000 (unsorted). Sorted on name => no help for an age predicate => 100,000. Lesson: an index/order only helps if it matches the query's predicate attribute - a name-sorted file is useless for an age query. 6e . Covering Index INDEX-ONLY An index-only (covering) plan answers a query from the index alone, never fetching the heap tuple. e. g. (city, name) answers SELECT name WHERE city =: v. Clustering not needed for index-only plans - usually a tree index. 7 . External Merge Sort LEC 05 For ORDER BY, DISTINCT, sort-merge join, bulk B+-tree load. Sort N pages with B buffer frames: Pass 0 (sort): read all N (B at a time), sort each chunk, write => [ N/B] runs of B pages. Cost 2N. Merge passes: merge B-1 runs at a time (1 output buffer); each pass reads+writes all N = 2N. #PASSES & TOTAL I/O #passes = 1 + [log_(B-1) [N/B]] Total I/0 = 2N . (#passes) Run length after merge pass i = (B-1) . B. In practice rarely > 2-3 passes. Clustered B+-tree on the sort col => read leaves in order (good); unclustered => random I/O/record (bad). 7b . Worked . Sort COUNT THE PASSES N = 10,000 pages, B = 30: Pass 0 runs = [10000/30] = 334 merge B-1 = 29 at a time P1: [334/29] = 12 runs P2: [12/29] = 1 = 2 merge passes total passes = 1 + 2 = 3 Total I/0 = 2. 10000. 3 = 60,000 SIA > Don't forget the +1 sort pass: #passes = 1 + (merge passes). Forgetting it is the classic off- by-one that drops the I/O total. 8 . Query Pipeline LEC 05 SQL (declarative) > Parser > Optimizer > Executor. SQL -> relational-algebra logical tree -> physical plan (operators + algorithm). Many plans per SQL; optimiser picks one. Materialization (set-at-a-time): write temp relation, next op reads it - always works, costly I/O. Pipelining (tuple- at-a-time): pass each row on - cheap, but the inner join input, sorts, hash joins & aggregations can't pipeline their input. Reduction factor = #out / #in (= selectivity). Cost = #block transfers, output ignored. 8b · Relational Algebra THE OPERATORS 6 basic: u union, - difference, o select (rows), Tt project (cols), x cross-product, p rename. Derived: n, M join, + division. Set ops need union-compatible schemas (same arity, names, domains). Join RM_0 S = o_0(RxS); equi-join = only equalities; natural join = equi-join on all same-named attrs, keep one copy. Complex conditions -> CNF to match access paths.
-
2.3 B+-tree / Hash(索引相关)
- B+-tree 高度 / 等值查找代价(材料给的形式):
- $$\text{Height} \approx \lceil \log_F N\rceil$$($F$ = fan-out,$N$ = 页数/叶子页规模的量)
- 等值查找(equality lookup):$$\lceil \log_F N\rceil +1 = h+1$$[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M[21]Source: asksia-cheatsheet-data3404.pdfDocument MongoDB, CouchDB Graph Neo4j "NoSQL"= Not-Only-SQL: scalability, schema flexibility, no standard model/API. Decoupled (lakehouse): separate compute/storage for elasticity. Apache Flink = pipelined dataflow, strong for streaming; cost-based optimiser but no join re-ordering ; manages raw byte[] in 32 KB pages to dodge GC. YARN = the cluster resource manager letting MR/Spark/Flink coexist (per- app Application Master). Data streams = unbounded, process with windows; Spark Streaming / Flink DataStream. RDBMS column on the NoSQL comparison: tuples, schema-first, SQL, ACID, in-place updates. 17 . If You See X, Compute Y METHOD TRIGGERS Page counts + buffer M, "join" => plug all 5 join formulas; smaller = outer. "sort N pages, B buffer" => 1+[ log_(B-1) [ N/B]] passes, 2N·passes. "B+-tree search/height" => [ log_FN]+1. · · CLOCK/GCLOCK trace => set ref bit on hits too. · "reduction / selectivity" => RF=1/V; multiply RFs (independence); matches = |R|-IT RF. "INLJ cost c" => c = 1 + h + #matches per outer row. SIA > Exam discipline: state the formula, sub numbers, box the I/O count, name which relation is outer. The exam tests by-hand cost, not your code. asksia. ai/cheatsheet/ data3404 · side 2/2 AskSia CHEATSHEET SERIES Independent revision aid . semi-open book - confirm permitted materials on canvas. sydney. edu. au . not affiliated with
- Hash index
-
2.4 Selectivity / Reduction factor(选择率与估计)
- 等值谓词:属性有 $V$ 个不同值(distinct values)
- $$\text{selectivity} = \frac{1}{V}, \quad \text{est. matches} = \frac{|R|}{V}$$[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M[25]Source: asksia-cheatsheet-data3404.pdfFault tol. re-run task lineage rebuild Both target shared-nothing; both materialise/shuffle by key for grouping. Spark adds lazy DAGs + in-memory reuse for interactive/iterative work. Spark join strategies (4): broadcast (small side under a threshold), shuffle sort-merge (default), shuffle hash, shuffle-replicate-NL (generic). Set with join hints; AQE re-optimises at runtime. These map 1:1 to the distributed-join approaches in §14. 11 · Query Optimisation LEC 07 Cost-based: (1) generate equivalent expressions; (2) annotate with physical operators; (3) pick cheapest estimated plan. I/O dominates (I/O » CPU >> network). Estimate both op cost & result size; assume uniform distribution + predicate independence (often wrong). PostgreSQL ANALYZE samples stats into pg_stats. RA EQUIVALENCES - APPLY THESE Selections cascade & commute: o_{c1,c2} = o_cl(o_c2(. )); push down before joins. · Join = o over x: RM_0 S = o_0(RxS). · Joins commute & associate: reorder freely. Heuristics: selection early, projection early, most- restrictive selections/joins first. 11b · Selectivity ESTIMATE SIZES EQUALITY ON ATTR, V DISTINCT VALUES selectivity = 1/V est. matches = |R| / V reduction factors multiply (independence) e. g. pod has 2500 distinct values = o(pod=7) keeps R|/2500 rows. Pushing this selection before a join is the single biggest cost lever. Worked: R=40,000 rows, city=200 distinct, paid=2; o(city='x') keeps 40,000/200 = 200 ; A paid=T = RFs multiply = 40,000/(200-2) = 100 . Range = RF = (hi-lo)/(max-min). 11c . Left-Deep . System- DP R SEARCH Left-deep tree: every join's right input is a base relation = fully pipelined (no temp files). System-R considers only left-deep trees. Dynamic programming (N passes): Pass1 = best access path/relation; Pass2 = best plan per pair; . . . PassN joins the (N-1)-plan with the last relation. Keep the cheapest plan plus the cheapest per interesting order (a sort order useful to a later SMJ/ORDER BY). Search space is huge: 6 relations = 40,340 bushy trees; left-deep prunes it. 11e . Worked . Push- Down RE- NUMBERED Query: name, site FROM Emp e, Dept d, Site b WHERE e. dept=d. id A d. site=b. id ^ e. sal>10K. · E1: cross-products + one big o (p-rename the two ids). · E2: rewrite x into joins.
- 多个谓词的“缩减因子相乘”
- 材料提醒:相乘默认依赖“均匀分布 + 谓词独立性(independence)”,实际常不成立;但你写出这个假设,阅卷人会给分。[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[25]Source: asksia-cheatsheet-data3404.pdfFault tol. re-run task lineage rebuild Both target shared-nothing; both materialise/shuffle by key for grouping. Spark adds lazy DAGs + in-memory reuse for interactive/iterative work. Spark join strategies (4): broadcast (small side under a threshold), shuffle sort-merge (default), shuffle hash, shuffle-replicate-NL (generic). Set with join hints; AQE re-optimises at runtime. These map 1:1 to the distributed-join approaches in §14. 11 · Query Optimisation LEC 07 Cost-based: (1) generate equivalent expressions; (2) annotate with physical operators; (3) pick cheapest estimated plan. I/O dominates (I/O » CPU >> network). Estimate both op cost & result size; assume uniform distribution + predicate independence (often wrong). PostgreSQL ANALYZE samples stats into pg_stats. RA EQUIVALENCES - APPLY THESE Selections cascade & commute: o_{c1,c2} = o_cl(o_c2(. )); push down before joins. · Join = o over x: RM_0 S = o_0(RxS). · Joins commute & associate: reorder freely. Heuristics: selection early, projection early, most- restrictive selections/joins first. 11b · Selectivity ESTIMATE SIZES EQUALITY ON ATTR, V DISTINCT VALUES selectivity = 1/V est. matches = |R| / V reduction factors multiply (independence) e. g. pod has 2500 distinct values = o(pod=7) keeps R|/2500 rows. Pushing this selection before a join is the single biggest cost lever. Worked: R=40,000 rows, city=200 distinct, paid=2; o(city='x') keeps 40,000/200 = 200 ; A paid=T = RFs multiply = 40,000/(200-2) = 100 . Range = RF = (hi-lo)/(max-min). 11c . Left-Deep . System- DP R SEARCH Left-deep tree: every join's right input is a base relation = fully pipelined (no temp files). System-R considers only left-deep trees. Dynamic programming (N passes): Pass1 = best access path/relation; Pass2 = best plan per pair; . . . PassN joins the (N-1)-plan with the last relation. Keep the cheapest plan plus the cheapest per interesting order (a sort order useful to a later SMJ/ORDER BY). Search space is huge: 6 relations = 40,340 bushy trees; left-deep prunes it. 11e . Worked . Push- Down RE- NUMBERED Query: name, site FROM Emp e, Dept d, Site b WHERE e. dept=d. id A d. site=b. id ^ e. sal>10K. · E1: cross-products + one big o (p-rename the two ids). · E2: rewrite x into joins.
-
3)每章/模块复习重点(按“你会考什么”来背)
-
A. Storage & Buffer(存储 + 缓冲)
-
- 追踪 LRU / CLOCK / GCLOCK(给引用串 → 数 page misses)
- 对比 heap vs sorted 的访问复杂度(扫/查/插/删)
- 解释 slotted page、记录格式(变长、NULL bitmap、TID)
- 对比 row-store vs column-store(给 workload 判断哪种更好)
-
一句话核心(考试写得分)
- “DATA3404 计代价只看 page I/O;buffer hit = 0 I/O,miss = 1 I/O;淘汰页必须 unpinned,脏页要写回。”[20]Source: asksia-cheatsheet-data3404.pdfCompress mixed types better (uniform) Update cheap expensive PAX (Partition Attributes Across) = hybrid: rows in a page, page split into n minipages, one per attr = better cache locality. Parquet = open column format (nested data, à la Dremel). Wide-column (BigTable, HBase) # column DBMS: column families stored row-wise, for sparse data. Column DBMS: MonetDB, Vertica, SAP HANA, Druid. Column-store cons: updates & full-row reads are expensive; bad for small tables. 3 . Buffer Management LEC 02 Buffer manager loads pages into frames. On a request: in buffer? return it; else pick an unpinned frame, if dirty write it back, then load. Pin count > 0 => not evictable; requestor must unpin & flag modification. Dirty pages written deferred by a background writer. READ EFFICIENCY (HIT RATIO) (req - physical I/0) / req . aim ≥ 80% Why DBMS not OS: needs to pin, force-to-disk for recovery, tune replacement, prefetch, avoid double- buffering. 3b . Replacement Policies MEMORISE CLOCK POLICY FIFO EVICTS oldest arrival (by age) SINGLE-NODE ENGINE . Storage & buffer . Slotted page . Heap vs sorted cost . Row vs column . B+-tree . Hashing . External sort . The REVISION SHEET . ALL TOPICS Compiled by AskSia . mapped to the DATA3404 syllabus . asksia. ai/cheatsheet/data3404 Prefix of composite[12]Source: asksia-bible-data3404-bilingual.pdfA single consumer disk has MTBF = 750,000 hours. For k identical disks the combined mean-time-between- failures is MTBF /k. Tabulate it. 单块消费级磁盘的 MTBF = 750,000 小时。对 k 块相同的磁盘,合并后的平均故障间隔时间为 MTBF1/k。把它列成 表。 Disks k MTBF = 750,000 / k 750,000 h 375,000 h 250,000 h 7,500 h 750 h - correct (slide mis-states 7,500 h) 40,000 18. 75 h ! The flagged slide error: 1000 nodes is 750 h, not 7,500 h 被标出的幻灯片错误:1000个节点是750 小时,不是7,500 小时 The lecture slide reuses the 100-node figure (7,500 h) on the 1000-node line. By the formula MTBF /k = 750,000 / 1000 = 750 h - ten times shorter. If asked, apply the formula and state 750 h. The takeaway is unchanged: at cluster scale, failures are guaranteed - roughly one every month at 1000 nodes - so replication and fault tolerance are mandatory, not optional. 讲义把 100 节点的数字 (7,500小时)用在了1000 节点那一行。按公式 MTBF1/k= 750,000 / 1000 = 750 小 时 -- 短了十倍。若被问到,请套用公式并写出750 小时。结论不变:在集群规模下,失败是必然的 -- 在 1000 节点上大约每月一次 -- 所以复制(replication)和容错是强制的,而非可选。 ★ Exam-move recap - the Chapter 1 cost vocabulary 考点动作回顾 -- 第1章的代价词汇 (1) Everything is page I/Os - the RAM++disk gap (~105x) makes the I/O count the only cost; output, sequential-vs- random and CPU are ignored. (2) Layout: variable-length records use an offset array; NULLs via a 1-bit-per- column bitmap; the slotted page (directory + records + free space) addresses records by a stable TID = (page, slot) that survives a record moving inside its page. (3) File organisation: heap = O(B) scan, O(2) insert, O(B/2) search; sorted = O(log,B) search but a costly insert - a sort only helps queries on the sort column. (4) Row vs column: row store for OLTP/whole-record/writes; column store for OLAP/few-column scans (less I/O + better compression); PAX/Parquet are the hybrids. (5) Buffer: hit = 0 I/O, miss = 1 I/O; a victim must be unpinned and is written back if dirty; count the 4 cold-start misses and never move the hand on a hit - for this string LRU=10, MRU=9, LFU=8, CLOCK=7, GCLOCK=5. (6) RAID: stripe for speed, mirror/parity for reliability; cluster MTBF = MTBF,/k. (1) 一切皆页 I/O -- RAM→磁盘的差距(约 105×)使 I/O 次数成为唯一代价;输出、顺序 vs随机以及 CPU 都被忽 略。(2)布局:变长记录使用偏移数组;NULL 通过每列1位的位图;槽式页(目录+记录+空闲空间)通过稳定的 TID =(页,槽)寻址记录,该 TID 在记录于页内移动后依然有效。(3)文件组织:堆=O(B)扫描、O(2)插入、O(B/2) 搜索;有序=O(log2B)搜索但插入代价高 -- 排序只对排序列上的查询有帮助。(4)行 vs 列:行存储用于 OLTP/整记 录/写;列存储用于 OLAP/少列扫描(更少 I/O+更好压缩);PAX/Parquet 是混合体。(5)缓冲:命中=0 I/O,缺失 =11/O;牺牲页必须已解固定,若脏则写回;算上4次冷启动缺失,且命中时绝不移动指针 -- 对此串 LRU=10, MRU=9, LFU=8, CLOCK=7, GCLOCK=5。(6) RAID: 条带化求速度,镜像/校验求可靠性;集群 MTBF = MTBF1/ko AskSia Library · DATA3404 · 双语 Bilingual CH 2 . INDEXING EXAM CORE - CHAPTER 2 . INDEXING Turning a full scan into a lookup 把全表扫描变成一次查找 Search key & data entry . dense / sparse . clustered / unclustered . B+-trees . hashing . bitmaps 搜索键与数据项 · 稠密/稀疏 · 聚簇/非聚簇 · B+树 · 散列 · 位图 An index is an access path that finds the rows matching a search key without reading the whole table. It is the single biggest lever on query cost - a well-chosen index turns an O(B) scan of every page into an O(log; B) descent of a few pages. The exam lives here: it hands you a query plus some indexes and asks "what is the access cost in page I/Os?" Everything below is built to answer that by hand. - 索引是一条访问路径(access path),它在不读取整张表的情况下找到与搜索键匹配的行。它是查询代价上最大的单一杠杆 -- 一个选得好的索引把对每个页的 O(B)扫描变成对少数几个页的 O(logF B)下降。考试就活在这里:它给你一个查询外加 一些索引,问“以页I/O计的访问代价是多少?”下面的一切都是为了能手工回答这个问题而建立的。 ★ What the exam asks here 考试在这里会问什么
-
B. Indexing(B+树、hash、clustered/unclustered、dense/sparse)
-
必背成对概念(很容易混):[6]Source: asksia-bible-data3404-bilingual.pdf对比:翻滚(tumbling)的1分钟窗口会把每个事件恰好放入一个窗口,给出互不重叠的逐分钟计数。 i Recap - Chapter 7 in seven exam moves 回顾 -- 七个考试动作讲完第7章 (1) MapReduce writes every intermediate to replicated HDFS - slow for iterative work; Spark keeps it in memory. (2) RDD = immutable, partitioned, resilient-via-lineage. (3) Transformations are lazy (return an RDD); actions trigger the run. (4) Lazy = Spark holds the whole DAG and optimises before running. (5) Narrow = no shuffle, pipelined; wide (reduceByKey / join / groupBy) = shuffle = stage boundary; #stages = #wide + 1. (6) Recover a lost partition by replaying its lineage - no replication. (7) DataFrames - Catalyst + AQE; pick a NoSQL family by access pattern; BASE/CAP trades C for A; streams = micro-batches over windows. (1) MapReduce 把每个中间结果写到复制的 HDFS→ 迭代工作慢;Spark 把它保留在内存。(2) RDD= 不可变、分 区、经由血统恢复弹性。(3)转换是惰性的(返回一个RDD);动作触发运行。(4)惰性⇒ Spark 持有整个 DAG 并在运 行前优化。(5)窄=无洗牌、流水线;宽(reduceByKey / join /groupBy)= 洗牌= 阶段边界;阶段数 = 宽依赖数 + 1. (6)通过重放血统恢复丢失的分区 -- 无需复制。(7) DataFrame → Catalyst + AQE;按访问模式挑一个 NoSQL 家 族;BASE/CAP 用 C换A;流=窗口上的微批。 AskSia Library · DATA3404 · 双语 Bilingual CH 8 . GLOSSARY CHAPTER 8 . GLOSSARY & COST - FORMULA MAP EN + 中文 Bilingual glossary & cost-formula map 双语术语表与代价公式速查 Every examinable term . English . X . one-line meaning - grouped by area 每一个可考术语 · 英文 · 中文 · 一行释义 -- 按领域分组 A fast bilingual reference for the vocabulary DATA3404 actually examines 〔DATA3404 实际考查的术语双语速 A) , grouped by the taught areas, followed by a single page that collects every exam I/O-cost formula. Lock down the precise wording examiners reward and the pairs students confuse (clustered vs dense, narrow vs wide dependency, speed-up vs scale-up). 一份针对 DATA3404 实际考查的术语的快速双语参考〔DATA3404 实际考查的术语双语速查〕,按所教领域分组,后面跟着 单独一页,汇集了每一个考试I/O代价 公式。锁定阅卷人奖励的精确措辞,以及学生混淆的成对概念(聚簇 vs 稠密、窄依赖 vs 宽依赖、加速比 vs 扩展比)。 Term (EN) 中文 One-line meaning Storage & file organisation 存储与文件组织 Storage hierarchy 存储层次 Registers - cache - RAM - disk - tertiary; latency grows by orders of magnitude down the levels. Access gap 访问鸿沟 The huge latency jump between main memory and secondary (disk) storage that I/O costing targets. Page / block 页/块 Fixed-size unit of transfer between disk and memory; a DB file is a sequence of pages. Heap file 堆文件 Records placed in any free space (unordered); good when the typical access is a full scan. Sorted file 有序文件[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[27]Source: asksia-cheatsheet-data3404.pdf6 . Index Classification 4 DIMENSIONS DIMENSION MEANING Primary/secondary key sets file order (often integrated) vs not Unique/non-unique over a candidate key or not Clustered/unclustered data & index ordered same; ≤1 clustered/table Dense/sparse entry per record vs per page Rules: a main/integrated index is always clustered; unclustered must be dense ; sparse needs a candidate- key search key. Search key # key (search key need not be unique). EGSR choice rule = Equality, GroupBy, Sort, Range. Equality attrs first (any type); range = tree index; group-by => index on those attrs; sort => clustered B- tree. Composite (A1,A2) = lexical; supports prefix-key & index-only (covering) plans. Order: equality keys before range keys . Worthwhile if lookup + retrieval < full-scan cost. 6b · Range-Scan Cost FAVOURITE 10,000 data pages, 100 rows in range, 20 rows/page: ACCESS PATH PAGE I/O Heap
- clustered vs unclustered(每表最多一个 clustered)
- dense vs sparse(unclustered 通常必须 dense;sparse 需要候选键一类前提)
- “search key ≠ key”(搜索键不一定唯一)
-
范围查询代价要会口头解释
- clustered:匹配记录在连续页上 → “高度+1 + 少量页”;
- unclustered:可能“每条匹配记录一次随机 I/O” → 100 行匹配可能就是 +100 I/O(别写成 +几页)。并且要写清你假设哪一种。[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[13]Source: asksia-bible-data3404-bilingual.pdfThe official Assessment Overview literally says "paper-based exam, 2-hours, semi-open book" - but the mined Canvas pages and slides do not specify what materials are permitted. There is no allowed-materials list (no "one A4 sheet", no "lecture notes only") anywhere in the source. Do not assume. Email cs. data3404@sydney. edu. au / A/Prof Roehm and check the official exam notice before the day so you know exactly what you can carry in. Prepare to work from memory either way - treat any permitted notes as a bonus, not a crutch. 官方评估概览原话说“纸笔考试,2小时,半开卷” -- 但所挖掘的 Canvas 页面和讲义并未指明允许哪 些材料。源材料中任何地方都没有允许材料清单(没 有“一张A4纸”,没有“仅讲义”)。不要假设。发邮件 {A cs. data3404@sydney. edu. au / A/Prof Roehm, 并在考试当天之前查阅官方考试通知,以确切知道你 能带什么进去。无论如何都准备好凭记忆作答 -- 把 任何获准的笔记当作额外加分,而非拐杖。 10. 2 If you see X, do Y - the trigger table 10. 2 看到 X 就做 Y -- 触发器表 Pattern-match the question stem to the method. The instant you read the trigger on the left, write the response on the right - before you even finish reading the numbers. 把题干模式匹配到方法上。一读到左侧的触发条件,就写出右侧的应答 -- 甚至在你读完数字之前。 If the question says . . . Storage, indexing & sorting "Sort N pages, B buffers" Passes = 1+ [logB_1[N/B]]; I/O = 2N . passes. Pass O makes [ N/B] runs. B+-tree height / lookup cost Height = [log[ N]; equality lookup = [ log[ N] + 1 (= height + 1, one match). "Which index for this query?" Equality only - hash. Range / prefix / sort / group-by - tree (B+). Hash cannot do ranges. Composite / covering query Tree matches a prefix of the key; put equality attrs before range attrs; answer from the index alone = index-only (no clustering needed). Range-scan cost, index given Clustered: (height+1) + few packed pages. Unclustered: (height+1) + one I/O per matching row. State which. Joins (the gold - §10. 3 belt) "Join cost, M buffers given" Write all five formulas, sub in bR/bs/IRI, state which relation is outer, then pick the min. Inequality or outer join Hash & sort-merge are OUT (equi/natural only). Use nested-loop - BNLJ usually best. AskSia Library . DATA3404 . XXia Bilingual If the question says . . . . . . do this (the method) Plenty of memory, equi-join Grace hash = 3(bR+bs), usually cheapest; symmetric, outer/inner doesn't matter. Output must be sorted (ORDER BY) Prefer sort-merge; its sorted output is a free "interesting order" for the next operator. Optimisation & distributed "Estimate result size"
-
C. Sorting & Query processing(外排、算子、流水线)
-
你要会:外排趟数与总 I/O;并理解 materialization vs pipelining:有些算子(sort、hash join、聚合等)不能把输入完全流水起来。[18]Source: asksia-cheatsheet-data3404.pdfSIA > Don't forget the +1 sort pass: #passes = 1 + (merge passes). Forgetting it is the classic off- by-one that drops the I/O total. 8 . Query Pipeline LEC 05 SQL (declarative) > Parser > Optimizer > Executor. SQL -> relational-algebra logical tree -> physical plan (operators + algorithm). Many plans per SQL; optimiser picks one. Materialization (set-at-a-time): write temp relation, next op reads it - always works, costly I/O. Pipelining (tuple- at-a-time): pass each row on - cheap, but the inner join input, sorts, hash joins & aggregations can't pipeline their input. Reduction factor = #out / #in (= selectivity). Cost = #block transfers, output ignored. 8b · Relational Algebra THE OPERATORS 6 basic: u union, - difference, o select (rows), Tt project (cols), x cross-product, p rename. Derived: n, M join, + division. Set ops need union-compatible schemas (same arity, names, domains). Join RM_0 S = o_0(RxS); equi-join = only equalities; natural join = equi-join on all same-named attrs, keep one copy. Complex conditions -> CNF to match access paths. Access paths: table scan, scan+filter, index scan, index- only (covering) scan. o (selection) = subset of rows; Tt (projection) = subset of columns (strict RA removes duplicates = SQL DISTINCT). o and It are usually fused into the access- path loop. An expression tree shows logical RA operators; a query execution plan shows the physical operators + algorithms chosen. Formula Belt SIDE 1 heap eq = B/2 . sorted eq = log2B B+tree search = [log_F N]+1 = h+1 ext-sort #passes = 1+[log_(B-1) [ N/B]] total I/0 = 2N . #passes ext-hash: double iff local=global depth hit ratio = (req-I/0)/req ≥ 80% asksia. ai/cheatsheet/ data3404 · side 1/2 AskSia CHEATSHEET SERIES Independent revision aid . DATA3404 exam is semi-open book - confirm permitted materials on your unit outline flip - for side 2 . joins, optimisation & scaling out Scan all O(B) O(B)[22]Source: asksia-cheatsheet-data3404.pdfHEAP VS SORTED Person: 1,000,000 rows, 10/page = 100,000 pages; age uniform 0-99. · age=30 on heap (1% match) => still scan all 100,000 (unsorted). Sorted on name => no help for an age predicate => 100,000. Lesson: an index/order only helps if it matches the query's predicate attribute - a name-sorted file is useless for an age query. 6e . Covering Index INDEX-ONLY An index-only (covering) plan answers a query from the index alone, never fetching the heap tuple. e. g. (city, name) answers SELECT name WHERE city =: v. Clustering not needed for index-only plans - usually a tree index. 7 . External Merge Sort LEC 05 For ORDER BY, DISTINCT, sort-merge join, bulk B+-tree load. Sort N pages with B buffer frames: Pass 0 (sort): read all N (B at a time), sort each chunk, write => [ N/B] runs of B pages. Cost 2N. Merge passes: merge B-1 runs at a time (1 output buffer); each pass reads+writes all N = 2N. #PASSES & TOTAL I/O #passes = 1 + [log_(B-1) [N/B]] Total I/0 = 2N . (#passes) Run length after merge pass i = (B-1) . B. In practice rarely > 2-3 passes. Clustered B+-tree on the sort col => read leaves in order (good); unclustered => random I/O/record (bad). 7b . Worked . Sort COUNT THE PASSES N = 10,000 pages, B = 30: Pass 0 runs = [10000/30] = 334 merge B-1 = 29 at a time P1: [334/29] = 12 runs P2: [12/29] = 1 = 2 merge passes total passes = 1 + 2 = 3 Total I/0 = 2. 10000. 3 = 60,000 SIA > Don't forget the +1 sort pass: #passes = 1 + (merge passes). Forgetting it is the classic off- by-one that drops the I/O total. 8 . Query Pipeline LEC 05 SQL (declarative) > Parser > Optimizer > Executor. SQL -> relational-algebra logical tree -> physical plan (operators + algorithm). Many plans per SQL; optimiser picks one. Materialization (set-at-a-time): write temp relation, next op reads it - always works, costly I/O. Pipelining (tuple- at-a-time): pass each row on - cheap, but the inner join input, sorts, hash joins & aggregations can't pipeline their input. Reduction factor = #out / #in (= selectivity). Cost = #block transfers, output ignored. 8b · Relational Algebra THE OPERATORS 6 basic: u union, - difference, o select (rows), Tt project (cols), x cross-product, p rename. Derived: n, M join, + division. Set ops need union-compatible schemas (same arity, names, domains). Join RM_0 S = o_0(RxS); equi-join = only equalities; natural join = equi-join on all same-named attrs, keep one copy. Complex conditions -> CNF to match access paths.
-
D. Join algorithms(核心中的核心)
-
你要做到:看到题干就能把 5 个公式先写出来,然后代入比较最小。[13]Source: asksia-bible-data3404-bilingual.pdfThe official Assessment Overview literally says "paper-based exam, 2-hours, semi-open book" - but the mined Canvas pages and slides do not specify what materials are permitted. There is no allowed-materials list (no "one A4 sheet", no "lecture notes only") anywhere in the source. Do not assume. Email cs. data3404@sydney. edu. au / A/Prof Roehm and check the official exam notice before the day so you know exactly what you can carry in. Prepare to work from memory either way - treat any permitted notes as a bonus, not a crutch. 官方评估概览原话说“纸笔考试,2小时,半开卷” -- 但所挖掘的 Canvas 页面和讲义并未指明允许哪 些材料。源材料中任何地方都没有允许材料清单(没 有“一张A4纸”,没有“仅讲义”)。不要假设。发邮件 {A cs. data3404@sydney. edu. au / A/Prof Roehm, 并在考试当天之前查阅官方考试通知,以确切知道你 能带什么进去。无论如何都准备好凭记忆作答 -- 把 任何获准的笔记当作额外加分,而非拐杖。 10. 2 If you see X, do Y - the trigger table 10. 2 看到 X 就做 Y -- 触发器表 Pattern-match the question stem to the method. The instant you read the trigger on the left, write the response on the right - before you even finish reading the numbers. 把题干模式匹配到方法上。一读到左侧的触发条件,就写出右侧的应答 -- 甚至在你读完数字之前。 If the question says . . . Storage, indexing & sorting "Sort N pages, B buffers" Passes = 1+ [logB_1[N/B]]; I/O = 2N . passes. Pass O makes [ N/B] runs. B+-tree height / lookup cost Height = [log[ N]; equality lookup = [ log[ N] + 1 (= height + 1, one match). "Which index for this query?" Equality only - hash. Range / prefix / sort / group-by - tree (B+). Hash cannot do ranges. Composite / covering query Tree matches a prefix of the key; put equality attrs before range attrs; answer from the index alone = index-only (no clustering needed). Range-scan cost, index given Clustered: (height+1) + few packed pages. Unclustered: (height+1) + one I/O per matching row. State which. Joins (the gold - §10. 3 belt) "Join cost, M buffers given" Write all five formulas, sub in bR/bs/IRI, state which relation is outer, then pick the min. Inequality or outer join Hash & sort-merge are OUT (equi/natural only). Use nested-loop - BNLJ usually best. AskSia Library . DATA3404 . XXia Bilingual If the question says . . . . . . do this (the method) Plenty of memory, equi-join Grace hash = 3(bR+bs), usually cheapest; symmetric, outer/inner doesn't matter. Output must be sorted (ORDER BY) Prefer sort-merge; its sorted output is a free "interesting order" for the next operator. Optimisation & distributed "Estimate result size"[21]Source: asksia-cheatsheet-data3404.pdfDocument MongoDB, CouchDB Graph Neo4j "NoSQL"= Not-Only-SQL: scalability, schema flexibility, no standard model/API. Decoupled (lakehouse): separate compute/storage for elasticity. Apache Flink = pipelined dataflow, strong for streaming; cost-based optimiser but no join re-ordering ; manages raw byte[] in 32 KB pages to dodge GC. YARN = the cluster resource manager letting MR/Spark/Flink coexist (per- app Application Master). Data streams = unbounded, process with windows; Spark Streaming / Flink DataStream. RDBMS column on the NoSQL comparison: tuples, schema-first, SQL, ACID, in-place updates. 17 . If You See X, Compute Y METHOD TRIGGERS Page counts + buffer M, "join" => plug all 5 join formulas; smaller = outer. "sort N pages, B buffer" => 1+[ log_(B-1) [ N/B]] passes, 2N·passes. "B+-tree search/height" => [ log_FN]+1. · · CLOCK/GCLOCK trace => set ref bit on hits too. · "reduction / selectivity" => RF=1/V; multiply RFs (independence); matches = |R|-IT RF. "INLJ cost c" => c = 1 + h + #matches per outer row. SIA > Exam discipline: state the formula, sub numbers, box the I/O count, name which relation is outer. The exam tests by-hand cost, not your code. asksia. ai/cheatsheet/ data3404 · side 2/2 AskSia CHEATSHEET SERIES Independent revision aid . semi-open book - confirm permitted materials on canvas. sydney. edu. au . not affiliated with
-
E. Query Optimisation(优化器)
-
- 把 SQL/查询想象成 关系代数(RA)树
- 用等价规则 下推 selection / projection(越早越好)
- 用选择率估计结果大小与代价(并说明独立性假设)
-
F. Parallel / Distributed / Spark / NoSQL(概念题)
-
- RDD:immutable、partitioned、靠 lineage 恢复
- transformation 是 lazy,action 触发执行
- DAG:wide dependency(shuffle)切 stage;阶段数 = wide 数 + 1(材料还给了捷径)[9]Source: asksia-bible-data3404-bilingual.pdf- PRACTICE BANK (CONT. ) Spark & conceptual - DAGs, engines, NoSQL Spark 与概念题–––DAG、引擎、NoSQL Narrow vs wide dependencies, Spark-vs-MapReduce, and a NoSQL choice 窄依赖 vs 宽依赖、Spark vs MapReduce,以及一次 NoSQL 选型 Q18 SPARK / CONCEPTS 10 marks . DAG / engines / NoSQL . Areas 8-10 Consider this Spark pipeline: 考虑这段 Spark 流水线: rdd. map(f). filter(g). groupByKey(). mapValues(h) . collect () rdd. map(f). filter(g). groupByKey(). mapValues(h). collect() (a) Label each transformation narrow or wide, and count the stages. [4] (b) Which call triggers execution, and why is the rest "lazy"? [2] (c) Name one reason Spark beats MapReduce for an iterative job. [2] (d) Choose a NoSQL family for a workload of simple Put /Get by key with no joins, and justify. [2] (a) 将每个转换(transformation) 标注为窄依赖(narrow)或宽依赖(wide),并统计阶段(stage)数。[4](b)哪个 调用触发执行,其余为何是“惰性的(lazy)”?[2](c)说出 Spark 在迭代作业上胜过 MapReduce 的一个原因。[2](d) 为一个仅按键做简单 Put/Get、无连接的工作负载选择一个 NoSQL 家族,并说明理由。[2] WORKED SOLUTION 1 (a) Dependencies. map = narrow (each output partition from one input partition); filter = narrow; groupByKey = wide (a shuffle - same key must meet across partitions); mapValues = narrow. A new stage starts at each wide dependency, so there are 2 stages: {map, filter} before the shuffle, {groupByKey's reduce side, mapValues} after it. - (a) 依赖关系。map =窄依赖(每个输出分区来自一个输入分区);filter =窄依赖;groupByKey = 宽依赖(一次洗牌 -- 相同的键必须跨分区相遇);mapValues = 窄依赖。每个宽依赖处开始一个新阶段,所以有 2个阶段:洗牌之前的 {map, filter},之后的 {groupByKey 的归约侧,mapValues}。 - 2 (b) collect () is the action that triggers execution. Transformations are lazy - recorded as a DAG (lineage), not run - so Spark can fuse narrow ops into one stage, plan shuffles, and rebuild lost partitions from lineage. Nothing computes until an action pulls a result to the driver. (b) collect(〕 是触发执行的动作。转换是惰性的 -- 被记录为一个 DAG(血统),并不运行 -- 这样 Spark 就能把窄算 子融合进一个阶段、规划洗牌,并从血统重建丢失的分区。在动作把结果拉到驱动器之前,什么都不会计算。 3 (c) Iterative speed-up: in-memory caching. Spark keeps reused RDDs/DataFrames in memory (cache ()), so each iteration avoids MapReduce's materialise-to-HDFS-then-reload between every map-reduce step. (c)迭代加速:内存缓存。Spark 把被复用的 RDD/DataFrame 留在内存中(cache()),所以每次迭代都避免了 MapReduce 在每个 map-reduce 步骤之间物化到 HDFS 再重新加载。 4 (d) Key-Value store (e. g. Redis, RocksDB, Dynamo). The access pattern is exactly Put /Get /Delete by key with an opaque payload and no joins or range scans - the simplest, fastest-scaling family for that shape. A document store fits if the value is queryable JSON; a wide-column store fits sparse, column-family data; a graph store fits relationship traversal - none of which this workload needs. AskSia Library . DATA3404 . XXia Bilingual (d) 键值存储(如 Redis、RocksDB、Dynamo)。访问模式恰好是按键的 Put/Get/Delete,负载不透明,且没有连接或范 围扫描 -- 对这种形态而言,这是最简单、扩展最快的一类。如果值是可查询的 JSON,文档存储更合适;如果是稀疏的列族 数据,宽列存储更合适;如果是关系遍历,图存储更合适 -- 而这个工作负载都不需要。 ✓ Stage-counting shortcut 数阶段的捷径 #stages = #wide dependencies + 1. Each shuffle (groupByKey, reduceByKey, join on non-co-partitioned data, distinct, repartition) ends a stage. Chains of map/filter/mapValues are fused into the surrounding stage for free. 阶段数 = 宽依赖数 +1。每次洗牌(groupByKey、reduceByKey、在非共分区数据上的 join、distinct、repartition) 结束一个阶段。map/filter/mapValues 的链条被融合进周围的阶段,不增加开销。 On every cost question: write the formula, substitute the page counts, then compute - in that order. The marker can award method marks on a correct 3 . (b_R+b_S) even if the multiply slips, but a bare number with no working earns nothing if it is wrong. Choose the smaller relation as the outer, keep ceilings until the end, and remember output cost is free. 对每道代价题:写出公式,代入页数,再计算 -- 按这个顺序。即使乘法算错,阅卷者也能在一个正确的 3·(b_R+b_S) 上给方法分,但一个没有过程的裸数字如果错了就什么都得不到。选较小的关系作外表,保留取整直到 最后,并记住输出代价是免费的。 MARKER'S NOTE . DATA3404 FINAL AskSia Library · DATA3404 · 双语 Bilingual CH 10 . EXAM MORNING - CHAPTER 10 . EXAM MORNING
- Spark 比 MapReduce 更适合迭代:内存缓存,不用每一步都落盘 HDFS[6]Source: asksia-bible-data3404-bilingual.pdf对比:翻滚(tumbling)的1分钟窗口会把每个事件恰好放入一个窗口,给出互不重叠的逐分钟计数。 i Recap - Chapter 7 in seven exam moves 回顾 -- 七个考试动作讲完第7章 (1) MapReduce writes every intermediate to replicated HDFS - slow for iterative work; Spark keeps it in memory. (2) RDD = immutable, partitioned, resilient-via-lineage. (3) Transformations are lazy (return an RDD); actions trigger the run. (4) Lazy = Spark holds the whole DAG and optimises before running. (5) Narrow = no shuffle, pipelined; wide (reduceByKey / join / groupBy) = shuffle = stage boundary; #stages = #wide + 1. (6) Recover a lost partition by replaying its lineage - no replication. (7) DataFrames - Catalyst + AQE; pick a NoSQL family by access pattern; BASE/CAP trades C for A; streams = micro-batches over windows. (1) MapReduce 把每个中间结果写到复制的 HDFS→ 迭代工作慢;Spark 把它保留在内存。(2) RDD= 不可变、分 区、经由血统恢复弹性。(3)转换是惰性的(返回一个RDD);动作触发运行。(4)惰性⇒ Spark 持有整个 DAG 并在运 行前优化。(5)窄=无洗牌、流水线;宽(reduceByKey / join /groupBy)= 洗牌= 阶段边界;阶段数 = 宽依赖数 + 1. (6)通过重放血统恢复丢失的分区 -- 无需复制。(7) DataFrame → Catalyst + AQE;按访问模式挑一个 NoSQL 家 族;BASE/CAP 用 C换A;流=窗口上的微批。 AskSia Library · DATA3404 · 双语 Bilingual CH 8 . GLOSSARY CHAPTER 8 . GLOSSARY & COST - FORMULA MAP EN + 中文 Bilingual glossary & cost-formula map 双语术语表与代价公式速查 Every examinable term . English . X . one-line meaning - grouped by area 每一个可考术语 · 英文 · 中文 · 一行释义 -- 按领域分组 A fast bilingual reference for the vocabulary DATA3404 actually examines 〔DATA3404 实际考查的术语双语速 A) , grouped by the taught areas, followed by a single page that collects every exam I/O-cost formula. Lock down the precise wording examiners reward and the pairs students confuse (clustered vs dense, narrow vs wide dependency, speed-up vs scale-up). 一份针对 DATA3404 实际考查的术语的快速双语参考〔DATA3404 实际考查的术语双语速查〕,按所教领域分组,后面跟着 单独一页,汇集了每一个考试I/O代价 公式。锁定阅卷人奖励的精确措辞,以及学生混淆的成对概念(聚簇 vs 稠密、窄依赖 vs 宽依赖、加速比 vs 扩展比)。 Term (EN) 中文 One-line meaning Storage & file organisation 存储与文件组织 Storage hierarchy 存储层次 Registers - cache - RAM - disk - tertiary; latency grows by orders of magnitude down the levels. Access gap 访问鸿沟 The huge latency jump between main memory and secondary (disk) storage that I/O costing targets. Page / block 页/块 Fixed-size unit of transfer between disk and memory; a DB file is a sequence of pages. Heap file 堆文件 Records placed in any free space (unordered); good when the typical access is a full scan. Sorted file 有序文件[9]Source: asksia-bible-data3404-bilingual.pdf- PRACTICE BANK (CONT. ) Spark & conceptual - DAGs, engines, NoSQL Spark 与概念题–––DAG、引擎、NoSQL Narrow vs wide dependencies, Spark-vs-MapReduce, and a NoSQL choice 窄依赖 vs 宽依赖、Spark vs MapReduce,以及一次 NoSQL 选型 Q18 SPARK / CONCEPTS 10 marks . DAG / engines / NoSQL . Areas 8-10 Consider this Spark pipeline: 考虑这段 Spark 流水线: rdd. map(f). filter(g). groupByKey(). mapValues(h) . collect () rdd. map(f). filter(g). groupByKey(). mapValues(h). collect() (a) Label each transformation narrow or wide, and count the stages. [4] (b) Which call triggers execution, and why is the rest "lazy"? [2] (c) Name one reason Spark beats MapReduce for an iterative job. [2] (d) Choose a NoSQL family for a workload of simple Put /Get by key with no joins, and justify. [2] (a) 将每个转换(transformation) 标注为窄依赖(narrow)或宽依赖(wide),并统计阶段(stage)数。[4](b)哪个 调用触发执行,其余为何是“惰性的(lazy)”?[2](c)说出 Spark 在迭代作业上胜过 MapReduce 的一个原因。[2](d) 为一个仅按键做简单 Put/Get、无连接的工作负载选择一个 NoSQL 家族,并说明理由。[2] WORKED SOLUTION 1 (a) Dependencies. map = narrow (each output partition from one input partition); filter = narrow; groupByKey = wide (a shuffle - same key must meet across partitions); mapValues = narrow. A new stage starts at each wide dependency, so there are 2 stages: {map, filter} before the shuffle, {groupByKey's reduce side, mapValues} after it. - (a) 依赖关系。map =窄依赖(每个输出分区来自一个输入分区);filter =窄依赖;groupByKey = 宽依赖(一次洗牌 -- 相同的键必须跨分区相遇);mapValues = 窄依赖。每个宽依赖处开始一个新阶段,所以有 2个阶段:洗牌之前的 {map, filter},之后的 {groupByKey 的归约侧,mapValues}。 - 2 (b) collect () is the action that triggers execution. Transformations are lazy - recorded as a DAG (lineage), not run - so Spark can fuse narrow ops into one stage, plan shuffles, and rebuild lost partitions from lineage. Nothing computes until an action pulls a result to the driver. (b) collect(〕 是触发执行的动作。转换是惰性的 -- 被记录为一个 DAG(血统),并不运行 -- 这样 Spark 就能把窄算 子融合进一个阶段、规划洗牌,并从血统重建丢失的分区。在动作把结果拉到驱动器之前,什么都不会计算。 3 (c) Iterative speed-up: in-memory caching. Spark keeps reused RDDs/DataFrames in memory (cache ()), so each iteration avoids MapReduce's materialise-to-HDFS-then-reload between every map-reduce step. (c)迭代加速:内存缓存。Spark 把被复用的 RDD/DataFrame 留在内存中(cache()),所以每次迭代都避免了 MapReduce 在每个 map-reduce 步骤之间物化到 HDFS 再重新加载。 4 (d) Key-Value store (e. g. Redis, RocksDB, Dynamo). The access pattern is exactly Put /Get /Delete by key with an opaque payload and no joins or range scans - the simplest, fastest-scaling family for that shape. A document store fits if the value is queryable JSON; a wide-column store fits sparse, column-family data; a graph store fits relationship traversal - none of which this workload needs. AskSia Library . DATA3404 . XXia Bilingual (d) 键值存储(如 Redis、RocksDB、Dynamo)。访问模式恰好是按键的 Put/Get/Delete,负载不透明,且没有连接或范 围扫描 -- 对这种形态而言,这是最简单、扩展最快的一类。如果值是可查询的 JSON,文档存储更合适;如果是稀疏的列族 数据,宽列存储更合适;如果是关系遍历,图存储更合适 -- 而这个工作负载都不需要。 ✓ Stage-counting shortcut 数阶段的捷径 #stages = #wide dependencies + 1. Each shuffle (groupByKey, reduceByKey, join on non-co-partitioned data, distinct, repartition) ends a stage. Chains of map/filter/mapValues are fused into the surrounding stage for free. 阶段数 = 宽依赖数 +1。每次洗牌(groupByKey、reduceByKey、在非共分区数据上的 join、distinct、repartition) 结束一个阶段。map/filter/mapValues 的链条被融合进周围的阶段,不增加开销。 On every cost question: write the formula, substitute the page counts, then compute - in that order. The marker can award method marks on a correct 3 . (b_R+b_S) even if the multiply slips, but a bare number with no working earns nothing if it is wrong. Choose the smaller relation as the outer, keep ceilings until the end, and remember output cost is free. 对每道代价题:写出公式,代入页数,再计算 -- 按这个顺序。即使乘法算错,阅卷者也能在一个正确的 3·(b_R+b_S) 上给方法分,但一个没有过程的裸数字如果错了就什么都得不到。选较小的关系作外表,保留取整直到 最后,并记住输出代价是免费的。 MARKER'S NOTE . DATA3404 FINAL AskSia Library · DATA3404 · 双语 Bilingual CH 10 . EXAM MORNING - CHAPTER 10 . EXAM MORNING
- NoSQL 选型按访问模式:只按 key 的 Put/Get → key-value 更合适(材料有例题解法)[9]Source: asksia-bible-data3404-bilingual.pdf- PRACTICE BANK (CONT. ) Spark & conceptual - DAGs, engines, NoSQL Spark 与概念题–––DAG、引擎、NoSQL Narrow vs wide dependencies, Spark-vs-MapReduce, and a NoSQL choice 窄依赖 vs 宽依赖、Spark vs MapReduce,以及一次 NoSQL 选型 Q18 SPARK / CONCEPTS 10 marks . DAG / engines / NoSQL . Areas 8-10 Consider this Spark pipeline: 考虑这段 Spark 流水线: rdd. map(f). filter(g). groupByKey(). mapValues(h) . collect () rdd. map(f). filter(g). groupByKey(). mapValues(h). collect() (a) Label each transformation narrow or wide, and count the stages. [4] (b) Which call triggers execution, and why is the rest "lazy"? [2] (c) Name one reason Spark beats MapReduce for an iterative job. [2] (d) Choose a NoSQL family for a workload of simple Put /Get by key with no joins, and justify. [2] (a) 将每个转换(transformation) 标注为窄依赖(narrow)或宽依赖(wide),并统计阶段(stage)数。[4](b)哪个 调用触发执行,其余为何是“惰性的(lazy)”?[2](c)说出 Spark 在迭代作业上胜过 MapReduce 的一个原因。[2](d) 为一个仅按键做简单 Put/Get、无连接的工作负载选择一个 NoSQL 家族,并说明理由。[2] WORKED SOLUTION 1 (a) Dependencies. map = narrow (each output partition from one input partition); filter = narrow; groupByKey = wide (a shuffle - same key must meet across partitions); mapValues = narrow. A new stage starts at each wide dependency, so there are 2 stages: {map, filter} before the shuffle, {groupByKey's reduce side, mapValues} after it. - (a) 依赖关系。map =窄依赖(每个输出分区来自一个输入分区);filter =窄依赖;groupByKey = 宽依赖(一次洗牌 -- 相同的键必须跨分区相遇);mapValues = 窄依赖。每个宽依赖处开始一个新阶段,所以有 2个阶段:洗牌之前的 {map, filter},之后的 {groupByKey 的归约侧,mapValues}。 - 2 (b) collect () is the action that triggers execution. Transformations are lazy - recorded as a DAG (lineage), not run - so Spark can fuse narrow ops into one stage, plan shuffles, and rebuild lost partitions from lineage. Nothing computes until an action pulls a result to the driver. (b) collect(〕 是触发执行的动作。转换是惰性的 -- 被记录为一个 DAG(血统),并不运行 -- 这样 Spark 就能把窄算 子融合进一个阶段、规划洗牌,并从血统重建丢失的分区。在动作把结果拉到驱动器之前,什么都不会计算。 3 (c) Iterative speed-up: in-memory caching. Spark keeps reused RDDs/DataFrames in memory (cache ()), so each iteration avoids MapReduce's materialise-to-HDFS-then-reload between every map-reduce step. (c)迭代加速:内存缓存。Spark 把被复用的 RDD/DataFrame 留在内存中(cache()),所以每次迭代都避免了 MapReduce 在每个 map-reduce 步骤之间物化到 HDFS 再重新加载。 4 (d) Key-Value store (e. g. Redis, RocksDB, Dynamo). The access pattern is exactly Put /Get /Delete by key with an opaque payload and no joins or range scans - the simplest, fastest-scaling family for that shape. A document store fits if the value is queryable JSON; a wide-column store fits sparse, column-family data; a graph store fits relationship traversal - none of which this workload needs. AskSia Library . DATA3404 . XXia Bilingual (d) 键值存储(如 Redis、RocksDB、Dynamo)。访问模式恰好是按键的 Put/Get/Delete,负载不透明,且没有连接或范 围扫描 -- 对这种形态而言,这是最简单、扩展最快的一类。如果值是可查询的 JSON,文档存储更合适;如果是稀疏的列族 数据,宽列存储更合适;如果是关系遍历,图存储更合适 -- 而这个工作负载都不需要。 ✓ Stage-counting shortcut 数阶段的捷径 #stages = #wide dependencies + 1. Each shuffle (groupByKey, reduceByKey, join on non-co-partitioned data, distinct, repartition) ends a stage. Chains of map/filter/mapValues are fused into the surrounding stage for free. 阶段数 = 宽依赖数 +1。每次洗牌(groupByKey、reduceByKey、在非共分区数据上的 join、distinct、repartition) 结束一个阶段。map/filter/mapValues 的链条被融合进周围的阶段,不增加开销。 On every cost question: write the formula, substitute the page counts, then compute - in that order. The marker can award method marks on a correct 3 . (b_R+b_S) even if the multiply slips, but a bare number with no working earns nothing if it is wrong. Choose the smaller relation as the outer, keep ceilings until the end, and remember output cost is free. 对每道代价题:写出公式,代入页数,再计算 -- 按这个顺序。即使乘法算错,阅卷者也能在一个正确的 3·(b_R+b_S) 上给方法分,但一个没有过程的裸数字如果错了就什么都得不到。选较小的关系作外表,保留取整直到 最后,并记住输出代价是免费的。 MARKER'S NOTE . DATA3404 FINAL AskSia Library · DATA3404 · 双语 Bilingual CH 10 . EXAM MORNING - CHAPTER 10 . EXAM MORNING
-
MapReduce vs Spark 一句话对比
-
4)“高分题”答题模板(你照这个写,方法分很稳)
- 所有代价题统一模板(材料明确建议“永远用同样形状回答”):[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows.
"Reorder this query"
Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined).
Distributed join, one small table
Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only).
Spark stage boundary / dependency
Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary.
"Any join condition" distributed
Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors.
✓ Always answer in the same shape 始终以相同的形式作答
Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks.
公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。
AskSia Library . DATA3404 . XXia Bilingual
CH 10 . EXAM MORNING
10. 3 The cost-formula belt - the whole exam on one box
10. 3 代价公式带 -- 整场考试浓缩成一个方框
If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored.
如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。
JOIN I/O - THE FIVE
Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c,
c = 1 + h + #matches
Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted)
Grace Hash = 3 . (bR + bs)
SORT / INDEX / SIZING
Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes
(Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through)
10. 4 Top traps - where the marks bleed out
10. 4 头号陷阱 -- 失分最严重的地方
! M-2 vs M in BNLJ
块嵌套循环连接(BNLJ)中的 M-2 vs M
- ① 写公式(Formula)
- ② 代入数字(Substitution:$b_R,b_S,|R|,M/B,h,#matches$)
- ③ 做算术(Arithmetic:保留每次 $\lceil \ \rceil$)
- ④ 一句结论(Verdict:谁最便宜 + 你把谁当 outer/inner)
- Join 题的“触发器反射”
- 题干出现 “join cost,给了 buffer” → 立刻写 5 个 join 公式;先选小表当 outer;算完取最小。[13]Source: asksia-bible-data3404-bilingual.pdfThe official Assessment Overview literally says "paper-based exam, 2-hours, semi-open book" - but the mined Canvas pages and slides do not specify what materials are permitted. There is no allowed-materials list (no "one A4 sheet", no "lecture notes only") anywhere in the source. Do not assume. Email cs. data3404@sydney. edu. au / A/Prof Roehm and check the official exam notice before the day so you know exactly what you can carry in. Prepare to work from memory either way - treat any permitted notes as a bonus, not a crutch. 官方评估概览原话说“纸笔考试,2小时,半开卷” -- 但所挖掘的 Canvas 页面和讲义并未指明允许哪 些材料。源材料中任何地方都没有允许材料清单(没 有“一张A4纸”,没有“仅讲义”)。不要假设。发邮件 {A cs. data3404@sydney. edu. au / A/Prof Roehm, 并在考试当天之前查阅官方考试通知,以确切知道你 能带什么进去。无论如何都准备好凭记忆作答 -- 把 任何获准的笔记当作额外加分,而非拐杖。 10. 2 If you see X, do Y - the trigger table 10. 2 看到 X 就做 Y -- 触发器表 Pattern-match the question stem to the method. The instant you read the trigger on the left, write the response on the right - before you even finish reading the numbers. 把题干模式匹配到方法上。一读到左侧的触发条件,就写出右侧的应答 -- 甚至在你读完数字之前。 If the question says . . . Storage, indexing & sorting "Sort N pages, B buffers" Passes = 1+ [logB_1[N/B]]; I/O = 2N . passes. Pass O makes [ N/B] runs. B+-tree height / lookup cost Height = [log[ N]; equality lookup = [ log[ N] + 1 (= height + 1, one match). "Which index for this query?" Equality only - hash. Range / prefix / sort / group-by - tree (B+). Hash cannot do ranges. Composite / covering query Tree matches a prefix of the key; put equality attrs before range attrs; answer from the index alone = index-only (no clustering needed). Range-scan cost, index given Clustered: (height+1) + few packed pages. Unclustered: (height+1) + one I/O per matching row. State which. Joins (the gold - §10. 3 belt) "Join cost, M buffers given" Write all five formulas, sub in bR/bs/IRI, state which relation is outer, then pick the min. Inequality or outer join Hash & sort-merge are OUT (equi/natural only). Use nested-loop - BNLJ usually best. AskSia Library . DATA3404 . XXia Bilingual If the question says . . . . . . do this (the method) Plenty of memory, equi-join Grace hash = 3(bR+bs), usually cheapest; symmetric, outer/inner doesn't matter. Output must be sorted (ORDER BY) Prefer sort-merge; its sorted output is a free "interesting order" for the next operator. Optimisation & distributed "Estimate result size"[21]Source: asksia-cheatsheet-data3404.pdfDocument MongoDB, CouchDB Graph Neo4j "NoSQL"= Not-Only-SQL: scalability, schema flexibility, no standard model/API. Decoupled (lakehouse): separate compute/storage for elasticity. Apache Flink = pipelined dataflow, strong for streaming; cost-based optimiser but no join re-ordering ; manages raw byte[] in 32 KB pages to dodge GC. YARN = the cluster resource manager letting MR/Spark/Flink coexist (per- app Application Master). Data streams = unbounded, process with windows; Spark Streaming / Flink DataStream. RDBMS column on the NoSQL comparison: tuples, schema-first, SQL, ACID, in-place updates. 17 . If You See X, Compute Y METHOD TRIGGERS Page counts + buffer M, "join" => plug all 5 join formulas; smaller = outer. "sort N pages, B buffer" => 1+[ log_(B-1) [ N/B]] passes, 2N·passes. "B+-tree search/height" => [ log_FN]+1. · · CLOCK/GCLOCK trace => set ref bit on hits too. · "reduction / selectivity" => RF=1/V; multiply RFs (independence); matches = |R|-IT RF. "INLJ cost c" => c = 1 + h + #matches per outer row. SIA > Exam discipline: state the formula, sub numbers, box the I/O count, name which relation is outer. The exam tests by-hand cost, not your code. asksia. ai/cheatsheet/ data3404 · side 2/2 AskSia CHEATSHEET SERIES Independent revision aid . semi-open book - confirm permitted materials on canvas. sydney. edu. au . not affiliated with
-
5)最常见“会让人白丢分”的陷阱清单(你要背下来反复自检)
- BNLJ 用 $M-2$ 还是 $M$?
- 外排趟数忘记 $+1$(Pass 0)
- 这是“经典 off-by-one”,会导致总 I/O 直接错一大截。[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[18]Source: asksia-cheatsheet-data3404.pdfSIA > Don't forget the +1 sort pass: #passes = 1 + (merge passes). Forgetting it is the classic off- by-one that drops the I/O total. 8 . Query Pipeline LEC 05 SQL (declarative) > Parser > Optimizer > Executor. SQL -> relational-algebra logical tree -> physical plan (operators + algorithm). Many plans per SQL; optimiser picks one. Materialization (set-at-a-time): write temp relation, next op reads it - always works, costly I/O. Pipelining (tuple- at-a-time): pass each row on - cheap, but the inner join input, sorts, hash joins & aggregations can't pipeline their input. Reduction factor = #out / #in (= selectivity). Cost = #block transfers, output ignored. 8b · Relational Algebra THE OPERATORS 6 basic: u union, - difference, o select (rows), Tt project (cols), x cross-product, p rename. Derived: n, M join, + division. Set ops need union-compatible schemas (same arity, names, domains). Join RM_0 S = o_0(RxS); equi-join = only equalities; natural join = equi-join on all same-named attrs, keep one copy. Complex conditions -> CNF to match access paths. Access paths: table scan, scan+filter, index scan, index- only (covering) scan. o (selection) = subset of rows; Tt (projection) = subset of columns (strict RA removes duplicates = SQL DISTINCT). o and It are usually fused into the access- path loop. An expression tree shows logical RA operators; a query execution plan shows the physical operators + algorithms chosen. Formula Belt SIDE 1 heap eq = B/2 . sorted eq = log2B B+tree search = [log_F N]+1 = h+1 ext-sort #passes = 1+[log_(B-1) [ N/B]] total I/0 = 2N . #passes ext-hash: double iff local=global depth hit ratio = (req-I/0)/req ≥ 80% asksia. ai/cheatsheet/ data3404 · side 1/2 AskSia CHEATSHEET SERIES Independent revision aid . DATA3404 exam is semi-open book - confirm permitted materials on your unit outline flip - for side 2 . joins, optimisation & scaling out Scan all O(B) O(B)[22]Source: asksia-cheatsheet-data3404.pdfHEAP VS SORTED Person: 1,000,000 rows, 10/page = 100,000 pages; age uniform 0-99. · age=30 on heap (1% match) => still scan all 100,000 (unsorted). Sorted on name => no help for an age predicate => 100,000. Lesson: an index/order only helps if it matches the query's predicate attribute - a name-sorted file is useless for an age query. 6e . Covering Index INDEX-ONLY An index-only (covering) plan answers a query from the index alone, never fetching the heap tuple. e. g. (city, name) answers SELECT name WHERE city =: v. Clustering not needed for index-only plans - usually a tree index. 7 . External Merge Sort LEC 05 For ORDER BY, DISTINCT, sort-merge join, bulk B+-tree load. Sort N pages with B buffer frames: Pass 0 (sort): read all N (B at a time), sort each chunk, write => [ N/B] runs of B pages. Cost 2N. Merge passes: merge B-1 runs at a time (1 output buffer); each pass reads+writes all N = 2N. #PASSES & TOTAL I/O #passes = 1 + [log_(B-1) [N/B]] Total I/0 = 2N . (#passes) Run length after merge pass i = (B-1) . B. In practice rarely > 2-3 passes. Clustered B+-tree on the sort col => read leaves in order (good); unclustered => random I/O/record (bad). 7b . Worked . Sort COUNT THE PASSES N = 10,000 pages, B = 30: Pass 0 runs = [10000/30] = 334 merge B-1 = 29 at a time P1: [334/29] = 12 runs P2: [12/29] = 1 = 2 merge passes total passes = 1 + 2 = 3 Total I/0 = 2. 10000. 3 = 60,000 SIA > Don't forget the +1 sort pass: #passes = 1 + (merge passes). Forgetting it is the classic off- by-one that drops the I/O total. 8 . Query Pipeline LEC 05 SQL (declarative) > Parser > Optimizer > Executor. SQL -> relational-algebra logical tree -> physical plan (operators + algorithm). Many plans per SQL; optimiser picks one. Materialization (set-at-a-time): write temp relation, next op reads it - always works, costly I/O. Pipelining (tuple- at-a-time): pass each row on - cheap, but the inner join input, sorts, hash joins & aggregations can't pipeline their input. Reduction factor = #out / #in (= selectivity). Cost = #block transfers, output ignored. 8b · Relational Algebra THE OPERATORS 6 basic: u union, - difference, o select (rows), Tt project (cols), x cross-product, p rename. Derived: n, M join, + division. Set ops need union-compatible schemas (same arity, names, domains). Join RM_0 S = o_0(RxS); equi-join = only equalities; natural join = equi-join on all same-named attrs, keep one copy. Complex conditions -> CNF to match access paths.
- 把 equi-join 限制忘了
- Hash / SMJ 只适用于 equi/natural;看到 inequality/outer 仍然用它=直接错方法。[13]Source: asksia-bible-data3404-bilingual.pdfThe official Assessment Overview literally says "paper-based exam, 2-hours, semi-open book" - but the mined Canvas pages and slides do not specify what materials are permitted. There is no allowed-materials list (no "one A4 sheet", no "lecture notes only") anywhere in the source. Do not assume. Email cs. data3404@sydney. edu. au / A/Prof Roehm and check the official exam notice before the day so you know exactly what you can carry in. Prepare to work from memory either way - treat any permitted notes as a bonus, not a crutch. 官方评估概览原话说“纸笔考试,2小时,半开卷” -- 但所挖掘的 Canvas 页面和讲义并未指明允许哪 些材料。源材料中任何地方都没有允许材料清单(没 有“一张A4纸”,没有“仅讲义”)。不要假设。发邮件 {A cs. data3404@sydney. edu. au / A/Prof Roehm, 并在考试当天之前查阅官方考试通知,以确切知道你 能带什么进去。无论如何都准备好凭记忆作答 -- 把 任何获准的笔记当作额外加分,而非拐杖。 10. 2 If you see X, do Y - the trigger table 10. 2 看到 X 就做 Y -- 触发器表 Pattern-match the question stem to the method. The instant you read the trigger on the left, write the response on the right - before you even finish reading the numbers. 把题干模式匹配到方法上。一读到左侧的触发条件,就写出右侧的应答 -- 甚至在你读完数字之前。 If the question says . . . Storage, indexing & sorting "Sort N pages, B buffers" Passes = 1+ [logB_1[N/B]]; I/O = 2N . passes. Pass O makes [ N/B] runs. B+-tree height / lookup cost Height = [log[ N]; equality lookup = [ log[ N] + 1 (= height + 1, one match). "Which index for this query?" Equality only - hash. Range / prefix / sort / group-by - tree (B+). Hash cannot do ranges. Composite / covering query Tree matches a prefix of the key; put equality attrs before range attrs; answer from the index alone = index-only (no clustering needed). Range-scan cost, index given Clustered: (height+1) + few packed pages. Unclustered: (height+1) + one I/O per matching row. State which. Joins (the gold - §10. 3 belt) "Join cost, M buffers given" Write all five formulas, sub in bR/bs/IRI, state which relation is outer, then pick the min. Inequality or outer join Hash & sort-merge are OUT (equi/natural only). Use nested-loop - BNLJ usually best. AskSia Library . DATA3404 . XXia Bilingual If the question says . . . . . . do this (the method) Plenty of memory, equi-join Grace hash = 3(bR+bs), usually cheapest; symmetric, outer/inner doesn't matter. Output must be sorted (ORDER BY) Prefer sort-merge; its sorted output is a free "interesting order" for the next operator. Optimisation & distributed "Estimate result size"[26]Source: asksia-cheatsheet-data3404.pdfINLJ: c = cost of one selection on S = 1 + h + #matches (root + index height + matching rows). SMJ sorted- only = b_R + b_S; result is sorted. Hash & SMJ work only for equi/natural joins (not outer/inequality). SIA > BNLJ uses M-2 (1 input + 1 output page reserved) in the worksheet form - the M-1 variant is for pipelined output. Read which the question wants; the block size drives the whole answer. 9b . Worked . NLJ vs BNLJ RE- NUMBERED R= 100 pg / 1000 rows; S = 400 pg / 10,000 rows; M = 10 (block M-2 = 8). SIMPLE NLJ R outer: 100 + 1000. 400 = 400,100 S outer: 400 + 10000-100 = 1,000,400 BLOCK NLJ R outer: 100 + [100/81. 400 = 100+13. 400 = 5,300 / S outer: 400 + [400/81. 100 = 5,400 BNLJ is ~75x cheaper than simple NLJ here - and smaller-as-outer still wins. 9d . Join Types & Stats SETUP THE COST 0-join (any cond), equi-join (=only), natural (all same- named), outer (NULL on no match). Hash & sort-merge work only for equi/natural. For R(1000 rows)MS(10,000) on a FK: cross-product = 10,000,000 rows; join result = 10,000; avg matches/R- row = 10. The o_0(RxS) view would discard 9,990,000 - why physical join algorithms beat materialising the cross-product. Joins matter: ~ 25% of CPU on a TPC-H workload (Impala) - hash join + sequential scan dominate analytics. That is why the exam drills the cost models. 9e . Which Join When DECISION Spark · No index, any condition => block NLJ (smaller as outer block). Index on inner join attr, equi => index NLJ if few matches. Output is ignored in cost, but a sort-merge join leaves the result sorted - free for a later ORDER BY or another merge join (an "interesting order"). · Inequality (R. a<S. b) or outer => hash/SMJ not applicable = block NL (or INLJ on a clustered B+- tree). 9c · Worked · INLJ, SMJ, Hash SAME R, S R = 100 pg/1000 rows, S = 400 pg/10,000 rows, M = 10. INDEX NLJ - C = 1 + H + MATCHES (a) idx on R, h=1, unique = c=3; S outer, R inner: 400 + 10000. 3 = 30,400
- 外/内关系没写清楚
- selectivity 直接乘,不写假设
-
6)考前 48 小时“最省命冲刺法”(按材料的两小时 clock 思路来练)
- 第一轮(以“拿方法分”为目标)
- 每个主题各做 1 题:Join cost、External sort、B+-tree height/lookup、Selectivity estimation、Spark stage counting。做的时候严格用“公式→代入→取整→结论”格式。[9]Source: asksia-bible-data3404-bilingual.pdf- PRACTICE BANK (CONT. ) Spark & conceptual - DAGs, engines, NoSQL Spark 与概念题–––DAG、引擎、NoSQL Narrow vs wide dependencies, Spark-vs-MapReduce, and a NoSQL choice 窄依赖 vs 宽依赖、Spark vs MapReduce,以及一次 NoSQL 选型 Q18 SPARK / CONCEPTS 10 marks . DAG / engines / NoSQL . Areas 8-10 Consider this Spark pipeline: 考虑这段 Spark 流水线: rdd. map(f). filter(g). groupByKey(). mapValues(h) . collect () rdd. map(f). filter(g). groupByKey(). mapValues(h). collect() (a) Label each transformation narrow or wide, and count the stages. [4] (b) Which call triggers execution, and why is the rest "lazy"? [2] (c) Name one reason Spark beats MapReduce for an iterative job. [2] (d) Choose a NoSQL family for a workload of simple Put /Get by key with no joins, and justify. [2] (a) 将每个转换(transformation) 标注为窄依赖(narrow)或宽依赖(wide),并统计阶段(stage)数。[4](b)哪个 调用触发执行,其余为何是“惰性的(lazy)”?[2](c)说出 Spark 在迭代作业上胜过 MapReduce 的一个原因。[2](d) 为一个仅按键做简单 Put/Get、无连接的工作负载选择一个 NoSQL 家族,并说明理由。[2] WORKED SOLUTION 1 (a) Dependencies. map = narrow (each output partition from one input partition); filter = narrow; groupByKey = wide (a shuffle - same key must meet across partitions); mapValues = narrow. A new stage starts at each wide dependency, so there are 2 stages: {map, filter} before the shuffle, {groupByKey's reduce side, mapValues} after it. - (a) 依赖关系。map =窄依赖(每个输出分区来自一个输入分区);filter =窄依赖;groupByKey = 宽依赖(一次洗牌 -- 相同的键必须跨分区相遇);mapValues = 窄依赖。每个宽依赖处开始一个新阶段,所以有 2个阶段:洗牌之前的 {map, filter},之后的 {groupByKey 的归约侧,mapValues}。 - 2 (b) collect () is the action that triggers execution. Transformations are lazy - recorded as a DAG (lineage), not run - so Spark can fuse narrow ops into one stage, plan shuffles, and rebuild lost partitions from lineage. Nothing computes until an action pulls a result to the driver. (b) collect(〕 是触发执行的动作。转换是惰性的 -- 被记录为一个 DAG(血统),并不运行 -- 这样 Spark 就能把窄算 子融合进一个阶段、规划洗牌,并从血统重建丢失的分区。在动作把结果拉到驱动器之前,什么都不会计算。 3 (c) Iterative speed-up: in-memory caching. Spark keeps reused RDDs/DataFrames in memory (cache ()), so each iteration avoids MapReduce's materialise-to-HDFS-then-reload between every map-reduce step. (c)迭代加速:内存缓存。Spark 把被复用的 RDD/DataFrame 留在内存中(cache()),所以每次迭代都避免了 MapReduce 在每个 map-reduce 步骤之间物化到 HDFS 再重新加载。 4 (d) Key-Value store (e. g. Redis, RocksDB, Dynamo). The access pattern is exactly Put /Get /Delete by key with an opaque payload and no joins or range scans - the simplest, fastest-scaling family for that shape. A document store fits if the value is queryable JSON; a wide-column store fits sparse, column-family data; a graph store fits relationship traversal - none of which this workload needs. AskSia Library . DATA3404 . XXia Bilingual (d) 键值存储(如 Redis、RocksDB、Dynamo)。访问模式恰好是按键的 Put/Get/Delete,负载不透明,且没有连接或范 围扫描 -- 对这种形态而言,这是最简单、扩展最快的一类。如果值是可查询的 JSON,文档存储更合适;如果是稀疏的列族 数据,宽列存储更合适;如果是关系遍历,图存储更合适 -- 而这个工作负载都不需要。 ✓ Stage-counting shortcut 数阶段的捷径 #stages = #wide dependencies + 1. Each shuffle (groupByKey, reduceByKey, join on non-co-partitioned data, distinct, repartition) ends a stage. Chains of map/filter/mapValues are fused into the surrounding stage for free. 阶段数 = 宽依赖数 +1。每次洗牌(groupByKey、reduceByKey、在非共分区数据上的 join、distinct、repartition) 结束一个阶段。map/filter/mapValues 的链条被融合进周围的阶段,不增加开销。 On every cost question: write the formula, substitute the page counts, then compute - in that order. The marker can award method marks on a correct 3 . (b_R+b_S) even if the multiply slips, but a bare number with no working earns nothing if it is wrong. Choose the smaller relation as the outer, keep ceilings until the end, and remember output cost is free. 对每道代价题:写出公式,代入页数,再计算 -- 按这个顺序。即使乘法算错,阅卷者也能在一个正确的 3·(b_R+b_S) 上给方法分,但一个没有过程的裸数字如果错了就什么都得不到。选较小的关系作外表,保留取整直到 最后,并记住输出代价是免费的。 MARKER'S NOTE . DATA3404 FINAL AskSia Library · DATA3404 · 双语 Bilingual CH 10 . EXAM MORNING - CHAPTER 10 . EXAM MORNING[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M
- 第二轮(专打陷阱)
- 专门刷:$M-2$、外排 $+1$、$B-1$ 扇入、equi 限制、outer/inner 必写、$\lceil\ \rceil$ 不能丢。[8]Source: asksia-bible-data3404-bilingual.pdfBlock NLJ uses an (M-2)-page outer block in this unit's worksheet (1 inner-input frame + 1 output frame reserved). Writing M, or the pipelined M-1 variant, gives the wrong block count and the wrong cost. And ceiling it: [ bR/(M-2)1. 块 NLJ 在本单元的工作纸中使用(M-2)页的外块 (保留1个内表输入帧+1个输出帧)。写M,或流水 线式的 M-1 变体,会给出错误的块数和错误的代价。 并要取顶:[bR/(M-2)]。 ! Forgetting that selectivity assumes independence 忘记选择率(selectivity)假定了独立性 Result-size estimation multiplies per-predicate reduction factors - valid only under uniform distribution + predicate independence, which is often false. State the assumption; if the question hints at correlation (e. g. city & postcode), flag that the estimate may be wrong. Markers reward you for naming the assumption. 结果大小估计把各谓词的缩减因子相乘 -- 仅在均匀 分布+谓词独立下有效,而这常常为假。说明该假 设;如果题目暗示相关性(如 city 和 postcode),就 指出估计可能出错。阅卷者会因你点明假设而给分。 AskSia Library . DATA3404 . XXia Bilingual ! Counting the sort pass in external-sort passes 在外部排序的趟数中数上排序那一趟 #passes = 1+ [logB-1[N/B]] - the leading 1 is the sort (Pass 0), the log term is the merge passes. Worked: N=10,000, B=30 -+ 1+ [log29334] = 1+ 2 = 3 passes; I/O = 2. 10,000. 3 = 60,000. Don't drop the +1, and merge fan-in is B-1, not B. 趟数=1+『logB-1「N/B]]––前导的1是排序(第 0 趟),对数项是归并趟数。算例:N=10,000, B=30 →1+「log29334]=1+2=3趟;I/O= 2·10,000·3= 60,000。别漏掉 +1,且归并扇入是 B-1,不是B。 ! Clustered vs unclustered range cost 聚簇 vs 非聚簇的范围查询代价 Clustered: matches sit on consecutive pages => (height+1) + a few pages. Unclustered must be dense = potentially one random I/O per matching row. On 100 matching rows that is +100 I/O, not +5. Always say which clustering you assume. 聚簇:匹配落在连续页上 ⇒(height+1)+几页。非聚 簇必须稠密⇒ 潜在每个匹配行一次随机 I/O。对 100 个匹配行那是 +100 I/O,而非 +5。永远说明你假设 的是哪种聚簇。 ! Confusing wide-column store with column store 把宽列存储与列存储混淆 Column store (MonetDB, Vertica, Parquet) = stores by attribute, one file per column, for OLAP scans. Wide- column store (BigTable, HBase) = a NoSQL multidimensional map with column families stored row-wise within a family, for sparse key retrieval - not column-oriented at all. Same two words, opposite physical layout. 列存储 (MonetDB、Vertica、Parquet)= 按属性存 储,每列一个文件,用于OLAP 扫描。宽列存储 (BigTable、HBase)= 一个 NoSQL 多维映射,族 内列族按行存储,用于稀疏的按键检索 -- 根本不面 向列。同样的两个词,相反的物理布局。 ! Outer/inner not stated & |R | vs b R 没说明外/内关系,以及 |R| vs bR Every nested-loop and index-join cost is meaningless without saying which relation is outer - it flips the answer. And the tuple-NLJ multiplier is |R| rows, not bR pages (that's the page-at-a-time form). Read the stem. 每个嵌套循环连接和索引连接的代价,不说明哪个关 系是外表就毫无意义 -- 它会颠倒答案。而逐元组 NLJ 的乘数是 |RI行,不是bR页(那是逐页形式)。 读懂题干。 10. 5 The two-hour clock 10. 5 两小时时间分配 1 0:00-0:05 - Triage. Read the whole paper. Mark each question E / M / H and note its marks. Marks per minute is your budget - ~ 1 mark = 1 minute over 2 hours. Do the easy, high-mark cost questions first. 0:00-0:05 -- 分诊。通读整张试卷。给每道题标注 易/中/难 并记下分值。每分钟得分是你的预算 -- 在 2 小时里约1分 ~1分钟。先做容易、高分值的题。 2 0:05-1:30 - Bank the method marks. For every cost question: write the formula, sub in page counts, show every ceiling, box the number. Half-finished correct method beats a lone wrong number. Don't over- invest in one hard part. 0:05-1:30 -- 拿下方法分。对每道代价题:写出公式,代入页数,展示每一次取整,把答案框起来。做对一半的正确方法胜 过一个孤零零的错误数字。不要在某个难点上过度投入。 3 1:30-1:50 - Mop up. Return to flagged questions. Attempt every part - an empty answer scores O, a formula scores something. Short-answer concept questions: one crisp definition + the trade-off. 1:30-1:50 -- 收尾补漏。回到标记过的题。每一小问都要尝试 -- 空白答案得0分,一个公式总能得点分。简答概念题: 一个利落的定义+权衡取舍。 4 1:50-2:00 - Sweep. Check units (pages vs rows), ceilings applied, outer/inner stated, hash/sort-merge only used on equi-joins, and that you cleared the 40% hurdle by attempting enough across the paper. AskSia Library · DATA3404 · 双语 Bilingual[11]Source: asksia-bible-data3404-bilingual.pdfMultiply by reduction factors (selectivity); equality on V distinct values - 1/V => |RI/V rows. "Reorder this query" Push selections & projections down, most-restrictive first; consider left-deep trees (System-R DP, pipelined). Distributed join, one small table Broadcast the small/filtered side to all nodes - local joins. Else shuffle / hash-partition both on the join key (equi only). Spark stage boundary / dependency Narrow dep (map, filter) = pipelined; wide dep (groupBy, join, reduceByKey) = shuffle = stage boundary. "Any join condition" distributed Fragment-and-replicate works for any condition (incl. non-equi); needs ≥ m. n processors. ✓ Always answer in the same shape 始终以相同的形式作答 Formula - substitution - arithmetic - one-line verdict. e. g. "BNLJ = bR + [bR/(M-2)1. bs = 100 + [100/81. 400 = 100 + 13. 400 = 5,300 I/O; Student outer is cheapest. " The marker can follow every step, so even an arithmetic slip keeps the method marks. 公式→代入→算术→一句结论。例如:“BNLJ=bR +[bR/(M-2)] ·bs = 100+[100/8] · 400 = 100+13·400= 5,300 I/O; Student 为外表最便宜。”阅卷者能跟上每一步,故即使算术出错也保住方法分。 AskSia Library . DATA3404 . XXia Bilingual CH 10 . EXAM MORNING 10. 3 The cost-formula belt - the whole exam on one box 10. 3 代价公式带 -- 整场考试浓缩成一个方框 If you memorise nothing else, memorise this. Notation: bR, bs, N = #pages; IRI = #rows; M / B = buffer frames; F = fanout; R = outer, S = inner; cost = page I/O, output ignored. 如果你别的都不背,就背这个。记号:bR,bs, N = 页数;|R| = 行数;M / B= 缓冲帧数;F = 扇出(fan-out); R =外表, S= 内表;代价=页 I/O,输出忽略不计。 JOIN I/O - THE FIVE Tuple NLJ = bR + |R | . bs Page NLJ = bp + bR . bs Block NLJ = bR + [bR/(M-2)1 . bs Index NLJ = bR + |R | · c, c = 1 + h + #matches Sort-Merge = sort (R)+sort(S)+bR+bs (= bR+bs if pre-sorted) Grace Hash = 3 . (bR + bs) SORT / INDEX / SIZING Ext. sort passes = 1 + [logB-1[N/B]] Ext. sort I/0 = 2N . #passes (Pass 0 - [N/B] runs of B) B+-tree height = [log- N] B+ equality cost = [log= N] + 1 Hash equality cost = #pages in bucket (1, no overflow) Selectivity (= on V values) = 1/V est . matches = |R | / V Reduction factor = out / in (multiply through) 10. 4 Top traps - where the marks bleed out 10. 4 头号陷阱 -- 失分最严重的地方 ! M-2 vs M in BNLJ 块嵌套循环连接(BNLJ)中的 M-2 vs M
- 第三轮(按考试节奏模拟)
-
7)你现在只要回答我 3 个问题,我就能把你的复习再“个性化压缩”成一份冲刺清单
- 你回我这三件事(越短越好):
- 你考试具体是哪一天(日期)?还有几天?(我好给你排每天刷哪类题)
- 你最虚的两块是哪两块:
- A Join 代价
- B External sort
- C B+-tree / Hash / clustered vs unclustered
- D Optimiser + selectivity 估计
- E Buffer/CLOCK trace
- F Spark/NoSQL 概念题
- 你现在更需要:“公式模板+例题带你算一遍”,还是 “给你出题你来算我来改”?
- (可选)如果你愿意,我也可以帮你把你“最弱的两块 + 最常踩的陷阱”记住,下次你来就直接从这两块先救分。
Ask Sia for quick explanations, examples, and study support.