DATA1001
Jun 1, 2026
All files
- 先给你一句“救命总纲”:DATA1001 的 Final 是“概念 + 解读”考试,不是写 R 代码考试。你会拿到研究/图/汇总/小数据集,被考:能不能读懂设计 → 选对方法 → 算出关键量 → 用一句话把结论讲清楚。[1]Source: asksia-bible-data1001-bilingual.pdfIndependent study companion. Not affiliated with or endorsed by the University of Sydney. Corrections: takedowns@asksia. ai PREFACE - HOW TO USE THIS BOOK Method, not memory; context, not code 重方法,不重记忆;重情境,不重代码 The exam is conceptual & interpretive - read a study, pick a tool, say what it means 考试偏概念与解读 -- 读一份研究、选一个工具、说清它意味着什么 This is not a transcript of the lecture slides or a re-run of the R labs. It is a self-contained course in the statistical thinking DATA1001 examines - each idea stated plainly, each method shown on a worked example with real numbers, each classic misread flagged. You learn R in the Coding Milestones and Projects; the exam tests whether you can read a study, choose the right method, run the logic and interpret the answer in context. That is what these pages drill. 这不是讲义幻灯片的逐字稿,也不是R实验课的重播。它是一门自成体系的课程,讲的是 DATA1001 所考查的统计思维 每个概念都讲得明明白白,每种方法都配一个用真实数字做出的范例,每个经典误读都被标了出来。你在 Coding Milestones 和 Projects 里学 R;考试考查的是你能否读懂一份研究、选对方法、跑通逻辑,并结合情境解读答案。这正是 本书所要演练的。 A 1 . LEARN 1· 学习 You haven't done the topic yet. Read a chapter top to bottom. Every idea opens with a one-line TL;DR, then define - picture - method - worked example - trap. The diagrams are original schematics of standard statistics - learn the picture cold. 你还没学过这个主题。从头到尾 通读一章。每个要点都以一句 TL;DR 开头,然后是定义→图 示→方法→例题→陷阱。图 都是标准统计内容的原创示意图 -- 把图刻进脑子里。 B 2 . DRILL 2 · 演练 You've seen lectures and a workshop. Cover the worked steps and re-do each one by hand, then write the one- sentence interpretation in context. The exam pays for the sentence, not the arithmetic. 你已经看过讲座和一次研讨课。 遮住解题步骤、亲手把每一步重 做一遍,再写出结合情境的一句 话解释。考试给分的是那句话, 而不是算术。 C 3 . EXAM 3 . 应考 It's the revision lecture / study week. The TL;DRs, the trap boxes and the recurring (OV-EV)/SE pattern are your map. The blueprint overleaf shows the weights, the backstop machinery and the question template. 到了复习讲座/学习周。那些 TL;DR、陷阱框、以及反复出现 的(OV-EV)/SE 模式就是你的 地图。背面的蓝图展示了分值权 重、兜底机制和题目模板。 i The single engine that runs the back half of the course 驱动这门课后半程的那台唯一引擎 Master one calculation and the whole inference half collapses into a pattern. Every test - proportion test, z-test, t- test, slope test - is the same standardised distance, only the EV, the SE and the reference curve change. Wrapped around it is HATPC, the course's literal exam scaffold that graders reward line by line. Internalise the engine and the scaffold and fresh exam numbers cannot surprise you. 掌握一个计算,整个推断部分就坍缩成一个模式。每个检验 -- 比例检验、z检验、t 检验、斜率检验 -- 都是同一个标 准化距离,只是EV、SE 和参考曲线在变。围绕它的是 HATPC,本课程字面意义上的考试脚手架,阅卷人逐行给分。 把引擎和脚手架内化,再新的考试数字也吓不到你。 DATA1001 . Foundations of Data Science . AskSia Library THE SPINE test statistic = OV - EV SE HATPC[4]Source: asksia-bible-data1001-bilingual.pdfProject 2 (individual; EDA + client report) 20% Parts due ~Wk 9 & 11 Project 1 (group reproducible report) 10% ~Wk 6, present Wk 7 Evaluate Quizzes (weekly online) 5% Best 8 of 10 + Early task Workshop participation 5% All weeks . attend + take part The better-mark / progress-mark machinery 取较优分 / 进步加分的机制 Rule What it does Project progress- mark A better Project 2 mark replaces Project 1 Quiz better-mark If exam % > quiz %, exam % replaces quiz score Spec-con adjustment Missed work can be pushed onto the exam weight Net effect The exam is the universal backstop - and ungated by anything DATA1001 . Foundations of Data Science . AskSia Library ★ The exam format - conceptual & interpretive, NOT coding 考试形式 -- 偏概念与解读,而非写代码 One 2-hour written paper. You will not be asked to write R from a blank screen. You will be given studies, plots, summaries and small datasets and asked to pick the right method, run the logic and interpret in context. The same skeleton - (OV-EV)/SE read against a Normal or t curve - powers nearly every inference question. Walk the module pipeline on a fresh dataset and you have walked the exam. 一份2小时的笔试。不会要求你对着空白屏幕写R。 你会拿到研究、图、汇总和小数据集,被要求挑对方 法、跑通逻辑、并结合情境解释。同一副骨架 (OV-EV)/SE 对照正态或 t 曲线 -- 驱动几乎每一道 推断题。在一个新数据集上走一遍模块流水线,你就 走了一遍考试。 ✓ The strategy this dictates
- 你这份“考前宝典/cheatsheet”的核心思想是:整门课后半程其实只有一台引擎在驱动——
- 检验统计量:$$\text{stat}=\frac{OV-EV}{SE}$$
- 每个检验(比例、z/t、斜率、卡方)只是 EV/SE/参考曲线换了,骨架不变。[1]Source: asksia-bible-data1001-bilingual.pdfIndependent study companion. Not affiliated with or endorsed by the University of Sydney. Corrections: takedowns@asksia. ai PREFACE - HOW TO USE THIS BOOK Method, not memory; context, not code 重方法,不重记忆;重情境,不重代码 The exam is conceptual & interpretive - read a study, pick a tool, say what it means 考试偏概念与解读 -- 读一份研究、选一个工具、说清它意味着什么 This is not a transcript of the lecture slides or a re-run of the R labs. It is a self-contained course in the statistical thinking DATA1001 examines - each idea stated plainly, each method shown on a worked example with real numbers, each classic misread flagged. You learn R in the Coding Milestones and Projects; the exam tests whether you can read a study, choose the right method, run the logic and interpret the answer in context. That is what these pages drill. 这不是讲义幻灯片的逐字稿,也不是R实验课的重播。它是一门自成体系的课程,讲的是 DATA1001 所考查的统计思维 每个概念都讲得明明白白,每种方法都配一个用真实数字做出的范例,每个经典误读都被标了出来。你在 Coding Milestones 和 Projects 里学 R;考试考查的是你能否读懂一份研究、选对方法、跑通逻辑,并结合情境解读答案。这正是 本书所要演练的。 A 1 . LEARN 1· 学习 You haven't done the topic yet. Read a chapter top to bottom. Every idea opens with a one-line TL;DR, then define - picture - method - worked example - trap. The diagrams are original schematics of standard statistics - learn the picture cold. 你还没学过这个主题。从头到尾 通读一章。每个要点都以一句 TL;DR 开头,然后是定义→图 示→方法→例题→陷阱。图 都是标准统计内容的原创示意图 -- 把图刻进脑子里。 B 2 . DRILL 2 · 演练 You've seen lectures and a workshop. Cover the worked steps and re-do each one by hand, then write the one- sentence interpretation in context. The exam pays for the sentence, not the arithmetic. 你已经看过讲座和一次研讨课。 遮住解题步骤、亲手把每一步重 做一遍,再写出结合情境的一句 话解释。考试给分的是那句话, 而不是算术。 C 3 . EXAM 3 . 应考 It's the revision lecture / study week. The TL;DRs, the trap boxes and the recurring (OV-EV)/SE pattern are your map. The blueprint overleaf shows the weights, the backstop machinery and the question template. 到了复习讲座/学习周。那些 TL;DR、陷阱框、以及反复出现 的(OV-EV)/SE 模式就是你的 地图。背面的蓝图展示了分值权 重、兜底机制和题目模板。 i The single engine that runs the back half of the course 驱动这门课后半程的那台唯一引擎 Master one calculation and the whole inference half collapses into a pattern. Every test - proportion test, z-test, t- test, slope test - is the same standardised distance, only the EV, the SE and the reference curve change. Wrapped around it is HATPC, the course's literal exam scaffold that graders reward line by line. Internalise the engine and the scaffold and fresh exam numbers cannot surprise you. 掌握一个计算,整个推断部分就坍缩成一个模式。每个检验 -- 比例检验、z检验、t 检验、斜率检验 -- 都是同一个标 准化距离,只是EV、SE 和参考曲线在变。围绕它的是 HATPC,本课程字面意义上的考试脚手架,阅卷人逐行给分。 把引擎和脚手架内化,再新的考试数字也吓不到你。 DATA1001 . Foundations of Data Science . AskSia Library THE SPINE test statistic = OV - EV SE HATPC[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- 最重要的复习策略:每做一道题,都强迫自己写“情境一句话解释”;考试给分的是这句话,而不是算术。[1]Source: asksia-bible-data1001-bilingual.pdfIndependent study companion. Not affiliated with or endorsed by the University of Sydney. Corrections: takedowns@asksia. ai PREFACE - HOW TO USE THIS BOOK Method, not memory; context, not code 重方法,不重记忆;重情境,不重代码 The exam is conceptual & interpretive - read a study, pick a tool, say what it means 考试偏概念与解读 -- 读一份研究、选一个工具、说清它意味着什么 This is not a transcript of the lecture slides or a re-run of the R labs. It is a self-contained course in the statistical thinking DATA1001 examines - each idea stated plainly, each method shown on a worked example with real numbers, each classic misread flagged. You learn R in the Coding Milestones and Projects; the exam tests whether you can read a study, choose the right method, run the logic and interpret the answer in context. That is what these pages drill. 这不是讲义幻灯片的逐字稿,也不是R实验课的重播。它是一门自成体系的课程,讲的是 DATA1001 所考查的统计思维 每个概念都讲得明明白白,每种方法都配一个用真实数字做出的范例,每个经典误读都被标了出来。你在 Coding Milestones 和 Projects 里学 R;考试考查的是你能否读懂一份研究、选对方法、跑通逻辑,并结合情境解读答案。这正是 本书所要演练的。 A 1 . LEARN 1· 学习 You haven't done the topic yet. Read a chapter top to bottom. Every idea opens with a one-line TL;DR, then define - picture - method - worked example - trap. The diagrams are original schematics of standard statistics - learn the picture cold. 你还没学过这个主题。从头到尾 通读一章。每个要点都以一句 TL;DR 开头,然后是定义→图 示→方法→例题→陷阱。图 都是标准统计内容的原创示意图 -- 把图刻进脑子里。 B 2 . DRILL 2 · 演练 You've seen lectures and a workshop. Cover the worked steps and re-do each one by hand, then write the one- sentence interpretation in context. The exam pays for the sentence, not the arithmetic. 你已经看过讲座和一次研讨课。 遮住解题步骤、亲手把每一步重 做一遍,再写出结合情境的一句 话解释。考试给分的是那句话, 而不是算术。 C 3 . EXAM 3 . 应考 It's the revision lecture / study week. The TL;DRs, the trap boxes and the recurring (OV-EV)/SE pattern are your map. The blueprint overleaf shows the weights, the backstop machinery and the question template. 到了复习讲座/学习周。那些 TL;DR、陷阱框、以及反复出现 的(OV-EV)/SE 模式就是你的 地图。背面的蓝图展示了分值权 重、兜底机制和题目模板。 i The single engine that runs the back half of the course 驱动这门课后半程的那台唯一引擎 Master one calculation and the whole inference half collapses into a pattern. Every test - proportion test, z-test, t- test, slope test - is the same standardised distance, only the EV, the SE and the reference curve change. Wrapped around it is HATPC, the course's literal exam scaffold that graders reward line by line. Internalise the engine and the scaffold and fresh exam numbers cannot surprise you. 掌握一个计算,整个推断部分就坍缩成一个模式。每个检验 -- 比例检验、z检验、t 检验、斜率检验 -- 都是同一个标 准化距离,只是EV、SE 和参考曲线在变。围绕它的是 HATPC,本课程字面意义上的考试脚手架,阅卷人逐行给分。 把引擎和脚手架内化,再新的考试数字也吓不到你。 DATA1001 . Foundations of Data Science . AskSia Library THE SPINE test statistic = OV - EV SE HATPC[3]Source: asksia-bible-data1001-bilingual.pdf由此决定的应考策略 Because the exam backstops quizzes, Project 1 and missed pieces - but nothing backstops the exam - the dominant move is to over-invest in exam-style reasoning. Treat the projects as exam practice with a longer deadline: the EDA, the method choice, HATPC and interpret-for-a-client are exactly what the exam rewards. Drill the engine; write the in-context sentence every time. 因为期末为小测、Project 1 和缺失环节兜底 -- 而期 未本身无人兜底 -- 主导策略就是过度投资于考试式 推理。把项目当作截止期更长的考试练习:EDA、方 法选择、HATPC、以及面向客户的解释,恰恰是考试 给分之处。狂练引擎;每次都写出结合情境的那句 话。 i What the exam is really testing 考试真正考查的是什么 Four recurring chains carry most marks: read the study design - say what conclusion is legal; summarise & plot - describe shape/centre/spread; state HATPC - compute (OV-EV)/SE - p-value; interpret the p-value / CI without the classic misreads. Every chapter in this book is built to make those chains automatic. 四条反复出现的链条承载了大部分分值:读研究设计→说出什么结论合法;汇总并作图→描述形状/中心/离散;陈述 HATPC → 计算(OV-EV)/SE →p 值;解释p 值/CI,避开那些经典误读。本书的每一章都旨在让这些链条变成下意 识动作。 DATA1001 . Foundations of Data Science . AskSia Library CONTENTS - CONTENTS Four modules, one pipeline 四个模块,一条流水线 Exploring - Modelling - Sampling - Deciding - and one engine under it all 探索→建模→ 抽样→ 决策––以及贯穿全程的一台引擎 Ch Topic Core ideas Module 1 . Exploring data (Weeks 1-3) 1 Design & data types categorical vs quantitative · observational vs experiment . confounding · bias . → sampling 2 Exploratory data analysis mean/median . SD/IQR . resistance . histogram & skew . boxplot & 1. 5. IQR → Module 2 . Modelling data (Weeks 4-5) 3 The Normal model z-scores . 68-95-99. 7 . measurement error . pnorm/qnorm → 4 The linear model correlation · regression line . SD line . regression to the mean · r2 → Module 3 . Sampling data (Weeks 6-9) 5 Chance & the box model probability rules . binomial . EV & SE . the CLT → 6 Surveys & confidence intervals parameter vs statistic · bias . 0-1 box . CI . bootstrap → Module 4 . Decisions with data (Weeks 10-12) 7 Testing: HATPC & the engine (OV-EV)/SE · proportion / z / t / slope · p-value & Cl literacy →
-
0)考试蓝图与复习优先级(你该把时间砸在哪)
- Final 权重很大(60%),而且它还是很多环节的“兜底/回收机制”(你的材料明确强调这一点):
- 强势 Final 可以挽救小测/项目;但 Final 自己没东西能救。所以复习要“过度投资”在考试式推理上。[2]Source: asksia-bible-data1001-bilingual.pdf! The most important strategic fact about DATA1001 关于 DATA1001 最重要的策略性事实 So much of your mark rolls up into the exam that a strong exam can rescue weak quizzes and projects - but a weak exam has nothing to rescue it. A better Project 2 mark replaces Project 1; if your exam % beats your quiz %, the exam % replaces the quiz score; missed workshops / un-submitted work can be pushed onto the exam weight. The 60% exam is both the biggest single mark and the universal backstop. Prioritise it. 你成绩的相当一部分会汇入期末考试,因此一场强势的期末能挽救糟糕的小测和项目 -- 但糟糕的期末本身没有任何东 西能挽救它。更好的 Project 2 成绩可替换 Project 1;若你的期末百分比高于小测百分比,则期末百分比替换小测成 绩;缺席的研讨课/未提交的作业可以并入期末权重。这个占60% 的期末既是单项最大分值,也是万能兜底。优先攻 它。 i How this book was built - and the two-layer rule 本书是如何搭建的 -- 以及双层规则 Standard statistical definitions and formulas are stated plainly (they are universal - a normal curve is a fact, a boxplot is a fact). The unit's own framing and the lecturer's example numbers are paraphrased and re-numbered, never copied. DATA1001 is taught in the box-model tradition (tickets in a box, observed = expected + chance error) - we follow that lineage, not any single text. Book status: the Canvas site lists an optional textbook but names none, so we assert no required book. Verify dates and weights on your own Canvas (canvas. sydney. edu. au). 标准的统计定义和公式照实陈述(它们是普适的 -- 正态曲线是事实,箱线图是事实)。本单元自身的框架和讲师的例题 数字则被改写并重新编号,绝不照抄。DATA1001 在盒子模型传统下教学(盒子里的签,观测=期望+机遇误差) - 我们沿用这一脉络,而非任何单一教材。教材状态:Canvas 站点列出一本可选教材但未指名,因此我们不主张任何必读 书。请在你自己的 Canvas 上核实日期与权重(canvas. sydney. edu. au)。 DATA1001 . Foundations of Data Science . AskSia Library THE BLUEPRINT THE EXAM BLUEPRINT FINAL 60% . THE BACKSTOP 60% in one 2-hour sitting 60% 集中在一场 2 小时考试里 Project 1 10% . Project 2 20% . Quizzes 5% . Workshop 5% . Final 60% 项目一10% · 项目二 20% · 小测5% · 工作坊5% · 期末 60% Your mark is built from five pieces, but one dominates and nets the rest. The final exam is 60% in a single 2- hour written sitting during the formal exam period - more than half your grade at once, and the safety net under almost everything else. 你的分数由五部分构成,但其中一部分占主导,并兜住其余。期末考试是一场2 小时笔试,占60%,安排在正式考试周 -- 一次性占去你一半多的成绩,也是几乎所有其他部分之下的安全网。 60% FINAL EXAM (2 HR) 期末考试(2小时) 30% TWO PROJECTS (10+20) 两个项目 (10+20) 10% QUIZZES + WORKSHOP 小测 + 研讨课 ENGINE: (OV-EV)/SE 引擎:(OV-EV)/SE The five assessment pieces 五个考核组成部分 Component Weight When / detail Final examination - 2 hr, written, the backstop 60% Formal exam period[3]Source: asksia-bible-data1001-bilingual.pdf由此决定的应考策略 Because the exam backstops quizzes, Project 1 and missed pieces - but nothing backstops the exam - the dominant move is to over-invest in exam-style reasoning. Treat the projects as exam practice with a longer deadline: the EDA, the method choice, HATPC and interpret-for-a-client are exactly what the exam rewards. Drill the engine; write the in-context sentence every time. 因为期末为小测、Project 1 和缺失环节兜底 -- 而期 未本身无人兜底 -- 主导策略就是过度投资于考试式 推理。把项目当作截止期更长的考试练习:EDA、方 法选择、HATPC、以及面向客户的解释,恰恰是考试 给分之处。狂练引擎;每次都写出结合情境的那句 话。 i What the exam is really testing 考试真正考查的是什么 Four recurring chains carry most marks: read the study design - say what conclusion is legal; summarise & plot - describe shape/centre/spread; state HATPC - compute (OV-EV)/SE - p-value; interpret the p-value / CI without the classic misreads. Every chapter in this book is built to make those chains automatic. 四条反复出现的链条承载了大部分分值:读研究设计→说出什么结论合法;汇总并作图→描述形状/中心/离散;陈述 HATPC → 计算(OV-EV)/SE →p 值;解释p 值/CI,避开那些经典误读。本书的每一章都旨在让这些链条变成下意 识动作。 DATA1001 . Foundations of Data Science . AskSia Library CONTENTS - CONTENTS Four modules, one pipeline 四个模块,一条流水线 Exploring - Modelling - Sampling - Deciding - and one engine under it all 探索→建模→ 抽样→ 决策––以及贯穿全程的一台引擎 Ch Topic Core ideas Module 1 . Exploring data (Weeks 1-3) 1 Design & data types categorical vs quantitative · observational vs experiment . confounding · bias . → sampling 2 Exploratory data analysis mean/median . SD/IQR . resistance . histogram & skew . boxplot & 1. 5. IQR → Module 2 . Modelling data (Weeks 4-5) 3 The Normal model z-scores . 68-95-99. 7 . measurement error . pnorm/qnorm → 4 The linear model correlation · regression line . SD line . regression to the mean · r2 → Module 3 . Sampling data (Weeks 6-9) 5 Chance & the box model probability rules . binomial . EV & SE . the CLT → 6 Surveys & confidence intervals parameter vs statistic · bias . 0-1 box . CI . bootstrap → Module 4 . Decisions with data (Weeks 10-12) 7 Testing: HATPC & the engine (OV-EV)/SE · proportion / z / t / slope · p-value & Cl literacy →[4]Source: asksia-bible-data1001-bilingual.pdfProject 2 (individual; EDA + client report) 20% Parts due ~Wk 9 & 11 Project 1 (group reproducible report) 10% ~Wk 6, present Wk 7 Evaluate Quizzes (weekly online) 5% Best 8 of 10 + Early task Workshop participation 5% All weeks . attend + take part The better-mark / progress-mark machinery 取较优分 / 进步加分的机制 Rule What it does Project progress- mark A better Project 2 mark replaces Project 1 Quiz better-mark If exam % > quiz %, exam % replaces quiz score Spec-con adjustment Missed work can be pushed onto the exam weight Net effect The exam is the universal backstop - and ungated by anything DATA1001 . Foundations of Data Science . AskSia Library ★ The exam format - conceptual & interpretive, NOT coding 考试形式 -- 偏概念与解读,而非写代码 One 2-hour written paper. You will not be asked to write R from a blank screen. You will be given studies, plots, summaries and small datasets and asked to pick the right method, run the logic and interpret in context. The same skeleton - (OV-EV)/SE read against a Normal or t curve - powers nearly every inference question. Walk the module pipeline on a fresh dataset and you have walked the exam. 一份2小时的笔试。不会要求你对着空白屏幕写R。 你会拿到研究、图、汇总和小数据集,被要求挑对方 法、跑通逻辑、并结合情境解释。同一副骨架 (OV-EV)/SE 对照正态或 t 曲线 -- 驱动几乎每一道 推断题。在一个新数据集上走一遍模块流水线,你就 走了一遍考试。 ✓ The strategy this dictates
- 试卷形态:2 小时笔试,偏概念与解读,不要求你对空白屏幕写 R。[4]Source: asksia-bible-data1001-bilingual.pdfProject 2 (individual; EDA + client report) 20% Parts due ~Wk 9 & 11 Project 1 (group reproducible report) 10% ~Wk 6, present Wk 7 Evaluate Quizzes (weekly online) 5% Best 8 of 10 + Early task Workshop participation 5% All weeks . attend + take part The better-mark / progress-mark machinery 取较优分 / 进步加分的机制 Rule What it does Project progress- mark A better Project 2 mark replaces Project 1 Quiz better-mark If exam % > quiz %, exam % replaces quiz score Spec-con adjustment Missed work can be pushed onto the exam weight Net effect The exam is the universal backstop - and ungated by anything DATA1001 . Foundations of Data Science . AskSia Library ★ The exam format - conceptual & interpretive, NOT coding 考试形式 -- 偏概念与解读,而非写代码 One 2-hour written paper. You will not be asked to write R from a blank screen. You will be given studies, plots, summaries and small datasets and asked to pick the right method, run the logic and interpret in context. The same skeleton - (OV-EV)/SE read against a Normal or t curve - powers nearly every inference question. Walk the module pipeline on a fresh dataset and you have walked the exam. 一份2小时的笔试。不会要求你对着空白屏幕写R。 你会拿到研究、图、汇总和小数据集,被要求挑对方 法、跑通逻辑、并结合情境解释。同一副骨架 (OV-EV)/SE 对照正态或 t 曲线 -- 驱动几乎每一道 推断题。在一个新数据集上走一遍模块流水线,你就 走了一遍考试。 ✓ The strategy this dictates
- 考题最常走同一条流水线(pipeline):
- Explore(描述)→ Model(建模)→ Sample(抽样/机遇)→ Decide(检验/决策)
- 考题经常在一份新数据上“从头走一遍流水线”。[3]Source: asksia-bible-data1001-bilingual.pdf由此决定的应考策略 Because the exam backstops quizzes, Project 1 and missed pieces - but nothing backstops the exam - the dominant move is to over-invest in exam-style reasoning. Treat the projects as exam practice with a longer deadline: the EDA, the method choice, HATPC and interpret-for-a-client are exactly what the exam rewards. Drill the engine; write the in-context sentence every time. 因为期末为小测、Project 1 和缺失环节兜底 -- 而期 未本身无人兜底 -- 主导策略就是过度投资于考试式 推理。把项目当作截止期更长的考试练习:EDA、方 法选择、HATPC、以及面向客户的解释,恰恰是考试 给分之处。狂练引擎;每次都写出结合情境的那句 话。 i What the exam is really testing 考试真正考查的是什么 Four recurring chains carry most marks: read the study design - say what conclusion is legal; summarise & plot - describe shape/centre/spread; state HATPC - compute (OV-EV)/SE - p-value; interpret the p-value / CI without the classic misreads. Every chapter in this book is built to make those chains automatic. 四条反复出现的链条承载了大部分分值:读研究设计→说出什么结论合法;汇总并作图→描述形状/中心/离散;陈述 HATPC → 计算(OV-EV)/SE →p 值;解释p 值/CI,避开那些经典误读。本书的每一章都旨在让这些链条变成下意 识动作。 DATA1001 . Foundations of Data Science . AskSia Library CONTENTS - CONTENTS Four modules, one pipeline 四个模块,一条流水线 Exploring - Modelling - Sampling - Deciding - and one engine under it all 探索→建模→ 抽样→ 决策––以及贯穿全程的一台引擎 Ch Topic Core ideas Module 1 . Exploring data (Weeks 1-3) 1 Design & data types categorical vs quantitative · observational vs experiment . confounding · bias . → sampling 2 Exploratory data analysis mean/median . SD/IQR . resistance . histogram & skew . boxplot & 1. 5. IQR → Module 2 . Modelling data (Weeks 4-5) 3 The Normal model z-scores . 68-95-99. 7 . measurement error . pnorm/qnorm → 4 The linear model correlation · regression line . SD line . regression to the mean · r2 → Module 3 . Sampling data (Weeks 6-9) 5 Chance & the box model probability rules . binomial . EV & SE . the CLT → 6 Surveys & confidence intervals parameter vs statistic · bias . 0-1 box . CI . bootstrap → Module 4 . Decisions with data (Weeks 10-12) 7 Testing: HATPC & the engine (OV-EV)/SE · proportion / z / t / slope · p-value & Cl literacy →[8]Source: asksia-bible-data1001-bilingual.pdfWalk in ready 8 Glossary & method map every term . which test, when → 9 Practice bank & solutions the recurring exam template, re-numbered → i Why this order 为什么按这个顺序讲 DATA1001 runs as four modules on a learning slogan - Exploring - Modelling - Sampling - Deciding - and the topics build strictly: describe the data, fit a model, quantify chance and sampling, then make a decision (a test). We keep that order because exam questions almost always walk the same pipeline on a fresh dataset. The first two chapters - design and EDA - are where most students lose easy marks by rushing; slow down here. DATA1001 以一句学习口号分为四个模块 -- 探索→建模→抽样→ 决策 -- 主题严格递进:描述数据、拟合模型、 量化机遇与抽样、再做决策(一个检验)。我们保留这个顺序,因为考题几乎总是在一个新数据集上走同一条流水线。前 两章 -- 设计与 EDA -- 是多数学生因赶进度而丢掉送分分的地方;在这里慢下来。 DATA1001 . Foundations of Data Science . AskSia Library DESIGN . DATA TYPES TOPIC 1 - CH 1 . DESIGN OF EXPERIMENTS Design drives the conclusion 设计决定结论 Module 1 - Exploring data . Topic 1 (101, CO2, (09) 模块 1 -- 探索数据 · 主题 1 (LO1, LO2, LO9) Before a single number is computed, one question fixes what you are allowed to say: how were the data produced? The same association - coffee drinkers live longer, helmet wearers crash less - supports a causal claim from a randomised experiment but only an associational claim from an observational study. This is the most heavily examined critique in DATA1001 (LO9), and it costs nothing to get right except discipline. 在算出任何一个数字之前,有一个问题就先决定了你被允许说什么:数据是怎么产生的?同样一种关联 -- 喝咖啡的人活得更 久、戴头盔的人撞车更少 -- 在随机化实验中支持因果主张,但在观察性研究中只支持关联主张。这是 DATA1001 中考得最 多的批判点(LO9),除了纪律之外,做对它不需要付出任何代价。 i TL; DR - the four moves of this chapter TL;DR -- 本章的四个动作 (1) Name each variable's type - it dictates the legal summary and plot. (2) Name the study design - it dictates whether you may say causes. (3) Spot the confounder - the lurking third variable that fakes (or reverses) a link. (4) Name the bias - selection, non-response, measurement - that a big sample will not fix. (1)说出每个变量的类型 -- 它决定了合法的汇总和作图。(2)说出研究设计 -- 它决定了你是否可以说导致。(3)找出混 杂因子 -- 那个伪造(或反转)关联的潜伏第三变量。(4)说出偏差 -- 选择、无回应、测量 -- 这是大样本无法修正 的。 1. 1 Data types - what kind of variable is it? 1. 1 数据类型 -- 这是哪一类变量? Every variable is one of two families, and the family decides which summary and which plot are legal. You cannot take a mean of a category; you cannot draw a histogram of a colour. 每个变量都属于两大类之一,而类别决定了哪种汇总和哪种图是合法的。你不能对一个类别取均值;你不能给一种颜色画直方 图。 Family Sub-type Examples Legal summary / plot[19]Source: asksia-cheatsheet-data1001.pdf1 Explore what does the data look like? 2 Model Normal & linear models 3 Sample how much does chance vary it? 4 Decide is the effect real? (test) Exam questions walk this pipeline on a fresh dataset - describe, model, then test. DATA1901 shares these lectures and this exam; it differs only in harder workshops and Project 1. R Reference . Side 1 DESCRIBE + MODEL INSPECT & SUMMARISE str(d) . head(d) . summary (d) mean() sd() median() IQR() quantile() PLOT hist() . boxplot(y~g) ggplot(d, aes(x))+geom_histogram() NORMAL pnorm(q,p,o) # area below qnorm(p, p,o) # percentile - value RELATIONSHIP cor (x, y) Lm(y~x); summary(); plot(fit) predict(fit, newdata) Work in Quarto (. qmd -> knit to HTML); set embed- resources: true in the YAML or the report is penalised. str()/head() first - initial data analysis (IDA) before any plot. Formula Belt SIDE 1 x -= (1/n) ΣΧΙ · s = V[Σ (x-x)2 /(n-1)] IQR = Q3-Q1 . fences ±1. 5 . IQR z = (x-p)/o . 68-95-99. 7 . z*95 = 1. 96 r = avg (zx . z_y) . y = be+b1X b1 = r. SD_y/SD_x . be = y-b1x" regr to mean: +k SD x = +r. k SD y . r2 = var expl Revision aid . check the official unit outline for assessment . @ 2026 flip - for side 2 . probability, the box model & testing asksia. ai/cheatsheet/ usyd-data1001 · side 1/2 AskSia CHEATSHEET SERIES - - REVISION SHEET . ALL TOPICS SIDE 1/2 regression EXPLORE & MODEL . Data types . Study design . EDA . Histograms & boxplots . The Normal model . z-scores . Correlation &
-
1)你要背到“条件反射”的四条得分链(最像真题)
-
链条 A:读研究设计 → 判断你“能说什么”
- 关键问题:数据怎么来的? 这一问决定你能不能说“因果”。[8]Source: asksia-bible-data1001-bilingual.pdfWalk in ready 8 Glossary & method map every term . which test, when → 9 Practice bank & solutions the recurring exam template, re-numbered → i Why this order 为什么按这个顺序讲 DATA1001 runs as four modules on a learning slogan - Exploring - Modelling - Sampling - Deciding - and the topics build strictly: describe the data, fit a model, quantify chance and sampling, then make a decision (a test). We keep that order because exam questions almost always walk the same pipeline on a fresh dataset. The first two chapters - design and EDA - are where most students lose easy marks by rushing; slow down here. DATA1001 以一句学习口号分为四个模块 -- 探索→建模→抽样→ 决策 -- 主题严格递进:描述数据、拟合模型、 量化机遇与抽样、再做决策(一个检验)。我们保留这个顺序,因为考题几乎总是在一个新数据集上走同一条流水线。前 两章 -- 设计与 EDA -- 是多数学生因赶进度而丢掉送分分的地方;在这里慢下来。 DATA1001 . Foundations of Data Science . AskSia Library DESIGN . DATA TYPES TOPIC 1 - CH 1 . DESIGN OF EXPERIMENTS Design drives the conclusion 设计决定结论 Module 1 - Exploring data . Topic 1 (101, CO2, (09) 模块 1 -- 探索数据 · 主题 1 (LO1, LO2, LO9) Before a single number is computed, one question fixes what you are allowed to say: how were the data produced? The same association - coffee drinkers live longer, helmet wearers crash less - supports a causal claim from a randomised experiment but only an associational claim from an observational study. This is the most heavily examined critique in DATA1001 (LO9), and it costs nothing to get right except discipline. 在算出任何一个数字之前,有一个问题就先决定了你被允许说什么:数据是怎么产生的?同样一种关联 -- 喝咖啡的人活得更 久、戴头盔的人撞车更少 -- 在随机化实验中支持因果主张,但在观察性研究中只支持关联主张。这是 DATA1001 中考得最 多的批判点(LO9),除了纪律之外,做对它不需要付出任何代价。 i TL; DR - the four moves of this chapter TL;DR -- 本章的四个动作 (1) Name each variable's type - it dictates the legal summary and plot. (2) Name the study design - it dictates whether you may say causes. (3) Spot the confounder - the lurking third variable that fakes (or reverses) a link. (4) Name the bias - selection, non-response, measurement - that a big sample will not fix. (1)说出每个变量的类型 -- 它决定了合法的汇总和作图。(2)说出研究设计 -- 它决定了你是否可以说导致。(3)找出混 杂因子 -- 那个伪造(或反转)关联的潜伏第三变量。(4)说出偏差 -- 选择、无回应、测量 -- 这是大样本无法修正 的。 1. 1 Data types - what kind of variable is it? 1. 1 数据类型 -- 这是哪一类变量? Every variable is one of two families, and the family decides which summary and which plot are legal. You cannot take a mean of a category; you cannot draw a histogram of a colour. 每个变量都属于两大类之一,而类别决定了哪种汇总和哪种图是合法的。你不能对一个类别取均值;你不能给一种颜色画直方 图。 Family Sub-type Examples Legal summary / plot
- 观察性研究(observational):研究者只观察,混杂不受控 → 不能推因果。[9]Source: asksia-bible-data1001-bilingual.pdfI 相关≠ 因果;显著的斜率仍是观察性的;而‘无线性趋势’不等于‘无关系’ -- 也许有一条曲线。 DATA1001 . THE THREE SENTENCES GRADERS MOST WANT TO SEE DATA1001 . Foundations of Data Science . AskSia Library ✓ Ridea R 思路 summary (1m(y ~ x)) - read the slope's Pr (> | t | ); check assumptions with plot (1m. obj) (residuals-vs-fitted + Normal QQ). summary(1m(y ~ x)) → 读斜率的 Pr(>|t |);用 plot(1m. obj)(残差对拟合值+正态 QQ)核查假设。 DATA1001 . Foundations of Data Science . AskSia Library GLOSSARY CHAPTER . GLOSSARY RIE EN + 中文 Bilingual glossary - every examinable term 双语术语表 -- 每个可考的术语 English term . X . one-line meaning - grouped by the four modules 英文术语 · 中文 · 一句话含义 -- 按四个模块分组 A fast reference for the vocabulary DATA1001 actually examines, ordered by the four-module pipeline - Exploring - Modelling - Sampling - Decisions. The +X column is filled in the bilingual build; for now cover the right-hand meaning and recite from the term, then flip and recall the term from the meaning. 这是一份针对 DATA1001 真正考查的术语的快速参考,按四模块流水线排序 -- 探索→建模→抽样→决策。中文一列在 双语版本中填入;现在请先遮住右侧含义、从术语背出含义,再翻转、从含义回忆术语。 Term (EN) 中文 One-line meaning Module 1 - Exploring data: study design 研究设计 Observational study — Investigator only watches; cannot establish causation (confounding uncontrolled). Controlled experiment — Investigator assigns the treatment; randomisation licenses causal claims. Randomised controlled trial — Random assignment balances confounders - supports "X causes Y. "[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 随机对照实验(randomised controlled trial):随机分配能平衡混杂 → 才能更有资格说“X causes Y”。[9]Source: asksia-bible-data1001-bilingual.pdfI 相关≠ 因果;显著的斜率仍是观察性的;而‘无线性趋势’不等于‘无关系’ -- 也许有一条曲线。 DATA1001 . THE THREE SENTENCES GRADERS MOST WANT TO SEE DATA1001 . Foundations of Data Science . AskSia Library ✓ Ridea R 思路 summary (1m(y ~ x)) - read the slope's Pr (> | t | ); check assumptions with plot (1m. obj) (residuals-vs-fitted + Normal QQ). summary(1m(y ~ x)) → 读斜率的 Pr(>|t |);用 plot(1m. obj)(残差对拟合值+正态 QQ)核查假设。 DATA1001 . Foundations of Data Science . AskSia Library GLOSSARY CHAPTER . GLOSSARY RIE EN + 中文 Bilingual glossary - every examinable term 双语术语表 -- 每个可考的术语 English term . X . one-line meaning - grouped by the four modules 英文术语 · 中文 · 一句话含义 -- 按四个模块分组 A fast reference for the vocabulary DATA1001 actually examines, ordered by the four-module pipeline - Exploring - Modelling - Sampling - Decisions. The +X column is filled in the bilingual build; for now cover the right-hand meaning and recite from the term, then flip and recall the term from the meaning. 这是一份针对 DATA1001 真正考查的术语的快速参考,按四模块流水线排序 -- 探索→建模→抽样→决策。中文一列在 双语版本中填入;现在请先遮住右侧含义、从术语背出含义,再翻转、从含义回忆术语。 Term (EN) 中文 One-line meaning Module 1 - Exploring data: study design 研究设计 Observational study — Investigator only watches; cannot establish causation (confounding uncontrolled). Controlled experiment — Investigator assigns the treatment; randomisation licenses causal claims. Randomised controlled trial — Random assignment balances confounders - supports "X causes Y. "
- 混杂因子(confounder)定义:与“解释变量 X”和“结果 Y”都相关的潜伏第三变量,会扭曲表面关系。问自己:是 X 导致 Y,还是有一个 Z 在背后?[8]Source: asksia-bible-data1001-bilingual.pdfWalk in ready 8 Glossary & method map every term . which test, when → 9 Practice bank & solutions the recurring exam template, re-numbered → i Why this order 为什么按这个顺序讲 DATA1001 runs as four modules on a learning slogan - Exploring - Modelling - Sampling - Deciding - and the topics build strictly: describe the data, fit a model, quantify chance and sampling, then make a decision (a test). We keep that order because exam questions almost always walk the same pipeline on a fresh dataset. The first two chapters - design and EDA - are where most students lose easy marks by rushing; slow down here. DATA1001 以一句学习口号分为四个模块 -- 探索→建模→抽样→ 决策 -- 主题严格递进:描述数据、拟合模型、 量化机遇与抽样、再做决策(一个检验)。我们保留这个顺序,因为考题几乎总是在一个新数据集上走同一条流水线。前 两章 -- 设计与 EDA -- 是多数学生因赶进度而丢掉送分分的地方;在这里慢下来。 DATA1001 . Foundations of Data Science . AskSia Library DESIGN . DATA TYPES TOPIC 1 - CH 1 . DESIGN OF EXPERIMENTS Design drives the conclusion 设计决定结论 Module 1 - Exploring data . Topic 1 (101, CO2, (09) 模块 1 -- 探索数据 · 主题 1 (LO1, LO2, LO9) Before a single number is computed, one question fixes what you are allowed to say: how were the data produced? The same association - coffee drinkers live longer, helmet wearers crash less - supports a causal claim from a randomised experiment but only an associational claim from an observational study. This is the most heavily examined critique in DATA1001 (LO9), and it costs nothing to get right except discipline. 在算出任何一个数字之前,有一个问题就先决定了你被允许说什么:数据是怎么产生的?同样一种关联 -- 喝咖啡的人活得更 久、戴头盔的人撞车更少 -- 在随机化实验中支持因果主张,但在观察性研究中只支持关联主张。这是 DATA1001 中考得最 多的批判点(LO9),除了纪律之外,做对它不需要付出任何代价。 i TL; DR - the four moves of this chapter TL;DR -- 本章的四个动作 (1) Name each variable's type - it dictates the legal summary and plot. (2) Name the study design - it dictates whether you may say causes. (3) Spot the confounder - the lurking third variable that fakes (or reverses) a link. (4) Name the bias - selection, non-response, measurement - that a big sample will not fix. (1)说出每个变量的类型 -- 它决定了合法的汇总和作图。(2)说出研究设计 -- 它决定了你是否可以说导致。(3)找出混 杂因子 -- 那个伪造(或反转)关联的潜伏第三变量。(4)说出偏差 -- 选择、无回应、测量 -- 这是大样本无法修正 的。 1. 1 Data types - what kind of variable is it? 1. 1 数据类型 -- 这是哪一类变量? Every variable is one of two families, and the family decides which summary and which plot are legal. You cannot take a mean of a category; you cannot draw a histogram of a colour. 每个变量都属于两大类之一,而类别决定了哪种汇总和哪种图是合法的。你不能对一个类别取均值;你不能给一种颜色画直方 图。 Family Sub-type Examples Legal summary / plot[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 经典必考陷阱:样本再大也修不好偏差:size cures variance, not bias(大样本只能减小随机波动,不能修正系统性偏差)。[8]Source: asksia-bible-data1001-bilingual.pdfWalk in ready 8 Glossary & method map every term . which test, when → 9 Practice bank & solutions the recurring exam template, re-numbered → i Why this order 为什么按这个顺序讲 DATA1001 runs as four modules on a learning slogan - Exploring - Modelling - Sampling - Deciding - and the topics build strictly: describe the data, fit a model, quantify chance and sampling, then make a decision (a test). We keep that order because exam questions almost always walk the same pipeline on a fresh dataset. The first two chapters - design and EDA - are where most students lose easy marks by rushing; slow down here. DATA1001 以一句学习口号分为四个模块 -- 探索→建模→抽样→ 决策 -- 主题严格递进:描述数据、拟合模型、 量化机遇与抽样、再做决策(一个检验)。我们保留这个顺序,因为考题几乎总是在一个新数据集上走同一条流水线。前 两章 -- 设计与 EDA -- 是多数学生因赶进度而丢掉送分分的地方;在这里慢下来。 DATA1001 . Foundations of Data Science . AskSia Library DESIGN . DATA TYPES TOPIC 1 - CH 1 . DESIGN OF EXPERIMENTS Design drives the conclusion 设计决定结论 Module 1 - Exploring data . Topic 1 (101, CO2, (09) 模块 1 -- 探索数据 · 主题 1 (LO1, LO2, LO9) Before a single number is computed, one question fixes what you are allowed to say: how were the data produced? The same association - coffee drinkers live longer, helmet wearers crash less - supports a causal claim from a randomised experiment but only an associational claim from an observational study. This is the most heavily examined critique in DATA1001 (LO9), and it costs nothing to get right except discipline. 在算出任何一个数字之前,有一个问题就先决定了你被允许说什么:数据是怎么产生的?同样一种关联 -- 喝咖啡的人活得更 久、戴头盔的人撞车更少 -- 在随机化实验中支持因果主张,但在观察性研究中只支持关联主张。这是 DATA1001 中考得最 多的批判点(LO9),除了纪律之外,做对它不需要付出任何代价。 i TL; DR - the four moves of this chapter TL;DR -- 本章的四个动作 (1) Name each variable's type - it dictates the legal summary and plot. (2) Name the study design - it dictates whether you may say causes. (3) Spot the confounder - the lurking third variable that fakes (or reverses) a link. (4) Name the bias - selection, non-response, measurement - that a big sample will not fix. (1)说出每个变量的类型 -- 它决定了合法的汇总和作图。(2)说出研究设计 -- 它决定了你是否可以说导致。(3)找出混 杂因子 -- 那个伪造(或反转)关联的潜伏第三变量。(4)说出偏差 -- 选择、无回应、测量 -- 这是大样本无法修正 的。 1. 1 Data types - what kind of variable is it? 1. 1 数据类型 -- 这是哪一类变量? Every variable is one of two families, and the family decides which summary and which plot are legal. You cannot take a mean of a category; you cannot draw a histogram of a colour. 每个变量都属于两大类之一,而类别决定了哪种汇总和哪种图是合法的。你不能对一个类别取均值;你不能给一种颜色画直方 图。 Family Sub-type Examples Legal summary / plot[18]Source: asksia-cheatsheet-data1001.pdfR Reference . Side 2 SAMPLE + DECIDE CHANCE sample (x, n, replace) . set. seed() dbinom(x, n,p) . pbinom(x,n,p) replicate() # sampling distribution TESTS t. test (x, mu=pe) t. test(x, y, paired=TRUE) chisq. test(table(x,y)) prop. test() . summary(Lm(y~x)) Exam Traps DO NOT LOSE THESE · Causation needs randomisation ; size cures variance, not bias. CLT makes the statistic Normal, not the data. · ris linear-only; regression to the mean is no cause. · Always run HATPC; split the conclusion statistical + in- context. Formula Belt SIDE 2 OV = EV + chance error . stat = (OV-EV)/SE sum: EV=n. avg, SE=Vn . SD . mean: SE=SD/ Vn binom: np, v(np(1-p)) . SE(p)=V(p(1-p)/n) CI = est ± z *. SE . z *= 1. 96(95%)/2. 58(99%) x2=>(0-E)2/E, df=(r-1)(c-1) · slope t=B1/SE, tn-2 P(≥1) = 1-(1-p)" . power = 1-B asksia. ai/cheatsheet/ usyd-data1001 · side 2/2 AskSia CHEATSHEET SERIES Revision aid . check the official unit outline for assessment . @ 2026 good luck. revise smart. SIDE 2/2 (UV-EV)/SE SAMPLE & DECIDE . Probability . Binomial . The box model . EV + chance error . The CLT . Confidence intervals . HATPC . REVISION SHEET . ALL TOPICS DATA1001[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
-
链条 B:EDA(汇总+作图)→ 用标准词描述 shape/centre/spread
- 考试常让你读直方图/箱线图并比较:形状(shape)/中心(centre)/离散(spread)/偏斜(skew)/离群点(outlier)。[3]Source: asksia-bible-data1001-bilingual.pdf由此决定的应考策略 Because the exam backstops quizzes, Project 1 and missed pieces - but nothing backstops the exam - the dominant move is to over-invest in exam-style reasoning. Treat the projects as exam practice with a longer deadline: the EDA, the method choice, HATPC and interpret-for-a-client are exactly what the exam rewards. Drill the engine; write the in-context sentence every time. 因为期末为小测、Project 1 和缺失环节兜底 -- 而期 未本身无人兜底 -- 主导策略就是过度投资于考试式 推理。把项目当作截止期更长的考试练习:EDA、方 法选择、HATPC、以及面向客户的解释,恰恰是考试 给分之处。狂练引擎;每次都写出结合情境的那句 话。 i What the exam is really testing 考试真正考查的是什么 Four recurring chains carry most marks: read the study design - say what conclusion is legal; summarise & plot - describe shape/centre/spread; state HATPC - compute (OV-EV)/SE - p-value; interpret the p-value / CI without the classic misreads. Every chapter in this book is built to make those chains automatic. 四条反复出现的链条承载了大部分分值:读研究设计→说出什么结论合法;汇总并作图→描述形状/中心/离散;陈述 HATPC → 计算(OV-EV)/SE →p 值;解释p 值/CI,避开那些经典误读。本书的每一章都旨在让这些链条变成下意 识动作。 DATA1001 . Foundations of Data Science . AskSia Library CONTENTS - CONTENTS Four modules, one pipeline 四个模块,一条流水线 Exploring - Modelling - Sampling - Deciding - and one engine under it all 探索→建模→ 抽样→ 决策––以及贯穿全程的一台引擎 Ch Topic Core ideas Module 1 . Exploring data (Weeks 1-3) 1 Design & data types categorical vs quantitative · observational vs experiment . confounding · bias . → sampling 2 Exploratory data analysis mean/median . SD/IQR . resistance . histogram & skew . boxplot & 1. 5. IQR → Module 2 . Modelling data (Weeks 4-5) 3 The Normal model z-scores . 68-95-99. 7 . measurement error . pnorm/qnorm → 4 The linear model correlation · regression line . SD line . regression to the mean · r2 → Module 3 . Sampling data (Weeks 6-9) 5 Chance & the box model probability rules . binomial . EV & SE . the CLT → 6 Surveys & confidence intervals parameter vs statistic · bias . 0-1 box . CI . bootstrap → Module 4 . Decisions with data (Weeks 10-12) 7 Testing: HATPC & the engine (OV-EV)/SE · proportion / z / t / slope · p-value & Cl literacy →[7]Source: asksia-bible-data1001-bilingual.pdf一句话要点。DATA1001 期末约 60% 是概念性/解释性的:读懂一项研究、挑对方法、跑通逻辑、结合情境做解释。这个题 库针对每一个反复出现的考试招式给你一道全新的题,按流水线顺序排列,每题完整解答。遮住答案,在纸上做一遍,再核 对。 ★ Fresh stems - the exam STYLE, not the exam 全新的题干 -- 考的是考试的“风格”,而非考题本身 These are AskSia-authored problems written in the DATA1001 style. They are not real exam questions. The moves tracked are the recurring ones: confounding judgement, boxplot/EDA read, z / Normal calc, regression + regression- to-the-mean, box-model EV/SE, CLT & CI, a full HATPC test, x2 independence, and 'interpret this R output. ' The standard statistics is canonical; every number is checked by hand. 这些是 AskSia 自撰、以DATA1001 风格写就的题。它们不是真实考题。所追踪的招式都是反复出现的那些:混杂判 断、箱线图/EDA 解读、z/正态计算、回归+向均值回归、盒子模型 EV/SE、CLT 与 CI、一次完整的 HATPC 检 验、x2 独立性、以及‘解释这段R输出’。标准统计内容是规范的;每个数字都经手工核验。 Q1-03 Module 1 - design, confounding & EDA Q1-Q3 模块 1 -- 设计、混杂与 EDA Q1 DESIGN / CAUSATION 3 marks . Module 1 A news report: "People who drink herbal tea daily have lower blood pressure, so herbal tea lowers blood pressure. " The data come from a voluntary online survey. Critique the causal claim, name a plausible confounder, and say what design would license "causes. " Q2 BOXPLOT READ 3 marks . Module 1 Two classes sit the same test. Class A boxplot: min 41, Q, 58, median 67, Q3 74, max 88. Class B: min 30, Q, 52, median 70, Q2 90, max 99. Compare centre and spread, and identify which class is more right-skewed. Q3 OUTLIER RULE 2 marks . Module 1 For Class A above (Q = 58, Q3 = 74), find the 1. 5. IQR fences and state whether a score of 40 is an outlier. DATA1001 . Foundations of Data Science . AskSia Library Q1-Q3 Worked solutions - design & EDA 1 Q1. This is an observational study (a voluntary survey), so causation cannot be inferred. A plausible confounder is overall health-consciousness: health-conscious people both drink herbal tea and exercise/eat well (which lowers BP), creating a spurious tea-BP link. The voluntary sample adds selection bias. Only a randomised controlled experiment (randomly assign people to tea vs none) balances confounders and could license "causes. " Q1。这是一项观察性研究(自愿调查),故无法推断因果。一个合理的混杂因子是整体健康意识:注重健康的人既喝草本 茶又运动/饮食良好(这会降低血压),从而制造出一条伪茶一血压关联。自愿样本又添了选择偏差。只有随机对照实验 (把人随机分到喝茶 vs 不喝)才能平衡混杂因子,并可能许可“导致”。 2 Q2. Centre: Class B's median (70) is slightly higher than A's (67). Spread: B's IQR = 90-52 = 38 is much wider than A's 74-58 = 16, and B's range (30-99) is far wider - B is much more variable. Skew: in B the upper whisker (90-99) and upper box (70-+90) are stretched while the lower half is compressed ++ Class B is right-skewed; A is roughly symmetric. Q2. 中心:B班的中位数(70)略高于 A班的(67)。离散:B班的 IQR = 90-52= 38 远宽于 A 班的 74-58=16, 且 B班的全距(30-99)也宽得多 -- B班变异性大得多。偏斜:在B班中,上须(90→99)和上箱(70→90)被拉 长而下半部被压缩→B班右偏;A班大致对称。 3 Q3. IQR = 74 - 58 = 16, so 1. 5. IQR = 24. Lower fence = 58 - 24 = 34; upper fence = 74 + 24 = 98. Since 40 > 34, the score 40 is inside the lower fence - not an outlier. Q3. IQR = 74-58=16,故1. 5·IQR= 24。下栅栏= 58-24= 34;上栅栏= 74+24= 98。由于 40>34,分 数 40 在下栅栏之内→ 不是离群点。 DATA1001 . Foundations of Data Science . AskSia Library PRACTICE Q4-06 - PRACTICE BANK (CONT. ) Module 2 - the Normal & linear models 模块 2 -- 正态分布与线性模型 Q4-Q6: 2-scores & the 68-95-99. 7 rule, regression prediction, regression to the mean Q4-Q6: z 分数与 68-95-99. 7 规则、回归预测、向均值回归[21]Source: asksia-cheatsheet-data1001.pdfFoundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Describe & model · also DATA1901 0 . Revision Blueprint READ FIRST * DATA1001 runs one pipeline: explore -> model -> sample > decide . Side 1 is the first half - describe data, then model it (the Normal model + regression). Side 2 is chance, sampling and testing. The exam is conceptual & interpretive, not a coding exam. You read a study, pick the method, run the logic, and interpret in context. Nearly every inference question runs on one engine - (OV - EV) / SE (side 2). Most-tested skills: read shape from a histogram/boxplot; choose mean vs median; standardise with a z-score; tell the SD line from the regression line; critique a causal claim. SIA - Two reflexes win marks: name the variable type first, and never read causation from observational data - only randomisation licenses "causes". 1 . Data Types TOPIC 1 The variable type decides which summary and plot are legal. TYPE SUB-TYPE EXAMPLE Quantitative discrete (count) # siblings (numeric) continuous (measured) height Categorical (qualitative) ordinal (ordered) grade A-F Trap: treating a coded categorical (1/2/3 for groups) as numeric - you cannot take a mean of a nominal variable. Bar chart for categories, histogram for quantitative. R IDEA class(x) . as. factor (x) . as. numeric (x) 2 . Study Design TOPIC 1 . L09
- 中心:mean vs median(什么时候用哪个)
- 样本均值:$$\bar{x}=\frac{1}{n}\sum x_i$$[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 中位数 median 抗极端值(resistant);数据偏斜或离群很多时,不要硬报 mean。[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 离散:SD vs IQR(什么时候用哪个)
- 样本标准差(注意 $n-1$):
$$s=\sqrt{\frac{1}{n-1}\sum (x_i-\bar{x})^2}$$[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN - 四分位距:$$IQR=Q_3-Q_1$$(抗极端值)[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 陷阱:把 $n$ 和 $n-1$ 的除数搞混;R 的
sd()用的是 $n-1$。[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 样本标准差(注意 $n-1$):
- 箱线图 1.5·IQR 离群规则(超级常考小分)
- 下栅栏:$$Q_1-1.5,IQR$$
- 上栅栏:$$Q_3+1.5,IQR$$[7]Source: asksia-bible-data1001-bilingual.pdf一句话要点。DATA1001 期末约 60% 是概念性/解释性的:读懂一项研究、挑对方法、跑通逻辑、结合情境做解释。这个题 库针对每一个反复出现的考试招式给你一道全新的题,按流水线顺序排列,每题完整解答。遮住答案,在纸上做一遍,再核 对。 ★ Fresh stems - the exam STYLE, not the exam 全新的题干 -- 考的是考试的“风格”,而非考题本身 These are AskSia-authored problems written in the DATA1001 style. They are not real exam questions. The moves tracked are the recurring ones: confounding judgement, boxplot/EDA read, z / Normal calc, regression + regression- to-the-mean, box-model EV/SE, CLT & CI, a full HATPC test, x2 independence, and 'interpret this R output. ' The standard statistics is canonical; every number is checked by hand. 这些是 AskSia 自撰、以DATA1001 风格写就的题。它们不是真实考题。所追踪的招式都是反复出现的那些:混杂判 断、箱线图/EDA 解读、z/正态计算、回归+向均值回归、盒子模型 EV/SE、CLT 与 CI、一次完整的 HATPC 检 验、x2 独立性、以及‘解释这段R输出’。标准统计内容是规范的;每个数字都经手工核验。 Q1-03 Module 1 - design, confounding & EDA Q1-Q3 模块 1 -- 设计、混杂与 EDA Q1 DESIGN / CAUSATION 3 marks . Module 1 A news report: "People who drink herbal tea daily have lower blood pressure, so herbal tea lowers blood pressure. " The data come from a voluntary online survey. Critique the causal claim, name a plausible confounder, and say what design would license "causes. " Q2 BOXPLOT READ 3 marks . Module 1 Two classes sit the same test. Class A boxplot: min 41, Q, 58, median 67, Q3 74, max 88. Class B: min 30, Q, 52, median 70, Q2 90, max 99. Compare centre and spread, and identify which class is more right-skewed. Q3 OUTLIER RULE 2 marks . Module 1 For Class A above (Q = 58, Q3 = 74), find the 1. 5. IQR fences and state whether a score of 40 is an outlier. DATA1001 . Foundations of Data Science . AskSia Library Q1-Q3 Worked solutions - design & EDA 1 Q1. This is an observational study (a voluntary survey), so causation cannot be inferred. A plausible confounder is overall health-consciousness: health-conscious people both drink herbal tea and exercise/eat well (which lowers BP), creating a spurious tea-BP link. The voluntary sample adds selection bias. Only a randomised controlled experiment (randomly assign people to tea vs none) balances confounders and could license "causes. " Q1。这是一项观察性研究(自愿调查),故无法推断因果。一个合理的混杂因子是整体健康意识:注重健康的人既喝草本 茶又运动/饮食良好(这会降低血压),从而制造出一条伪茶一血压关联。自愿样本又添了选择偏差。只有随机对照实验 (把人随机分到喝茶 vs 不喝)才能平衡混杂因子,并可能许可“导致”。 2 Q2. Centre: Class B's median (70) is slightly higher than A's (67). Spread: B's IQR = 90-52 = 38 is much wider than A's 74-58 = 16, and B's range (30-99) is far wider - B is much more variable. Skew: in B the upper whisker (90-99) and upper box (70-+90) are stretched while the lower half is compressed ++ Class B is right-skewed; A is roughly symmetric. Q2. 中心:B班的中位数(70)略高于 A班的(67)。离散:B班的 IQR = 90-52= 38 远宽于 A 班的 74-58=16, 且 B班的全距(30-99)也宽得多 -- B班变异性大得多。偏斜:在B班中,上须(90→99)和上箱(70→90)被拉 长而下半部被压缩→B班右偏;A班大致对称。 3 Q3. IQR = 74 - 58 = 16, so 1. 5. IQR = 24. Lower fence = 58 - 24 = 34; upper fence = 74 + 24 = 98. Since 40 > 34, the score 40 is inside the lower fence - not an outlier. Q3. IQR = 74-58=16,故1. 5·IQR= 24。下栅栏= 58-24= 34;上栅栏= 74+24= 98。由于 40>34,分 数 40 在下栅栏之内→ 不是离群点。 DATA1001 . Foundations of Data Science . AskSia Library PRACTICE Q4-06 - PRACTICE BANK (CONT. ) Module 2 - the Normal & linear models 模块 2 -- 正态分布与线性模型 Q4-Q6: 2-scores & the 68-95-99. 7 rule, regression prediction, regression to the mean Q4-Q6: z 分数与 68-95-99. 7 规则、回归预测、向均值回归[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 偏斜时的“送分句型”:
- 右偏:右尾更长,常见关系是 mean > median;左偏相反(你至少要会用“右偏/左偏”说清楚)。[7]Source: asksia-bible-data1001-bilingual.pdf一句话要点。DATA1001 期末约 60% 是概念性/解释性的:读懂一项研究、挑对方法、跑通逻辑、结合情境做解释。这个题 库针对每一个反复出现的考试招式给你一道全新的题,按流水线顺序排列,每题完整解答。遮住答案,在纸上做一遍,再核 对。 ★ Fresh stems - the exam STYLE, not the exam 全新的题干 -- 考的是考试的“风格”,而非考题本身 These are AskSia-authored problems written in the DATA1001 style. They are not real exam questions. The moves tracked are the recurring ones: confounding judgement, boxplot/EDA read, z / Normal calc, regression + regression- to-the-mean, box-model EV/SE, CLT & CI, a full HATPC test, x2 independence, and 'interpret this R output. ' The standard statistics is canonical; every number is checked by hand. 这些是 AskSia 自撰、以DATA1001 风格写就的题。它们不是真实考题。所追踪的招式都是反复出现的那些:混杂判 断、箱线图/EDA 解读、z/正态计算、回归+向均值回归、盒子模型 EV/SE、CLT 与 CI、一次完整的 HATPC 检 验、x2 独立性、以及‘解释这段R输出’。标准统计内容是规范的;每个数字都经手工核验。 Q1-03 Module 1 - design, confounding & EDA Q1-Q3 模块 1 -- 设计、混杂与 EDA Q1 DESIGN / CAUSATION 3 marks . Module 1 A news report: "People who drink herbal tea daily have lower blood pressure, so herbal tea lowers blood pressure. " The data come from a voluntary online survey. Critique the causal claim, name a plausible confounder, and say what design would license "causes. " Q2 BOXPLOT READ 3 marks . Module 1 Two classes sit the same test. Class A boxplot: min 41, Q, 58, median 67, Q3 74, max 88. Class B: min 30, Q, 52, median 70, Q2 90, max 99. Compare centre and spread, and identify which class is more right-skewed. Q3 OUTLIER RULE 2 marks . Module 1 For Class A above (Q = 58, Q3 = 74), find the 1. 5. IQR fences and state whether a score of 40 is an outlier. DATA1001 . Foundations of Data Science . AskSia Library Q1-Q3 Worked solutions - design & EDA 1 Q1. This is an observational study (a voluntary survey), so causation cannot be inferred. A plausible confounder is overall health-consciousness: health-conscious people both drink herbal tea and exercise/eat well (which lowers BP), creating a spurious tea-BP link. The voluntary sample adds selection bias. Only a randomised controlled experiment (randomly assign people to tea vs none) balances confounders and could license "causes. " Q1。这是一项观察性研究(自愿调查),故无法推断因果。一个合理的混杂因子是整体健康意识:注重健康的人既喝草本 茶又运动/饮食良好(这会降低血压),从而制造出一条伪茶一血压关联。自愿样本又添了选择偏差。只有随机对照实验 (把人随机分到喝茶 vs 不喝)才能平衡混杂因子,并可能许可“导致”。 2 Q2. Centre: Class B's median (70) is slightly higher than A's (67). Spread: B's IQR = 90-52 = 38 is much wider than A's 74-58 = 16, and B's range (30-99) is far wider - B is much more variable. Skew: in B the upper whisker (90-99) and upper box (70-+90) are stretched while the lower half is compressed ++ Class B is right-skewed; A is roughly symmetric. Q2. 中心:B班的中位数(70)略高于 A班的(67)。离散:B班的 IQR = 90-52= 38 远宽于 A 班的 74-58=16, 且 B班的全距(30-99)也宽得多 -- B班变异性大得多。偏斜:在B班中,上须(90→99)和上箱(70→90)被拉 长而下半部被压缩→B班右偏;A班大致对称。 3 Q3. IQR = 74 - 58 = 16, so 1. 5. IQR = 24. Lower fence = 58 - 24 = 34; upper fence = 74 + 24 = 98. Since 40 > 34, the score 40 is inside the lower fence - not an outlier. Q3. IQR = 74-58=16,故1. 5·IQR= 24。下栅栏= 58-24= 34;上栅栏= 74+24= 98。由于 40>34,分 数 40 在下栅栏之内→ 不是离群点。 DATA1001 . Foundations of Data Science . AskSia Library PRACTICE Q4-06 - PRACTICE BANK (CONT. ) Module 2 - the Normal & linear models 模块 2 -- 正态分布与线性模型 Q4-Q6: 2-scores & the 68-95-99. 7 rule, regression prediction, regression to the mean Q4-Q6: z 分数与 68-95-99. 7 规则、回归预测、向均值回归[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
-
链条 C:写 HATPC → 算 $(OV-EV)/SE$ → 读 p 值
- HATPC 是阅卷人逐行给分的“脚手架”:每次都写全五步。[1]Source: asksia-bible-data1001-bilingual.pdfIndependent study companion. Not affiliated with or endorsed by the University of Sydney. Corrections: takedowns@asksia. ai PREFACE - HOW TO USE THIS BOOK Method, not memory; context, not code 重方法,不重记忆;重情境,不重代码 The exam is conceptual & interpretive - read a study, pick a tool, say what it means 考试偏概念与解读 -- 读一份研究、选一个工具、说清它意味着什么 This is not a transcript of the lecture slides or a re-run of the R labs. It is a self-contained course in the statistical thinking DATA1001 examines - each idea stated plainly, each method shown on a worked example with real numbers, each classic misread flagged. You learn R in the Coding Milestones and Projects; the exam tests whether you can read a study, choose the right method, run the logic and interpret the answer in context. That is what these pages drill. 这不是讲义幻灯片的逐字稿,也不是R实验课的重播。它是一门自成体系的课程,讲的是 DATA1001 所考查的统计思维 每个概念都讲得明明白白,每种方法都配一个用真实数字做出的范例,每个经典误读都被标了出来。你在 Coding Milestones 和 Projects 里学 R;考试考查的是你能否读懂一份研究、选对方法、跑通逻辑,并结合情境解读答案。这正是 本书所要演练的。 A 1 . LEARN 1· 学习 You haven't done the topic yet. Read a chapter top to bottom. Every idea opens with a one-line TL;DR, then define - picture - method - worked example - trap. The diagrams are original schematics of standard statistics - learn the picture cold. 你还没学过这个主题。从头到尾 通读一章。每个要点都以一句 TL;DR 开头,然后是定义→图 示→方法→例题→陷阱。图 都是标准统计内容的原创示意图 -- 把图刻进脑子里。 B 2 . DRILL 2 · 演练 You've seen lectures and a workshop. Cover the worked steps and re-do each one by hand, then write the one- sentence interpretation in context. The exam pays for the sentence, not the arithmetic. 你已经看过讲座和一次研讨课。 遮住解题步骤、亲手把每一步重 做一遍,再写出结合情境的一句 话解释。考试给分的是那句话, 而不是算术。 C 3 . EXAM 3 . 应考 It's the revision lecture / study week. The TL;DRs, the trap boxes and the recurring (OV-EV)/SE pattern are your map. The blueprint overleaf shows the weights, the backstop machinery and the question template. 到了复习讲座/学习周。那些 TL;DR、陷阱框、以及反复出现 的(OV-EV)/SE 模式就是你的 地图。背面的蓝图展示了分值权 重、兜底机制和题目模板。 i The single engine that runs the back half of the course 驱动这门课后半程的那台唯一引擎 Master one calculation and the whole inference half collapses into a pattern. Every test - proportion test, z-test, t- test, slope test - is the same standardised distance, only the EV, the SE and the reference curve change. Wrapped around it is HATPC, the course's literal exam scaffold that graders reward line by line. Internalise the engine and the scaffold and fresh exam numbers cannot surprise you. 掌握一个计算,整个推断部分就坍缩成一个模式。每个检验 -- 比例检验、z检验、t 检验、斜率检验 -- 都是同一个标 准化距离,只是EV、SE 和参考曲线在变。围绕它的是 HATPC,本课程字面意义上的考试脚手架,阅卷人逐行给分。 把引擎和脚手架内化,再新的考试数字也吓不到你。 DATA1001 . Foundations of Data Science . AskSia Library THE SPINE test statistic = OV - EV SE HATPC[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- HATPC 五步(你可以当模板背诵):[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error
H A T P c The exam scaffold - write all five letters, every time
HATPC 考试脚手架 -- 每次都把五个字母全部写出来
Step
What goes here
Marks reward
H Hypotheses
Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided.
Correct symbols & direction
A
Assumptions
Independence; Normality / large n; equal variance - state and justify each.
Checking, not just listing
T Test statistic
Plug into (OV-EV)/SE with the right EV and SE under Ho.
Right EV, right SE, arithmetic
P P-value
P(stat as/more extreme | H. ) from the reference curve. Double for two- sided.
Correct tail, correct doubling
C Conclusion
Statistical (vs a) and scientific (in context).
Both layers, in plain English
=
DATA1001 . Foundations of Data Science . AskSia Library
i Why one engine covers four tests
为什么一台引擎能覆盖四种检验
OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping.
OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。
★ The two-layer conclusion graders look for 评分者想看到的双层结论
A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test:
· P-value - see col 5.
· Conclusion - two layers: statistical (vs a) + scientific (in context).
Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance.
Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A".
Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence.
Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001
18 . THE ENGINE
ONE STAT, FOUR TESTS *
THE MASTER TEST STATISTIC stat = (OV - EV) / SE
- = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one
calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes.
TEST
SE / EV
CURVE
Proportion
VIPo(1-Po)/n]
N(0,1)
z (o known)
o/n
- H(Hypotheses):$H_0$ 带“=”(差异归因于机遇);$H_1$ 用 $>,<,\neq$;先根据研究问题决定单侧/双侧,不要为追显著性事后选。[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- A(Assumptions):独立性;正态/大样本;(两样本合并 t 检验)残差等方差等。要“说明并合理化”,不只是列清单。[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- T(Test statistic):把正确的 $EV$ 和 $SE$ 塞进引擎:$$\text{stat}=\frac{OV-EV}{SE}$$[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- P(P-value):在 $H_0$ 为真时,出现“至少这么极端”的概率;双侧要记得翻倍。[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[11]Source: asksia-bible-data1001-bilingual.pdfDecisions with data: HATPC & the test zoo 用数据做决策:HATPC 与检验动物园 The framework, the errors, and every named test with its statistic & curve 这套框架、那些错误,以及每个有名字的检验及其统计量与曲线 Term (EN) +x One-line meaning Module 4 – the testing framework 检验框架 HATPC The exam scaffold: Hypotheses, Assumptions, Test statistic, P-value, Conclusion. Null hypothesis Ho "The gap is due to chance"; stated with =. Alternative hypothesis H, "Not chance"; stated with >, < or #; one- or two-sided. (OV-EV)/SE The universal test statistic: SEs the data sit from what Ho predicts. P-value — P(stat as/more extreme | H. ); double it for a two-sided H,. Significance level a The reject threshold (usually 0. 05); p < a = reject Ho- Reject vs retain Ho p < a = reject; p ≥ a => retain (never 'accept'). Type I error Reject a true Ho (false positive); probability = a. Type II error — Retain a false Ho (false negative); probability = B. Power — 1 - ß; the chance of detecting a real effect; rises with n.
- C(Conclusion):必须写双层结论:统计层(比 $\alpha$)+ 科学层(回到情境用人话解释)。[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- 结论的禁忌:不要写 “accept $H_0$”,只能写 retain(因为“没证据”≠“证据表明没有”)。[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- 最常用显著性水平:$\alpha=0.05$(材料明确提示“通常 0.05”)。[11]Source: asksia-bible-data1001-bilingual.pdfDecisions with data: HATPC & the test zoo 用数据做决策:HATPC 与检验动物园 The framework, the errors, and every named test with its statistic & curve 这套框架、那些错误,以及每个有名字的检验及其统计量与曲线 Term (EN) +x One-line meaning Module 4 – the testing framework 检验框架 HATPC The exam scaffold: Hypotheses, Assumptions, Test statistic, P-value, Conclusion. Null hypothesis Ho "The gap is due to chance"; stated with =. Alternative hypothesis H, "Not chance"; stated with >, < or #; one- or two-sided. (OV-EV)/SE The universal test statistic: SEs the data sit from what Ho predicts. P-value — P(stat as/more extreme | H. ); double it for a two-sided H,. Significance level a The reject threshold (usually 0. 05); p < a = reject Ho- Reject vs retain Ho p < a = reject; p ≥ a => retain (never 'accept'). Type I error Reject a true Ho (false positive); probability = a. Type II error — Retain a false Ho (false negative); probability = B. Power — 1 - ß; the chance of detecting a real effect; rises with n.
-
链条 D:解释 p-value / CI(不踩经典误读)
- 你材料明确说:考试会反复考“正确解释 p 值/置信区间,并避开经典误读”。[3]Source: asksia-bible-data1001-bilingual.pdf由此决定的应考策略 Because the exam backstops quizzes, Project 1 and missed pieces - but nothing backstops the exam - the dominant move is to over-invest in exam-style reasoning. Treat the projects as exam practice with a longer deadline: the EDA, the method choice, HATPC and interpret-for-a-client are exactly what the exam rewards. Drill the engine; write the in-context sentence every time. 因为期末为小测、Project 1 和缺失环节兜底 -- 而期 未本身无人兜底 -- 主导策略就是过度投资于考试式 推理。把项目当作截止期更长的考试练习:EDA、方 法选择、HATPC、以及面向客户的解释,恰恰是考试 给分之处。狂练引擎;每次都写出结合情境的那句 话。 i What the exam is really testing 考试真正考查的是什么 Four recurring chains carry most marks: read the study design - say what conclusion is legal; summarise & plot - describe shape/centre/spread; state HATPC - compute (OV-EV)/SE - p-value; interpret the p-value / CI without the classic misreads. Every chapter in this book is built to make those chains automatic. 四条反复出现的链条承载了大部分分值:读研究设计→说出什么结论合法;汇总并作图→描述形状/中心/离散;陈述 HATPC → 计算(OV-EV)/SE →p 值;解释p 值/CI,避开那些经典误读。本书的每一章都旨在让这些链条变成下意 识动作。 DATA1001 . Foundations of Data Science . AskSia Library CONTENTS - CONTENTS Four modules, one pipeline 四个模块,一条流水线 Exploring - Modelling - Sampling - Deciding - and one engine under it all 探索→建模→ 抽样→ 决策––以及贯穿全程的一台引擎 Ch Topic Core ideas Module 1 . Exploring data (Weeks 1-3) 1 Design & data types categorical vs quantitative · observational vs experiment . confounding · bias . → sampling 2 Exploratory data analysis mean/median . SD/IQR . resistance . histogram & skew . boxplot & 1. 5. IQR → Module 2 . Modelling data (Weeks 4-5) 3 The Normal model z-scores . 68-95-99. 7 . measurement error . pnorm/qnorm → 4 The linear model correlation · regression line . SD line . regression to the mean · r2 → Module 3 . Sampling data (Weeks 6-9) 5 Chance & the box model probability rules . binomial . EV & SE . the CLT → 6 Surveys & confidence intervals parameter vs statistic · bias . 0-1 box . CI . bootstrap → Module 4 . Decisions with data (Weeks 10-12) 7 Testing: HATPC & the engine (OV-EV)/SE · proportion / z / t / slope · p-value & Cl literacy →
- 95% CI 的正确表述模板(重点是“程序解释”,不是“这个区间有 95% 概率”):
- CI 公式(你材料给的腰带公式):
- $$\text{CI}=\text{estimate}\pm z^*\cdot SE$$
- 其中 $z^*$:95% 常用 $1.96$;99% 常用 $2.58$。[18]Source: asksia-cheatsheet-data1001.pdfR Reference . Side 2 SAMPLE + DECIDE CHANCE sample (x, n, replace) . set. seed() dbinom(x, n,p) . pbinom(x,n,p) replicate() # sampling distribution TESTS t. test (x, mu=pe) t. test(x, y, paired=TRUE) chisq. test(table(x,y)) prop. test() . summary(Lm(y~x)) Exam Traps DO NOT LOSE THESE · Causation needs randomisation ; size cures variance, not bias. CLT makes the statistic Normal, not the data. · ris linear-only; regression to the mean is no cause. · Always run HATPC; split the conclusion statistical + in- context. Formula Belt SIDE 2 OV = EV + chance error . stat = (OV-EV)/SE sum: EV=n. avg, SE=Vn . SD . mean: SE=SD/ Vn binom: np, v(np(1-p)) . SE(p)=V(p(1-p)/n) CI = est ± z *. SE . z *= 1. 96(95%)/2. 58(99%) x2=>(0-E)2/E, df=(r-1)(c-1) · slope t=B1/SE, tn-2 P(≥1) = 1-(1-p)" . power = 1-B asksia. ai/cheatsheet/ usyd-data1001 · side 2/2 AskSia CHEATSHEET SERIES Revision aid . check the official unit outline for assessment . @ 2026 good luck. revise smart. SIDE 2/2 (UV-EV)/SE SAMPLE & DECIDE . Probability . Binomial . The box model . EV + chance error . The CLT . Confidence intervals . HATPC . REVISION SHEET . ALL TOPICS DATA1001[20]Source: asksia-cheatsheet-data1001.pdfCHANCE sample (x, n, replace) . set. seed() dbinom(x, n,p) . pbinom(x,n,p) replicate() # sampling distribution TESTS t. test (x, mu=pe) t. test(x, y, paired=TRUE) chisq. test(table(x,y)) prop. test() . summary(Lm(y~x)) Exam Traps DO NOT LOSE THESE · Causation needs randomisation ; size cures variance, not bias. CLT makes the statistic Normal, not the data. · ris linear-only; regression to the mean is no cause. · Always run HATPC; split the conclusion statistical + in- context. Formula Belt SIDE 2 OV = EV + chance error . stat = (OV-EV)/SE sum: EV=n. avg, SE=Vn . SD . mean: SE=SD/ Vn binom: np, v(np(1-p)) . SE(p)=V(p(1-p)/n) CI = est ± z *. SE . z *= 1. 96(95%)/2. 58(99%) x2=>(0-E)2/E, df=(r-1)(c-1) · slope t=B1/SE, tn-2 P(≥1) = 1-(1-p)" . power = 1-B asksia. ai/cheatsheet/ usyd-data1001 · side 2/2 AskSia CHEATSHEET SERIES Revision aid . check the official unit outline for assessment . @ 2026 good luck. revise smart. SIDE 2/2 (UV-EV)/SE SAMPLE & DECIDE . Probability . Binomial . The box model . EV + chance error . The CLT . Confidence intervals . HATPC . REVISION SHEET . ALL TOPICS
-
2)模块化复习地图(按课程“四模块一流水线”)
-
模块 1:Explore(设计 + 数据类型 + EDA)
-
变量类型先说出口:变量属于哪一类决定你“合法”的汇总/作图方式。[8]Source: asksia-bible-data1001-bilingual.pdfWalk in ready 8 Glossary & method map every term . which test, when → 9 Practice bank & solutions the recurring exam template, re-numbered → i Why this order 为什么按这个顺序讲 DATA1001 runs as four modules on a learning slogan - Exploring - Modelling - Sampling - Deciding - and the topics build strictly: describe the data, fit a model, quantify chance and sampling, then make a decision (a test). We keep that order because exam questions almost always walk the same pipeline on a fresh dataset. The first two chapters - design and EDA - are where most students lose easy marks by rushing; slow down here. DATA1001 以一句学习口号分为四个模块 -- 探索→建模→抽样→ 决策 -- 主题严格递进:描述数据、拟合模型、 量化机遇与抽样、再做决策(一个检验)。我们保留这个顺序,因为考题几乎总是在一个新数据集上走同一条流水线。前 两章 -- 设计与 EDA -- 是多数学生因赶进度而丢掉送分分的地方;在这里慢下来。 DATA1001 . Foundations of Data Science . AskSia Library DESIGN . DATA TYPES TOPIC 1 - CH 1 . DESIGN OF EXPERIMENTS Design drives the conclusion 设计决定结论 Module 1 - Exploring data . Topic 1 (101, CO2, (09) 模块 1 -- 探索数据 · 主题 1 (LO1, LO2, LO9) Before a single number is computed, one question fixes what you are allowed to say: how were the data produced? The same association - coffee drinkers live longer, helmet wearers crash less - supports a causal claim from a randomised experiment but only an associational claim from an observational study. This is the most heavily examined critique in DATA1001 (LO9), and it costs nothing to get right except discipline. 在算出任何一个数字之前,有一个问题就先决定了你被允许说什么:数据是怎么产生的?同样一种关联 -- 喝咖啡的人活得更 久、戴头盔的人撞车更少 -- 在随机化实验中支持因果主张,但在观察性研究中只支持关联主张。这是 DATA1001 中考得最 多的批判点(LO9),除了纪律之外,做对它不需要付出任何代价。 i TL; DR - the four moves of this chapter TL;DR -- 本章的四个动作 (1) Name each variable's type - it dictates the legal summary and plot. (2) Name the study design - it dictates whether you may say causes. (3) Spot the confounder - the lurking third variable that fakes (or reverses) a link. (4) Name the bias - selection, non-response, measurement - that a big sample will not fix. (1)说出每个变量的类型 -- 它决定了合法的汇总和作图。(2)说出研究设计 -- 它决定了你是否可以说导致。(3)找出混 杂因子 -- 那个伪造(或反转)关联的潜伏第三变量。(4)说出偏差 -- 选择、无回应、测量 -- 这是大样本无法修正 的。 1. 1 Data types - what kind of variable is it? 1. 1 数据类型 -- 这是哪一类变量? Every variable is one of two families, and the family decides which summary and which plot are legal. You cannot take a mean of a category; you cannot draw a histogram of a colour. 每个变量都属于两大类之一,而类别决定了哪种汇总和哪种图是合法的。你不能对一个类别取均值;你不能给一种颜色画直方 图。 Family Sub-type Examples Legal summary / plot[21]Source: asksia-cheatsheet-data1001.pdfFoundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Describe & model · also DATA1901 0 . Revision Blueprint READ FIRST * DATA1001 runs one pipeline: explore -> model -> sample > decide . Side 1 is the first half - describe data, then model it (the Normal model + regression). Side 2 is chance, sampling and testing. The exam is conceptual & interpretive, not a coding exam. You read a study, pick the method, run the logic, and interpret in context. Nearly every inference question runs on one engine - (OV - EV) / SE (side 2). Most-tested skills: read shape from a histogram/boxplot; choose mean vs median; standardise with a z-score; tell the SD line from the regression line; critique a causal claim. SIA - Two reflexes win marks: name the variable type first, and never read causation from observational data - only randomisation licenses "causes". 1 . Data Types TOPIC 1 The variable type decides which summary and plot are legal. TYPE SUB-TYPE EXAMPLE Quantitative discrete (count) # siblings (numeric) continuous (measured) height Categorical (qualitative) ordinal (ordered) grade A-F Trap: treating a coded categorical (1/2/3 for groups) as numeric - you cannot take a mean of a nominal variable. Bar chart for categories, histogram for quantitative. R IDEA class(x) . as. factor (x) . as. numeric (x) 2 . Study Design TOPIC 1 . L09
-
设计三件套:design / confounder / bias
- 观察性 vs 实验:决定因果话术。[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 混杂:找潜伏变量 Z。[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
- 偏差类型:选择偏差、测量偏差、无回应偏差;大样本无解。[24]Source: asksia-cheatsheet-data1001.pdf2 . Study Design TOPIC 1 . L09 Design drives the conclusion you may draw. · Observational - you only observe; confounding uncontrolled = cannot establish causation. Controlled experiment - you assign the treatment. Confounder = a variable linked to both the exposure and the outcome, distorting the apparent relationship. Ask: "Does X cause Y, or is there a lurking Z?" Trap (most-tested critique): "coffee drinkers live longer => coffee causes longevity. " Observational = a confounder (e. g. wealth) may drive both. 2b . Simpson's Paradox & Bias L09 Simpson's paradox: a trend that holds in every subgroup can reverse when the groups are pooled - a lurking variable drives it. Always check whether a third variable splits the data. Bias types: selection (who gets in), measurement (how you record), non-response. Size cures variance, not bias - a big biased sample is still biased. Prosecutor's fallacy: confusing P(evidence | innocent) with P(innocent | evidence) - they are not equal. 2c . Why Statistics? TOPIC 1 . L01 The unit's framing: evidence-based decisions need statistical thinking + computational skill (R) - to be a data storyteller . Ethics, privacy and big-data responsibility sit alongside the maths (LO1). LO9 critiques are plain English - read a study's design, name what it licenses, spot the flaw. 3 . Centre & Spread CENTRE TOPIC 3 SAMPLE MEAN x= = (1/n) Σ Χ. · Median - middle value; resistant. Trap: quoting the mean for skewed/outlier-heavy data (income!) - use the median. SPREAD SAMPLE SD (N-1 DIVISOR) s = V[ (1/(n-1)) Σ (xi - x)2 ] SD = RMS distance from the mean, in the data's units. IQR = Q3 - Q1 (resistant); range = max - min (very non- resistant); variance = s2. Traps: mixing the n vs n-1 divisor (R's sd ()/var () use n-1); reporting SD where IQR is better (skew). R IDEA mean() . median() . sd() . IQR() quantile () . summary () Percentiles: the p-th percentile is the value below which p% of the data fall - median = 50th, Q1 = 25th, Q3 = 75th. Resistant (robust) = unmoved by a few extreme values: median & IQR are resistant; mean, SD & range are not. Note R's sd () uses the n-1 divisor (a population SD would divide by n). 4 . Histograms & Skew TOPIC 2 Area (not height) = proportion of data in an interval. Shows shape: modality, skew, outliers. SHAPE TAIL MEAN VS MEDIAN
-
模块 2:Model(正态模型 + 线性模型/回归)
-
正态模型核心动作:用 z 分数标准化 + 读正态曲线面积(工具是 z)。[19]Source: asksia-cheatsheet-data1001.pdf1 Explore what does the data look like? 2 Model Normal & linear models 3 Sample how much does chance vary it? 4 Decide is the effect real? (test) Exam questions walk this pipeline on a fresh dataset - describe, model, then test. DATA1901 shares these lectures and this exam; it differs only in harder workshops and Project 1. R Reference . Side 1 DESCRIBE + MODEL INSPECT & SUMMARISE str(d) . head(d) . summary (d) mean() sd() median() IQR() quantile() PLOT hist() . boxplot(y~g) ggplot(d, aes(x))+geom_histogram() NORMAL pnorm(q,p,o) # area below qnorm(p, p,o) # percentile - value RELATIONSHIP cor (x, y) Lm(y~x); summary(); plot(fit) predict(fit, newdata) Work in Quarto (. qmd -> knit to HTML); set embed- resources: true in the YAML or the report is penalised. str()/head() first - initial data analysis (IDA) before any plot. Formula Belt SIDE 1 x -= (1/n) ΣΧΙ · s = V[Σ (x-x)2 /(n-1)] IQR = Q3-Q1 . fences ±1. 5 . IQR z = (x-p)/o . 68-95-99. 7 . z*95 = 1. 96 r = avg (zx . z_y) . y = be+b1X b1 = r. SD_y/SD_x . be = y-b1x" regr to mean: +k SD x = +r. k SD y . r2 = var expl Revision aid . check the official unit outline for assessment . @ 2026 flip - for side 2 . probability, the box model & testing asksia. ai/cheatsheet/ usyd-data1001 · side 1/2 AskSia CHEATSHEET SERIES - - REVISION SHEET . ALL TOPICS SIDE 1/2 regression EXPLORE & MODEL . Data types . Study design . EDA . Histograms & boxplots . The Normal model . z-scores . Correlation &[21]Source: asksia-cheatsheet-data1001.pdfFoundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 Describe & model · also DATA1901 0 . Revision Blueprint READ FIRST * DATA1001 runs one pipeline: explore -> model -> sample > decide . Side 1 is the first half - describe data, then model it (the Normal model + regression). Side 2 is chance, sampling and testing. The exam is conceptual & interpretive, not a coding exam. You read a study, pick the method, run the logic, and interpret in context. Nearly every inference question runs on one engine - (OV - EV) / SE (side 2). Most-tested skills: read shape from a histogram/boxplot; choose mean vs median; standardise with a z-score; tell the SD line from the regression line; critique a causal claim. SIA - Two reflexes win marks: name the variable type first, and never read causation from observational data - only randomisation licenses "causes". 1 . Data Types TOPIC 1 The variable type decides which summary and plot are legal. TYPE SUB-TYPE EXAMPLE Quantitative discrete (count) # siblings (numeric) continuous (measured) height Categorical (qualitative) ordinal (ordered) grade A-F Trap: treating a coded categorical (1/2/3 for groups) as numeric - you cannot take a mean of a nominal variable. Bar chart for categories, histogram for quantitative. R IDEA class(x) . as. factor (x) . as. numeric (x) 2 . Study Design TOPIC 1 . L09
-
线性模型与相关/回归(重点是解释,不是只报一个 r)
- 经典提醒:相关 ≠ 因果;“没有线性趋势”不等于“没有关系”(可能是曲线)。[9]Source: asksia-bible-data1001-bilingual.pdfI 相关≠ 因果;显著的斜率仍是观察性的;而‘无线性趋势’不等于‘无关系’ -- 也许有一条曲线。 DATA1001 . THE THREE SENTENCES GRADERS MOST WANT TO SEE DATA1001 . Foundations of Data Science . AskSia Library ✓ Ridea R 思路 summary (1m(y ~ x)) - read the slope's Pr (> | t | ); check assumptions with plot (1m. obj) (residuals-vs-fitted + Normal QQ). summary(1m(y ~ x)) → 读斜率的 Pr(>|t |);用 plot(1m. obj)(残差对拟合值+正态 QQ)核查假设。 DATA1001 . Foundations of Data Science . AskSia Library GLOSSARY CHAPTER . GLOSSARY RIE EN + 中文 Bilingual glossary - every examinable term 双语术语表 -- 每个可考的术语 English term . X . one-line meaning - grouped by the four modules 英文术语 · 中文 · 一句话含义 -- 按四个模块分组 A fast reference for the vocabulary DATA1001 actually examines, ordered by the four-module pipeline - Exploring - Modelling - Sampling - Decisions. The +X column is filled in the bilingual build; for now cover the right-hand meaning and recite from the term, then flip and recall the term from the meaning. 这是一份针对 DATA1001 真正考查的术语的快速参考,按四模块流水线排序 -- 探索→建模→抽样→决策。中文一列在 双语版本中填入;现在请先遮住右侧含义、从术语背出含义,再翻转、从含义回忆术语。 Term (EN) 中文 One-line meaning Module 1 - Exploring data: study design 研究设计 Observational study — Investigator only watches; cannot establish causation (confounding uncontrolled). Controlled experiment — Investigator assigns the treatment; randomisation licenses causal claims. Randomised controlled trial — Random assignment balances confounders - supports "X causes Y. "[29]Source: asksia-cheatsheet-data1001.pdf19 . t-tests for Means TOPIC 10 TEST STATISTIC DF One-sample (X-Mo)/(s/ Vn) n-1 Paired (d-0)/(s_d//n) n-1 Two-sample (×1-X2)/SE n1+02-2 Paired = reduce two measurements per unit to the differences d. (e. g. caffeine vs none per cyclist). Trap: an unpaired test on paired data throws away the pairing and loses power. Assumptions: independence (from design), Normality/large n (histogram, QQ-plot, shapiro. test), and equal variance for the two-sample pooled test. 20 . Tests for Relationship TOPIC 11 CHI-SQUARE (BOTH KINDS) x2 = Σ (Observed - Expected)2 / Expected Large x2 => far from expected => evidence against Ho. Always right-tailed. Traps: use raw counts, not % ; needs expected cells ≥ 5; flags only whether a relationship exists, not its direction. Slope test: Ho: 31 = 0 (no linear trend) vs H1: B1 # 0. t = 1/SE(B;) vs tn-2 - read straight off the slope row's Pr(>|t|) in summary (Tm). Assumptions LINE: Linear, Independent, Normal, Equal-variance residuals. A significant slope still is not causation, and "no linear trend" does not rule out a nonlinear one. R Reference . Side 2 SAMPLE + DECIDE
- 回归与检验常见结合:在
summary(lm(y~x))里读斜率的 p 值(Pr(>|t|)),并用残差图/QQ 图检查假设。[9]Source: asksia-bible-data1001-bilingual.pdfI 相关≠ 因果;显著的斜率仍是观察性的;而‘无线性趋势’不等于‘无关系’ -- 也许有一条曲线。 DATA1001 . THE THREE SENTENCES GRADERS MOST WANT TO SEE DATA1001 . Foundations of Data Science . AskSia Library ✓ Ridea R 思路 summary (1m(y ~ x)) - read the slope's Pr (> | t | ); check assumptions with plot (1m. obj) (residuals-vs-fitted + Normal QQ). summary(1m(y ~ x)) → 读斜率的 Pr(>|t |);用 plot(1m. obj)(残差对拟合值+正态 QQ)核查假设。 DATA1001 . Foundations of Data Science . AskSia Library GLOSSARY CHAPTER . GLOSSARY RIE EN + 中文 Bilingual glossary - every examinable term 双语术语表 -- 每个可考的术语 English term . X . one-line meaning - grouped by the four modules 英文术语 · 中文 · 一句话含义 -- 按四个模块分组 A fast reference for the vocabulary DATA1001 actually examines, ordered by the four-module pipeline - Exploring - Modelling - Sampling - Decisions. The +X column is filled in the bilingual build; for now cover the right-hand meaning and recite from the term, then flip and recall the term from the meaning. 这是一份针对 DATA1001 真正考查的术语的快速参考,按四模块流水线排序 -- 探索→建模→抽样→决策。中文一列在 双语版本中填入;现在请先遮住右侧含义、从术语背出含义,再翻转、从含义回忆术语。 Term (EN) 中文 One-line meaning Module 1 - Exploring data: study design 研究设计 Observational study — Investigator only watches; cannot establish causation (confounding uncontrolled). Controlled experiment — Investigator assigns the treatment; randomisation licenses causal claims. Randomised controlled trial — Random assignment balances confounders - supports "X causes Y. " - 斜率检验(slope test)骨架:
- 回归假设(LINE):线性、独立、残差正态、等方差。[5]Source: asksia-bible-data1001-bilingual.pdf— Linearity, Independence, Normality, Equal variance of residuals. DATA1001 . Foundations of Data Science . AskSia Library — — — — — — — — — — — — — — — i How to use this glossary in revision 复习时如何使用这份术语表 The terms are in teaching order - the same pipeline the exam walks on a fresh dataset. Two passes: EN-+meaning (cover the right column), then meaning-EN (cover the term). The starred ideas to over-learn are (OV-EV)/SE, SD vs SE, CI interpretation, and correlation # causation - they are tested almost verbatim. 这些术语按教学顺序排列 -- 正是考试在新数据集上所走的同一条流水线。两遍:EN→含义(遮住右列),再 含义→EN (遮住术语)。需要过度熟练的加星概念是(OV-EV)/SE、SD vs SE、CI 的解释、以及相关 ≠因果 -- 它们几乎照原 话考。 DATA1001 . Foundations of Data Science . AskSia Library PRACTICE Q1-Q3 - CHAPTER . PRACTICE BANK & WORKED SOLUTIONS MIRRORS THE 60% FINAL Drill the whole pipeline, exam-style 按考试风格演练整条流水线 Fourteen fresh interpretive problems - design - model- sample - test, each worked end to end 十四道全新的解读型题目 -- 设计→建模→抽样→检验,每道都从头做到尾 The one-line takeaway. The DATA1001 final is ~60% conceptual / interpretive: read a study, pick the right method, run the logic, interpret in context. This bank gives you one fresh problem per recurring exam move, in pipeline order, each fully worked. Cover the solution, do it on paper, then check.[29]Source: asksia-cheatsheet-data1001.pdf19 . t-tests for Means TOPIC 10 TEST STATISTIC DF One-sample (X-Mo)/(s/ Vn) n-1 Paired (d-0)/(s_d//n) n-1 Two-sample (×1-X2)/SE n1+02-2 Paired = reduce two measurements per unit to the differences d. (e. g. caffeine vs none per cyclist). Trap: an unpaired test on paired data throws away the pairing and loses power. Assumptions: independence (from design), Normality/large n (histogram, QQ-plot, shapiro. test), and equal variance for the two-sample pooled test. 20 . Tests for Relationship TOPIC 11 CHI-SQUARE (BOTH KINDS) x2 = Σ (Observed - Expected)2 / Expected Large x2 => far from expected => evidence against Ho. Always right-tailed. Traps: use raw counts, not % ; needs expected cells ≥ 5; flags only whether a relationship exists, not its direction. Slope test: Ho: 31 = 0 (no linear trend) vs H1: B1 # 0. t = 1/SE(B;) vs tn-2 - read straight off the slope row's Pr(>|t|) in summary (Tm). Assumptions LINE: Linear, Independent, Normal, Equal-variance residuals. A significant slope still is not causation, and "no linear trend" does not rule out a nonlinear one. R Reference . Side 2 SAMPLE + DECIDE
-
模块 3:Sample(概率规则 + 二项 + 盒子模型 + CLT)
-
概率两大法则(最基本但最容易混):[13]Source: asksia-bible-data1001-bilingual.pdfDATA1001 . Foundations of Data Science . AskSia Library CHANCE . RULES - TOPIC 6 . UNDERSTANDING CHANCE MODULE 3 . SAMPLING Probability is a long-run proportion 概率是一个长期频率 Two rules run everything: add for 'or', multiply for 'and' 两条规则统管全局:‘或’用加法,‘且’用乘法 A probability is the proportion of times an event would occur if the chance process were repeated over and over - always between 0% and 100%. Every chance calculation in DATA1001 is built from three primitives: the complement, the addition rule (for 'or') and the multiplication rule (for 'and'). Get the and / or split right and the arithmetic is just bookkeeping. 概率是如果把这个偶然过程反复重复,某事件会发生的次数比例 -- 总是介于0% 与 100% 之间。DATA1001 里每个偶然计 算都由三个基本元件搭成:补集、加法规则(用于‘或’)和乘法规则(用于‘且’)。把且/或分清楚,剩下的算术不过是记账。 0-1 EVERY PROBABILITY LIVES HERE 每个概率都住在这里 + 'OR' - ADD (RULE 1) 'OR'→相加(规则 1) × 'AND' - MULTIPLY CRULE 2) ‘AND’→相乘(规则 2) 1-P COMPLEMENT - THE ESCAPE HATCH 补一 那个应急出口 6. 1 Complement & conditional 6. 1 补集与条件概率 COMPLEMENT & CONDITIONAL P(Ac) = 1 - P(A) P(An B) P(A | B) = = P(B) The complement - "the chance it does not happen" - is the single most useful trick: "at least one" problems are almost always easier as 1 - P(none). The conditional P(A | B) re-bases the chance of A onto the world where B already happened. 补集 -- “它不发生的概率” -- 是最有用的单一技巧:“至 少一个”的问题几乎总是用1-P(一个都没有)更好做。条 件概率P(A|B)把A的概率重新建立在B已经发生的那个 世界里。 6. 2 The two combining rules 6. 2 两条组合规则 ADDITION (OR) & MULTIPLICATION (AND) P(AUB) = P(A) + P(B) - P(An B) P(AnB) = P(A) P(B | A) The addition rule subtracts the overlap so it is not double-counted; it collapses to P(A)+P(B) only when the events are mutually exclusive (they cannot co- occur). The multiplication rule collapses to P(A)P(B) only when the events are independent. 加法规则减去重叠部分以免重复计数;只有当两个事件互斥 (不能同时发生)时,它才简化为 P(A)+P(B)。乘法规则只 有当两个事件相互独立时,才简化为 P(A)P(B)。[25]Source: asksia-cheatsheet-data1001.pdfblood type POPULATION (PARAM. ) SAMPLE (STAT. ) DATA1001 Foundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Sample & decide . the engine 10 . Probability Rules TOPIC 6 . LO6 Probability = the long-run proportion of times an event occurs, 0 ≤ P ≤ 1. COMPLEMENT P(Ac) = 1 - P(A) CONDITIONAL P(A | B) = P(A and B) / P(B) ADDITION (OR) P(A or B) = P(A) + P(B) - P(A and B) MULTIPLICATION (AND) P(A and B) = P(A) . P(B | A) · Mutually exclusive: A prevents B = P(A and B) = 0, so OR reduces to P(A) + P(B). reduces to P(A)·P(B). Trap: mutually exclusive # independent (they are opposites for events with P > 0); multiplying dependent events as if independent. 10b . "At Least One" COMPLEMENT TRICK Compute via the complement - far easier than summing cases: P(at least one) = 1 - P(none) = 1 - (1 - p)" [n independent trials] With vs without replacement: with replacement => trials independent, p constant. Without = p changes each draw (use conditional probabilities). For n < N the difference is negligible. Worked: P(at least one six in 4 die rolls) = 1 - (5/6)4 = 0. 518 . Trying to add P(six) four times (4-1/6 = 0. 67) double-counts - always go through the complement. Conditional # joint: P(A|B) is computed within the B- world (divide by P(B)); P(A and B) is the slice of the whole. Bayes' direction matters - P(A|B) ± P(B|A) in general (the prosecutor's fallacy from side 1). A tree diagram keeps the with/without-replacement bookkeeping straight. 11 . The Binomial TOPIC 6 n fixed independent binary trials, constant success prob p. Chance of exactly x successes: BINOMIAL PMF P(X = x) = C(n,x) . px . (1-p)n -* C(n,x) = n! / [x! (n-x) !] MEAN & SD mean = np . SD = V(np(1-p)) Trap: using the binomial when trials are dependent or p varies; confusing "exactly x" (dbinom) with "at most x" (pbinom). R IDEA dbinom(x, n,p) # exactly x pbinom(x,n,p) # P(X ≤ x) sample() + set. seed() # simulate
-
“至少一个”必用补集技巧:
- $$P(\ge 1)=1-P(0)=1-(1-p)^n\quad(\text{n 次独立试验})$$[25]Source: asksia-cheatsheet-data1001.pdfblood type POPULATION (PARAM. ) SAMPLE (STAT. ) DATA1001 Foundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Sample & decide . the engine 10 . Probability Rules TOPIC 6 . LO6 Probability = the long-run proportion of times an event occurs, 0 ≤ P ≤ 1. COMPLEMENT P(Ac) = 1 - P(A) CONDITIONAL P(A | B) = P(A and B) / P(B) ADDITION (OR) P(A or B) = P(A) + P(B) - P(A and B) MULTIPLICATION (AND) P(A and B) = P(A) . P(B | A) · Mutually exclusive: A prevents B = P(A and B) = 0, so OR reduces to P(A) + P(B). reduces to P(A)·P(B). Trap: mutually exclusive # independent (they are opposites for events with P > 0); multiplying dependent events as if independent. 10b . "At Least One" COMPLEMENT TRICK Compute via the complement - far easier than summing cases: P(at least one) = 1 - P(none) = 1 - (1 - p)" [n independent trials] With vs without replacement: with replacement => trials independent, p constant. Without = p changes each draw (use conditional probabilities). For n < N the difference is negligible. Worked: P(at least one six in 4 die rolls) = 1 - (5/6)4 = 0. 518 . Trying to add P(six) four times (4-1/6 = 0. 67) double-counts - always go through the complement. Conditional # joint: P(A|B) is computed within the B- world (divide by P(B)); P(A and B) is the slice of the whole. Bayes' direction matters - P(A|B) ± P(B|A) in general (the prosecutor's fallacy from side 1). A tree diagram keeps the with/without-replacement bookkeeping straight. 11 . The Binomial TOPIC 6 n fixed independent binary trials, constant success prob p. Chance of exactly x successes: BINOMIAL PMF P(X = x) = C(n,x) . px . (1-p)n -* C(n,x) = n! / [x! (n-x) !] MEAN & SD mean = np . SD = V(np(1-p)) Trap: using the binomial when trials are dependent or p varies; confusing "exactly x" (dbinom) with "at most x" (pbinom). R IDEA dbinom(x, n,p) # exactly x pbinom(x,n,p) # P(X ≤ x) sample() + set. seed() # simulate[30]Source: asksia-cheatsheet-data1001.pdf11b . Worked . Binomial AT LEAST ONE n = 10 free throws, p = 0. 7. Mean made = np = 7 ; SD = /(10-0. 7. 0. 3) = /2. 1 = 1. 45. P(exactly 8) = C(10,8) . 0. 78 . 0. 32 = 0. 233 = dbinom(8, 10, 0. 7) P(at least 1 miss) = 1 - P(0 misses) = 1 - 0. 71º = 0. 972 the complement trick again. set. seed() makes a simulation reproducible (same random draws every run) - required for marked Quarto reports. 12 . THE BOX MODEL SIGNATURE * TOPIC 7 DATA1001's master device: model any chance process as drawing tickets from a box . Specify (1) the numbers on the tickets, (2) how many of each, (3) the number of draws n and whether with/without replacement. The box = the population. THE CORE IDENTITY observed value = expected value + chance error OV = EV + chance error First compute the box's average (u_box) and SD (SD_box, population divisor n) - these describe one ticket. Then scale up to n draws: STATISTIC EV SE Sum of n draws n . u_box Vn . SD_box Mean of n draws μ_box SD_box / /n SE = the likely size of the chance error. The SE of the mean shrinks like In (quadruple n to halve SE); the SE of the sum grows like In. SIA - SD describes the spread of the data/box; SE describes the spread of a statistic. Confusing the two - and mixing the sum-SE (xVn) with the mean-SE (+Vn) - are the top box-model errors. 12b . The Vn Law & LLN TOPIC 7 Law of Large Numbers: as n grows, the chance error becomes small in relative size - proportions stabilise - but the chance error in the sum/count grows in absolute size (like /n). Gambler's fallacy: thinking past outcomes make the count "even out". Proportions converge; absolute deviations grow. The coin has no memory. Two faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests.
-
二项分布(Binomial):适用条件:固定 $n$、每次成功概率恒定 $p$、试验独立;随机变量是成功次数 $X$。[12]Source: asksia-bible-data1001-bilingual.pdf有放回:签放回去,所以每次抽取都相同且独立 -- 这是盒子模型所处的干净世界。无放回:每次抽取都改变剩下的内 容,所以抽取是相依的,你必须做条件处理。对于从巨大总体中抽取的极小样本,两者几乎相同(下一章的有限总体校 正衡量这一差距)。 DATA1001 . Foundations of Data Science . AskSia Library THE BINOMIAL - THE BINOMIAL IDEA Counting successes in n independent trials 在 n 次独立试验中计数成功次数 Fixed n . constant p . independent - the first 'named' distribution 固定 n · 恒定 p · 相互独立 -- 第一个有‘名字’的分布 When you repeat the same yes/no trial a fixed number of times - n trials, each with the same success chance p, all independent - the count of successes follows the binomial distribution. It is the multiplication rule (for one exact sequence) times the addition rule (over all the equally-likely orderings). 当你把同一个是/否试验重复固定次数 -- n次试验、每次成功概率相同为 p、全部相互独立 -- 成功次数的计数服从二项分 布。它是乘法规则(对某一种确切的序列)乘以加法规则(对所有等可能的排列方式)。 BINOMIAL DISTRIBUTION n! P(X = ) = (2) pz (1 - p)n-a, ( n) x n = x! (n- x)! mean = np SD = Vnp(1 - p) Reading the formula 读懂这个公式 > pr(1 - p)n-a is the chance of one particular sequence with x successes (multiplication rule, independence). 是某一个含 x 次成功的特定序列的概率(乘法规则、独立 性). → ") counts how many such sequences there are - ( the orderings that all give x successes (addition rule). 数的是有多少个这样的序列 -- 所有都给出 x 次成功的排列 (加法规则)。 - The mean np and SD _np(1 - p) are exactly the box-model EV and SE for a 0-1 box drawn n times - the bridge to the next page. 均值 np和 SD 恰好就是抽 n 次的 0-1盒子的盒子模型 EV 和 SE -- 通往下一页的桥梁。 When it applies (and when it does not) 何时适用(以及何时不适用) Binomial needs . . . If it fails . . . n fixed in advance "keep going till a 6" - not binomial[25]Source: asksia-cheatsheet-data1001.pdfblood type POPULATION (PARAM. ) SAMPLE (STAT. ) DATA1001 Foundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Sample & decide . the engine 10 . Probability Rules TOPIC 6 . LO6 Probability = the long-run proportion of times an event occurs, 0 ≤ P ≤ 1. COMPLEMENT P(Ac) = 1 - P(A) CONDITIONAL P(A | B) = P(A and B) / P(B) ADDITION (OR) P(A or B) = P(A) + P(B) - P(A and B) MULTIPLICATION (AND) P(A and B) = P(A) . P(B | A) · Mutually exclusive: A prevents B = P(A and B) = 0, so OR reduces to P(A) + P(B). reduces to P(A)·P(B). Trap: mutually exclusive # independent (they are opposites for events with P > 0); multiplying dependent events as if independent. 10b . "At Least One" COMPLEMENT TRICK Compute via the complement - far easier than summing cases: P(at least one) = 1 - P(none) = 1 - (1 - p)" [n independent trials] With vs without replacement: with replacement => trials independent, p constant. Without = p changes each draw (use conditional probabilities). For n < N the difference is negligible. Worked: P(at least one six in 4 die rolls) = 1 - (5/6)4 = 0. 518 . Trying to add P(six) four times (4-1/6 = 0. 67) double-counts - always go through the complement. Conditional # joint: P(A|B) is computed within the B- world (divide by P(B)); P(A and B) is the slice of the whole. Bayes' direction matters - P(A|B) ± P(B|A) in general (the prosecutor's fallacy from side 1). A tree diagram keeps the with/without-replacement bookkeeping straight. 11 . The Binomial TOPIC 6 n fixed independent binary trials, constant success prob p. Chance of exactly x successes: BINOMIAL PMF P(X = x) = C(n,x) . px . (1-p)n -* C(n,x) = n! / [x! (n-x) !] MEAN & SD mean = np . SD = V(np(1-p)) Trap: using the binomial when trials are dependent or p varies; confusing "exactly x" (dbinom) with "at most x" (pbinom). R IDEA dbinom(x, n,p) # exactly x pbinom(x,n,p) # P(X ≤ x) sample() + set. seed() # simulate
- 概率质量函数:
$$P(X=x)=\binom{n}{x}p^x(1-p)^{n-x}$$[12]Source: asksia-bible-data1001-bilingual.pdf有放回:签放回去,所以每次抽取都相同且独立 -- 这是盒子模型所处的干净世界。无放回:每次抽取都改变剩下的内 容,所以抽取是相依的,你必须做条件处理。对于从巨大总体中抽取的极小样本,两者几乎相同(下一章的有限总体校 正衡量这一差距)。 DATA1001 . Foundations of Data Science . AskSia Library THE BINOMIAL - THE BINOMIAL IDEA Counting successes in n independent trials 在 n 次独立试验中计数成功次数 Fixed n . constant p . independent - the first 'named' distribution 固定 n · 恒定 p · 相互独立 -- 第一个有‘名字’的分布 When you repeat the same yes/no trial a fixed number of times - n trials, each with the same success chance p, all independent - the count of successes follows the binomial distribution. It is the multiplication rule (for one exact sequence) times the addition rule (over all the equally-likely orderings). 当你把同一个是/否试验重复固定次数 -- n次试验、每次成功概率相同为 p、全部相互独立 -- 成功次数的计数服从二项分 布。它是乘法规则(对某一种确切的序列)乘以加法规则(对所有等可能的排列方式)。 BINOMIAL DISTRIBUTION n! P(X = ) = (2) pz (1 - p)n-a, ( n) x n = x! (n- x)! mean = np SD = Vnp(1 - p) Reading the formula 读懂这个公式 > pr(1 - p)n-a is the chance of one particular sequence with x successes (multiplication rule, independence). 是某一个含 x 次成功的特定序列的概率(乘法规则、独立 性). → ") counts how many such sequences there are - ( the orderings that all give x successes (addition rule). 数的是有多少个这样的序列 -- 所有都给出 x 次成功的排列 (加法规则)。 - The mean np and SD _np(1 - p) are exactly the box-model EV and SE for a 0-1 box drawn n times - the bridge to the next page. 均值 np和 SD 恰好就是抽 n 次的 0-1盒子的盒子模型 EV 和 SE -- 通往下一页的桥梁。 When it applies (and when it does not) 何时适用(以及何时不适用) Binomial needs . . . If it fails . . . n fixed in advance "keep going till a 6" - not binomial[25]Source: asksia-cheatsheet-data1001.pdfblood type POPULATION (PARAM. ) SAMPLE (STAT. ) DATA1001 Foundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Sample & decide . the engine 10 . Probability Rules TOPIC 6 . LO6 Probability = the long-run proportion of times an event occurs, 0 ≤ P ≤ 1. COMPLEMENT P(Ac) = 1 - P(A) CONDITIONAL P(A | B) = P(A and B) / P(B) ADDITION (OR) P(A or B) = P(A) + P(B) - P(A and B) MULTIPLICATION (AND) P(A and B) = P(A) . P(B | A) · Mutually exclusive: A prevents B = P(A and B) = 0, so OR reduces to P(A) + P(B). reduces to P(A)·P(B). Trap: mutually exclusive # independent (they are opposites for events with P > 0); multiplying dependent events as if independent. 10b . "At Least One" COMPLEMENT TRICK Compute via the complement - far easier than summing cases: P(at least one) = 1 - P(none) = 1 - (1 - p)" [n independent trials] With vs without replacement: with replacement => trials independent, p constant. Without = p changes each draw (use conditional probabilities). For n < N the difference is negligible. Worked: P(at least one six in 4 die rolls) = 1 - (5/6)4 = 0. 518 . Trying to add P(six) four times (4-1/6 = 0. 67) double-counts - always go through the complement. Conditional # joint: P(A|B) is computed within the B- world (divide by P(B)); P(A and B) is the slice of the whole. Bayes' direction matters - P(A|B) ± P(B|A) in general (the prosecutor's fallacy from side 1). A tree diagram keeps the with/without-replacement bookkeeping straight. 11 . The Binomial TOPIC 6 n fixed independent binary trials, constant success prob p. Chance of exactly x successes: BINOMIAL PMF P(X = x) = C(n,x) . px . (1-p)n -* C(n,x) = n! / [x! (n-x) !] MEAN & SD mean = np . SD = V(np(1-p)) Trap: using the binomial when trials are dependent or p varies; confusing "exactly x" (dbinom) with "at most x" (pbinom). R IDEA dbinom(x, n,p) # exactly x pbinom(x,n,p) # P(X ≤ x) sample() + set. seed() # simulate - 均值与标准差:
$$E[X]=np,\quad SD=\sqrt{np(1-p)}$$[12]Source: asksia-bible-data1001-bilingual.pdf有放回:签放回去,所以每次抽取都相同且独立 -- 这是盒子模型所处的干净世界。无放回:每次抽取都改变剩下的内 容,所以抽取是相依的,你必须做条件处理。对于从巨大总体中抽取的极小样本,两者几乎相同(下一章的有限总体校 正衡量这一差距)。 DATA1001 . Foundations of Data Science . AskSia Library THE BINOMIAL - THE BINOMIAL IDEA Counting successes in n independent trials 在 n 次独立试验中计数成功次数 Fixed n . constant p . independent - the first 'named' distribution 固定 n · 恒定 p · 相互独立 -- 第一个有‘名字’的分布 When you repeat the same yes/no trial a fixed number of times - n trials, each with the same success chance p, all independent - the count of successes follows the binomial distribution. It is the multiplication rule (for one exact sequence) times the addition rule (over all the equally-likely orderings). 当你把同一个是/否试验重复固定次数 -- n次试验、每次成功概率相同为 p、全部相互独立 -- 成功次数的计数服从二项分 布。它是乘法规则(对某一种确切的序列)乘以加法规则(对所有等可能的排列方式)。 BINOMIAL DISTRIBUTION n! P(X = ) = (2) pz (1 - p)n-a, ( n) x n = x! (n- x)! mean = np SD = Vnp(1 - p) Reading the formula 读懂这个公式 > pr(1 - p)n-a is the chance of one particular sequence with x successes (multiplication rule, independence). 是某一个含 x 次成功的特定序列的概率(乘法规则、独立 性). → ") counts how many such sequences there are - ( the orderings that all give x successes (addition rule). 数的是有多少个这样的序列 -- 所有都给出 x 次成功的排列 (加法规则)。 - The mean np and SD _np(1 - p) are exactly the box-model EV and SE for a 0-1 box drawn n times - the bridge to the next page. 均值 np和 SD 恰好就是抽 n 次的 0-1盒子的盒子模型 EV 和 SE -- 通往下一页的桥梁。 When it applies (and when it does not) 何时适用(以及何时不适用) Binomial needs . . . If it fails . . . n fixed in advance "keep going till a 6" - not binomial[25]Source: asksia-cheatsheet-data1001.pdfblood type POPULATION (PARAM. ) SAMPLE (STAT. ) DATA1001 Foundations of Data Science UNIVERSITY OF SYDNEY . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Sample & decide . the engine 10 . Probability Rules TOPIC 6 . LO6 Probability = the long-run proportion of times an event occurs, 0 ≤ P ≤ 1. COMPLEMENT P(Ac) = 1 - P(A) CONDITIONAL P(A | B) = P(A and B) / P(B) ADDITION (OR) P(A or B) = P(A) + P(B) - P(A and B) MULTIPLICATION (AND) P(A and B) = P(A) . P(B | A) · Mutually exclusive: A prevents B = P(A and B) = 0, so OR reduces to P(A) + P(B). reduces to P(A)·P(B). Trap: mutually exclusive # independent (they are opposites for events with P > 0); multiplying dependent events as if independent. 10b . "At Least One" COMPLEMENT TRICK Compute via the complement - far easier than summing cases: P(at least one) = 1 - P(none) = 1 - (1 - p)" [n independent trials] With vs without replacement: with replacement => trials independent, p constant. Without = p changes each draw (use conditional probabilities). For n < N the difference is negligible. Worked: P(at least one six in 4 die rolls) = 1 - (5/6)4 = 0. 518 . Trying to add P(six) four times (4-1/6 = 0. 67) double-counts - always go through the complement. Conditional # joint: P(A|B) is computed within the B- world (divide by P(B)); P(A and B) is the slice of the whole. Bayes' direction matters - P(A|B) ± P(B|A) in general (the prosecutor's fallacy from side 1). A tree diagram keeps the with/without-replacement bookkeeping straight. 11 . The Binomial TOPIC 6 n fixed independent binary trials, constant success prob p. Chance of exactly x successes: BINOMIAL PMF P(X = x) = C(n,x) . px . (1-p)n -* C(n,x) = n! / [x! (n-x) !] MEAN & SD mean = np . SD = V(np(1-p)) Trap: using the binomial when trials are dependent or p varies; confusing "exactly x" (dbinom) with "at most x" (pbinom). R IDEA dbinom(x, n,p) # exactly x pbinom(x,n,p) # P(X ≤ x) sample() + set. seed() # simulate
- 概率质量函数:
-
- 先把过程写成“从盒子抽签”:写清
- 签上是什么数(tickets)
- 各有多少张(multiplicities)
- 抽多少次 $n$
- 有放回/无放回
- 核心恒等式(考试一句话总纲):
- 典型 $EV/SE$(材料给的表):
- 和(sum):$$EV=n\mu_{\text{box}},\quad SE=\sqrt{n},SD_{\text{box}}$$[15]Source: asksia-bible-data1001-bilingual.pdfmore precise if strata differ Cluster / multi-stage sample whole groups, then within cheaper; usually less precise Quota / convenience fill quotas / whoever is handy NOT probability - biased DATA1001 . Foundations of Data Science . AskSia Library i The sampling distribution - the key abstraction 抽样分布 -- 关键的抽象概念 Imagine taking every possible sample of size n and computing the statistic each time. The distribution of those values is the sampling distribution. Its centre is the EV, its spread is the SE, and - by the box model - its SE for the mean is SDbox/ Vn. The CLT (next page) tells us its shape. 想象取每一个可能的大小为n 的样本,每次都计算该统计量。这些值的分布就是抽样分布。它的中心是EV,离散是 SE,而且 -- 由盒子模型 -- 它对均值的 SE是。CLT(下一页)告诉我们它的形状。 DATA1001 . Foundations of Data Science . AskSia Library THE CLT - THE CENTRAL LIMIT THEOREM THE CENTREPIECE The sampling distribution goes Normal 抽样分布趋于正态 Average enough draws and the bell appears - whatever the box looks like 抽取足够多次再求平均,钟形就会出现 -- 无论盒子长什么样 This is the most important theorem in the course, and the reason inference works at all. Draw n tickets at random with replacement. For large n, the probability histogram of the sample SUM (or MEAN) follows the Normal curve - no matter what shape the box has. The data can be wildly skewed; the sampling distribution of the statistic is still bell-shaped. 这是本课程最重要的定理,也是推断之所以成立的根本原因。有放回地随机抽取 n 张签。当 n 较大时,样本之和(或均值) 的概率直方图服从正态曲线 -- 无论盒子本身是什么形状。数据可以极度偏斜;而该统计量的抽样分布仍然呈钟形。 SAMPLING DISTRIBUTION of the mean POPULATION / box n=25 NCEV, SE') many samples of size n, n=5 record the mean n=1 EV = box mean[30]Source: asksia-cheatsheet-data1001.pdf11b . Worked . Binomial AT LEAST ONE n = 10 free throws, p = 0. 7. Mean made = np = 7 ; SD = /(10-0. 7. 0. 3) = /2. 1 = 1. 45. P(exactly 8) = C(10,8) . 0. 78 . 0. 32 = 0. 233 = dbinom(8, 10, 0. 7) P(at least 1 miss) = 1 - P(0 misses) = 1 - 0. 71º = 0. 972 the complement trick again. set. seed() makes a simulation reproducible (same random draws every run) - required for marked Quarto reports. 12 . THE BOX MODEL SIGNATURE * TOPIC 7 DATA1001's master device: model any chance process as drawing tickets from a box . Specify (1) the numbers on the tickets, (2) how many of each, (3) the number of draws n and whether with/without replacement. The box = the population. THE CORE IDENTITY observed value = expected value + chance error OV = EV + chance error First compute the box's average (u_box) and SD (SD_box, population divisor n) - these describe one ticket. Then scale up to n draws: STATISTIC EV SE Sum of n draws n . u_box Vn . SD_box Mean of n draws μ_box SD_box / /n SE = the likely size of the chance error. The SE of the mean shrinks like In (quadruple n to halve SE); the SE of the sum grows like In. SIA - SD describes the spread of the data/box; SE describes the spread of a statistic. Confusing the two - and mixing the sum-SE (xVn) with the mean-SE (+Vn) - are the top box-model errors. 12b . The Vn Law & LLN TOPIC 7 Law of Large Numbers: as n grows, the chance error becomes small in relative size - proportions stabilise - but the chance error in the sum/count grows in absolute size (like /n). Gambler's fallacy: thinking past outcomes make the count "even out". Proportions converge; absolute deviations grow. The coin has no memory. Two faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests.
- 均值(mean):$$EV=\mu_{\text{box}},\quad SE=\frac{SD_{\text{box}}}{\sqrt{n}}$$[15]Source: asksia-bible-data1001-bilingual.pdfmore precise if strata differ Cluster / multi-stage sample whole groups, then within cheaper; usually less precise Quota / convenience fill quotas / whoever is handy NOT probability - biased DATA1001 . Foundations of Data Science . AskSia Library i The sampling distribution - the key abstraction 抽样分布 -- 关键的抽象概念 Imagine taking every possible sample of size n and computing the statistic each time. The distribution of those values is the sampling distribution. Its centre is the EV, its spread is the SE, and - by the box model - its SE for the mean is SDbox/ Vn. The CLT (next page) tells us its shape. 想象取每一个可能的大小为n 的样本,每次都计算该统计量。这些值的分布就是抽样分布。它的中心是EV,离散是 SE,而且 -- 由盒子模型 -- 它对均值的 SE是。CLT(下一页)告诉我们它的形状。 DATA1001 . Foundations of Data Science . AskSia Library THE CLT - THE CENTRAL LIMIT THEOREM THE CENTREPIECE The sampling distribution goes Normal 抽样分布趋于正态 Average enough draws and the bell appears - whatever the box looks like 抽取足够多次再求平均,钟形就会出现 -- 无论盒子长什么样 This is the most important theorem in the course, and the reason inference works at all. Draw n tickets at random with replacement. For large n, the probability histogram of the sample SUM (or MEAN) follows the Normal curve - no matter what shape the box has. The data can be wildly skewed; the sampling distribution of the statistic is still bell-shaped. 这是本课程最重要的定理,也是推断之所以成立的根本原因。有放回地随机抽取 n 张签。当 n 较大时,样本之和(或均值) 的概率直方图服从正态曲线 -- 无论盒子本身是什么形状。数据可以极度偏斜;而该统计量的抽样分布仍然呈钟形。 SAMPLING DISTRIBUTION of the mean POPULATION / box n=25 NCEV, SE') many samples of size n, n=5 record the mean n=1 EV = box mean[30]Source: asksia-cheatsheet-data1001.pdf11b . Worked . Binomial AT LEAST ONE n = 10 free throws, p = 0. 7. Mean made = np = 7 ; SD = /(10-0. 7. 0. 3) = /2. 1 = 1. 45. P(exactly 8) = C(10,8) . 0. 78 . 0. 32 = 0. 233 = dbinom(8, 10, 0. 7) P(at least 1 miss) = 1 - P(0 misses) = 1 - 0. 71º = 0. 972 the complement trick again. set. seed() makes a simulation reproducible (same random draws every run) - required for marked Quarto reports. 12 . THE BOX MODEL SIGNATURE * TOPIC 7 DATA1001's master device: model any chance process as drawing tickets from a box . Specify (1) the numbers on the tickets, (2) how many of each, (3) the number of draws n and whether with/without replacement. The box = the population. THE CORE IDENTITY observed value = expected value + chance error OV = EV + chance error First compute the box's average (u_box) and SD (SD_box, population divisor n) - these describe one ticket. Then scale up to n draws: STATISTIC EV SE Sum of n draws n . u_box Vn . SD_box Mean of n draws μ_box SD_box / /n SE = the likely size of the chance error. The SE of the mean shrinks like In (quadruple n to halve SE); the SE of the sum grows like In. SIA - SD describes the spread of the data/box; SE describes the spread of a statistic. Confusing the two - and mixing the sum-SE (xVn) with the mean-SE (+Vn) - are the top box-model errors. 12b . The Vn Law & LLN TOPIC 7 Law of Large Numbers: as n grows, the chance error becomes small in relative size - proportions stabilise - but the chance error in the sum/count grows in absolute size (like /n). Gambler's fallacy: thinking past outcomes make the count "even out". Proportions converge; absolute deviations grow. The coin has no memory. Two faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests.
- 最常见错误:把 SD(数据/盒子本身的离散)和 SE(统计量的波动)混为一谈;把 sum 的 $SE\propto \sqrt{n}$ 和 mean 的 $SE\propto 1/\sqrt{n}$ 搞反。[30]Source: asksia-cheatsheet-data1001.pdf11b . Worked . Binomial AT LEAST ONE n = 10 free throws, p = 0. 7. Mean made = np = 7 ; SD = /(10-0. 7. 0. 3) = /2. 1 = 1. 45. P(exactly 8) = C(10,8) . 0. 78 . 0. 32 = 0. 233 = dbinom(8, 10, 0. 7) P(at least 1 miss) = 1 - P(0 misses) = 1 - 0. 71º = 0. 972 the complement trick again. set. seed() makes a simulation reproducible (same random draws every run) - required for marked Quarto reports. 12 . THE BOX MODEL SIGNATURE * TOPIC 7 DATA1001's master device: model any chance process as drawing tickets from a box . Specify (1) the numbers on the tickets, (2) how many of each, (3) the number of draws n and whether with/without replacement. The box = the population. THE CORE IDENTITY observed value = expected value + chance error OV = EV + chance error First compute the box's average (u_box) and SD (SD_box, population divisor n) - these describe one ticket. Then scale up to n draws: STATISTIC EV SE Sum of n draws n . u_box Vn . SD_box Mean of n draws μ_box SD_box / /n SE = the likely size of the chance error. The SE of the mean shrinks like In (quadruple n to halve SE); the SE of the sum grows like In. SIA - SD describes the spread of the data/box; SE describes the spread of a statistic. Confusing the two - and mixing the sum-SE (xVn) with the mean-SE (+Vn) - are the top box-model errors. 12b . The Vn Law & LLN TOPIC 7 Law of Large Numbers: as n grows, the chance error becomes small in relative size - proportions stabilise - but the chance error in the sum/count grows in absolute size (like /n). Gambler's fallacy: thinking past outcomes make the count "even out". Proportions converge; absolute deviations grow. The coin has no memory. Two faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests.
- 先把过程写成“从盒子抽签”:写清
-
- 表述(按你材料意思):当 $n$ 大时,样本“和/均值”的抽样分布近似正态:
$$\text{sample sum/mean}\approx N(EV,SE^2)$$[28]Source: asksia-cheatsheet-data1001.pdfEV is the centre, SE the typical wobble around it. "Observed = EV give or take an SE" is the one-line summary of every chance-process question on the exam - and combined with the CLT it gives the chance of any range of outcomes. 12c . Worked . Box Model SUM & MEAN Box = {1, 2, 3, 4, 5, 6} (a fair die). u_box = 3. 5, SD_box = 1. 71. Roll n = 100 times. SUM OF 100 ROLLS EV = 100. 3. 5 = 350 SE = V100 . 1. 71 = 10. 1. 71 = 17. 1 MEAN OF 100 ROLLS EV = 3. 5 . SE = 1. 71/V100 = 0. 171 By the CLT the sum is = N(350, 17. 12): a total near 384 (= +2 SE) would be surprising (about 2. 5% chance). Note the sum's SE (17. 1) is 100x the mean's SE (0. 171). 0-1 box for counts: to count an event, fill the box with 1s (event) and 0s (not). The sum then counts occurrences and the mean is the proportion - the bridge from the box model to proportions (col 3), where u_box = p and SD_box = /(p(1-p)). Setting up the box is the marks: name the tickets, their multiplicities, n, and with/without replacement - then EV and SE follow mechanically from the table above. Most box-model errors are set-up errors, not arithmetic, so write the box out explicitly. 13 . The CLT TOPIC 7 . LO6 Drawing at random, for large n the probability histogram of the sample sum or mean follows the Normal curve, regardless of the box's shape: CENTRAL LIMIT THEOREM sample sum / mean = N(EV, SE2) This is why z/t-tests and Cls work - the sampling distribution is Normal even when the data are not. The bigger n, the tighter that Normal curve (SE ) and the better the approximation. Trap: thinking the CLT makes the data Normal. It is about the distribution of the statistic, not the raw data. A skewed box needs a larger n before the Normal approximation is good. R IDEA replicate (N, mean (sample (box, n, TRUE))) hist( . . . ) # builds a sampling distribution Three distributions, don't confuse them: the population/box (any shape), one sample (looks like the box), and the sampling distribution of the statistic (Normal by the CLT, centred at EV, spread SE). Inference lives in the third. The histogram of one sample does not become Normal as n grows - only the statistic's distribution does. 14 . Proportions . 0-1 Box TOPIC 8 Code "success" = 1, "failure" = 0. Then __ box = p and SD_box = /(p(1-p)), so the sample proportion p has: SE OF A PROPORTION |SE(p) = V[ p(1-p) / n ] Finite-population correction (sampling without replacement): SE_w/o = V[(N-n)/(N-1)] . SE_with cf - 1 when n < N (usually ignorable) Sampling without replacement gives less variability than with. For a small sample from a large population, treat draws as independent (with replacement) - the correction is = 1. - 陷阱:CLT 让“统计量”的分布变正态,不是让原始数据变正态。[27]Source: asksia-cheatsheet-data1001.pdfTwo faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests. EV is the centre, SE the typical wobble around it. "Observed = EV give or take an SE" is the one-line summary of every chance-process question on the exam - and combined with the CLT it gives the chance of any range of outcomes. 12c . Worked . Box Model SUM & MEAN Box = {1, 2, 3, 4, 5, 6} (a fair die). u_box = 3. 5, SD_box = 1. 71. Roll n = 100 times. SUM OF 100 ROLLS EV = 100. 3. 5 = 350 SE = V100 . 1. 71 = 10. 1. 71 = 17. 1 MEAN OF 100 ROLLS EV = 3. 5 . SE = 1. 71/V100 = 0. 171 By the CLT the sum is = N(350, 17. 12): a total near 384 (= +2 SE) would be surprising (about 2. 5% chance). Note the sum's SE (17. 1) is 100x the mean's SE (0. 171). 0-1 box for counts: to count an event, fill the box with 1s (event) and 0s (not). The sum then counts occurrences and the mean is the proportion - the bridge from the box model to proportions (col 3), where u_box = p and SD_box = /(p(1-p)). Setting up the box is the marks: name the tickets, their multiplicities, n, and with/without replacement - then EV and SE follow mechanically from the table above. Most box-model errors are set-up errors, not arithmetic, so write the box out explicitly. 13 . The CLT TOPIC 7 . LO6 Drawing at random, for large n the probability histogram of the sample sum or mean follows the Normal curve, regardless of the box's shape: CENTRAL LIMIT THEOREM sample sum / mean = N(EV, SE2) This is why z/t-tests and Cls work - the sampling distribution is Normal even when the data are not. The bigger n, the tighter that Normal curve (SE ) and the better the approximation. Trap: thinking the CLT makes the data Normal. It is about the distribution of the statistic, not the raw data. A skewed box needs a larger n before the Normal approximation is good. R IDEA replicate (N, mean (sample (box, n, TRUE))) hist( . . . ) # builds a sampling distribution Three distributions, don't confuse them: the population/box (any shape), one sample (looks like the box), and the sampling distribution of the statistic (Normal by the CLT, centred at EV, spread SE). Inference lives in the third. The histogram of one sample does not become Normal as n grows - only the statistic's distribution does. 14 . Proportions . 0-1 Box TOPIC 8 Code "success" = 1, "failure" = 0. Then __ box = p and SD_box = /(p(1-p)), so the sample proportion p has: SE OF A PROPORTION |SE(p) = V[ p(1-p) / n ] Finite-population correction (sampling without replacement): SE_w/o = V[(N-n)/(N-1)] . SE_with cf - 1 when n < N (usually ignorable)[30]Source: asksia-cheatsheet-data1001.pdf11b . Worked . Binomial AT LEAST ONE n = 10 free throws, p = 0. 7. Mean made = np = 7 ; SD = /(10-0. 7. 0. 3) = /2. 1 = 1. 45. P(exactly 8) = C(10,8) . 0. 78 . 0. 32 = 0. 233 = dbinom(8, 10, 0. 7) P(at least 1 miss) = 1 - P(0 misses) = 1 - 0. 71º = 0. 972 the complement trick again. set. seed() makes a simulation reproducible (same random draws every run) - required for marked Quarto reports. 12 . THE BOX MODEL SIGNATURE * TOPIC 7 DATA1001's master device: model any chance process as drawing tickets from a box . Specify (1) the numbers on the tickets, (2) how many of each, (3) the number of draws n and whether with/without replacement. The box = the population. THE CORE IDENTITY observed value = expected value + chance error OV = EV + chance error First compute the box's average (u_box) and SD (SD_box, population divisor n) - these describe one ticket. Then scale up to n draws: STATISTIC EV SE Sum of n draws n . u_box Vn . SD_box Mean of n draws μ_box SD_box / /n SE = the likely size of the chance error. The SE of the mean shrinks like In (quadruple n to halve SE); the SE of the sum grows like In. SIA - SD describes the spread of the data/box; SE describes the spread of a statistic. Confusing the two - and mixing the sum-SE (xVn) with the mean-SE (+Vn) - are the top box-model errors. 12b . The Vn Law & LLN TOPIC 7 Law of Large Numbers: as n grows, the chance error becomes small in relative size - proportions stabilise - but the chance error in the sum/count grows in absolute size (like /n). Gambler's fallacy: thinking past outcomes make the count "even out". Proportions converge; absolute deviations grow. The coin has no memory. Two faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests.
- 表述(按你材料意思):当 $n$ 大时,样本“和/均值”的抽样分布近似正态:
-
模块 4:Decide(推断:CI + 各类检验 + 错误与功效)
-
推断题的统一骨架:HATPC + 引擎 $(OV-EV)/SE$ + 参考曲线(Normal / $t$ / $\chi^2$)。[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[14]Source: asksia-bible-data1001-bilingual.pdfThe 95% Cl is about (0. 552, 0. 648) - roughly 55% to 65%. 该95% Cl 约为(0. 552,0. 648) -- 大致是55% 到 65%。 Step 4 - state it correctly 第 4步 -- 正确地表述它 "We are 95% confident the true proportion of students who commute by public transport is between 55% and 65%" - meaning the procedure captures the true p about 95% of the time, not that this particular interval has a 95% chance of containing it. To halve the ±0. 048 margin you would need n = 1,600 (four times the data) - the Vn rule again. “我们有 95% 的把握认为乘公共交通通勤的学生真实比例介于55% 到65% 之间” -- 意思是该程序大约95% 的时候 能捕捉到真实的 p,而非这一具体区间有 95% 的概率包含它。要把±0. 048 的误差减半,你需要 n=1,600(四倍的数 据量) -- 又是那条规则。 DATA1001 . Foundations of Data Science . AskSia Library i Ridea R 思路 Formula CI: prop. test(240, 400). Bootstrap: boot <- replicate(10000, mean(sample(x, length(x), replace=TRUE))); quantile(boot, c(. 025, . 975)) - the percentile interval, no Normal assumption needed. 公式 CI: prop. test(240, 400)。自助法:boot <- replicate(10000, mean(sample(x, length(x), replace=TRUE))); quantile(boot, c(. 025, . 975)) -- 即百分位区间,无需正态假设。 DATA1001 . Foundations of Data Science . AskSia Library TESTING - HATPC - CHAPTER . HYPOTHESIS TESTING MODULE 4 . TOPICS 9-11 One engine, the whole test zoo 一台引擎,整个检验动物园 HATPC . the universal (OV-EV)/SE statistic . proportion, t, chi-square & slope tests HATPC · 通用的(OV-EV)/SE 统计量 · 比例、t、卡方与斜率检验 This is the capstone of DATA1001 and the densest block of exam marks. The good news: almost every inference question is the same calculation. You measure how far the data sit from what chance predicts, in units of the likely chance error - (OV-EV)/SE - then read a probability off a reference curve. The four named tests differ only in what EV is, what SE is, and which curve (Normal vs tdf VS x2) you compare to. Wrap every test in the course's scaffold, HATPC, and graders reward you line by line. 这是 DATA1001 的压轴内容,也是考试分值最密集的一块。好消息是:几乎每一道推断题都是同一个计算。你衡量数据离机 遇所预测的位置有多远,以可能的机遇误差为单位 -- (OV-EV)/SE -- 再从一条参考曲线上读出一个概率。四种具名检验的 差别只在于EV 是什么、SE 是什么、以及比对哪条曲线(正态 vs tdf VS x2)。把每个检验都套进本课程的脚手架 HATPC,阅 卷人就会逐行给你分。 0. 05 ≥|stat| HATPC STEPS HATPC 步骤 ENGINE: (OV-EV)/SE 引擎:(OV-EV)/SE USUAL A 常用 α WHAT P MEASURES p 衡量什么 THE ENGINE OV - EV (observed) - (what Ho predicts) test statistic = SE
-
- 单样本:$$t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}},\quad df=n-1$$
- 配对:把每个单位两次测量变成差值 $d$:
$$t=\frac{\bar{d}-0}{s_d/\sqrt{n}},\quad df=n-1$$ - 两样本(合并/pooled 版本有等方差假设):df 常见是 $n_1+n_2-2$。[29]Source: asksia-cheatsheet-data1001.pdf19 . t-tests for Means TOPIC 10 TEST STATISTIC DF One-sample (X-Mo)/(s/ Vn) n-1 Paired (d-0)/(s_d//n) n-1 Two-sample (×1-X2)/SE n1+02-2 Paired = reduce two measurements per unit to the differences d. (e. g. caffeine vs none per cyclist). Trap: an unpaired test on paired data throws away the pairing and loses power. Assumptions: independence (from design), Normality/large n (histogram, QQ-plot, shapiro. test), and equal variance for the two-sample pooled test. 20 . Tests for Relationship TOPIC 11 CHI-SQUARE (BOTH KINDS) x2 = Σ (Observed - Expected)2 / Expected Large x2 => far from expected => evidence against Ho. Always right-tailed. Traps: use raw counts, not % ; needs expected cells ≥ 5; flags only whether a relationship exists, not its direction. Slope test: Ho: 31 = 0 (no linear trend) vs H1: B1 # 0. t = 1/SE(B;) vs tn-2 - read straight off the slope row's Pr(>|t|) in summary (Tm). Assumptions LINE: Linear, Independent, Normal, Equal-variance residuals. A significant slope still is not causation, and "no linear trend" does not rule out a nonlinear one. R Reference . Side 2 SAMPLE + DECIDE
- 高频陷阱(你材料点名):配对就是功效(pairing is power),别把配对数据当成非配对做双样本检验。[6]Source: asksia-bible-data1001-bilingual.pdfH: Ho: Id = O vs H,: Hd > O (one-sided - we expect a benefit). H: Ho: Hd= 0 vs H1: Hd>0(单侧 -- 我们预期有益处)。 2 A: the 8 riders independent; differences roughly Normal (small n, so check a QQ-plot / Shapiro - see overleaf). A: 8 名骑手相互独立;差值大致正态(n 小,故核查一张 QQ 图/Shapiro -- 见背面)。 3 T: d = 28/8 = 3. 5; sa ~ 2. 51 (n-1 divisor); SE = 2. 51/18 = 0. 887. t = (3. 5 - 0)/0. 887 ~ 3. 95 on df = 7. T: d= 28/8 = 3. 5; Sd ~ 2. 51 (n-1 作除数); SE = 2. 51/58 ~0. 887。t =(3. 5-0)/0. 887~ 3. 95, df = 7。 4 P: one-sided p = P(t ≥ 3. 95) ~ 0. 0028 (R:pt(3. 95, 7, lower. tail=FALSE)). P:单侧 p= P(ty≥ 3. 95) ~ 0. 0028 (R: pt(3. 95, 7, lower. tail=FALSE))。 5 C: 0. 0028 < 0. 05 - reject H : evidence that caffeine increases mean time-to-exhaustion for these riders. C: 0. 0028<0. 05→拒绝 Ho:有证据表明咖啡因增加了这些骑手的平均力竭时间。 ! Pairing is power - don't throw it away 配对就是检验力 -- 别把它扔掉 Because each rider is their own control, the paired test removes between-rider variation. Running an unpaired two- sample test on paired data wastes that and loses power - a frequently-tested mistake. Rule: same units measured twice - paired; two separate groups - two-sample. 因为每名骑手是自己的对照,配对检验消除了骑手间的变异。对配对数据做非配对的双样本检验会浪费这一点并损失功 效 -- 一个常考的错误。规则:同一单元测量两次→配对;两个独立分组→双样本。 ✓ Ridea R 思路 t. test(d, mu = 0, alternative = "greater") on the differences - or directly t . test (caf, plac, paired = TRUE). Two independent groups: t . test (g1, g2, var. equal = TRUE) for the pooled test. 对差值用 t. test(d, mu = 0, alternative = "greater") -- 或直接 t. test(caf, plac, paired = TRUE)。两个独立组:合并检验用 t. test(g1, g2, var. equal = TRUE)。 - DATA1001 . Foundations of Data Science . AskSia Library AskSia Library STUDY BIBLE . ASKSIA FACULTY OF SCIENCE / MATHEMATICS & STATISTICS SEMESTER 1 . 2026 THE COMPLETE STUDY BIBLE Foundations of Data Science 数据科学基础 READ A STUDY, PICK THE METHOD, RUN THE LOGIC, INTERPRET IT IN CONTEXT - THE WHOLE COURSE ON ONE ENGINE. 完整双语学习圣经 DATA1001 . UNIVERSITY OF SYDNEY 中英双语版 · BILINGUAL EDITION 英文主讲,中文随行 一 考试要点与术语保留英文原词 The final exam is 60% of your mark in a single 2-hour sitting, and the universal backstop for almost every other component. It is conceptual and interpretive, not a coding exam: it tests whether you can read a study, choose the right method and read the answer in plain English. Nearly every inference question runs the same engine - (OV-EV)/SE - scaffolded by HATPC. This book drills exactly that.[29]Source: asksia-cheatsheet-data1001.pdf19 . t-tests for Means TOPIC 10 TEST STATISTIC DF One-sample (X-Mo)/(s/ Vn) n-1 Paired (d-0)/(s_d//n) n-1 Two-sample (×1-X2)/SE n1+02-2 Paired = reduce two measurements per unit to the differences d. (e. g. caffeine vs none per cyclist). Trap: an unpaired test on paired data throws away the pairing and loses power. Assumptions: independence (from design), Normality/large n (histogram, QQ-plot, shapiro. test), and equal variance for the two-sample pooled test. 20 . Tests for Relationship TOPIC 11 CHI-SQUARE (BOTH KINDS) x2 = Σ (Observed - Expected)2 / Expected Large x2 => far from expected => evidence against Ho. Always right-tailed. Traps: use raw counts, not % ; needs expected cells ≥ 5; flags only whether a relationship exists, not its direction. Slope test: Ho: 31 = 0 (no linear trend) vs H1: B1 # 0. t = 1/SE(B;) vs tn-2 - read straight off the slope row's Pr(>|t|) in summary (Tm). Assumptions LINE: Linear, Independent, Normal, Equal-variance residuals. A significant slope still is not causation, and "no linear trend" does not rule out a nonlinear one. R Reference . Side 2 SAMPLE + DECIDE
-
- 统计量:$$\chi^2=\sum \frac{(O-E)^2}{E}$$
- 特点:永远右尾;用原始计数(counts),不是百分比;期望频数通常要够大(材料提醒 expected cells ≥ 5)。[29]Source: asksia-cheatsheet-data1001.pdf19 . t-tests for Means TOPIC 10 TEST STATISTIC DF One-sample (X-Mo)/(s/ Vn) n-1 Paired (d-0)/(s_d//n) n-1 Two-sample (×1-X2)/SE n1+02-2 Paired = reduce two measurements per unit to the differences d. (e. g. caffeine vs none per cyclist). Trap: an unpaired test on paired data throws away the pairing and loses power. Assumptions: independence (from design), Normality/large n (histogram, QQ-plot, shapiro. test), and equal variance for the two-sample pooled test. 20 . Tests for Relationship TOPIC 11 CHI-SQUARE (BOTH KINDS) x2 = Σ (Observed - Expected)2 / Expected Large x2 => far from expected => evidence against Ho. Always right-tailed. Traps: use raw counts, not % ; needs expected cells ≥ 5; flags only whether a relationship exists, not its direction. Slope test: Ho: 31 = 0 (no linear trend) vs H1: B1 # 0. t = 1/SE(B;) vs tn-2 - read straight off the slope row's Pr(>|t|) in summary (Tm). Assumptions LINE: Linear, Independent, Normal, Equal-variance residuals. A significant slope still is not causation, and "no linear trend" does not rule out a nonlinear one. R Reference . Side 2 SAMPLE + DECIDE
- 它只回答“有没有关系”,不直接告诉方向。[29]Source: asksia-cheatsheet-data1001.pdf19 . t-tests for Means TOPIC 10 TEST STATISTIC DF One-sample (X-Mo)/(s/ Vn) n-1 Paired (d-0)/(s_d//n) n-1 Two-sample (×1-X2)/SE n1+02-2 Paired = reduce two measurements per unit to the differences d. (e. g. caffeine vs none per cyclist). Trap: an unpaired test on paired data throws away the pairing and loses power. Assumptions: independence (from design), Normality/large n (histogram, QQ-plot, shapiro. test), and equal variance for the two-sample pooled test. 20 . Tests for Relationship TOPIC 11 CHI-SQUARE (BOTH KINDS) x2 = Σ (Observed - Expected)2 / Expected Large x2 => far from expected => evidence against Ho. Always right-tailed. Traps: use raw counts, not % ; needs expected cells ≥ 5; flags only whether a relationship exists, not its direction. Slope test: Ho: 31 = 0 (no linear trend) vs H1: B1 # 0. t = 1/SE(B;) vs tn-2 - read straight off the slope row's Pr(>|t|) in summary (Tm). Assumptions LINE: Linear, Independent, Normal, Equal-variance residuals. A significant slope still is not causation, and "no linear trend" does not rule out a nonlinear one. R Reference . Side 2 SAMPLE + DECIDE
-
- I 类错误:错拒真 $H_0$(false positive),概率是 $\alpha$。
- II 类错误:错保留假 $H_0$(false negative),概率是 $\beta$。
- 功效 power:$1-\beta$,样本量 $n$ 增大通常提高 power。[11]Source: asksia-bible-data1001-bilingual.pdfDecisions with data: HATPC & the test zoo 用数据做决策:HATPC 与检验动物园 The framework, the errors, and every named test with its statistic & curve 这套框架、那些错误,以及每个有名字的检验及其统计量与曲线 Term (EN) +x One-line meaning Module 4 – the testing framework 检验框架 HATPC The exam scaffold: Hypotheses, Assumptions, Test statistic, P-value, Conclusion. Null hypothesis Ho "The gap is due to chance"; stated with =. Alternative hypothesis H, "Not chance"; stated with >, < or #; one- or two-sided. (OV-EV)/SE The universal test statistic: SEs the data sit from what Ho predicts. P-value — P(stat as/more extreme | H. ); double it for a two-sided H,. Significance level a The reject threshold (usually 0. 05); p < a = reject Ho- Reject vs retain Ho p < a = reject; p ≥ a => retain (never 'accept'). Type I error Reject a true Ho (false positive); probability = a. Type II error — Retain a false Ho (false negative); probability = B. Power — 1 - ß; the chance of detecting a real effect; rises with n.
-
3)“必背定义 + 一句话拿分句型”清单(按你材料强调的高频)
- (OV-EV)/SE 引擎:数据距离 $H_0$ 预测有多少个标准误;$|\text{stat}|$ 大 → 更反对 $H_0$。[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
- SD vs SE(最常考混淆点):SD 描述数据/盒子本身的散;SE 描述统计量在抽样中的波动大小。[30]Source: asksia-cheatsheet-data1001.pdf11b . Worked . Binomial AT LEAST ONE n = 10 free throws, p = 0. 7. Mean made = np = 7 ; SD = /(10-0. 7. 0. 3) = /2. 1 = 1. 45. P(exactly 8) = C(10,8) . 0. 78 . 0. 32 = 0. 233 = dbinom(8, 10, 0. 7) P(at least 1 miss) = 1 - P(0 misses) = 1 - 0. 71º = 0. 972 the complement trick again. set. seed() makes a simulation reproducible (same random draws every run) - required for marked Quarto reports. 12 . THE BOX MODEL SIGNATURE * TOPIC 7 DATA1001's master device: model any chance process as drawing tickets from a box . Specify (1) the numbers on the tickets, (2) how many of each, (3) the number of draws n and whether with/without replacement. The box = the population. THE CORE IDENTITY observed value = expected value + chance error OV = EV + chance error First compute the box's average (u_box) and SD (SD_box, population divisor n) - these describe one ticket. Then scale up to n draws: STATISTIC EV SE Sum of n draws n . u_box Vn . SD_box Mean of n draws μ_box SD_box / /n SE = the likely size of the chance error. The SE of the mean shrinks like In (quadruple n to halve SE); the SE of the sum grows like In. SIA - SD describes the spread of the data/box; SE describes the spread of a statistic. Confusing the two - and mixing the sum-SE (xVn) with the mean-SE (+Vn) - are the top box-model errors. 12b . The Vn Law & LLN TOPIC 7 Law of Large Numbers: as n grows, the chance error becomes small in relative size - proportions stabilise - but the chance error in the sum/count grows in absolute size (like /n). Gambler's fallacy: thinking past outcomes make the count "even out". Proportions converge; absolute deviations grow. The coin has no memory. Two faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests.
- p-value:在 $H_0$ 成立时,观察到“至少这么极端”的统计量的概率;双侧要翻倍。[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[11]Source: asksia-bible-data1001-bilingual.pdfDecisions with data: HATPC & the test zoo 用数据做决策:HATPC 与检验动物园 The framework, the errors, and every named test with its statistic & curve 这套框架、那些错误,以及每个有名字的检验及其统计量与曲线 Term (EN) +x One-line meaning Module 4 – the testing framework 检验框架 HATPC The exam scaffold: Hypotheses, Assumptions, Test statistic, P-value, Conclusion. Null hypothesis Ho "The gap is due to chance"; stated with =. Alternative hypothesis H, "Not chance"; stated with >, < or #; one- or two-sided. (OV-EV)/SE The universal test statistic: SEs the data sit from what Ho predicts. P-value — P(stat as/more extreme | H. ); double it for a two-sided H,. Significance level a The reject threshold (usually 0. 05); p < a = reject Ho- Reject vs retain Ho p < a = reject; p ≥ a => retain (never 'accept'). Type I error Reject a true Ho (false positive); probability = a. Type II error — Retain a false Ho (false negative); probability = B. Power — 1 - ß; the chance of detecting a real effect; rises with n.
- 95% CI 的正确解释:程序覆盖率,不是“该区间有 95% 概率”。[14]Source: asksia-bible-data1001-bilingual.pdfThe 95% Cl is about (0. 552, 0. 648) - roughly 55% to 65%. 该95% Cl 约为(0. 552,0. 648) -- 大致是55% 到 65%。 Step 4 - state it correctly 第 4步 -- 正确地表述它 "We are 95% confident the true proportion of students who commute by public transport is between 55% and 65%" - meaning the procedure captures the true p about 95% of the time, not that this particular interval has a 95% chance of containing it. To halve the ±0. 048 margin you would need n = 1,600 (four times the data) - the Vn rule again. “我们有 95% 的把握认为乘公共交通通勤的学生真实比例介于55% 到65% 之间” -- 意思是该程序大约95% 的时候 能捕捉到真实的 p,而非这一具体区间有 95% 的概率包含它。要把±0. 048 的误差减半,你需要 n=1,600(四倍的数 据量) -- 又是那条规则。 DATA1001 . Foundations of Data Science . AskSia Library i Ridea R 思路 Formula CI: prop. test(240, 400). Bootstrap: boot <- replicate(10000, mean(sample(x, length(x), replace=TRUE))); quantile(boot, c(. 025, . 975)) - the percentile interval, no Normal assumption needed. 公式 CI: prop. test(240, 400)。自助法:boot <- replicate(10000, mean(sample(x, length(x), replace=TRUE))); quantile(boot, c(. 025, . 975)) -- 即百分位区间,无需正态假设。 DATA1001 . Foundations of Data Science . AskSia Library TESTING - HATPC - CHAPTER . HYPOTHESIS TESTING MODULE 4 . TOPICS 9-11 One engine, the whole test zoo 一台引擎,整个检验动物园 HATPC . the universal (OV-EV)/SE statistic . proportion, t, chi-square & slope tests HATPC · 通用的(OV-EV)/SE 统计量 · 比例、t、卡方与斜率检验 This is the capstone of DATA1001 and the densest block of exam marks. The good news: almost every inference question is the same calculation. You measure how far the data sit from what chance predicts, in units of the likely chance error - (OV-EV)/SE - then read a probability off a reference curve. The four named tests differ only in what EV is, what SE is, and which curve (Normal vs tdf VS x2) you compare to. Wrap every test in the course's scaffold, HATPC, and graders reward you line by line. 这是 DATA1001 的压轴内容,也是考试分值最密集的一块。好消息是:几乎每一道推断题都是同一个计算。你衡量数据离机 遇所预测的位置有多远,以可能的机遇误差为单位 -- (OV-EV)/SE -- 再从一条参考曲线上读出一个概率。四种具名检验的 差别只在于EV 是什么、SE 是什么、以及比对哪条曲线(正态 vs tdf VS x2)。把每个检验都套进本课程的脚手架 HATPC,阅 卷人就会逐行给你分。 0. 05 ≥|stat| HATPC STEPS HATPC 步骤 ENGINE: (OV-EV)/SE 引擎:(OV-EV)/SE USUAL A 常用 α WHAT P MEASURES p 衡量什么 THE ENGINE OV - EV (observed) - (what Ho predicts) test statistic = SE
- 相关≠因果:即便斜率显著也仍可能只是观察性;“无线性关系”不排除非线性。[9]Source: asksia-bible-data1001-bilingual.pdfI 相关≠ 因果;显著的斜率仍是观察性的;而‘无线性趋势’不等于‘无关系’ -- 也许有一条曲线。 DATA1001 . THE THREE SENTENCES GRADERS MOST WANT TO SEE DATA1001 . Foundations of Data Science . AskSia Library ✓ Ridea R 思路 summary (1m(y ~ x)) - read the slope's Pr (> | t | ); check assumptions with plot (1m. obj) (residuals-vs-fitted + Normal QQ). summary(1m(y ~ x)) → 读斜率的 Pr(>|t |);用 plot(1m. obj)(残差对拟合值+正态 QQ)核查假设。 DATA1001 . Foundations of Data Science . AskSia Library GLOSSARY CHAPTER . GLOSSARY RIE EN + 中文 Bilingual glossary - every examinable term 双语术语表 -- 每个可考的术语 English term . X . one-line meaning - grouped by the four modules 英文术语 · 中文 · 一句话含义 -- 按四个模块分组 A fast reference for the vocabulary DATA1001 actually examines, ordered by the four-module pipeline - Exploring - Modelling - Sampling - Decisions. The +X column is filled in the bilingual build; for now cover the right-hand meaning and recite from the term, then flip and recall the term from the meaning. 这是一份针对 DATA1001 真正考查的术语的快速参考,按四模块流水线排序 -- 探索→建模→抽样→决策。中文一列在 双语版本中填入;现在请先遮住右侧含义、从术语背出含义,再翻转、从含义回忆术语。 Term (EN) 中文 One-line meaning Module 1 - Exploring data: study design 研究设计 Observational study — Investigator only watches; cannot establish causation (confounding uncontrolled). Controlled experiment — Investigator assigns the treatment; randomisation licenses causal claims. Randomised controlled trial — Random assignment balances confounders - supports "X causes Y. "[29]Source: asksia-cheatsheet-data1001.pdf19 . t-tests for Means TOPIC 10 TEST STATISTIC DF One-sample (X-Mo)/(s/ Vn) n-1 Paired (d-0)/(s_d//n) n-1 Two-sample (×1-X2)/SE n1+02-2 Paired = reduce two measurements per unit to the differences d. (e. g. caffeine vs none per cyclist). Trap: an unpaired test on paired data throws away the pairing and loses power. Assumptions: independence (from design), Normality/large n (histogram, QQ-plot, shapiro. test), and equal variance for the two-sample pooled test. 20 . Tests for Relationship TOPIC 11 CHI-SQUARE (BOTH KINDS) x2 = Σ (Observed - Expected)2 / Expected Large x2 => far from expected => evidence against Ho. Always right-tailed. Traps: use raw counts, not % ; needs expected cells ≥ 5; flags only whether a relationship exists, not its direction. Slope test: Ho: 31 = 0 (no linear trend) vs H1: B1 # 0. t = 1/SE(B;) vs tn-2 - read straight off the slope row's Pr(>|t|) in summary (Tm). Assumptions LINE: Linear, Independent, Normal, Equal-variance residuals. A significant slope still is not causation, and "no linear trend" does not rule out a nonlinear one. R Reference . Side 2 SAMPLE + DECIDE
-
4)冲刺复习安排(你照着做,最省命)
- 第 1 轮(把骨架练熟)
- 每天做 2-3 道题,强制写:研究设计一句话 + EDA 三件套(shape/centre/spread)+ HATPC 五步 + 情境结论一句话。[3]Source: asksia-bible-data1001-bilingual.pdf由此决定的应考策略 Because the exam backstops quizzes, Project 1 and missed pieces - but nothing backstops the exam - the dominant move is to over-invest in exam-style reasoning. Treat the projects as exam practice with a longer deadline: the EDA, the method choice, HATPC and interpret-for-a-client are exactly what the exam rewards. Drill the engine; write the in-context sentence every time. 因为期末为小测、Project 1 和缺失环节兜底 -- 而期 未本身无人兜底 -- 主导策略就是过度投资于考试式 推理。把项目当作截止期更长的考试练习:EDA、方 法选择、HATPC、以及面向客户的解释,恰恰是考试 给分之处。狂练引擎;每次都写出结合情境的那句 话。 i What the exam is really testing 考试真正考查的是什么 Four recurring chains carry most marks: read the study design - say what conclusion is legal; summarise & plot - describe shape/centre/spread; state HATPC - compute (OV-EV)/SE - p-value; interpret the p-value / CI without the classic misreads. Every chapter in this book is built to make those chains automatic. 四条反复出现的链条承载了大部分分值:读研究设计→说出什么结论合法;汇总并作图→描述形状/中心/离散;陈述 HATPC → 计算(OV-EV)/SE →p 值;解释p 值/CI,避开那些经典误读。本书的每一章都旨在让这些链条变成下意 识动作。 DATA1001 . Foundations of Data Science . AskSia Library CONTENTS - CONTENTS Four modules, one pipeline 四个模块,一条流水线 Exploring - Modelling - Sampling - Deciding - and one engine under it all 探索→建模→ 抽样→ 决策––以及贯穿全程的一台引擎 Ch Topic Core ideas Module 1 . Exploring data (Weeks 1-3) 1 Design & data types categorical vs quantitative · observational vs experiment . confounding · bias . → sampling 2 Exploratory data analysis mean/median . SD/IQR . resistance . histogram & skew . boxplot & 1. 5. IQR → Module 2 . Modelling data (Weeks 4-5) 3 The Normal model z-scores . 68-95-99. 7 . measurement error . pnorm/qnorm → 4 The linear model correlation · regression line . SD line . regression to the mean · r2 → Module 3 . Sampling data (Weeks 6-9) 5 Chance & the box model probability rules . binomial . EV & SE . the CLT → 6 Surveys & confidence intervals parameter vs statistic · bias . 0-1 box . CI . bootstrap → Module 4 . Decisions with data (Weeks 10-12) 7 Testing: HATPC & the engine (OV-EV)/SE · proportion / z / t / slope · p-value & Cl literacy →[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).
- 第 2 轮(专抓陷阱清单)
- 观察性研究乱说因果;大样本能修偏差;CLT 把数据变正态;互斥当独立;配对数据用非配对检验;只写“reject $H_0$”不写情境解释。[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n[18]Source: asksia-cheatsheet-data1001.pdfR Reference . Side 2 SAMPLE + DECIDE CHANCE sample (x, n, replace) . set. seed() dbinom(x, n,p) . pbinom(x,n,p) replicate() # sampling distribution TESTS t. test (x, mu=pe) t. test(x, y, paired=TRUE) chisq. test(table(x,y)) prop. test() . summary(Lm(y~x)) Exam Traps DO NOT LOSE THESE · Causation needs randomisation ; size cures variance, not bias. CLT makes the statistic Normal, not the data. · ris linear-only; regression to the mean is no cause. · Always run HATPC; split the conclusion statistical + in- context. Formula Belt SIDE 2 OV = EV + chance error . stat = (OV-EV)/SE sum: EV=n. avg, SE=Vn . SD . mean: SE=SD/ Vn binom: np, v(np(1-p)) . SE(p)=V(p(1-p)/n) CI = est ± z *. SE . z *= 1. 96(95%)/2. 58(99%) x2=>(0-E)2/E, df=(r-1)(c-1) · slope t=B1/SE, tn-2 P(≥1) = 1-(1-p)" . power = 1-B asksia. ai/cheatsheet/ usyd-data1001 · side 2/2 AskSia CHEATSHEET SERIES Revision aid . check the official unit outline for assessment . @ 2026 good luck. revise smart. SIDE 2/2 (UV-EV)/SE SAMPLE & DECIDE . Probability . Binomial . The box model . EV + chance error . The CLT . Confidence intervals . HATPC . REVISION SHEET . ALL TOPICS DATA1001[27]Source: asksia-cheatsheet-data1001.pdfTwo faces of the In law: the percentage of heads gets reliably closer to 50% as n grows (relative error +), yet the likely gap between heads and tails widens (absolute error +). Both are true at once - that is the whole subtlety the topic tests. EV is the centre, SE the typical wobble around it. "Observed = EV give or take an SE" is the one-line summary of every chance-process question on the exam - and combined with the CLT it gives the chance of any range of outcomes. 12c . Worked . Box Model SUM & MEAN Box = {1, 2, 3, 4, 5, 6} (a fair die). u_box = 3. 5, SD_box = 1. 71. Roll n = 100 times. SUM OF 100 ROLLS EV = 100. 3. 5 = 350 SE = V100 . 1. 71 = 10. 1. 71 = 17. 1 MEAN OF 100 ROLLS EV = 3. 5 . SE = 1. 71/V100 = 0. 171 By the CLT the sum is = N(350, 17. 12): a total near 384 (= +2 SE) would be surprising (about 2. 5% chance). Note the sum's SE (17. 1) is 100x the mean's SE (0. 171). 0-1 box for counts: to count an event, fill the box with 1s (event) and 0s (not). The sum then counts occurrences and the mean is the proportion - the bridge from the box model to proportions (col 3), where u_box = p and SD_box = /(p(1-p)). Setting up the box is the marks: name the tickets, their multiplicities, n, and with/without replacement - then EV and SE follow mechanically from the table above. Most box-model errors are set-up errors, not arithmetic, so write the box out explicitly. 13 . The CLT TOPIC 7 . LO6 Drawing at random, for large n the probability histogram of the sample sum or mean follows the Normal curve, regardless of the box's shape: CENTRAL LIMIT THEOREM sample sum / mean = N(EV, SE2) This is why z/t-tests and Cls work - the sampling distribution is Normal even when the data are not. The bigger n, the tighter that Normal curve (SE ) and the better the approximation. Trap: thinking the CLT makes the data Normal. It is about the distribution of the statistic, not the raw data. A skewed box needs a larger n before the Normal approximation is good. R IDEA replicate (N, mean (sample (box, n, TRUE))) hist( . . . ) # builds a sampling distribution Three distributions, don't confuse them: the population/box (any shape), one sample (looks like the box), and the sampling distribution of the statistic (Normal by the CLT, centred at EV, spread SE). Inference lives in the third. The histogram of one sample does not become Normal as n grows - only the statistic's distribution does. 14 . Proportions . 0-1 Box TOPIC 8 Code "success" = 1, "failure" = 0. Then __ box = p and SD_box = /(p(1-p)), so the sample proportion p has: SE OF A PROPORTION |SE(p) = V[ p(1-p) / n ] Finite-population correction (sampling without replacement): SE_w/o = V[(N-n)/(N-1)] . SE_with cf - 1 when n < N (usually ignorable)
- 第 3 轮(把“引擎题”练到闭眼)
- 看到题目就先写:$OV$ 是什么统计量?$EV$ 在 $H_0$ 下是多少?$SE$ 用哪个公式?参考曲线是 Normal / $t$ / $\chi^2$?然后套 HATPC。[10]Source: asksia-bible-data1001-bilingual.pdflikely size of chance error H A T P c The exam scaffold - write all five letters, every time HATPC 考试脚手架 -- 每次都把五个字母全部写出来 Step What goes here Marks reward H Hypotheses Ho (=, "due to chance") vs H, (>, < or #). Decide one- vs two-sided. Correct symbols & direction A Assumptions Independence; Normality / large n; equal variance - state and justify each. Checking, not just listing T Test statistic Plug into (OV-EV)/SE with the right EV and SE under Ho. Right EV, right SE, arithmetic P P-value P(stat as/more extreme | H. ) from the reference curve. Double for two- sided. Correct tail, correct doubling C Conclusion Statistical (vs a) and scientific (in context). Both layers, in plain English = DATA1001 . Foundations of Data Science . AskSia Library i Why one engine covers four tests 为什么一台引擎能覆盖四种检验 OV = the statistic you computed (a proportion p, a mean x, a difference, a slope @ ). EV = what He says that statistic should be. SE = the standard error - the likely size of the chance error in that statistic, from the box model / sampling distribution. The only things that change across the test zoo are the formula for SE and which reference curve you read the p-value from. Learn the skeleton once and the rest is bookkeeping. OV= 你计算出的统计量(一个比例 p、一个均值 x、一个差值、一个斜率βi)。EV= Ho 所说该统计量应有的值。SE= 标准误 -- 该统计量中机遇误差可能的大小,来自盒子模型/抽样分布。在这套检验体系里唯一变化的,是 SE 的公式 以及你从哪条参考曲线读p值。骨架学一次,其余都是记账。 ★ The two-layer conclusion graders look for 评分者想看到的双层结论 A bare "reject H. " rarely gets full marks. Write both: (1) statistical - "p = 0. 013 < 0. 05, so we reject Ho at the 5% level"; then (2) scientific - "there is evidence the new tutoring program raises mean exam scores. " And never say "accept H. " - you retain it (absence of evidence # evidence of absence).[16]Source: asksia-cheatsheet-data1001.pdf17 . HATPC TOPIC 9 . THE SCAFFOLD * The exam scaffold for every test: · P-value - see col 5. · Conclusion - two layers: statistical (vs a) + scientific (in context). Ho always carries the = ("the difference is due to chance"); H, carries >, < or #. Decide one- vs two-sided before seeing the data, from the research question - not to chase significance. Assumptions checklist: independence (from the design), Normality or large-n (histogram / QQ-plot / Shapiro-Wilk), equal variance for a pooled two-sample test (compare spreads). If they fail: transform the data or use a different test. State each - graders award the "A". Write the conclusion in both registers: "p = 0. 03 < 0. 05, so we reject Ho (statistical); there is evidence the new method raises the mean score (scientific, in context). " Dropping the context half loses marks. Never write "accept Ho" - only "retain", since absence of evidence is not evidence of absence. Compiled by AskSia . mapped to the DATA1001 syllabus . asksia. ai/cheatsheet/usyd- data1001 18 . THE ENGINE ONE STAT, FOUR TESTS * THE MASTER TEST STATISTIC stat = (OV - EV) / SE - = (observed - what He predicts) / SE How many SEs the data sit from Ho. Large |stat]=> evidence against Ho; near 0 => consistent with chance. Proportion, z, t and slope tests are all this one calculation - only EV, SE and the reference curve change. Once you can read EV and SE off the box model, every test is the same three keystrokes. TEST SE / EV CURVE Proportion VIPo(1-Po)/n] N(0,1) z (o known) o/n
-
5)你给我 3 个信息,我可以把它变成“你的个人复习清单 + 每天刷题计划”
- 你 Final 是哪天(具体日期)?
- 你最怕哪三块(选 3 个):
- A 设计/混杂/偏差(因果判断)
- B EDA(箱线图、IQR、离群、偏斜、mean vs median)
- C 正态 + z 分数
- D 回归/相关 + 斜率检验(LINE 假设)
- E 概率/二项/“至少一个”
- F 盒子模型 + CLT + SE(SD vs SE)
- G HATPC + t 检验 / 卡方 / CI 解释
- 你希望我用哪种方式带你:
- “我给你模板+你自己写答案我来批改”,还是“我先给你完整示范答案再让你仿写”?
- (可选)如果你愿意,我也可以把你“最弱的 3 块 + 常踩陷阱”帮你记住,之后你每次来复习我就优先带你过这些。
Ask Sia for quick explanations, examples, and study support.