BUSS6002: Data Science in Business: BUSS6002 W6: Linear Algebra and
Mar 26, 2026
All files
It looks like you've uploaded a file, but didn't include a message or question.
If you need help with something in your document (BUSS6002 W6-Lec.pdf), please let me know what you'd like to do. Here are some common options:
- Summarize the contents of a section or the whole file
- Explain a specific term, equation, or concept
- Answer practice questions or work through an example
- Check assumptions or explain outputs from a regression
- Anything else you need for study or revision
Just type your request or a keyword—I'm here to assist!
BUSS6002 W6: Linear Algebra and Linear Regression
This document provides a summary of key concepts in Linear Algebra and Linear Regression, intended for students of Ella Luo.
I. Linear Algebra
A. Matrices
- Definition: A matrix is a rectangular array of numbers.
d_ijrepresents the element in the i-th row and j-th column.- The dimension of a matrix A is denoted as n x m (n rows by m columns).
- Matrices are typically represented by bold upper-case letters (e.g., A).
- Matrix Equality: Two matrices A and B are equal (
A = B) if they have the same dimensions and all corresponding elements are equal (a_ij = b_ijfor all i and j).
B. Matrix Operations
- Matrix Addition:
- Matrices must have the same dimensions to be added.
- Commutative Property:
A + B = B + A - Associative Property:
(A + B) + C = A + (B + C)
- Matrix Scalar Multiplication: Multiplying each element of a matrix by a scalar.
- Example:
0.5 * [[1, 2], [3, 4]] = [[0.5, 1], [1.5, 2]]
- Example:
- Matrix Multiplication:
- Given an n x m matrix A and an m x r matrix B, their product AB results in an n x r matrix C.
- The element
c_ijis calculated by the dot product of the i-th row of A and the j-th column of B. - Example:
[[4, 9, 6], [1, 5, 8]] * [[1, 2], [8, 3], [5, 7]] = [[(4*1 + 9*8 + 6*5), (4*2 + 9*3 + 6*7)], [(1*1 + 5*8 + 8*5), (1*2 + 5*3 + 8*7)]] = [[77, 71], [61, 73]] - Properties:
- Not Commutative:
AB ≠ BA(generally) - Associative Property:
(AB)C = A(BC) - Distributive Property:
A(B + C) = AB + AC
- Not Commutative:
C. Special Matrices
- Square Matrix: A matrix where the number of rows equals the number of columns (n = m).
- Diagonal Matrix: A square matrix where all off-diagonal elements are zero.
- Identity Matrix (I): A diagonal matrix with ones on the main diagonal.
- Useful Property:
AI = IA = A
- Useful Property:
D. Inverse of a Matrix
- An n x n square matrix A is invertible if there exists an n x n matrix B such that
AB = BA = I. - The matrix B is called the inverse of A and is denoted as
A⁻¹. - A square matrix is singular (not invertible) if and only if its determinant is zero.
- For a 2x2 matrix
[[a, b], [c, d]], the inverse is(1 / (ad - bc)) * [[d, -b], [-c, a]].
II. Linear Regression
A. Introduction to Linear Regression
- Purpose: To describe the relationship between a continuous response variable (Y) and a set of predictor variables (X).
Y = f(X) + εY: Response, dependent variable, target.X: Predictors, features, independent variables, covariates.ε: Error term.
- Supervised Learning: Involves learning from observed feature-target pairs where targets have labels.
- Why Use Linear Regression:
- Useful for predictions.
- Easy to interpret (not a "black-box").
- A good starting point for more complex models.
- Visualization: Pairwise scatter plots can help visualize the relationship between predictors and the response.
B. Simple Linear Regression (SLR)
- Model: Involves only one predictor (
x) and assumes a linear relationship:y = β₀ + β₁x + ε.β₀: Intercept (estimated asβ̂₀).β₁: Slope (estimated asβ̂₁).ε: Error term with zero mean and constant variance.
- Prediction: Predicted value
ŷ = β̂₀ + β̂₁x. - Estimating Coefficients (Least Squares):
- Minimize the Residual Sum of Squares (RSS):
RSS = Σ(yᵢ - ŷᵢ)² = Σ(yᵢ - (β̂₀ + β̂₁xᵢ))². - The optimal
β̂₀andβ̂₁are found using the Least Squares method. - Optimal Solutions:
β̂₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²β̂₀ = ȳ - β̂₁x̄
- Minimize the Residual Sum of Squares (RSS):
C. Interpreting a Linear Regression Model
β₁(Slope): Represents the average change inyfor a one-unit increase inx.β₀(Intercept): Represents the expected value ofywhenxis 0.ε(Error Term): Represents the variation inynot explained by the model.
D. Accuracy of Coefficients
- True Model vs. Estimated Model: The true model describes the population, while estimates are derived from a sample.
- Estimates as Random Variables: Parameter estimates (
β̂₀,β̂₁) are random variables with their own mean and standard deviation. - Unbiasedness: Least squares estimates are unbiased if their mean equals the true parameter value (
E(β̂₀) = β₀,E(β̂₁) = β₁). - Standard Error (SE): The standard deviation of an estimator.
SE(β̂₀)andSE(β̂₁)formulas are provided (dependent on sample varianceσ²and predictor values).
- Confidence Interval (CI): A range of values likely to contain the true parameter.
- Approximate 95% CI for
β₁:[β̂₁ - 2 * SE(β̂₁), β̂₁ + 2 * SE(β̂₁)]. (More accurately uses t-distribution quantiles).
- Approximate 95% CI for
- Hypothesis Testing:
- Null Hypothesis (H₀): No relationship between
xandy(β₁ = 0). - Alternative Hypothesis (H₁): There is a relationship (
β₁ ≠ 0). - Rejection: If the CI does not contain 0, or if the calculated t-statistic (
t = β̂₁ / SE(β̂₁)) falls outside the critical region defined by the t-distribution (withn-2degrees of freedom), H₀ is rejected.
- Null Hypothesis (H₀): No relationship between
E. Goodness of Fit
- R-squared (
R²):- Measures the proportion of variability in
yexplained by the model. R² = 1 - (RSS / TSS)TSS(Total Sum of Squares):Σ(yᵢ - ȳ)²RSS(Residual Sum of Squares):Σ(yᵢ - ŷᵢ)²- Ranges from 0 to 1. Higher values indicate a better fit to the training data.
- Measures the proportion of variability in
F. Multiple Linear Regression
- Model:
y = β₀ + β₁x₁ + ... + βₚxₚ + εβ: Vector of coefficients[β₀, β₁, ..., βₚ]ᵀ.X: Matrix of predictors (including a column of ones for the intercept).
- Interpretation of
βⱼ: The average change inyfor a one-unit increase inxⱼ, holding all other predictors constant. - Least Squares: Minimize
RSS = ||y - Xβ||² = (y - Xβ)ᵀ(y - Xβ).- Solution:
β̂ = (XᵀX)⁻¹Xᵀy(requiresXᵀXto be invertible).
- Solution:
G. Residual Diagnostics
- Purpose: To check if the assumptions of the linear regression model are met, which is crucial for valid statistical inference.
- Assumptions:
- Linearity: The relationship between predictors and the response is linear.
- Independence: Errors (
ε) are independent of each other. - Normality: Errors are normally distributed.
- Equal Variance (Homoscedasticity): Errors have a constant variance across all levels of predictors.
- Diagnostic Tools:
- Residual Plots: Plotting residuals against fitted values (
ŷ).- Patterns suggest non-linearity or unequal variance (heteroscedasticity).
- Squared Residuals vs. Fitted Values: Helps detect heteroscedasticity.
- Q-Q Plot (Quantile-Quantile Plot): Compares the distribution of residuals to a theoretical normal distribution. Points should roughly follow a straight line for normality.
- Residual Plots: Plotting residuals against fitted values (
Note: The provided text also includes practice questions and examples related to these topics.
以下是对“BUSS6002 W6-Lec.pdf”文件的中文要点总结与讲解:
一、线性代数基础
1. 矩阵基础
- 矩阵是数字按照行和列排列形成的表格,记作 $\mathbf{A}$,其中 $a_{ij}$ 代表第 $i$ 行第 $j$ 列。
- 矩阵的维度为 $n \times m$($n$ 行,$m$ 列)。
- 矩阵相等要求两个矩阵的对应元素都相等且维度一致[7]Source: BUSS6002 W6-Lec.pdf领航名师 Exclusive to Ella Luo's Students BUSS6002 W6 - Linear Algebra 2 1. Matrix A = a11 a21 a22 : Lan1 an2 : a12 . . . a1m a2m : anm - dij represents the element from the i-th row and j-th column - The dimension of A is n x m (n by m) - Usually matrix use bold upper-case A. 2. Matrix Equality A = B If and only if aij = bij Vi = 1, . . . , n Vj = 1, . . . , m 师 3. Matrix Transpose B [1 L3 41 2] BT = [2 3 4. 4. Matrix addition。
2. 矩阵运算
- 加法/减法:只有在同样维度下才能进行,满足交换律和结合律[10]Source: BUSS6002 W6-Lec.pdfA : [10 3 -9 11. B = -5 -3 12 2 9 4. 10-3 7+21 7 9 15 -5 11 + 4. 10 15 0 15. Properties of matrix addition: - Commutative: A + B = B + A - Associative: (A + B) + C = A + (B + C) 5. Matrix Scalar Multiplication 0. 5 [1 2] [1 4J [0. 5 x 1 0. 5 x 2] = [0. 5 1] = 10. 5 x 3 0. 5 x 4] L1. 5 -9 +9 = 领航名师。
- 数乘:数和矩阵每个元素相乘[10]Source: BUSS6002 W6-Lec.pdfA : [10 3 -9 11. B = -5 -3 12 2 9 4. 10-3 7+21 7 9 15 -5 11 + 4. 10 15 0 15. Properties of matrix addition: - Commutative: A + B = B + A - Associative: (A + B) + C = A + (B + C) 5. Matrix Scalar Multiplication 0. 5 [1 2] [1 4J [0. 5 x 1 0. 5 x 2] = [0. 5 1] = 10. 5 x 3 0. 5 x 4] L1. 5 -9 +9 = 领航名师。
- 乘法:$n\times m$ 的矩阵 $A$ 乘以 $m\times r$ 的矩阵 $B$,结果是 $n\times r$ 的新矩阵 $C$,$c_{ij}$ 取 $A$ 的第 $i$ 行与 $B$ 第 $j$ 列相乘再求和。矩阵乘法不满足交换律,但满足结合律和分配律[16]Source: BUSS6002 W6-Lec.pdfExclusive to Ella Luo's Students 6. Matrix Multiplication Given a n X m matrix A and a m x r matrix B, the product AB = C is the n x r matrix A = [4 9 6 15 8 B = 1 15 8 3 AB = [4×1+9x8+6x5 4x2+9x3+6x7] 771 = 106 1×1+5×8+8×5 1x2+5x3+8x7] 81 73] Useful properties: - Not commutative AB # BA - Associative: (AB)C = A(BC) - Distributive: A(B + C) = AB + AC - (AB)T = BTAT 7. Special Matrix (1) Square matrix : m = n (same row and column dimension) (2) Diagonal Matrix: is a square matrix whose off-diagonal elements are all equal to zero. : a11 0 0 ann. (3) Identity matrix: a diagonal matrix with ones on the diagonal [1 L0 I= 0 1 0 ,I= HO。
3. 特殊矩阵
4. 矩阵的逆
二、线性回归基础
1. 线性回归基本模型
2. 为什么用线性回归
3. 简单线性回归(SLR)
- 只有一个自变量:
$$
y = \beta_0 + \beta_1 x + \varepsilon
$$
- $\beta_1$:斜率,$x$ 每变1单位时,$y$ 平均变多少。
- 预测值 $$ \hat{y} = \beta_0 + \beta_1 x $$
- 用最小二乘法(Least Squares)找最优 $\beta_0,,\beta_1$,即使残差平方和(RSS)最小[8]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:。
4. 回归参数的估计与解释
- $$ RSS = \sum_{i=1}^n \left(y_i - (\beta_0 + \beta_1 x_i)\right)^2 $$
- $\hat{\beta}_1$ 和 $\hat{\beta}_0$ 的公式,分别代表回归线的斜率和截距[8]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:。
- 置信区间(CI):例如 $\beta_1$ 的 95% CI 就是 $[\hat{\beta}_1 - 2\times SE(\hat{\beta}_1), \hat{\beta}_1 + 2\times SE(\hat{\beta}_1)]$[11]Source: BUSS6002 W6-Lec.pdfy -1 -2 -2 -1 0. -2 -1 几名师 x 00 00 领航名师 Exclusive to Ella Luo's Students 4. Accuracy of coefficient: Unbias: if mean of estimate equals to the true value, least square estimates are unbiased E(ß0) = B0, E(ß1) = B1 - The standard deviation of an estimator is called standard Error SE(80)= 02 1 n + Ef 1 (i-x)2 6 SE (81)2 = ΣΤ=1 (xi-x)2 x2 · Confidence Interval for coefficients: - A 95% CI is defined as a range of values such that with 95% probability, the range will contain true parameter value. - Approximate 95% CI for B1:[£1-2×SE(£1),1 +2× SE (81)] To be more accurate, the 2 should be 97. 5% quantile of t-distribution with n-2 degree of freedom。
5. 假设检验与显著性
- 零假设 $H_0$:$\beta_1 = 0$(没有关系);若在置信区间之外或 t 统计量大于临界值,则拒绝零假设[$t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}$][1]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £。
6. 模型优度与诊断
- $$ R^2 = 1 - \frac{RSS}{TSS} $$ $TSS$ 全部方差,$RSS$ 为残差方差,$R^2$ 越高说明模型解释能力越强,值在 0~1 之间[1]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £。
- 回归模型的基本假设:线性性、残差独立性、正态性、同方差性。诊断方法有残差图、Q-Q 图等[15]Source: BUSS6002 W6-Lec.pdfy = : ynInx1 X = 1 : 1 X11 Xn1 : : Inx(p+1) ß = TBO® B1 Bn (p+1)×1 E = : Ennx1 · Least Square (minimise RSS): RSS =| y - 9 112=Il y - X 113= (y -XB)™ (y -Xß) - ^ = arg min | y - XB 12 - ₿ = (XTX)-1XTy, this requires XTX to be invertible B 6. Residual Diagnostics - Linearity: relationship between x and y is linear - Independence: Errors are independent - Normality: The errors are normally distributed - Equal Variance (homoskedasticity): The errors have same variance · Residual diagnostics (Check the assumptions) - Residual plots: residual against fitted values Check the non-linearity or the equal variance assumption Unequal variance also named heteroscedasticity - Square residuals or absolute residuals vs fitted values Check equal variance assumption[5]Source: BUSS6002 W6-Lec.pdf- Q-Q plots Check the normality assumption Residual 2. 30 - -2 - -4 - -6 -2 -1 Fitted Fitted Theoretical Quantiles · Squared residuals vs fitted: heteroskedasticity Sample Quantiles 1 HO -1 -2 Squared Residual 领航名师。
7. 多元线性回归
- 有多个自变量,矩阵形式 $$ y = X\beta + \varepsilon $$ 最小二乘法解:$$ \hat{\beta} = (X^TX)^{-1} X^T y $$ 解释 $\beta_j$:固定其他变量,$x_j$ 每增加1,$y$ 的平均增量[15]Source: BUSS6002 W6-Lec.pdfy = : ynInx1 X = 1 : 1 X11 Xn1 : : Inx(p+1) ß = TBO® B1 Bn (p+1)×1 E = : Ennx1 · Least Square (minimise RSS): RSS =| y - 9 112=Il y - X 113= (y -XB)™ (y -Xß) - ^ = arg min | y - XB 12 - ₿ = (XTX)-1XTy, this requires XTX to be invertible B 6. Residual Diagnostics - Linearity: relationship between x and y is linear - Independence: Errors are independent - Normality: The errors are normally distributed - Equal Variance (homoskedasticity): The errors have same variance · Residual diagnostics (Check the assumptions) - Residual plots: residual against fitted values Check the non-linearity or the equal variance assumption Unequal variance also named heteroscedasticity - Square residuals or absolute residuals vs fitted values Check equal variance assumption。
如需某一部分详细推导、例题解析或公式讲解,可继续提出!
BUSS 6002 Final Review Summary
This document provides a comprehensive review for BUSS 6002, covering Big Data, Data Handling, Machine Learning algorithms, Model Selection, and Marketing Applications.
Part 1. Introduction to Big Data
1.1 Analytical Capabilities
Big Data analytics can be categorized into four types:
- Descriptive Analytics: What happened?
- Diagnostic Analytics: Why did it happen?
- Predictive Analytics: What will happen?
- Prescriptive Analytics: What should we do? How can we make it happen?
Prescriptive analytics often involves Optimization and Foresight.
1.2 CRISP-DM and Snail Shell Model
These are process models for Knowledge Discovery in Databases (KDDA).
-
CRISP-DM (Cross-Industry Standard Process for Data Mining):
- Business Understanding: Identify business problems, collect initial data, determine objectives, output a project plan.
- Data Understanding: Explore data, check quality, examine metadata.
- Data Preparation: Select, clean, construct, integrate, and format data.
- Modeling: Select appropriate techniques, generate test design (train, test, validation), build models.
- Evaluation: Evaluate model performance against business objectives, determine next steps.
- Deployment: Deploy the model, plan monitoring, and create a final report.
-
Snail Shell KDDA Process Model: This model also outlines a process for data analytics, with similar phases to CRISP-DM.
Part 2. Data Handling
2.1 Data Quality Issue
-
Missing Data:
- Missing Completely at Random (MCAR): Missingness is independent of all variables (observed and unobserved). Causes no bias but is rare.
- Missing at Random (MAR): Missingness depends on observed variables. Can cause bias.
- Not Missing at Random (NMAR): Missingness depends on unobserved variables. Common but hard to identify and address.
- Handling: Delete or impute missing data.
-
Outliers: Data points significantly different from others.
2.2 Exploratory Data Analysis (EDA)
-
Univariate:
- Non-Graphical:
- Categorical: Counts, proportions.
- Numerical: Mean, median, spread.
- Graphical:
- Categorical: Bar chart.
- Numerical: Histogram, box plot, QQ plot.
- Non-Graphical:
-
Multivariate:
- Non-Graphical:
- Numerical vs. Numerical: Covariance matrix.
- Graphical:
- Numerical vs. Numerical: Correlation heatmap, scatterplot.
- Category vs. Numerical: Box plot.
- Non-Graphical:
2.3 Feature Engineering
-
Structured Data Feature Engineering:
- Standardization: Scales data to have a mean of 0 and a variance of 1.
- Normalization: Scales data into a range, typically [0, 1].
- Exponential or Log Transformation: Useful for addressing right-skewed data.
- Linear Regression Feature Engineering:
- Interpreting log-transformed variables.
- Creating dummy variables.
- Polynomial Regression.
-
Text Data Feature Engineering:
- Bag-of-Words: Extracts features based on word occurrence within a document. Creates a vocabulary and measures word presence. Does not capture word order.
- TF-IDF (Term Frequency-Inverse Document Frequency): Translates word counts into a measure of importance.
- Formula:
TFIDF = (N of token t in document d) / (N of tokens in document d) * ln(N of documents containing token t)
- Formula:
- Stemming or Lemmatization: Reducing words to their root form.
Part 4. Machine Learning Algorithms
4.1 Unsupervised Learning - Clustering
- Goal: Partition data into clusters where points within a cluster are similar, and points in different clusters are dissimilar.
- K-Means Algorithm:
- Initialize cluster centroids (randomly).
- Assign each data point to the nearest centroid.
- Recalculate the centroid of each cluster.
- Repeat steps 2-3 until centroids no longer change.
- K-Means Appropriateness:
- Highly dependent on initial centroid selection.
- Elbow Method: Used to select the optimal number of clusters (k).
- Advantage: Fast computational speed.
- Limitation: Assumes clusters are convex and isotropic (circular).
4.2 Supervised Learning
4.2.1 Regression - Linear Regression
- Model: $Y = X\beta + \epsilon$
- Solution (OLS): $\beta = (X^T X)^{-1} X^T y$
- Residual Diagnostics:
- Linearity: Plot residuals against fitted values.
- Equal Variance (Homoscedasticity): Plot squared residuals against fitted values.
4.2.2 Classification - Logistic Regression
-
Model: A generalized linear model (GLM).
-
Forecasting Probability: $P(Y=1|x) = \frac{1}{1 + e^{-X\beta}}$
-
Decision Boundary: Linear.
-
Interpreting Coefficients:
- $\beta_0$: Odds of class 1 when all $x$ are zero.
- $\beta_i$: For a 1-unit increase in $x_i$, the odds increase by $(\exp(\beta_i) - 1) * 100%$.
-
Likelihood Function: Measures how likely the observed data is given the model parameters.
-
Maximum Likelihood Estimation (MLE): Finds parameters that maximize the likelihood function. $\hat{\beta} = \arg \max L(x|\beta)$.
-
Classification Model Evaluation:
- Accuracy: $(TP + TN) / (TP + TN + FP + FN)$
- True Positive Rate (Recall, Sensitivity): $TP / (TP + FN)$
- True Negative Rate (Specificity): $TN / (TN + FP)$
- False Positive Rate: $FP / (FP + TN) = 1 - Specificity$
- False Negative Rate: $FN / (TP + FN)$
- Precision: $TP / (TP + FP)$
- False Discovery Rate: $FP / (TP + FP) = 1 - Precision$
- F1-Score: $2 * (Precision * Recall) / (Precision + Recall)$
Part 5. Analytical Methods & Optimization
5.3.1 Analytical Methods
- Approach: Solve for parameters by setting partial derivatives of the loss function to zero.
- Problem: Matrix inversion can be computationally expensive and infeasible for large datasets. Some loss functions may not have a unique solution.
5.3.2 Gradient Descent
- Basic Idea: Iteratively move towards the minimum of a function by taking steps in the direction of the negative gradient.
- Steps:
- Initialize parameters ($\beta^0$).
- Iterate: $\beta^{t+1} = \beta^t - \alpha \nabla L(\beta^t)$, where $\alpha$ is the step size (learning rate).
- Stop when the update is below a threshold.
- Step Size (Learning Rate): Controls the speed of convergence. Too large can overshoot, too small can lead to slow convergence.
- Convexity:
- Convex Function: Guaranteed to find the global minimum.
- Non-Convex Function: May find a local minimum; requires trying different initializations.
5.3.3 Comparison between Analytic and Gradient Descent
- Analytic Solution: Mathematically simpler, easier to implement for some problems.
- Gradient Descent: Lower maximum computation requirements, more scalable for large datasets.
- Both methods aim for the same solution (within tolerance). Gradient descent is often preferred for large-scale problems.
Part 6. Model Selection
- Overfitting vs. Underfitting:
- Overfitting: Complex model, low bias, high variance. Performs well on training data but poorly on unseen data.
- Underfitting: Simple model, high bias, low variance. Performs poorly on both training and unseen data.
- MSE Trend: Training MSE decreases with increasing complexity. Test MSE first decreases, then increases. The optimal model minimizes test error.
- Data Splitting:
- Training Set: Used for EDA, model building, and parameter estimation.
- Validation Set: Used for model selection and hyperparameter tuning.
- Test Set: Used for final, unbiased evaluation of the selected model.
Part 7. Applications in Marketing
2. Customer Analytics
- Product-Centric Marketing: Often uses Collaborative Filtering.
- Customer-Centric Marketing: Focuses on individual customer value. Uses models like:
- Natural Propensity Models: Predicts the likelihood of a customer taking a specific action (e.g., purchasing a product) regardless of marketing intervention.
- Campaign Response Models: Predicts the likelihood of a customer responding to a specific marketing campaign.
- Uplift Models: Estimate the causal impact of a marketing intervention by comparing the outcome for treated vs. untreated customers. Categories include:
- "Lost Causes" (treated vs. untreated: No difference or negative)
- "Sleeping Dogs" (treated vs. untreated: Negative impact)
- "Persuadable" (treated vs. untreated: Positive impact)
- "Sure Thing" (treated vs. untreated: No difference, but would have bought anyway)
3. Customer Analytics - Measuring Success
- Good Success Metrics: Should be business-aligned, measure uplift, have appropriate timing, and be at the right level of aggregation.
- Common Pitfalls: Focusing solely on engagement metrics (likes, shares) without linking them to business outcomes.
Part 8. Big Data Solutions
- Algorithm Running Time: Determined by the fastest-growing term after dropping coefficients (e.g., $O(n^3)$).
- Tall Data Solutions:
- Scalable Algorithms: e.g., Stochastic Gradient Descent (SGD).
- Parallelization: Divide and Conquer approaches.
Key Concepts & Questions Addressed in Practice Problems
- Data Quality: Identifying missing data types (MCAR, MAR, NMAR), handling duplicates.
- EDA: Univariate and multivariate analysis techniques.
- Feature Engineering: Standardization, normalization, log transformations, dummy variables, TF-IDF, stemming/lemmatization.
- Clustering (K-Means): Algorithm steps, sensitivity to initialization and outliers, selecting 'k' (elbow method), limitations (cluster shape).
- Linear Regression: OLS solution, residual diagnostics, interpreting coefficients, feature transformations, bias-variance trade-off.
- Logistic Regression: Probability forecasting, interpreting coefficients, MLE, classification metrics (Accuracy, Precision, Recall, F1-Score).
- Optimization: Analytic solutions vs. Gradient Descent, learning rate, convexity.
- Model Selection: Overfitting/underfitting, train/validation/test splits.
- Marketing Analytics: Customer-centric vs. product-centric, campaign response models, uplift models, defining success metrics.
- Big Data: Scalability, parallelization.
- Text Analysis: Bag-of-Words, TF-IDF.
- CRISP-DM: Phases and their purpose.
- Ethics in Data Science: Principles of responsible AI.
BUSS6002 第六周(Week 6)“线性代数与线性回归”结合全套复习&重点概念中文汇总:
一、线性代数基础知识(复习PPT + 期末复习文档)
1. 矩阵基础
- 定义与记法: 矩阵$\mathbf{A}$为$n\times m$($n$行$m$列),元素记为$a_{ij}$,通常用粗体大写字母表示。
- 矩阵相等: 维度和每个位置的元素都相同才算相等[4]Source: BUSS6002 W6-Lec.pdf领航名师 Exclusive to Ella Luo's Students BUSS6002 W6 - Linear Algebra 2 1. Matrix A = a11 a21 a22 : Lan1 an2 : a12 . . . a1m a2m : anm - dij represents the element from the i-th row and j-th column - The dimension of A is n x m (n by m) - Usually matrix use bold upper-case A. 2. Matrix Equality A = B If and only if aij = bij Vi = 1, . . . , n Vj = 1, . . . , m 师 3. Matrix Transpose B [1 L3 41 2] BT = [2 3 4. 4. Matrix addition。
- 转置、加法、数乘、乘法等基础运算,理解对应格式匹配与各类运算法则:
- 加法满足交换律、结合律 [10]Source: BUSS6002 W6-Lec.pdfA : [10 3 -9 11. B = -5 -3 12 2 9 4. 10-3 7+21 7 9 15 -5 11 + 4. 10 15 0 15. Properties of matrix addition: - Commutative: A + B = B + A - Associative: (A + B) + C = A + (B + C) 5. Matrix Scalar Multiplication 0. 5 [1 2] [1 4J [0. 5 x 1 0. 5 x 2] = [0. 5 1] = 10. 5 x 3 0. 5 x 4] L1. 5 -9 +9 = 领航名师
- 乘法则一般不满足交换律,但满足结合律和分配律 [27]Source: BUSS6002 W6-Lec.pdfExclusive to Ella Luo's Students 6. Matrix Multiplication Given a n X m matrix A and a m x r matrix B, the product AB = C is the n x r matrix A = [4 9 6 15 8 B = 1 15 8 3 AB = [4×1+9x8+6x5 4x2+9x3+6x7] 771 = 106 1×1+5×8+8×5 1x2+5x3+8x7] 81 73] Useful properties: - Not commutative AB # BA - Associative: (AB)C = A(BC) - Distributive: A(B + C) = AB + AC - (AB)T = BTAT 7. Special Matrix (1) Square matrix : m = n (same row and column dimension) (2) Diagonal Matrix: is a square matrix whose off-diagonal elements are all equal to zero. : a11 0 0 ann. (3) Identity matrix: a diagonal matrix with ones on the diagonal [1 L0 I= 0 1 0 ,I= HO
2. 特殊矩阵与矩阵逆
- **方阵、对角矩阵、单位矩阵(I)**及其特殊性质:$AI = IA = A$
- 逆矩阵:方阵$A$可逆则有$A^{-1}$,满足$AA^{-1}=A^{-1}A=I$。行列式为0则不可逆[28]Source: BUSS6002 W6-Lec.pdf0 1. Useful Property: AI = IA = A 8. Inverse of a Matrix: Ann x n square matrix A is invertible if there exists an n x n square matrix B such that: AB = BA = I The matrix B is called the inverse of A and is denoted by A-1 A square matrix is not invertible (called singular) if and only if its determinant is zero. [a ] = . - 1 ad - bc l-c :[d -b] £ a a22 : 0 0 领航名师 Exclusive to Ella Luo's Students BUSS 6002 W6 - Linear Regression 1. Linear Regression (1) Describe the relationship between a continuous response and a set of predictors =f( x )+ £ Response Predictors Error - X: Features, independent variables, covariates - Y: response, dependent variables, target - Supervised learning has observed feature - target pairs. Targets in supervised has labels. (2) Why use linear regression:。
二、线性回归整体结构
1. 线性回归模型本质
- 目标: 用数值型自变量(特征$X$)解释响应变量$Y$的变化,数学表达为: $$ Y = f(X) + \varepsilon $$ $X$为特征,$Y$为响应变量,$\varepsilon$为误差项(噪音)[28]Source: BUSS6002 W6-Lec.pdf0 1. Useful Property: AI = IA = A 8. Inverse of a Matrix: Ann x n square matrix A is invertible if there exists an n x n square matrix B such that: AB = BA = I The matrix B is called the inverse of A and is denoted by A-1 A square matrix is not invertible (called singular) if and only if its determinant is zero. [a ] = . - 1 ad - bc l-c :[d -b] £ a a22 : 0 0 领航名师 Exclusive to Ella Luo's Students BUSS 6002 W6 - Linear Regression 1. Linear Regression (1) Describe the relationship between a continuous response and a set of predictors =f( x )+ £ Response Predictors Error - X: Features, independent variables, covariates - Y: response, dependent variables, target - Supervised learning has observed feature - target pairs. Targets in supervised has labels. (2) Why use linear regression:。
- 线性回归适用场景:便于解释,推断简单,是复杂模型的基础[16]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:。
2. 简单线性回归(SLR)推导与解读
-
基本形式: $$ y = \beta_0 + \beta_1 x + \varepsilon $$
-
最小二乘法(OLS): 最小化残差平方和(RSS)找到最佳$\hat{\beta}_0$、$\hat{\beta}1$: $$ RSS = \sum{i=1}^{n}(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2 $$ $$ \hat{\beta}_1 = \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sum_i(x_i-\bar{x})^2} $$ $$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} $$ [16]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:
-
斜率$\beta_1$: $x$每变1单位,$y$平均变$\beta_1$单位。
-
回归模型的假设检验与置信区间:
- 零假设$H_0:\ \beta_1=0$
- $t$统计量:$t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}$
- 95%置信区间:$[\hat{\beta}_1-2SE(\hat{\beta}_1), \hat{\beta}_1+2SE(\hat{\beta}_1)]$ [11]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £, [32]Source: BUSS6002 W6-Lec.pdfy -1 -2 -2 -1 0. -2 -1 几名师 x 00 00 领航名师 Exclusive to Ella Luo's Students 4. Accuracy of coefficient: Unbias: if mean of estimate equals to the true value, least square estimates are unbiased E(ß0) = B0, E(ß1) = B1 - The standard deviation of an estimator is called standard Error SE(80)= 02 1 n + Ef 1 (i-x)2 6 SE (81)2 = ΣΤ=1 (xi-x)2 x2 · Confidence Interval for coefficients: - A 95% CI is defined as a range of values such that with 95% probability, the range will contain true parameter value. - Approximate 95% CI for B1:[£1-2×SE(£1),1 +2× SE (81)] To be more accurate, the 2 should be 97. 5% quantile of t-distribution with n-2 degree of freedom
3. 多元线性回归
- 向量/矩阵形式:适合多变量情况 $$ y = X\beta + \varepsilon $$ $X$为$n \times (p+1)$矩阵,$\beta$为$(p+1)\times 1$系数向量。
- OLS解析解: $$ \hat{\beta} = (X^TX)^{-1}X^Ty $$ [11]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £
4. 模型拟合优度
- 相关统计量:
- $R^2 = 1-\frac{RSS}{TSS}$,$R^2$越高模型自身解释度越强
- $TSS=\sum(y_i-\bar{y})^2$,$RSS=\sum(y_i-\hat{y}_i)^2$[11]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £
- 残差分析与诊断图:判别线性性、同方差、正态假设等[17]Source: BUSS6002 W6-Lec.pdf- Q-Q plots Check the normality assumption Residual 2. 30 - -2 - -4 - -6 -2 -1 Fitted Fitted Theoretical Quantiles · Squared residuals vs fitted: heteroskedasticity Sample Quantiles 1 HO -1 -2 Squared Residual 领航名师
三、补充:数据工程与特征处理
四、知识点典型总结与实战题解析(部分)
- K-means聚类:无监督学习代表;聚类流程、适用条件、局部极值、初始点敏感等[12]Source: 2022s2 BUSS6002 Final Review + Final Practice_AntiCopy.pdf3) Which model is more appropriate for Kathryn to apply to her data analytic project? Explain why? 4) What should be her next steps be, to apply the process model at Big Chain? Question3. Answer the following questions about k-means clustering: 1) Outline the steps of the k-means clustering algorithm 2) Comment on whether k-means is unsupervised or supervised 3) Comment on whether k-means reaches a local or global optimum Question4. Big Data continues to present significant challenges when training supervised learning models. 1) Identify and describe one technique that can be used to mitigate the problems created by big data. 2) Explain how this technique could be applied to either of linear regression or logistic regression Question5. Your company with an online marketing consultancy on the campaign to launch a line of organic beauty products to the Australian market. You are discussing how to track the success of this campaign. Evaluate and compare the following two success measures and make a recommendation for one of them: - Click through rate Number of shares on social media 领航名师 Exclusive to Ella Luo's Students Question 6. From a recent marketing campaign, the result of your uplift model suggests that your current client base falls into the following categories: Lost causes: 59% Sleeping dogs: 9% - Persuadable: 31% Sure thing: 1% How would you interpret these results? Question 7. Consider the setting where you are aiming to predict the response y using two feature variables X1 and x2. You are provided with a dataset containing 500 observations. As part of your EDA, the following two scatterplots are produced: y y -10 -10。
- 大数据解决方案:梯度下降法(GD)、分布式并行[6]Source: 2022s2 BUSS6002 Final Review + Final Practice_AntiCopy.pdfPart 8. Big Data Solutions 1. Big O notation Suppose that running time of an algorithm as a function of input size n, is given by: Approach: · Find the fastest growing term: 2n3 Drop the coefficient: n3 2. Solution to Tall Data - Scalable Algorithms: Stochastic Gradient Descent - Parallelization: Divide and Conquer E 领航名师 Exclusive to Ella Luo's Students BUSS6002 Final Practice 1. To a large extent, Data Science is about using data to draw insights for decision support and action. Which of the following types of data analytics require the most human dependency for action. A. Descriptive Analytics B. Predictive Analytics C. Diagnostic Analytics D. Prescriptive Analytics 2. Which of the following is not associated with the Evaluation Phase of CRISP-DM: A. Explore and review whether the models were correctly built. B. Selecting test design, partitioning data, and defining model performance measures. C. Determine if there are inherent biases in the model or the training data sets. D. Assess the degree to which the model meets the business objectives. 3. A dataset contains some missing values. It appears as though these variables are unrelated to any variable in the dataset. Which category of missing value could have occurred? A. Missing at Random (MAR) or Not Missing at Random (NMAR) B. Not missing at random (NMAR) or Missing completely at random (MCAR) C. Missing at Random (MAR) or Missing completely at random (MCAR) D. Missing at Random (MAR) 4. Simply put, Ethics is: A. What brings shame or disgrace。
- 回归模型选择:防止过拟合、欠拟合,数据按训练/验证/测试集合理切分;选用性能最优模型[9]Source: 2022s2 BUSS6002 Final Review + Final Practice_AntiCopy.pdfAvailable data Training Validation Test O Training Set: Used for EDA, Model building, model estimation O Validation Set: model Selection O Test Set: Model Evaluation (Evaluated the selected model) 领航名师 Exclusive to Ella Luo's Students Part 7. Applications in Marketing 1. Marketing Online - Problem Definition - Marketing Fatigue 2. Customer Analytics - Product Centric Marketing 0 Collaborative Filtering - Customer Centric Marketing O Natural Propensity model 0 Campaign Response Mode O Uplift Model Action if treated: No Action if treated: Yes Action if Untreated: No Lost Cause Persuadable Action of untreated: Yes Sleeping dogs Sure Things 3. Customer Analytics - Measuring Success - Timing Level of aggregation - Uplift Business Alignment。
- 营销案例分析:在线销售、客户中心模型,提升策略与度量方法[9]Source: 2022s2 BUSS6002 Final Review + Final Practice_AntiCopy.pdfAvailable data Training Validation Test O Training Set: Used for EDA, Model building, model estimation O Validation Set: model Selection O Test Set: Model Evaluation (Evaluated the selected model) 领航名师 Exclusive to Ella Luo's Students Part 7. Applications in Marketing 1. Marketing Online - Problem Definition - Marketing Fatigue 2. Customer Analytics - Product Centric Marketing 0 Collaborative Filtering - Customer Centric Marketing O Natural Propensity model 0 Campaign Response Mode O Uplift Model Action if treated: No Action if treated: Yes Action if Untreated: No Lost Cause Persuadable Action of untreated: Yes Sleeping dogs Sure Things 3. Customer Analytics - Measuring Success - Timing Level of aggregation - Uplift Business Alignment。
重点掌握:Week6重点在于线性代数(矩阵概念/运算/逆矩阵)+ 线性回归完整步骤(建模、估计、推断、诊断图、假设检验)以及连接实际数据工程与模型选用全流程的综合理解。
如需每个部分公式详细推导或习题举例讲解,欢迎继续提问!
Ask Sia for quick explanations, examples, and study support.