AskSia - BUSS6002: Data Science in Business: BUSS6002 W6: Linear Algebra and

It looks like you've uploaded a file, but didn't include a message or question.
If you need help with something in your document (BUSS6002 W6-Lec.pdf), please let me know what you'd like to do. Here are some common options:

Summarize the contents of a section or the whole file
Explain a specific term, equation, or concept
Answer practice questions or work through an example
Check assumptions or explain outputs from a regression
Anything else you need for study or revision

Just type your request or a keyword—I'm here to assist!

summarize_document

BUSS6002 W6: Linear Algebra and Linear Regression

This document provides a summary of key concepts in Linear Algebra and Linear Regression, intended for students of Ella Luo.

I. Linear Algebra

A. Matrices

Definition: A matrix is a rectangular array of numbers.
- d_ij represents the element in the i-th row and j-th column.
- The dimension of a matrix A is denoted as n x m (n rows by m columns).
- Matrices are typically represented by bold upper-case letters (e.g., A).
Matrix Equality: Two matrices A and B are equal (A = B) if they have the same dimensions and all corresponding elements are equal (a_ij = b_ij for all i and j).

B. Matrix Operations

Matrix Addition:
- Matrices must have the same dimensions to be added.
- Commutative Property: A + B = B + A
- Associative Property: (A + B) + C = A + (B + C)
Matrix Scalar Multiplication: Multiplying each element of a matrix by a scalar.
- Example: 0.5 * [[1, 2], [3, 4]] = [[0.5, 1], [1.5, 2]]
Matrix Multiplication:
- Given an n x m matrix A and an m x r matrix B, their product AB results in an n x r matrix C.
- The element c_ij is calculated by the dot product of the i-th row of A and the j-th column of B.
- Example: [[4, 9, 6], [1, 5, 8]] * [[1, 2], [8, 3], [5, 7]] = [[(4*1 + 9*8 + 6*5), (4*2 + 9*3 + 6*7)], [(1*1 + 5*8 + 8*5), (1*2 + 5*3 + 8*7)]] = [[77, 71], [61, 73]]
- Properties:
  - Not Commutative: AB ≠ BA (generally)
  - Associative Property: (AB)C = A(BC)
  - Distributive Property: A(B + C) = AB + AC

C. Special Matrices

Square Matrix: A matrix where the number of rows equals the number of columns (n = m).
Diagonal Matrix: A square matrix where all off-diagonal elements are zero.
Identity Matrix (I): A diagonal matrix with ones on the main diagonal.
- Useful Property: AI = IA = A

D. Inverse of a Matrix

An n x n square matrix A is invertible if there exists an n x n matrix B such that AB = BA = I.
The matrix B is called the inverse of A and is denoted as A⁻¹.
A square matrix is singular (not invertible) if and only if its determinant is zero.
For a 2x2 matrix [[a, b], [c, d]], the inverse is (1 / (ad - bc)) * [[d, -b], [-c, a]].

II. Linear Regression

A. Introduction to Linear Regression

Purpose: To describe the relationship between a continuous response variable (Y) and a set of predictor variables (X).
- Y = f(X) + ε
- Y: Response, dependent variable, target.
- X: Predictors, features, independent variables, covariates.
- ε: Error term.
Supervised Learning: Involves learning from observed feature-target pairs where targets have labels.
Why Use Linear Regression:
- Useful for predictions.
- Easy to interpret (not a "black-box").
- A good starting point for more complex models.
Visualization: Pairwise scatter plots can help visualize the relationship between predictors and the response.

B. Simple Linear Regression (SLR)

Model: Involves only one predictor (x) and assumes a linear relationship: y = β₀ + β₁x + ε.
- β₀: Intercept (estimated as β̂₀).
- β₁: Slope (estimated as β̂₁).
- ε: Error term with zero mean and constant variance.
Prediction: Predicted value ŷ = β̂₀ + β̂₁x.
Estimating Coefficients (Least Squares):
- Minimize the Residual Sum of Squares (RSS): RSS = Σ(yᵢ - ŷᵢ)² = Σ(yᵢ - (β̂₀ + β̂₁xᵢ))².
- The optimal β̂₀ and β̂₁ are found using the Least Squares method.
- Optimal Solutions:
  - β̂₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
  - β̂₀ = ȳ - β̂₁x̄

C. Interpreting a Linear Regression Model

β₁ (Slope): Represents the average change in y for a one-unit increase in x.
β₀ (Intercept): Represents the expected value of y when x is 0.
ε (Error Term): Represents the variation in y not explained by the model.

D. Accuracy of Coefficients

True Model vs. Estimated Model: The true model describes the population, while estimates are derived from a sample.
Estimates as Random Variables: Parameter estimates (β̂₀, β̂₁) are random variables with their own mean and standard deviation.
Unbiasedness: Least squares estimates are unbiased if their mean equals the true parameter value (E(β̂₀) = β₀, E(β̂₁) = β₁).
Standard Error (SE): The standard deviation of an estimator.
- SE(β̂₀) and SE(β̂₁) formulas are provided (dependent on sample variance σ² and predictor values).
Confidence Interval (CI): A range of values likely to contain the true parameter.
- Approximate 95% CI for β₁: [β̂₁ - 2 * SE(β̂₁), β̂₁ + 2 * SE(β̂₁)]. (More accurately uses t-distribution quantiles).
Hypothesis Testing:
- Null Hypothesis (H₀): No relationship between x and y (β₁ = 0).
- Alternative Hypothesis (H₁): There is a relationship (β₁ ≠ 0).
- Rejection: If the CI does not contain 0, or if the calculated t-statistic (t = β̂₁ / SE(β̂₁)) falls outside the critical region defined by the t-distribution (with n-2 degrees of freedom), H₀ is rejected.

E. Goodness of Fit

R-squared (R²):
- Measures the proportion of variability in y explained by the model.
- R² = 1 - (RSS / TSS)
- TSS (Total Sum of Squares): Σ(yᵢ - ȳ)²
- RSS (Residual Sum of Squares): Σ(yᵢ - ŷᵢ)²
- Ranges from 0 to 1. Higher values indicate a better fit to the training data.

F. Multiple Linear Regression

Model: y = β₀ + β₁x₁ + ... + βₚxₚ + ε
- β: Vector of coefficients [β₀, β₁, ..., βₚ]ᵀ.
- X: Matrix of predictors (including a column of ones for the intercept).
Interpretation of βⱼ: The average change in y for a one-unit increase in xⱼ, holding all other predictors constant.
Least Squares: Minimize RSS = ||y - Xβ||² = (y - Xβ)ᵀ(y - Xβ).
- Solution: β̂ = (XᵀX)⁻¹Xᵀy (requires XᵀX to be invertible).

G. Residual Diagnostics

Purpose: To check if the assumptions of the linear regression model are met, which is crucial for valid statistical inference.
Assumptions:
- Linearity: The relationship between predictors and the response is linear.
- Independence: Errors (ε) are independent of each other.
- Normality: Errors are normally distributed.
- Equal Variance (Homoscedasticity): Errors have a constant variance across all levels of predictors.
Diagnostic Tools:
- Residual Plots: Plotting residuals against fitted values (ŷ).
  - Patterns suggest non-linearity or unequal variance (heteroscedasticity).
- Squared Residuals vs. Fitted Values: Helps detect heteroscedasticity.
- Q-Q Plot (Quantile-Quantile Plot): Compares the distribution of residuals to a theoretical normal distribution. Points should roughly follow a straight line for normality.

Note: The provided text also includes practice questions and examples related to these topics.

中文讲解

以下是对“BUSS6002 W6-Lec.pdf”文件的中文要点总结与讲解：

一、线性代数基础

二、线性回归基础

1. 线性回归基本模型

目的是描述自变量（特征 $X$）与因变量（响应 $Y$）之间的关系，记作： $$ Y = f(X) + \varepsilon $$ 其中 $\varepsilon$ 代表误差项[6]Source: BUSS6002 W6-Lec.pdf0 1. Useful Property: AI = IA = A 8. Inverse of a Matrix: Ann x n square matrix A is invertible if there exists an n x n square matrix B such that: AB = BA = I The matrix B is called the inverse of A and is denoted by A-1 A square matrix is not invertible (called singular) if and only if its determinant is zero. [a ] = . - 1 ad - bc l-c :[d -b] £ a a22 : 0 0 领航名师 Exclusive to Ella Luo's Students BUSS 6002 W6 - Linear Regression 1. Linear Regression (1) Describe the relationship between a continuous response and a set of predictors =f( x )+ £ Response Predictors Error - X: Features, independent variables, covariates - Y: response, dependent variables, target - Supervised learning has observed feature - target pairs. Targets in supervised has labels. (2) Why use linear regression:。

2. 为什么用线性回归

容易解释，便于预测和推断，是许多复杂模型的基础[8]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:。

3. 简单线性回归（SLR）

只有一个自变量： $$ y = \beta_0 + \beta_1 x + \varepsilon $$
- $\beta_1$：斜率，$x$ 每变1单位时，$y$ 平均变多少。
- 预测值 $$ \hat{y} = \beta_0 + \beta_1 x $$
用最小二乘法（Least Squares）找最优 $\beta_0,,\beta_1$，即使残差平方和（RSS）最小[8]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:。

4. 回归参数的估计与解释

$$ RSS = \sum_{i=1}^n \left(y_i - (\beta_0 + \beta_1 x_i)\right)^2 $$
$\hat{\beta}_1$ 和 $\hat{\beta}_0$ 的公式，分别代表回归线的斜率和截距[8]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:。
置信区间（CI）：例如 $\beta_1$ 的 95% CI 就是 $[\hat{\beta}_1 - 2\times SE(\hat{\beta}_1), \hat{\beta}_1 + 2\times SE(\hat{\beta}_1)]$[11]Source: BUSS6002 W6-Lec.pdfy -1 -2 -2 -1 0. -2 -1 几名师 x 00 00 领航名师 Exclusive to Ella Luo's Students 4. Accuracy of coefficient: Unbias: if mean of estimate equals to the true value, least square estimates are unbiased E(ß0) = B0, E(ß1) = B1 - The standard deviation of an estimator is called standard Error SE(80)= 02 1 n + Ef 1 (i-x)2 6 SE (81)2 = ΣΤ=1 (xi-x)2 x2 · Confidence Interval for coefficients: - A 95% CI is defined as a range of values such that with 95% probability, the range will contain true parameter value. - Approximate 95% CI for B1:[£1-2×SE(£1),1 +2× SE (81)] To be more accurate, the 2 should be 97. 5% quantile of t-distribution with n-2 degree of freedom。

5. 假设检验与显著性

零假设 $H_0$：$\beta_1 = 0$（没有关系）；若在置信区间之外或 t 统计量大于临界值，则拒绝零假设[$t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}$][1]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £。

6. 模型优度与诊断

7. 多元线性回归

有多个自变量，矩阵形式 $$ y = X\beta + \varepsilon $$ 最小二乘法解：$$ \hat{\beta} = (X^TX)^{-1} X^T y $$ 解释 $\beta_j$：固定其他变量，$x_j$ 每增加1，$y$ 的平均增量[15]Source: BUSS6002 W6-Lec.pdfy = : ynInx1 X = 1 : 1 X11 Xn1 : : Inx(p+1) ß = TBO® B1 Bn (p+1)×1 E = : Ennx1 · Least Square (minimise RSS): RSS =| y - 9 112=Il y - X 113= (y -XB)™ (y -Xß) - ^ = arg min | y - XB 12 - ₿ = (XTX)-1XTy, this requires XTX to be invertible B 6. Residual Diagnostics - Linearity: relationship between x and y is linear - Independence: Errors are independent - Normality: The errors are normally distributed - Equal Variance (homoskedasticity): The errors have same variance · Residual diagnostics (Check the assumptions) - Residual plots: residual against fitted values Check the non-linearity or the equal variance assumption Unequal variance also named heteroscedasticity - Square residuals or absolute residuals vs fitted values Check equal variance assumption。

如需某一部分详细推导、例题解析或公式讲解，可继续提出！

summarize_document

BUSS 6002 Final Review Summary

This document provides a comprehensive review for BUSS 6002, covering Big Data, Data Handling, Machine Learning algorithms, Model Selection, and Marketing Applications.

Part 1. Introduction to Big Data

1.1 Analytical Capabilities

Big Data analytics can be categorized into four types:

Descriptive Analytics: What happened?
Diagnostic Analytics: Why did it happen?
Predictive Analytics: What will happen?
Prescriptive Analytics: What should we do? How can we make it happen?

Prescriptive analytics often involves Optimization and Foresight.

1.2 CRISP-DM and Snail Shell Model

These are process models for Knowledge Discovery in Databases (KDDA).

CRISP-DM (Cross-Industry Standard Process for Data Mining):
- Business Understanding: Identify business problems, collect initial data, determine objectives, output a project plan.
- Data Understanding: Explore data, check quality, examine metadata.
- Data Preparation: Select, clean, construct, integrate, and format data.
- Modeling: Select appropriate techniques, generate test design (train, test, validation), build models.
- Evaluation: Evaluate model performance against business objectives, determine next steps.
- Deployment: Deploy the model, plan monitoring, and create a final report.
Snail Shell KDDA Process Model: This model also outlines a process for data analytics, with similar phases to CRISP-DM.

Part 2. Data Handling

2.1 Data Quality Issue

Missing Data:
- Missing Completely at Random (MCAR): Missingness is independent of all variables (observed and unobserved). Causes no bias but is rare.
- Missing at Random (MAR): Missingness depends on observed variables. Can cause bias.
- Not Missing at Random (NMAR): Missingness depends on unobserved variables. Common but hard to identify and address.
- Handling: Delete or impute missing data.
Outliers: Data points significantly different from others.

2.2 Exploratory Data Analysis (EDA)

Univariate:
- Non-Graphical:
  - Categorical: Counts, proportions.
  - Numerical: Mean, median, spread.
- Graphical:
  - Categorical: Bar chart.
  - Numerical: Histogram, box plot, QQ plot.
Multivariate:
- Non-Graphical:
  - Numerical vs. Numerical: Covariance matrix.
- Graphical:
  - Numerical vs. Numerical: Correlation heatmap, scatterplot.
  - Category vs. Numerical: Box plot.

2.3 Feature Engineering

Structured Data Feature Engineering:
- Standardization: Scales data to have a mean of 0 and a variance of 1.
- Normalization: Scales data into a range, typically [0, 1].
- Exponential or Log Transformation: Useful for addressing right-skewed data.
- Linear Regression Feature Engineering:
  - Interpreting log-transformed variables.
  - Creating dummy variables.
  - Polynomial Regression.
Text Data Feature Engineering:
- Bag-of-Words: Extracts features based on word occurrence within a document. Creates a vocabulary and measures word presence. Does not capture word order.
- TF-IDF (Term Frequency-Inverse Document Frequency): Translates word counts into a measure of importance.
  - Formula: TFIDF = (N of token t in document d) / (N of tokens in document d) * ln(N of documents containing token t)
- Stemming or Lemmatization: Reducing words to their root form.

Part 4. Machine Learning Algorithms

4.1 Unsupervised Learning - Clustering

Goal: Partition data into clusters where points within a cluster are similar, and points in different clusters are dissimilar.
K-Means Algorithm:
1. Initialize cluster centroids (randomly).
2. Assign each data point to the nearest centroid.
3. Recalculate the centroid of each cluster.
4. Repeat steps 2-3 until centroids no longer change.
K-Means Appropriateness:
- Highly dependent on initial centroid selection.
- Elbow Method: Used to select the optimal number of clusters (k).
- Advantage: Fast computational speed.
- Limitation: Assumes clusters are convex and isotropic (circular).

4.2 Supervised Learning

4.2.1 Regression - Linear Regression

Model: $Y = X\beta + \epsilon$
Solution (OLS): $\beta = (X^T X)^{-1} X^T y$
Residual Diagnostics:
- Linearity: Plot residuals against fitted values.
- Equal Variance (Homoscedasticity): Plot squared residuals against fitted values.

4.2.2 Classification - Logistic Regression

Model: A generalized linear model (GLM).
Forecasting Probability: $P(Y=1|x) = \frac{1}{1 + e^{-X\beta}}$
Decision Boundary: Linear.
Interpreting Coefficients:
- $\beta_0$: Odds of class 1 when all $x$ are zero.
- $\beta_i$: For a 1-unit increase in $x_i$, the odds increase by $(\exp(\beta_i) - 1) * 100%$.
Likelihood Function: Measures how likely the observed data is given the model parameters.
Maximum Likelihood Estimation (MLE): Finds parameters that maximize the likelihood function. $\hat{\beta} = \arg \max L(x|\beta)$.
Classification Model Evaluation:
- Accuracy: $(TP + TN) / (TP + TN + FP + FN)$
- True Positive Rate (Recall, Sensitivity): $TP / (TP + FN)$
- True Negative Rate (Specificity): $TN / (TN + FP)$
- False Positive Rate: $FP / (FP + TN) = 1 - Specificity$
- False Negative Rate: $FN / (TP + FN)$
- Precision: $TP / (TP + FP)$
- False Discovery Rate: $FP / (TP + FP) = 1 - Precision$
- F1-Score: $2 * (Precision * Recall) / (Precision + Recall)$

Part 5. Analytical Methods & Optimization

5.3.1 Analytical Methods

Approach: Solve for parameters by setting partial derivatives of the loss function to zero.
Problem: Matrix inversion can be computationally expensive and infeasible for large datasets. Some loss functions may not have a unique solution.

5.3.2 Gradient Descent

Basic Idea: Iteratively move towards the minimum of a function by taking steps in the direction of the negative gradient.
Steps:
1. Initialize parameters ($\beta^0$).
2. Iterate: $\beta^{t+1} = \beta^t - \alpha \nabla L(\beta^t)$, where $\alpha$ is the step size (learning rate).
3. Stop when the update is below a threshold.
Step Size (Learning Rate): Controls the speed of convergence. Too large can overshoot, too small can lead to slow convergence.
Convexity:
- Convex Function: Guaranteed to find the global minimum.
- Non-Convex Function: May find a local minimum; requires trying different initializations.

5.3.3 Comparison between Analytic and Gradient Descent

Analytic Solution: Mathematically simpler, easier to implement for some problems.
Gradient Descent: Lower maximum computation requirements, more scalable for large datasets.
Both methods aim for the same solution (within tolerance). Gradient descent is often preferred for large-scale problems.

Part 6. Model Selection

Overfitting vs. Underfitting:
- Overfitting: Complex model, low bias, high variance. Performs well on training data but poorly on unseen data.
- Underfitting: Simple model, high bias, low variance. Performs poorly on both training and unseen data.
MSE Trend: Training MSE decreases with increasing complexity. Test MSE first decreases, then increases. The optimal model minimizes test error.
Data Splitting:
- Training Set: Used for EDA, model building, and parameter estimation.
- Validation Set: Used for model selection and hyperparameter tuning.
- Test Set: Used for final, unbiased evaluation of the selected model.

Part 7. Applications in Marketing

2. Customer Analytics

Product-Centric Marketing: Often uses Collaborative Filtering.
Customer-Centric Marketing: Focuses on individual customer value. Uses models like:
- Natural Propensity Models: Predicts the likelihood of a customer taking a specific action (e.g., purchasing a product) regardless of marketing intervention.
- Campaign Response Models: Predicts the likelihood of a customer responding to a specific marketing campaign.
- Uplift Models: Estimate the causal impact of a marketing intervention by comparing the outcome for treated vs. untreated customers. Categories include:
  - "Lost Causes" (treated vs. untreated: No difference or negative)
  - "Sleeping Dogs" (treated vs. untreated: Negative impact)
  - "Persuadable" (treated vs. untreated: Positive impact)
  - "Sure Thing" (treated vs. untreated: No difference, but would have bought anyway)

3. Customer Analytics - Measuring Success

Good Success Metrics: Should be business-aligned, measure uplift, have appropriate timing, and be at the right level of aggregation.
Common Pitfalls: Focusing solely on engagement metrics (likes, shares) without linking them to business outcomes.

Part 8. Big Data Solutions

Algorithm Running Time: Determined by the fastest-growing term after dropping coefficients (e.g., $O(n^3)$).
Tall Data Solutions:
- Scalable Algorithms: e.g., Stochastic Gradient Descent (SGD).
- Parallelization: Divide and Conquer approaches.

Key Concepts & Questions Addressed in Practice Problems

Data Quality: Identifying missing data types (MCAR, MAR, NMAR), handling duplicates.
EDA: Univariate and multivariate analysis techniques.
Feature Engineering: Standardization, normalization, log transformations, dummy variables, TF-IDF, stemming/lemmatization.
Clustering (K-Means): Algorithm steps, sensitivity to initialization and outliers, selecting 'k' (elbow method), limitations (cluster shape).
Linear Regression: OLS solution, residual diagnostics, interpreting coefficients, feature transformations, bias-variance trade-off.
Logistic Regression: Probability forecasting, interpreting coefficients, MLE, classification metrics (Accuracy, Precision, Recall, F1-Score).
Optimization: Analytic solutions vs. Gradient Descent, learning rate, convexity.
Model Selection: Overfitting/underfitting, train/validation/test splits.
Marketing Analytics: Customer-centric vs. product-centric, campaign response models, uplift models, defining success metrics.
Big Data: Scalability, parallelization.
Text Analysis: Bag-of-Words, TF-IDF.
CRISP-DM: Phases and their purpose.
Ethics in Data Science: Principles of responsible AI.

结合Week6 汇总

BUSS6002 第六周（Week 6）“线性代数与线性回归”结合全套复习&重点概念中文汇总：

一、线性代数基础知识（复习PPT + 期末复习文档）

1. 矩阵基础

定义与记法： 矩阵$\mathbf{A}$为$n\times m$（$n$行$m$列），元素记为$a_{ij}$，通常用粗体大写字母表示。
矩阵相等： 维度和每个位置的元素都相同才算相等[4]Source: BUSS6002 W6-Lec.pdf领航名师 Exclusive to Ella Luo's Students BUSS6002 W6 - Linear Algebra 2 1. Matrix A = a11 a21 a22 : Lan1 an2 : a12 . . . a1m a2m : anm - dij represents the element from the i-th row and j-th column - The dimension of A is n x m (n by m) - Usually matrix use bold upper-case A. 2. Matrix Equality A = B If and only if aij = bij Vi = 1, . . . , n Vj = 1, . . . , m 师 3. Matrix Transpose B [1 L3 41 2] BT = [2 3 4. 4. Matrix addition。
转置、加法、数乘、乘法等基础运算，理解对应格式匹配与各类运算法则：
- 加法满足交换律、结合律 [10]Source: BUSS6002 W6-Lec.pdfA : [10 3 -9 11. B = -5 -3 12 2 9 4. 10-3 7+21 7 9 15 -5 11 + 4. 10 15 0 15. Properties of matrix addition: - Commutative: A + B = B + A - Associative: (A + B) + C = A + (B + C) 5. Matrix Scalar Multiplication 0. 5 [1 2] [1 4J [0. 5 x 1 0. 5 x 2] = [0. 5 1] = 10. 5 x 3 0. 5 x 4] L1. 5 -9 +9 = 领航名师
- 乘法则一般不满足交换律，但满足结合律和分配律 [27]Source: BUSS6002 W6-Lec.pdfExclusive to Ella Luo's Students 6. Matrix Multiplication Given a n X m matrix A and a m x r matrix B, the product AB = C is the n x r matrix A = [4 9 6 15 8 B = 1 15 8 3 AB = [4×1+9x8+6x5 4x2+9x3+6x7] 771 = 106 1×1+5×8+8×5 1x2+5x3+8x7] 81 73] Useful properties: - Not commutative AB # BA - Associative: (AB)C = A(BC) - Distributive: A(B + C) = AB + AC - (AB)T = BTAT 7. Special Matrix (1) Square matrix : m = n (same row and column dimension) (2) Diagonal Matrix: is a square matrix whose off-diagonal elements are all equal to zero. : a11 0 0 ann. (3) Identity matrix: a diagonal matrix with ones on the diagonal [1 L0 I= 0 1 0 ,I= HO

2. 特殊矩阵与矩阵逆

**方阵、对角矩阵、单位矩阵（I）**及其特殊性质：$AI = IA = A$
逆矩阵：方阵$A$可逆则有$A^{-1}$，满足$AA^{-1}=A^{-1}A=I$。行列式为0则不可逆[28]Source: BUSS6002 W6-Lec.pdf0 1. Useful Property: AI = IA = A 8. Inverse of a Matrix: Ann x n square matrix A is invertible if there exists an n x n square matrix B such that: AB = BA = I The matrix B is called the inverse of A and is denoted by A-1 A square matrix is not invertible (called singular) if and only if its determinant is zero. [a ] = . - 1 ad - bc l-c :[d -b] £ a a22 : 0 0 领航名师 Exclusive to Ella Luo's Students BUSS 6002 W6 - Linear Regression 1. Linear Regression (1) Describe the relationship between a continuous response and a set of predictors =f( x )+ £ Response Predictors Error - X: Features, independent variables, covariates - Y: response, dependent variables, target - Supervised learning has observed feature - target pairs. Targets in supervised has labels. (2) Why use linear regression:。

二、线性回归整体结构

1. 线性回归模型本质

2. 简单线性回归（SLR）推导与解读

基本形式： $$ y = \beta_0 + \beta_1 x + \varepsilon $$
最小二乘法（OLS）： 最小化残差平方和（RSS）找到最佳$\hat{\beta}_0$、$\hat{\beta}1$： $$ RSS = \sum{i=1}^{n}(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2 $$ $$ \hat{\beta}_1 = \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sum_i(x_i-\bar{x})^2} $$ $$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} $$ [16]Source: BUSS6002 W6-Lec.pdf- Useful for predictions in many practical settings - Easy to make inferences (not a black-box approach) - Good starting point for complicated models (3) We can use pairwise scatter plots to see the relationship between each feature and the targets. 2. Simple Linear regression (1) SLR has only one predictor x and assumes linear relationship between x and y. y = Bo + B1x + & · B1 is the slope (2) Prediction using SLR: (3) Estimate Coefficient: Êo, Ê1 = arg min || y - ŷ 112 Bo,B1 · Residual Sum of Square (RSS): RSS =|| y - § 113= 82 + . . . + 82 · Least Square: Choosing ß0, 31 that minimise the RSS · The Optimal Solution to the least square optimization problem: ₿ = 5 Li-1 (xi-)(yi - y) Σ΄ 1 (Xi-x)2 ₿0 = 7 - ₿18 3. Interpreting a linear regression model · ß1 is the slope: the average increase in y when x increases one-unit 领航名师 Exclusive to Ella Luo's Students Practice Questions for Week 6:
斜率$\beta_1$： $x$每变1单位，$y$平均变$\beta_1$单位。
回归模型的假设检验与置信区间：

3. 多元线性回归

向量/矩阵形式：适合多变量情况 $$ y = X\beta + \varepsilon $$ $X$为$n \times (p+1)$矩阵，$\beta$为$(p+1)\times 1$系数向量。
OLS解析解： $$ \hat{\beta} = (X^TX)^{-1}X^Ty $$ [11]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £

4. 模型拟合优度

相关统计量：
- $R^2 = 1-\frac{RSS}{TSS}$，$R^2$越高模型自身解释度越强
- $TSS=\sum(y_i-\bar{y})^2$，$RSS=\sum(y_i-\hat{y}_i)^2$[11]Source: BUSS6002 W6-Lec.pdf· Hypothesis Testing of coefficient Null Hypothesis (There is no relationship between x and y): Ho: B1 = 0 Alternative Hypothesis: H1: 0 0 Rejection region: O If outside CI, then null hypothesis is rejected Using t-statistics: t = ฿ . - 0 SE(B) ~t(n-2) Compare with t distribution with df = n-2 · Goodness of fit: R-square R2 TSS - RSS TSS =1- RSS TSS - TSS = >[=1 (Vi - 7)2: total sum of square Total variance in y - RSS = >[=1 (yi - yi)2: Residual Sum of square Variance left unexplained after regression R2 is the proportion of variability in y that can be explained by the line. Which is the Overall Fit of the training data R2 is between 0 and 1 领航名师 Exclusive to Ella Luo's Students 5. Multiple Linear regression y = Bo + B1x1 + . . . + BpXp + & · Matrix Form: y = Xß + £
残差分析与诊断图：判别线性性、同方差、正态假设等[17]Source: BUSS6002 W6-Lec.pdf- Q-Q plots Check the normality assumption Residual 2. 30 - -2 - -4 - -6 -2 -1 Fitted Fitted Theoretical Quantiles · Squared residuals vs fitted: heteroskedasticity Sample Quantiles 1 HO -1 -2 Squared Residual 领航名师

三、补充：数据工程与特征处理

四、知识点典型总结与实战题解析（部分）

重点掌握：Week6重点在于线性代数（矩阵概念/运算/逆矩阵）+ 线性回归完整步骤（建模、估计、推断、诊断图、假设检验）以及连接实际数据工程与模型选用全流程的综合理解。

如需每个部分公式详细推导或习题举例讲解，欢迎继续提问！

generate_study_guide

BUSS6002: Data Science in Business: BUSS6002 W6: Linear Algebra and

BUSS6002 W6: Linear Algebra and Linear Regression

I. Linear Algebra

A. Matrices

B. Matrix Operations

C. Special Matrices

D. Inverse of a Matrix

II. Linear Regression

A. Introduction to Linear Regression

B. Simple Linear Regression (SLR)

C. Interpreting a Linear Regression Model

D. Accuracy of Coefficients

E. Goodness of Fit

F. Multiple Linear Regression

G. Residual Diagnostics

一、线性代数基础

1. 矩阵基础

2. 矩阵运算

3. 特殊矩阵

4. 矩阵的逆

二、线性回归基础

1. 线性回归基本模型

2. 为什么用线性回归

3. 简单线性回归（SLR）

4. 回归参数的估计与解释

5. 假设检验与显著性

6. 模型优度与诊断

7. 多元线性回归

BUSS 6002 Final Review Summary

Part 1. Introduction to Big Data

1.1 Analytical Capabilities

1.2 CRISP-DM and Snail Shell Model

Part 2. Data Handling

2.1 Data Quality Issue

2.2 Exploratory Data Analysis (EDA)

2.3 Feature Engineering

Part 4. Machine Learning Algorithms

4.1 Unsupervised Learning - Clustering

4.2 Supervised Learning

4.2.1 Regression - Linear Regression

4.2.2 Classification - Logistic Regression

Part 5. Analytical Methods & Optimization

5.3.1 Analytical Methods

5.3.2 Gradient Descent

5.3.3 Comparison between Analytic and Gradient Descent

Part 6. Model Selection

Part 7. Applications in Marketing

2. Customer Analytics

3. Customer Analytics - Measuring Success

Part 8. Big Data Solutions

Key Concepts & Questions Addressed in Practice Problems

一、线性代数基础知识（复习PPT + 期末复习文档）

1. 矩阵基础

2. 特殊矩阵与矩阵逆

二、线性回归整体结构

1. 线性回归模型本质

2. 简单线性回归（SLR）推导与解读

3. 多元线性回归

4. 模型拟合优度

三、补充：数据工程与特征处理

四、知识点典型总结与实战题解析（部分）

Let's Get in Touch