Regression¶
used to predict for the dependent variable on the basis of past information available on dependent and independent variables.
The estimated regression line is given by
| Term | Meaning | 
|---|---|
| \(y\) | dependent variable | 
| \(x\) | independent variable | 
| \(b_0\) | y-intercept | 
| \(b_1\) | slope | 
| \(\hat y\) | estimated value | 
| \(\bar x\) | mean of \(x\) | 
| \(\bar y\) | mean of \(y\) | 
Correlation¶
gives the degree of linear relationship between 2 vars
Properties
- Dimensionless
- Symmetric: \(r(x, y)=r(y, x)\)
- \(r \in [-1, +1]\)
Pearson’s Correlation¶
Also called product moment correlation $$ \begin{aligned} r &= \dfrac{1}{n-1} \sum_{i=1}^n z_{xi} z_{yi} \ &= \dfrac{ \sum (x_i - \bar x)(y_i - \bar y) }{ \sqrt{\sum (x_i - \bar x)^2 \sum (y_i - \bar y)^2} } \ &= \dfrac{ n \sum(xy) - \sum x \sum y }{ n \sqrt{\sum (x^2) - \big(\sum x \big)^2 } \sqrt{ \sum (y^2) - \big(\sum y \big)^2 } } \end{aligned} $$
Measures whether 2 vars are above/below mean at the same time
Robust correlation¶
- Replace mean/sum with median
- Replace square with abs
where \(\tilde x = \text{med} (x)\)
Spearman Correlation¶
Modified Correlation¶
Setting the center as origin \(\implies \bar x=\bar y=0\)
- Contributes +vely if both vars are positive
- Contributes +vely if both vars are negative
- Contributes -vely if both vars are opposing sign
Useful for comparing time-series, returns, etc
| Type | Correlation | |
|---|---|---|
| Strength | Weak | \(\vert r \vert \le 0.5\) | 
| Moderate | \(0.5 < \vert r \vert < 0.8\) | |
| Strong | \(\vert r \vert \ge 0.8\) | |
| Direction | Directly | \(r > 0\) | 
| Inversely | \(r < 0\) | 
Similarity to Dissimilarity¶
\(\sqrt{2(1-r)}\)
Coefficient of Determination¶
\(R^2\) value is used for non-linear regression. It shows how well data fits within the regression.
It has a range of \([0, 1]\). Higher the better.
where
| Symbol | Meaning | 
|---|---|
| \(\text{SS}_\text{res}\) | Residual sum of squares | 
| \(\text{SS}_\text{tot}\) | Total sum of squares Proportional to variance of the data |