Exploratory Data Analysis¶
Preliminary investigation of data, to understand its characteristics
Helps identify appropriate pre-processing technique and data mining algorithm
Involves
- Summary Statistics
- Visualization
Summary Statistics¶
Note: Statistics about the data \(\ne\) data itself
Uncertainty of estimate¶
- Standard error: variability in sample estimate
- Confidence intervals: estimated range for the true parameter that is being estimated
Robustness¶
Ability of a statistical procedure to handle a variety of distributions (non-normal) and contamination (outliers, etc)
There is a trade-off between efficiency and robustness
Breakdown Point¶
Fraction of contaminated data in a dataset that can be tolerated by the statistical procedure
Max logical BP is 0.5, because after that, you canβt tell what is correct data and what is contaminated
Contamination¶
Fraction of data comes from a different distribution
There are 2 models for contamination
- Mean shift
- Variance shift
Univariate Summary Statistics¶
Minimal set of value(s) that captures the characteristics of large amounts of data, and show the properties of a distribution
Measure | Statistic | Meaning | Formula | Formula with Measurement Error | Moment | Breakdown Point (Higher is better) | SE Standard Error \(\sigma(\text{Estimate})\) (Lower is better) | SE with measurement error | SNR Signal Noise Ratio \(\dfrac{E [\text{Estimate}]}{\sigma(\text{Estimate})}\) (Higher is better) | Comment |
---|---|---|---|---|---|---|---|---|---|---|
Location | Mean/ Arithmetic Mean \(\mu\) | Central tendency of distribution | \(\dfrac{\sum x_i}{n}\) | Weighted with \(w_i = \dfrac{1}{\sigma^2_m}\) | 1st | \(\dfrac{1}{n}\) | \(1 \times \dfrac{s}{\sqrt{n}}\) (assumes Normal dist) | |||
Trimmed Mean | \(k \%\) obs from top of dist are removed \(k \%\) obs from bottom of dist are removed \(\implies 2k \%\) obs are removed in total | \(\dfrac{k}{n}\) | \(\left( 1+\dfrac{2k}{n} \right)\dfrac{s}{\sqrt{n}}\) | For \(k>12.5\), better to use median | ||||||
Winsorized Mean | \(k \%\) obs from top of dist are replaced with \((1-k)\)th percentile \(k \%\) obs from bottom of dist are replaced with \(k\)th percentile \(\implies 2k \%\) obs are replaced in total | \(\dfrac{k}{n}\) | \(\left( 1+\dfrac{2k}{n} \right)\dfrac{s}{\sqrt{n}}\) | For \(k>12.5\), better to use median | ||||||
Weighted Mean | \(\dfrac{\sum w_i x_i}{\sum w_i}\) | \(\dfrac{1}{n}\) | ||||||||
Geometric Mean | \(\sqrt[{\Large n}]{\Pi x}\) | \(\dfrac{1}{n}\) | ||||||||
Root Mean Squared | \(\sqrt{\dfrac{\sum_{i=1}^n (x_i)^2}{n}}\) | Gives more weightage to larger values | ||||||||
Root Mean N | \(\sqrt[p]{\dfrac{\sum_{i=1}^n (x_i)^p}{n}}\) | Gives more weightage based on power | ||||||||
Harmonic Mean | \(\dfrac{n}{\sum \frac{1}{x}}\) | \(\dfrac{1}{n}\) | Gives more weightage to smaller values | |||||||
Median | Middle most observation 50th quantile | \(\begin{cases} x_{{n+1}/2}, & n = \text{odd} \\ \dfrac{x_{n} + x_{n+1}}{2}, & n = \text{even}\end{cases}\) | \(\dfrac{1}{2}\) | \(1.253 \dfrac{s}{\sqrt{n}}\) | Robust to outliers | |||||
SoftMedian | 1. Sort the data in ascending order 2. Calculate the median 3. Assign weights \(w_i\) to each data point based on how close it is to the median, usually Gaussian Weighting \(w_i = \exp \left \{ \dfrac{-(x_i - \text{med})^2}{2 \sigma^2} \right\}\) 4. Compute weighted average \(\dfrac{\sum w_i x_i}{\sum w_i}\) | |||||||||
Mode | Most frequent observation | Unstable for small samples | ||||||||
Scale | Variance \(\sigma^2\) \(\mu_2\) | Squared average deviation of observations from mean | \(\dfrac{1}{n} \sum (x_i - \mu)^2\) \(\dfrac{1}{n} \sum (x_i - \bar x)^2 \times \dfrac{n}{n-1}\) | 2nd Centralised | \(\dfrac{1}{n}\) | \(2 s \times \dfrac{s}{\sqrt{2 (n-1)}} \times \sqrt{1+ \dfrac{n-1}{2n} \gamma_4'}\) (Assumes Normal dist) | \(\dfrac{n-1}{2}\) | |||
Adjusted variance | \(\dfrac{1}{n-1} \left( 1 - \hat \gamma_3 \hat x + \dfrac{\hat \gamma_4 - 1}{4} \hat x^2 \right)\) | |||||||||
Standard Deviation | Average deviation of observations from mean | \(\sqrt{\text{Variance}}\) | \(\dfrac{1}{n}\) | \(1 \times \dfrac{s}{\sqrt{2 (n-1)}} \times \sqrt{1+ \dfrac{n-1}{2n} \gamma_4'}\) (Assumes Normal dist) | \(\sqrt{\text{SNR}(\sigma^2)}\) | |||||
MAD Mean Absolute Deviation | Mean deviation of observations from mean | \(\dfrac{\sum \vert x_i - \mu \vert}{n}\) \(\dfrac{\sum \vert x_i - \bar x \vert}{n} \times \dfrac{n}{n-1}\) | ||||||||
MAD' | corrects it to be comparable to standard deviation | \(1.253 \times \text{MAD}\) | ||||||||
MedAD Median Absolute Deviation | Median deviation of observations from median | \(\text{med} (\vert x_i - \text{med}_x \vert)\) \(\text{med} (\vert x_i - \hat {\text{med}_x} \vert ) \times \dfrac{n}{n-1}\) | \(\dfrac{1}{2}\) | \(1.67 \times \dfrac{s}{\sqrt{2 (n-1)}}\) | ||||||
MedAD' | corrects it to be comparable to standard deviation | \(1.4826 \times \text{MedAD'}\) | ||||||||
Skewness \(\gamma_3\) | Direction of tail | \(\dfrac{\sum (x_i - \mu)^3}{n \sigma^3}\) \(\dfrac{\mu - \text{Mo}}{\sigma}\) \(\dfrac{3(\mu - \text{Md})}{\text{MedAD'}}\) \(\dfrac{(Q_3 - Q_2) - (Q_2 - Q_1)}{\text{MedAD'}}\) \(\dfrac{\sum (x_i - \bar x)^3}{n s^3} \times \dfrac{\sqrt{n(n-1)}}{(n-2)}\) | 3rd Standardized | \(\sqrt{\dfrac{6n(n-1)}{(n-2)(n+1)(n+3)}}\) \(\approx \sqrt{\dfrac{6}{n}} \left[ 1 - \dfrac{3}{2n} + O \left(\dfrac{1}{n^2} \right) \right]\) | 0: Symmetric \([-0.5, 0.5]\): Approximately-Symmetric \([-1, 1]\): Moderately-skewed else: Higly-skewed | |||||
Kurtosis \(\gamma_4\) | Peakedness of distribution | \(\dfrac{\sum (x_i - \mu)^4}{n \sigma^4}\) \(\dfrac{(\hat q_{.875} - \hat q_{.625}) + (\hat q_{.375} - \hat q_{.125})}{\text{MedAD'}}\) \(\dfrac{\hat q_{.975}+\hat q_{.075}}{\text{MedAD'}}\) \(\dfrac{\sum (x_i - \bar x)^4}{n s^4} \times \dfrac{(n+1)(n-1)}{(n-2)(n-3)}\) | 4th standardized | \(2 \times \text{SE}(\gamma_3) \times \sqrt{\dfrac{n^2-1}{(n-3)(n+5)}}\) \(\approx \sqrt{\dfrac{24}{n}} \left[ 1- \dfrac{2}{n} + O \left(\dfrac{1}{n^2} \right) \right]\) | ||||||
Excess Kurtosis \(\gamma_4'\) | Kurtosis compared to Normal distribution | \(\gamma_4-3\) | ||||||||
Max | ||||||||||
Min | ||||||||||
Percentile/ Quantile | Divides distributions into 100 parts | \(\dfrac{s}{\sqrt{n}} \dfrac{\sqrt{p (1-p)}}{f(q_p)}\), where \(f=\) PDF \(q_p=\) obtained quantile \(x\) value for given \(p\) | Unstable for small datasets | |||||||
Quartile | Divides distributions into 4 parts | |||||||||
Decile | Divides distributions into 10 parts | |||||||||
Range | Range of values | Max-Min | Susceptible to outliers | |||||||
IQR Interquartile Range | Q3 - Q1 | \(\dfrac{1}{4}\) | \(2.23 \times \dfrac{s}{\sqrt{2(n-1)}}\) | Robust to outliers | ||||||
IRQ' | corrects it to be comparable to standard deviation | \(0.7413 \times \text{IQR}\) | ||||||||
CV Coefficient of Variation | \(\dfrac{\sigma}{\mu}\) |
Standard Error of Statistic¶
- Standard deviation of statistic in sampling distribution
- Measure of uncertainty in the sample statistic wrt true population mean
Relationship between Mean, Median, Mode¶
Skewness¶
Skewness | Property | |
---|---|---|
\(> 0\) | Mode < Median < Mean | Positively Skewed |
\(0\) | Mode = Median = Mean | |
\(<0\) | Mean < Median < Mode | Negatively Skewed |
Moment¶
Multivariate Summary Statistics¶
How 2 variables vary together | Covariance | \(-\infty < C < +\infty\) |
Correlation | \(-1 \le r \le +1\) |
Covariance Matrix¶
It is always \(n \times n\), where \(n =\) no of attributes
\(A_1\) | \(A_2\) | \(A_3\) | |
---|---|---|---|
\(A_1\) | \(\sigma^2_{A_1}\) | \(\text{Cov}(A_1, A_2)\) | \(\text{Cov}(A_1, A_3)\) |
\(A_2\) | \(\text{Cov}(A_2, A_1)\) | \(\sigma^2_{A_2}\) | \(\text{Cov}(A_2, A_3)\) |
\(A_3\) | \(\text{Cov}(A_3, A_1)\) | \(\text{Cov}(A_3, A_2)\) | \(\sigma^2_{A_3}\) |
The diagonal elements will be variance of the corresponding attribute
Correlation Matrix¶
\(A_1\) | \(A_2\) | \(A_3\) | |
---|---|---|---|
\(A_1\) | \(1\) | \(r(A_1, A_2)\) | \(r(A_1, A_3)\) |
\(A_2\) | \(r(A_2, A_1)\) | \(1\) | \(r(A_2, A_3)\) |
\(A_3\) | \(r(A_3, A_1)\) | \(r(A_3, A_2)\) | \(1\) |
The diagonal elements will be 1
Why \((n-k)\) for sample statistics?¶
where \(k=\) No of estimators
- High probability that variance of sample is low, so we correct for that
- Lost degree of freedom