📘 Correlation — Introduction & Intuition
In simple words, Correlation tells us whether two variables move together and how strongly. If two variables increase or decrease together, they are said to be positively correlated. If one goes up while the other goes down, they are negatively correlated. If they move independently, correlation is near zero.
🔹 Why was Correlation Introduced?
The idea of correlation was developed in the late 19th century by Sir Francis Galton while studying the relationship between the height of parents and their children. He wanted to understand if tall parents always produced tall children — and how strongly one could predict one from the other.
His student, Karl Pearson, refined this into a mathematical formula called the Pearson Correlation Coefficient, which is still the most widely used measure of correlation today.
🔹 Real-Life Examples
- ✅ Height and Weight — Positive Correlation
- ✅ Hours Studied and Marks Scored — Positive Correlation
- ❌ Alcohol Consumption and Health Score — Negative Correlation
- ⚙ Number of Cars in City & Air Pollution — Strong Positive
- ⚙ Shoe Size and Intelligence — No Correlation
🔹 Types of Correlation (Based on Direction & Strength)
- +1 → Perfect Positive (both move in same direction)
- 0 → No Correlation (movement is independent)
- -1 → Perfect Negative (opposite movement)
- Weak (0 to ±0.3), Moderate (±0.3 to ±0.7), Strong (±0.7 to ±1.0)
🔹 Correlation != Causation
Just because two variables are correlated, it does not mean one causes the other. For example:
- 📈 Ice cream sales and drowning cases — both increase in summer, but one does not cause the other.
- 📈 Number of movies Nicolas Cage acted in vs Swimming pool accidents — absurd, but correlated.
🔹 Limits & Misinterpretations
- Correlation only measures linear relationships.
- It cannot detect curved relationships.
- It ignores the influence of third (hidden) variables.
- It is sensitive to outliers (one extreme value can distort result).
📊 Mathematical Understanding
🔹 What is Correlation Numerically?
Correlation is a measure of how two variables move in relation to each other. It compares the covariance of two variables scaled by their standard deviations.
General Relationship:
Correlation proportional to Covariance / (Variability of X x Variability of Y)
🔹 Karl Pearson’s Correlation Coefficient (r)
This is the most widely used formula, developed by Karl Pearson. It measures the strength of a linear relationship between two variables X and Y.
Σ (xᵢ − x̄)(yᵢ − ȳ)
r = -----------------------------------------
√[ Σ (xᵢ − x̄)² × Σ (yᵢ − ȳ)² ]
1
r = --------- {((xᵢ - x̄)/SD(x)) × ((yᵢ - ȳ)/SD(y))}
(N - 1)
Alternate Shortcut Formula (when mean not needed directly):
nΣxy − (Σx)(Σy)
r = -------------------------------------------------
√[nΣx² − (Σx)²] × √[nΣy² − (Σy)²]
Where:
- r = correlation coefficient (ranges from -1 to +1)
- n = number of observations
- x̄, ȳ = mean of X and Y values
- Σxy = sum of product of paired scores
🔹 Spearman’s Rank Correlation (ρ or rₛ)
Used when data is ordinal or not normally distributed. It tests how well the relationship can be described using a monotonic function.
6 Σ dᵢ²
ρ = 1 - -----------------
n(n^2 - 1)
Where:
- ρ or rₛ = Spearman correlation coefficient
- dᵢ = difference between the ranks of each pair (xᵢ rank - yᵢ rank)
- n = number of pairs
If no tied ranks exist: use the above formula.
If there are ties, then both X and Y are converted to ranks and
Pearson’s formula is applied on ranks instead of raw values.
✅ Summary of All Formulas
| Type | Formula | When Used? |
|---|---|---|
| Pearson (r) | r = Σ(x-x̄)(y-ȳ) / √[Σ(x-x̄)² × Σ(y-ȳ)²] | Raw data, linear relationship |
| Spearman (ρ) | 1 - (6 Σd²) / n(n²-1) | Ranked or non-normal data |
| Covariance-Based | r = Cov(X,Y) / (σₓ × σᵧ) | Conceptual definition |