Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. In data engineering and analytics, correlation is often used to identify patterns, trends, and dependencies between columns in a dataset.
In Amazon Redshift, analysts frequently need to calculate correlation to:
- Understand relationships between metrics
- Perform exploratory data analysis (EDA)
- Validate assumptions before modeling
- Build analytical dashboards
However, unlike many analytical databases, Amazon Redshift does not support the built-in CORR() function.
According to AWS documentation, the PostgreSQL CORR function is listed as unsupported in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html
As a result, correlation must be calculated manually using SQL.
This article explains how to calculate the Pearson correlation coefficient between two columns in Amazon Redshift using plain SQL, step by step.
What Is the Pearson Correlation Coefficient?
The Pearson correlation coefficient measures the linear relationship between two variables and returns a value between:
- 1 → perfect positive correlation
- 0 → no correlation
- -1 → perfect negative correlation
The mathematical formula is:
corr(X, Y) =
Σ((Xi − X̄)(Yi − Ȳ)) / √(Σ(Xi − X̄)² × Σ(Yi − Ȳ)²)
Where:
X̄is the average of column XȲis the average of column Y
We can implement this formula directly in SQL.
Calculating Correlation in Amazon Redshift Using SQL
Because Redshift lacks a native CORR() function, we compute the correlation coefficient manually using window functions and aggregations.
Step 1: Calculate Column Averages
First, calculate the average values for both columns using window functions so that the averages are available on every row.
WITH cte AS (
SELECT
column_1,
AVG(column_1::NUMERIC) OVER () AS avg_col1,
column_2,
AVG(column_2::NUMERIC) OVER () AS avg_col2
FROM schema.table
)
This Common Table Expression (CTE) calculates:
avg_col1: mean ofcolumn_1avg_col2: mean ofcolumn_2
Step 2: Apply the Pearson Correlation Formula
Next, use aggregation functions to compute the numerator and denominator of the Pearson formula.
SELECT
SUM((column_1 - avg_col1) * (column_2 - avg_col2)) AS numerator,
SQRT(SUM(POWER(column_1 - avg_col1, 2))) *
SQRT(SUM(POWER(column_2 - avg_col2, 2))) AS denominator,
CASE
WHEN
SUM((column_1 - avg_col1) * (column_2 - avg_col2)) = 0
OR
SQRT(SUM(POWER(column_1 - avg_col1, 2))) *
SQRT(SUM(POWER(column_2 - avg_col2, 2))) = 0
THEN 0
ELSE
SUM((column_1 - avg_col1) * (column_2 - avg_col2)) /
(
SQRT(SUM(POWER(column_1 - avg_col1, 2))) *
SQRT(SUM(POWER(column_2 - avg_col2, 2)))
)
END AS correlation
FROM cte;
Explanation of the Query
- Numerator
Computes the covariance between the two columns. - Denominator
Normalizes the covariance using the standard deviation of each column. - CASE statement
Prevents division by zero when:- One or both columns have no variance
- All values in a column are identical
The final result is the Pearson correlation coefficient between the two columns.
Handling NULL Values
Before calculating correlation, ensure that NULL values are removed:
FROM schema.table
WHERE column_1 IS NOT NULL
AND column_2 IS NOT NULL
Failing to filter NULLs can lead to incorrect results or reduced row counts.
Performance Considerations in Redshift
- This approach scales well for large tables because it relies on set-based operations
- For extremely large datasets, consider:
- Sampling
- Pre-aggregating data
- Running correlation on summary tables
Summary
Although Amazon Redshift does not support the CORR() function, it is still possible to calculate correlation between two columns using standard SQL. By translating the Pearson correlation formula into SQL expressions, you can perform correlation analysis directly inside Redshift without exporting data to external tools.
This method is useful for:
- Data analysis
- Reporting
- Dashboard metrics
- Exploratory analytics in Redshift
