Select Page

Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. In data engineering and analytics, correlation is often used to identify patterns, trends, and dependencies between columns in a dataset.

In Amazon Redshift, analysts frequently need to calculate correlation to:

  • Understand relationships between metrics
  • Perform exploratory data analysis (EDA)
  • Validate assumptions before modeling
  • Build analytical dashboards

However, unlike many analytical databases, Amazon Redshift does not support the built-in CORR() function.

According to AWS documentation, the PostgreSQL CORR function is listed as unsupported in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html

As a result, correlation must be calculated manually using SQL.

This article explains how to calculate the Pearson correlation coefficient between two columns in Amazon Redshift using plain SQL, step by step.


What Is the Pearson Correlation Coefficient?

The Pearson correlation coefficient measures the linear relationship between two variables and returns a value between:

  • 1 → perfect positive correlation
  • 0 → no correlation
  • -1 → perfect negative correlation

The mathematical formula is:

corr(X, Y) =
Σ((Xi − X̄)(Yi − Ȳ)) / √(Σ(Xi − X̄)² × Σ(Yi − Ȳ)²)

Where:

  • is the average of column X
  • Ȳ is the average of column Y

We can implement this formula directly in SQL.


Calculating Correlation in Amazon Redshift Using SQL

Because Redshift lacks a native CORR() function, we compute the correlation coefficient manually using window functions and aggregations.


Step 1: Calculate Column Averages

First, calculate the average values for both columns using window functions so that the averages are available on every row.

WITH cte AS (
    SELECT
        column_1,
        AVG(column_1::NUMERIC) OVER () AS avg_col1,
        column_2,
        AVG(column_2::NUMERIC) OVER () AS avg_col2
    FROM schema.table
)

This Common Table Expression (CTE) calculates:

  • avg_col1: mean of column_1
  • avg_col2: mean of column_2

Step 2: Apply the Pearson Correlation Formula

Next, use aggregation functions to compute the numerator and denominator of the Pearson formula.

SELECT
    SUM((column_1 - avg_col1) * (column_2 - avg_col2)) AS numerator,
    SQRT(SUM(POWER(column_1 - avg_col1, 2))) *
    SQRT(SUM(POWER(column_2 - avg_col2, 2))) AS denominator,
    CASE
        WHEN
            SUM((column_1 - avg_col1) * (column_2 - avg_col2)) = 0
            OR
            SQRT(SUM(POWER(column_1 - avg_col1, 2))) *
            SQRT(SUM(POWER(column_2 - avg_col2, 2))) = 0
        THEN 0
        ELSE
            SUM((column_1 - avg_col1) * (column_2 - avg_col2)) /
            (
                SQRT(SUM(POWER(column_1 - avg_col1, 2))) *
                SQRT(SUM(POWER(column_2 - avg_col2, 2)))
            )
    END AS correlation
FROM cte;

Explanation of the Query

  • Numerator
    Computes the covariance between the two columns.
  • Denominator
    Normalizes the covariance using the standard deviation of each column.
  • CASE statement
    Prevents division by zero when:
    • One or both columns have no variance
    • All values in a column are identical

The final result is the Pearson correlation coefficient between the two columns.


Handling NULL Values

Before calculating correlation, ensure that NULL values are removed:

FROM schema.table
WHERE column_1 IS NOT NULL
  AND column_2 IS NOT NULL

Failing to filter NULLs can lead to incorrect results or reduced row counts.


Performance Considerations in Redshift

  • This approach scales well for large tables because it relies on set-based operations
  • For extremely large datasets, consider:
    • Sampling
    • Pre-aggregating data
    • Running correlation on summary tables

Summary

Although Amazon Redshift does not support the CORR() function, it is still possible to calculate correlation between two columns using standard SQL. By translating the Pearson correlation formula into SQL expressions, you can perform correlation analysis directly inside Redshift without exporting data to external tools.

This method is useful for:

  • Data analysis
  • Reporting
  • Dashboard metrics
  • Exploratory analytics in Redshift