Select Page

Overview

Ensuring data quality is a critical responsibility for data engineers, analytics engineers, and platform teams. One of the core data quality dimensions is consistency, which often manifests as format validation—verifying that values follow expected patterns.

In Google BigQuery, large datasets frequently contain values with inconsistent formats due to manual data entry, system integrations, or legacy migrations. Phone numbers, in particular, are prone to formatting issues.

This guide demonstrates how to detect non-conforming data formats in BigQuery using SQL, while safely masking sensitive values. The approach is scalable, privacy-aware, and suitable for production data quality checks.


Problem Statement: Validating Phone Number Formats

Assume your organization defines the following valid U.S. phone number formats:

  • nnn-nnn-nnnn
  • (nnn)nnn-nnnn

In practice, datasets often contain additional formats such as:

  • Numbers with extensions
  • Text embedded in phone numbers
  • International prefixes
  • Extra punctuation or whitespace

Your objective is to:

  • Identify records that do not conform to the accepted formats
  • Avoid exposing sensitive phone numbers
  • Provide enough context for remediation

Solution Approach

We will use Google BigQuery SQL to:

  1. Query a public dataset
  2. Apply pattern-based data masking
  3. Filter out records that match valid formats
  4. Surface only non-standard phone number entries

Dataset Used

This tutorial uses a BigQuery public dataset:

bigquery-public-data.sdoh_hud_housing.2017_lihtc_database_hud

Relevant Columns

ColumnDescription
hud_idHUD project identifier
projectProject name
co_telContact phone number (unstandardized)

Why Data Masking Matters in Data Quality Checks

Phone numbers are personally identifiable information (PII). Even when working with public datasets, exposing raw values is a bad practice.

Common data masking techniques include:

  • Redaction
  • Substitution
  • Shuffling
  • Encryption or tokenization
  • Masking with patterns

In this tutorial, we use pattern-based masking, which preserves the structure of the data while hiding actual values. This allows analysts to evaluate formats without seeing sensitive information.


BigQuery SQL Query

SELECT 
  hud_id,
  project,
  REGEXP_REPLACE(
    REGEXP_REPLACE(SAFE_CAST(co_tel AS STRING), '[A-Za-z]', 'x'),
    '\\d',
    'n'
  ) AS masked_phone_number,
  LENGTH(CAST(co_tel AS STRING)) AS length_phone_number
FROM `bigquery-public-data.sdoh_hud_housing.2017_lihtc_database_hud`
WHERE REGEXP_REPLACE(REGEXP_REPLACE(SAFE_CAST(co_tel AS STRING), '[A-Za-z]', 'x'), '\\d', 'n')
NOT IN ('nnn-nnn-nnnn', '(nnn)nnn-nnnn')


Here is the actual results of the above query:

Query Explanation

Step 1: Safe Casting to STRING

SAFE_CAST(co_tel AS STRING)
  • Ensures consistent string processing
  • Prevents query failures caused by malformed or NULL values

Step 2: Mask Alphabetic Characters

REGEXP_REPLACE(..., '[A-Za-z]', 'x')

Purpose:
Replaces all letters with x.

Example:

1-800-FLOWERS → 1-800-xxxxxxx

Step 3: Mask Numeric Digits

REGEXP_REPLACE(..., '\\d', 'n')

Purpose:
Replaces all digits (0–9) with n.

Example:

415-555-1234 → nnn-nnn-nnnn

Resulting Masked Output

ElementMasked As
Digitsn
Lettersx
SymbolsPreserved

This preserves the format structure while masking actual values.


Calculating Phone Number Length

LENGTH(CAST(`co_tel` AS STRING)) AS length_phone_number

This helps:

  • Identify extensions
  • Detect trailing text
  • Spot malformed or truncated values
  • Provide metadata for data remediation

Filtering Valid Formats

WHERE REGEXP_REPLACE(REGEXP_REPLACE(SAFE_CAST(co_tel AS STRING), '[A-Za-z]', 'x'), '\\d', 'n') NOT IN ('nnn-nnn-nnnn', '(nnn)nnn-nnnn')

This removes records that match the approved formats, leaving only unexpected or invalid entries.


Example Results

Original ValueMasked OutputIncluded
415-555-1234nnn-nnn-nnnn❌ No
(415)555-1234(nnn)nnn-nnnn❌ No
1-800-FLOWERSn-nnn-xxxxxxx✅ Yes
+1 (415) 555-1234 ext 9+n (nnn) nnn-nnnn xxx n✅ Yes

Type of Data Masking Used

This approach uses masking with patterns, not redaction.

Why this matters:

  • Sensitive values are hidden
  • Format consistency can still be evaluated
  • Results are safe to share with analysts and stakeholders

Practical Applications

This technique is commonly used for:

  • Data quality consistency checks
  • Phone number and email validation
  • Auditing public or shared datasets
  • PII-safe data profiling
  • Automated data quality monitoring in BigQuery

Key Takeaways

  • Data format validation is a fundamental data quality check
  • BigQuery regex functions enable scalable format analysis
  • Pattern-based masking balances privacy and usability
  • This method generalizes to ZIP codes, emails, IDs, and dates