AWS Glue PySpark transformation layer

Bronze change events become trusted Silver tables.

This repository owns the transformation layer between raw DMS files and the analytics-ready Silver schema. It reads full-load plus CDC Parquet files, resolves the latest current-state row, validates data quality, isolates bad records, and publishes operational evidence.

Glue 4.0 PySpark CDC reconciliation Quarantine CloudWatch metrics Silver star schema

Full platform context

The Glue layer is where raw history becomes queryable structure.

Why Glue mattersBronze contains every source change. Glue turns that noisy event history into one current, validated row per business entity.

What it protectsBad types, null keys, invalid statuses, and partition mistakes stop here instead of leaking into dbt and the analytics agent.

What comes nextSilver becomes the stable input for dbt Gold models. This page keeps the Glue responsibility separate from downstream serving.

CDC reconciliation

Multiple Bronze events collapse into one current Silver record.

Full-load plus CDCThe job reads every file, not only the newest batch, because the correct answer depends on the complete change history for each key.

Why bookmarks stay offSkipping older files would break reconciliation. The jobs overwrite Silver from a full canonical pass each run.

Delete handlingA row whose latest event is `D` is absent from Silver, matching the current state of the source table.

Six parallel jobs

Each entity has one focused PySpark job and one Silver output.

Dimension jobsCustomer and product tables stay small and are read in full, so no date partitioning is needed.

Fact jobsOrders, order items, payments, and shipments are partitioned by integer year and month columns.

Parallel designThe orchestration layer can run all six jobs together because there are no write conflicts between outputs.

Validation and observability

Clean rows go to Silver. Problem rows remain inspectable.

No silent lossInvalid data is written separately with reason labels, so a bad record can be diagnosed instead of disappearing.

Downstream trustdbt and Athena only read Silver, while Quarantine remains a controlled exception path for investigation.

Observable runsFreshness and row-count metrics let orchestration detect stale or empty outputs without scanning S3.

Silver data model

The jobs create a practical star-schema foundation for dbt.

Analytics shapeSilver is structured so revenue, product, customer, payment, and delivery questions can be joined predictably.

Partition strategyDate-heavy fact tables use year/month partitions so Athena scans less data for monthly and yearly business questions.

dbt handoffThe next repo transforms these Silver facts and dimensions into Gold tables for stakeholder-facing analytics.

Delivery path

The CI pipeline avoids expensive Spark work until fast gates pass.

Fast gates firstCheap checks catch style, type, unit, and security problems before pulling the large Glue Docker image.

Real runtime testThe integration test executes all six PySpark jobs inside the same Glue 4.0 runtime used by AWS.

Manual promotionDeployment uses OIDC and environment approvals for staging and prod, with no long-lived AWS keys.

Repository mental model

Bronze history → Current state → Silver tables

The repo is a controlled conversion point: noisy DMS history in, clean analytics structures out.

Why overwrite Silver

Read all Bronze → Rebuild canonical table

Because CDC reconciliation depends on history, the safest result is a fresh canonical snapshot each run.

Where to read exact commands

HTML version Markdown README