EDP platform-glue-jobs
AWS Glue PySpark transformation layer

Bronze change events become trusted Silver tables.

This repository owns the transformation layer between raw DMS files and the analytics-ready Silver schema. It reads full-load plus CDC Parquet files, resolves the latest current-state row, validates data quality, isolates bad records, and publishes operational evidence.

Glue 4.0 PySpark CDC reconciliation Quarantine CloudWatch metrics Silver star schema
Full platform context

The Glue layer is where raw history becomes queryable structure.

Bronze-to-Silver transformation path Glue is the transformation boundary: it reads raw CDC history from Bronze, reconciles current state, validates rows, and writes clean Silver tables. PostgreSQL OLTP source DMS CDC Full load + changes Bronze S3 Raw Parquet events Glue PySpark CDC reconcile Validate + partition Silver S3 Clean star schema dbt + Athena Gold models Gold Marts input output CloudWatch row counts + freshness Quarantine S3 invalid rows + reasons Glue repo stops at Silver dbt consumes Silver later to build Gold marts; analytics frontends are downstream of Gold, not part of this transformation layer.
Why Glue mattersBronze contains every source change. Glue turns that noisy event history into one current, validated row per business entity.
What it protectsBad types, null keys, invalid statuses, and partition mistakes stop here instead of leaking into dbt and the analytics agent.
What comes nextSilver becomes the stable input for dbt Gold models. This page keeps the Glue responsibility separate from downstream serving.
CDC reconciliation

Multiple Bronze events collapse into one current Silver record.

The core algorithm in lib/cdc.py DMS writes row history. Spark keeps the latest non-delete event per key and sends one current-state candidate to validation. Bronze history append-only events LOAD Op=I CDC update Op=U CDC delete Op=D 1. Read events LOAD + CDC 2. Window partition by key 3. Rank newest first Keep row latest Op is not D Drop row latest Op = D One current candidate per business key The validation gate decides whether the candidate can be written to Silver.
Full-load plus CDCThe job reads every file, not only the newest batch, because the correct answer depends on the complete change history for each key.
Why bookmarks stay offSkipping older files would break reconciliation. The jobs overwrite Silver from a full canonical pass each run.
Delete handlingA row whose latest event is `D` is absent from Silver, matching the current state of the source table.
Six parallel jobs

Each entity has one focused PySpark job and one Silver output.

The six transformation lanes The jobs can run in parallel because each owns an independent Bronze prefix and Silver target. Bronze source tables Glue job scripts Silver output tables customers dim_customer.py dim_customer products dim_product.py dim_product orders fact_orders.py fact_orders order_items + orders fact_order_items.py fact_order_items payments fact_payments.py fact_payments shipments fact_shipments.py fact_shipments Dimensions are unpartitioned. Facts partition by year and month for Athena pruning.
Dimension jobsCustomer and product tables stay small and are read in full, so no date partitioning is needed.
Fact jobsOrders, order items, payments, and shipments are partitioned by integer year and month columns.
Parallel designThe orchestration layer can run all six jobs together because there are no write conflicts between outputs.
Validation and observability

Clean rows go to Silver. Problem rows remain inspectable.

The quality gate before Silver Rows are classified before writing outputs: clean records become Silver; rejected records remain inspectable in quarantine. Current-state rows after CDC reconcile Schema check explicit StructType Business rules keys, statuses, amounts Quality decision valid vs invalid Silver S3 clean Parquet output Quarantine S3 invalid rows + reasons DataQuality SilverRowCount DataFreshness SilverDataAgeHours valid invalid
No silent lossInvalid data is written separately with reason labels, so a bad record can be diagnosed instead of disappearing.
Downstream trustdbt and Athena only read Silver, while Quarantine remains a controlled exception path for investigation.
Observable runsFreshness and row-count metrics let orchestration detect stale or empty outputs without scanning S3.
Silver data model

The jobs create a practical star-schema foundation for dbt.

Silver star schema produced by this repo Silver tables are clean enough for dbt to aggregate, but still close to operational business entities. dim_customer customer_id email, country, signup_date dim_product product_id brand, category, unit_price fact_orders order_id, customer_id order_year / order_month fact_order_items order_id, product_id line_total, quantity fact_payments order_id, amount payment_year / payment_month fact_shipments order_id, carrier shipped_year / shipped_month customer_id product_id through items order_id order_id order_id Partitioned facts let Athena skip irrelevant months before dbt builds Gold marts.
Analytics shapeSilver is structured so revenue, product, customer, payment, and delivery questions can be joined predictably.
Partition strategyDate-heavy fact tables use year/month partitions so Athena scans less data for monthly and yearly business questions.
dbt handoffThe next repo transforms these Silver facts and dimensions into Gold tables for stakeholder-facing analytics.
Delivery path

The CI pipeline avoids expensive Spark work until fast gates pass.

Code-to-Glue deployment path The large Glue 4.0 Docker integration test only runs after quick checks prove the change is worth testing end-to-end. Push / PR source change Lint + type check ruff + mypy Unit tests CDC + paths Security scan bandit Gate decision all fast checks pass Glue integration AWS Glue 4.0 Docker Runs all 6 jobs locally Deploy workflow Upload scripts + lib.zip Update Glue definitions AWS Glue
Fast gates firstCheap checks catch style, type, unit, and security problems before pulling the large Glue Docker image.
Real runtime testThe integration test executes all six PySpark jobs inside the same Glue 4.0 runtime used by AWS.
Manual promotionDeployment uses OIDC and environment approvals for staging and prod, with no long-lived AWS keys.

Repository mental model

Bronze history Current state Silver tables

The repo is a controlled conversion point: noisy DMS history in, clean analytics structures out.

Why overwrite Silver

Read all Bronze Rebuild canonical table

Because CDC reconciliation depends on history, the safest result is a fresh canonical snapshot each run.