Design a scalable data architecture for GCP that can handle large volumes of real-time data ingestion
2 min readNov 11, 2024
Overview of Requirements
The architecture supports two types of payment priorities with distinct Service Level Objectives (SLOs):
- High-Priority Payments: SLO of 10 minutes
- Requires near real-time ingestion, processing, and storage in BigQuery.
- Uses Pub/Sub as the ingestion layer, with a Dataflow streaming pipeline for immediate processing.
- Standard-Priority Payments: SLO by 9 AM the next day
- Less time-sensitive and suitable for batch processing.
- Data is stored in Cloud SQL and processed periodically via a Dataflow batch pipeline.
Dataflow Pipeline Plan
1. Streaming Pipeline (for High-Priority Payments)
- Source: Connect the Dataflow streaming pipeline to the Pub/Sub subscription for high-priority payments.
- Transformations:
- Parse incoming messages from Pub/Sub.
- Perform any necessary transformations, such as currency conversion, data validation, and enrichment (e.g., adding metadata like transaction region).
- Apply windowing and aggregation (if required) to calculate metrics, such as total transaction volume within time windows.
- Sink: Write the processed data directly to the BigQuery table (Finalized Payments Table) using BigQuery’s Streaming API.
- Error Handling: Configure dead-letter topics or logging mechanisms to handle messages that fail processing, ensuring they do not disrupt the real-time flow.
2. Batch Pipeline (for Standard-Priority Payments)
- Source: Pull data from Cloud SQL where standard-priority payments are stored.
- Transformations:
- Extract data from Cloud SQL and perform any data cleansing or enrichment operations.
- Filter or aggregate data as required by business logic before loading it into BigQuery.
- Sink: Use BigQuery Load Jobs to insert data into the Finalized Payments Table. This is suitable for batch operations as it avoids the cost of continuous streaming inserts.
- Scheduling: Schedule this pipeline to run at a suitable interval (e.g., nightly, before 9 AM) to meet the SLO for standard-priority payments.