Design a scalable data architecture for GCP that can handle large volumes of real-time data ingestion

Biswanath Giri
2 min readNov 11, 2024

--

Overview of Requirements

The architecture supports two types of payment priorities with distinct Service Level Objectives (SLOs):

  1. High-Priority Payments: SLO of 10 minutes
  • Requires near real-time ingestion, processing, and storage in BigQuery.
  • Uses Pub/Sub as the ingestion layer, with a Dataflow streaming pipeline for immediate processing.
  1. Standard-Priority Payments: SLO by 9 AM the next day
  • Less time-sensitive and suitable for batch processing.
  • Data is stored in Cloud SQL and processed periodically via a Dataflow batch pipeline.

Dataflow Pipeline Plan

1. Streaming Pipeline (for High-Priority Payments)

  • Source: Connect the Dataflow streaming pipeline to the Pub/Sub subscription for high-priority payments.
  • Transformations:
  • Parse incoming messages from Pub/Sub.
  • Perform any necessary transformations, such as currency conversion, data validation, and enrichment (e.g., adding metadata like transaction region).
  • Apply windowing and aggregation (if required) to calculate metrics, such as total transaction volume within time windows.
  • Sink: Write the processed data directly to the BigQuery table (Finalized Payments Table) using BigQuery’s Streaming API.
  • Error Handling: Configure dead-letter topics or logging mechanisms to handle messages that fail processing, ensuring they do not disrupt the real-time flow.

2. Batch Pipeline (for Standard-Priority Payments)

  • Source: Pull data from Cloud SQL where standard-priority payments are stored.
  • Transformations:
  • Extract data from Cloud SQL and perform any data cleansing or enrichment operations.
  • Filter or aggregate data as required by business logic before loading it into BigQuery.
  • Sink: Use BigQuery Load Jobs to insert data into the Finalized Payments Table. This is suitable for batch operations as it avoids the cost of continuous streaming inserts.
  • Scheduling: Schedule this pipeline to run at a suitable interval (e.g., nightly, before 9 AM) to meet the SLO for standard-priority payments.

--

--

Biswanath Giri
Biswanath Giri

Written by Biswanath Giri

Cloud & AI Architect | Empowering People in Cloud Computing, Google Cloud AI/ML, and Google Workspace | Enabling Businesses on Their Cloud Journey

No responses yet