Google Cloud Architecture Framework — System Design Architecture guidelines
Introduction
In today’s digital landscape, designing robust and adaptable cloud systems is crucial for businesses of all sizes. The Google Cloud Architecture Framework (GCAF) empowers you to achieve this by providing a comprehensive set of best practices and recommendations. This blog post delves into the System Design pillar of the GCAF, equipping you with the knowledge to create secure, scalable, and high-performing systems on Google Cloud Platform (GCP).
What is System Design in the GCAF?
The System Design category serves as the foundation of the GCAF. It equips you with the essential tools and knowledge to design the architecture of your cloud systems. This includes:
- Defining System Components: Break down your system into well-defined modules, interfaces, and data structures that effectively meet your system requirements.
- Understanding Google Cloud Products: Explore Google Cloud’s extensive product portfolio to identify the ideal services and features that align with your specific design needs.
Key System Design Principles in the GCAF
- Security: Prioritize security throughout the entire design process. Implement robust authentication, authorization, and encryption mechanisms to safeguard your data and systems.
- Scalability: Design your system to accommodate growth and changing workloads. Leverage Google Cloud’s inherently scalable services to ensure your system can handle spikes in traffic or data volume.
- Reliability: Build fault-tolerant systems that can withstand failures and disruptions. Employ redundancy, load balancing, and self-healing mechanisms to maintain high availability.
- Performance: Optimize your system for speed and efficiency. Utilize caching, load balancing, and efficient data storage solutions to deliver exceptional user experiences.
- Cost-Effectiveness: Design with cost optimization in mind. Select appropriate Google Cloud services and pricing models that align with your budget and resource requirements.
Benefits of Following GCAF System Design Principles
- Enhanced Security: Mitigate security risks and protect your valuable data with Google Cloud’s built-in security features and best practices.
- Improved Scalability: Easily scale your systems up or down as needed, ensuring smooth operation during fluctuations in demand.
- Increased Reliability: Experience minimal downtime and disruptions with fault-tolerant architectures designed for resilience.
- Optimized Performance: Deliver fast and responsive applications, enhancing user satisfaction.
- Cost Efficiency: Reduce cloud expenditures by selecting cost-effective services and optimizing resource utilization.
The details of the 5 pillars that reside on top of the system design layer can be found here :
- Operational Excellence (how to run, manage, and monitor systems that deliver business value) — https://cloud.google.com/architecture/framework/operational-excellence
- Security, privacy, and compliance (how to architect and operate secure services on Google Cloud. You also learn about Google Cloud products and features that support security and compliance) — https://cloud.google.com/architecture/framework/security
- Reliability (how to architect and operate reliable services on a cloud platform. You also learn about some of the Google Cloud products and features that support reliability) — https://cloud.google.com/architecture/framework/reliability
- Cost optimization (how to optimize the cost of workloads in Google Cloud) — https://cloud.google.com/architecture/framework/cost-optimization
- Performance optimization( best practices to optimize the performance of workloads in Google Cloud) — https://cloud.google.com/architecture/framework/performance-optimization
In the system design category of the Architecture Framework, you learn to do the following:
- Apply core principles of system design.
- Select geographic regions to support your business applications.
- Manage cloud resources.
- Choose and manage compute.
- Design your network infrastructure.
- Select and implement a storage strategy.
- Optimize your database.
- Analyze your data.
- Implement machine learning.
- Design your cloud workloads for sustainability.
1. Core principles of system design
This document in the Google Cloud Architecture Framework describes the core principles of system design. A robust system design is secure, reliable, scalable, and independent. It lets you make iterative and reversible changes without disrupting the system, minimize potential risks, and improve operational efficiency. To achieve a robust system design, we recommend that you follow four core principles.
Document everything
When you start to move your workloads to the cloud or build your applications, a major blocker to success is lack of documentation of the system. Documentation is especially important for correctly visualizing the architecture of your current deployments.
A properly documented cloud architecture establishes a common language and standards, which enable cross-functional teams to communicate and collaborate effectively. It also provides the information that’s necessary to identify and guide future design decisions. Documentation should be written with your use cases in mind, to provide context for the design decisions.
Over time, your design decisions will evolve and change. The change history provides the context that your teams require to align initiatives, avoid duplication, and measure performance changes effectively over time. Change logs are particularly valuable when you onboard a new cloud architect who is not yet familiar with your current system design, strategy, or history.
Simplify your design and use fully managed services
Simplicity is crucial for system design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.
If you’re already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you’re developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time.
Decouple your architecture
Decoupling is a technique that’s used to separate your applications and service components into smaller components that can operate independently. For example, you might break up a monolithic application stack into separate service components. In a decoupled architecture, an application can run its functions independently, regardless of the various dependencies.
A decoupled architecture gives you increased flexibility to do the following:
- Apply independent upgrades.
- Enforce specific security controls.
- Establish reliability goals for each subsystem.
- Monitor health.
- Granularly control performance and cost parameters.
You can start decoupling early in your design phase or incorporate it as part of your system upgrades as you scale.
Use a stateless architecture
A stateless architecture can increase both the reliability and scalability of your applications.
Stateful applications rely on various dependencies to perform tasks, such as locally cached data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.
The system design category describes recommendations to make your applications stateless or to utilize cloud-native features to improve capturing machine state for your stateful applications.
2. Choose Google Cloud deployment archetypes
This document in the Google Cloud Architecture Framework describes six deployment archetypes1 — zonal, regional, multi-regional, global, hybrid, and multicloud — that you can use to build architectures for your cloud workloads based your requirements for availability, cost, performance, and operational efficiency.
What is a deployment archetype?
A deployment archetype is an abstract, provider-independent model that you use as the foundation to build application-specific deployment architectures that meet your business and technical requirements. Each deployment archetype specifies a combination of failure domains where an application can run. These failure domains can be one or more Google Cloud zones or regions, and they can extend to include your on-premises data centers or failure domains in other cloud providers.
The following diagram shows six applications deployed in Google Cloud. Each application uses a deployment archetype that meets its specific requirements.
As the preceding diagram shows, in an architecture that uses the hybrid or multicloud deployment archetype, the cloud topology is based on one of the basic archetypes: zonal, regional, multi-regional, or global. In this sense, the hybrid and multicloud deployment archetypes can be considered as composite deployment archetypes that include one of the basic archetypes.
Note: Deployment archetypes are different from location scopes. The location scope of a Google Cloud resource defines its availability boundary. For example, the location scope of a Compute Engine VM is zonal. This means that if the Google Cloud zone in which a VM is provisioned has an outage, the availability of the VM is affected. However, by distributing VMs across multiple zones, you can build a highly available architecture that’s based on the regional deployment archetype.
Choosing a deployment archetype helps to simplify subsequent decisions regarding the Google Cloud products and features that you should use. For example, for a highly available containerized application, if you choose the regional deployment archetype, then regional Google Kubernetes Engine (GKE) clusters are more appropriate than zonal GKE clusters.
When you choose a deployment archetype for an application, you need to consider tradeoffs between factors like availability, cost, and operational complexity. For example, if an application serves users in multiple countries and needs high availability, you might choose the multi-regional deployment archetype. But for an internal application that’s used by employees in a single geographical region, you might prioritize cost over availability and, therefore, choose the regional deployment archetype.
Overview of the deployment archetypes
The following tabs provide definitions for the deployment archetypes and a summary of the use cases and design considerations for each.
These archetypes can be:
- Zonal
- Regional
- Multi-regional
- Global
- Hybrid
- Multicloud
Use best practices to deploy your system based on geographic requirements
ZonalRegionalMulti-regionalGlobalHybridMulticloud
Your application runs within a single Google Cloud zone, as shown in the following diagram:
Use cases
- Development and test environments.
- Applications that don’t need high availability.
- Low-latency networking between application components.
- Migrating commodity workloads.
- Applications that use license-restricted software.
Design considerations
- Downtime during zone outages.
- For business continuity, you can provision a passive replica of the application in another zone in the same region. If a zone outage occurs, you can restore the application to production by using the passive replica.
See the following sections:
3.Managing cloud resources
Google Cloud’s hierarchy lets you manage common aspects of your resources like access control, configuration settings, and policies.
pic credit — https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy
- Use tags and labels at the outset of your project
- Set up organization policies such as naming conventions, audits, etc.
4.Evaluate Google Cloud Compute options
- Virtual machines (VM) with cloud-specific benefits like live migration.
- Bin-packing of containers on cluster-machines that can share CPUs.
- Functions and serverless approaches, where your use of CPU time can be metered to the work performed during a single HTTP request.
screen grab from https://cloud.google.com/architecture/framework/system-design/compute
Choose a compute migration approach
screen grab from https://cloud.google.com/architecture/framework/system-design/compute
Best practices for designing workloads
- Evaluate serverless options for simple logic
- Decouple your applications to be stateless
- Use caching logic when you decouple architectures
- Use live migrations to facilitate upgrades
To support your system
- Design Scaling workloads — eg Use startup and shutdown script for stateful applications
- Manage operations to support your system — eg Use snapshots for instance backup
- Manage capacity, reservations and isolation — eg Use committed-use discounts to reduce costs
5.Design network infrastructure
- this helps optimize for performance and secure application communications with internal and external services
- Google’s private network connects regional locations to more than 100 global network points of presence
- Google Cloud Virtual Private Cloud (VPC) provides networking functionality to Compute Engine virtual machine (VM) instances, Google Kubernetes Engine (GKE) containers, and serverless workloads.
- Google ensures content is delivered with high throughput by using technologies like Bottleneck Bandwidth and Round-trip propagation time (BBR) congestion-control intelligence.
Best practices for designing workload VPC architectures to support your system:
- Consider VPC network design early
- Start with a single VPC network
- Keep VPC network topology simple to ensure a manageable, reliable, and well-understood architecture
- Use VPC networks in custom mode. To ensure that Google Cloud networking integrates seamlessly with your existing networking systems, we recommend that you use custom mode when you create VPC networks. Using custom mode helps you integrate Google Cloud networking into existing IP address management schemes and it lets you control which cloud regions are included in the VPC.
Best practices for designing inter-VPC connectivity to support your system
- Choose a VPC connection method. To implement multiple VPC networks, you need to connect those networks. VPC networks are isolated tenant spaces within Google’s Andromeda software-defined network (SDN). Choose how you connect your network based on your bandwidth, latency, and service level agreement (SLA) requirements.
- Use Shared VPC to administer multiple working groups
- Use simple naming conventions to understand the purpose of each resource, where it’s located, and how it’s differentiated
- Use connectivity tests to verify network security
- Use Private Service Connect to create private endpoints
- Secure and limit external connectivity
- Use Network Intelligence Center to monitor your cloud networks
Storage Strategy Implementation
Cloud Storage provides reliable, secure object storage services
screen grab from https://cloud.google.com/architecture/framework/system-design/storage
Best practices for choosing a storage type to support your system
screen grab illustrating storage strategy options from https://cloud.google.com/architecture/framework/system-design/storage
- Choose active or archival storage based on storage access needs
A storage class is a metadata that is used by every object. For data that is served at a high rate with high availability, use the Standard Storage class. For data that is infrequently accessed use the Nearline Storage, Coldline Storage, or Archive Storage class.
- Evaluate storage location and data protection needs for Cloud Storage
For a Cloud Storage bucket located in a region, data contained within it is automatically replicated across zones within the region. Data is also replicated across multiple, geographically separate data centers.
- Use Cloud CDN to improve static object delivery
Cloud CDN uses the Cloud Load Balancing external Application Load Balancer to provide routing, health checking, and anycast IP address support.
- Use best practices to select optimal Storage access pattern and workload type. Use Persistent Disk to support high-performance storage access
Storage management best practices
- Assign unique names to every bucket — see the bucket naming guidelines and Object naming guidelines.
- Keep Cloud Storage buckets private — Overview of access control
- Assign random object names to distribute load evenly
- To prevent access at the organization, folder, project, or bucket level, use public access prevention
Database optimization
GC offers a multitude of database services as listed in the table.
screen grab of key GC database services from https://cloud.google.com/architecture/framework/system-design/databases
Best practices for choosing a database to support your system
- Consider using a managed database service and evaluate Google Cloud managed database services before you install your own database or database cluster. Installing your own database involves maintenance overhead including installing patches and updates, and managing daily operational activities like monitoring and performing backups.
To migrate databases, use one of the products described in the following table:
screen grab of key GC migration services from https://cloud.google.com/architecture/framework/system-design/databases
- Choose an appropriate migration strategy
screen grab of migration options from https://cloud.google.com/architecture/framework/system-design/databases
- Use Memorystore to support your caching database layer. Memorystore is a fully managed Redis and Memcached database that supports submilliseconds latency. Memorystore is fully compatible with open source Redis and Memcached.
- Use Bare Metal servers to run an Oracle database. This approach fits into a lift-and-shift migration strategy.
- Use migration as an opportunity to modernize your database and prepare it to support future business needs.
- Use fixed databases with off-the-shelf applications. Commercial off-the-shelf (COTS) applications often require Lift and shift migration approach
- Verify your team’s database migration skill set. Use Google Cloud Partner Advantage to find a partner to support you throughout your migration journey.
- Design your databases to meet high availability (HA) and disaster recovery (DR) requirements, and evaluate the tradeoffs between reliability and cost
- Specify cloud regions to support data residency (i.e. where your data physically resides at rest)
- Include disaster recovery in data residency design — refer to 100% reliability is the wrong target and Disaster recovery planning guide.
- Make your database Google Cloud-compliant
Encryption
- Determine encryption requirements based on company security policies and compliance requirements
- Choose an encryption key strategy. Examples include Cloud Key Management Service (Cloud KMS) and Customer-managed encryption keys (CMEK).
Database design and scaling
- Use monitoring metrics to assess scaling needs with Monitoring Cloud SQL instances, Monitoring with Cloud Monitoring, and Monitoring an instance.
Networking and access
- Run databases inside a private network
Run your databases inside your private network and grant restricted access only from the clients who need to interact with the database. You can create Cloud SQL instances inside a VPC. Google Cloud also provides VPC Service Controls for Cloud SQL, Spanner, and Bigtable databases to ensure restricted access
- Grant minimum privileges to users
Identity and Access Management (IAM) controls access to Google Cloud services, including database services.
Automation and right-sizing
- Define database instances as code, which lets you apply a consistent and repeatable approach to creating and updating your databases.
- Use Liquibase to version control your database. Google database services like Cloud SQL and Cloud Spanner support Liquibase. Liquibase helps you to track your database schema changes, roll back schema changes, and perform repeatable migrations.
- Test and tune your database to support scaling
- Choose the right database for your scaling requirements
screen grab of GC db scaling options from https://cloud.google.com/architecture/framework/system-design/databases
Operations — Use Cloud Monitoring to monitor and set up alerts for your database
Licensing — Select between on-demand licenses and existing licenses
Analyze your data
- Google Cloud provides you with various services that help you through the entire data lifecycle, from data ingestion through reports and visualization.
- Most of these services are fully managed, and some are serverless. You can also build and manage a data-analytics environment on Compute Engine VMs, such as to self-host Apache Hadoop or Beam
screen grab of GC cloud analytics services from https://cloud.google.com/architecture/framework/system-design/data-analytics
Data Lifecycle
As part of your system design, you can group the Google Cloud data analytics services around the data lifecycle:
- Ingestion includes services such as Pub/Sub, Storage Transfer Service, Transfer Appliance, and BigQuery.
- Storage includes services such as Cloud Storage, Bigtable, Memorystore, and BigQuery.
- Processing and transformation includes services such as Dataflow, Dataproc, Dataprep, Sensitive Data Protection, and BigQuery.
- Analysis and warehousing includes services such as BigQuery.
- Reporting and visualization includes services such as Looker Studio and Looker.
The following stages and services run across the entire data lifecycle:
- Data integration includes services such as Data Fusion.
- Metadata management and governance includes services such as Data Catalog.
- Workflow management includes services such as Cloud Composer.
Data Ingestion best practices
- Determine the data source for ingestion. Data typically comes from another cloud provider or service (use Cloud Data Fusion, Storage Transfer Service, or BigQuery Transfer Service), or from an on-premises location ( use Cloud Data Fusion and for large volumes of data, you can use Transfer Appliance or Storage Transfer Service)
- Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.
- Identify streaming or batch data sources. For example, if you run a global streaming service that has low latency requirements, you can use Pub/Sub. If you need your data for analytics and reporting uses, you can stream data into BigQuery. If you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment, use the Kafka to BigQuery Dataflow template.
- Ingest data with automated tools. For example, Cloud Data Fusion provides connectors and plugins to bring data from external sources with a drag-and-drop GUI. If your teams want to write some code, Data Flow or BigQuery can help to automate data ingestion. Pub/Sub can help in both a low-code or code-first approach. To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use Storage Transfer Service.
- Use migration tools to ingest from another data warehouse. If you need to migrate from another data warehouse system, such as Teradata, Netezza, or Redshift, you can use the BigQuery Data Transfer Service migration assistance
- Estimate your data ingestion needs. The volume of data that you need to ingest helps you to determine which service to use in your system design. For streaming ingestion of data, Pub/Sub scales to tens of gigabytes per second.
- Use appropriate tools to regularly ingest data on a schedule. Storage Transfer Service and BigQuery Data Transfer Service both let you schedule ingestion jobs.
- Review FTP/SFTP server data ingest needs. If you need a code-free environment to ingest data from an FTP/SFTP server, you can use the FTP copy plugins.
- Use Apache Kafka connectors to ingest data. If you use Pub/Sub, Dataflow, or BigQuery, you can ingest data using one of the Apache Kafka connectors
Data storage
Apply the following data storage best practices to your own environment.
screen grab illustrating data storage usecase from https://cloud.google.com/architecture/framework/system-design/data-analytics
Data Processing and Data Transformation
- Explore the open source software you can use in Google Cloud. Dataproc is a Hadoop-compatible managed service that lets you host open source software, with little operational burden. Dataproc includes support for Spark, Hive, Pig, Presto, and Zookeeper.
- Determine your ETL or ELT data-processing needs. Google Cloud lets you use either traditional ETL or more modern ELT data-processing systems.
- Use the appropriate framework for your data use case. For a batch data processing system, you can process and transform data in BigQuery with a familiar SQL interface. If you have an existing pipeline that runs on Apache Hadoop or Spark on-premises or in another public cloud, you can use Dataproc. If you have analytics and SQL-focused teams and capabilities, you can also stream data into BigQuery. For real-time use cases, use Dataflow.
- Retain future control over your execution engine. To minimize vendor lock-in and to be able to use a different platform in the future, use the Apache Beam programming model and Dataflow as a managed serverless solution
- Use Dataflow to ingest data from multiple sources. To ingest data from multiple sources, such as Pub/Sub, Cloud Storage, HDFS, S3, or Kafka, use Dataflow
- Discover, identify, and protect sensitive data. Use Sensitive Data Protection to perform actions such as to scan BigQuery data or de-identify and re-identify PII in large-scale datasets.
- Modernize your data transformation processes. Use Dataform to write data transformations as code and to start to use version control by default.
Data analytics and warehouses best practices
- Review your data storage needs. Data lakes and data warehouses aren’t mutually exclusive. Data lakes are useful for unstructured and semi-structured data storage and processing. Data warehouses are best for analytics and BI.
- Identify opportunities to migrate from a traditional data warehouse to BigQuery. For more information and example scenarios, see Migrating data warehouses to BigQuery.
- Plan for federated access to data. Identify your data federation (i.e. This virtual database takes data from a range of sources and converts them all to a common mode)lneeds, and create an appropriate system design. For example, BigQuery lets you define external tables that can read data from other sources, such as Bigtable, Cloud SQL, Cloud Storage, or Google Drive.
- Use BigQuery flex slots to provide on-demand burst capacity. These flex slots help you when there’s a period of high demand or when you want to complete an important analysis.
- Understand schema differences if you migrate to BigQuery. BigQuery supports both star and snowflake schemas, but by default it uses nested and repeated fields.
Reports and visualization
- Use BigQuery BI Engine to visualize your data
- Modernize your BI processes with Looker. Looker is a modern, enterprise platform for BI, data applications, and embedded analytics. If you have existing BI processes and tools, we recommend that you modernize and use a central platform such as Looker.
- Use workflow management tools to manage and maintain end-to-end data pipelines, use appropriate workflow management tools. Cloud Composer is a fully managed workflow management tool based on the open source Apache Airflow project.
Implement machine learning
screen grab illustrating GC’s AI and ML services from https://cloud.google.com/architecture/framework/system-design/ai-ml
Data processing best practices
- Ensure that your data meets ML requirements such as accurately labeled data for training
- Store tabular data in BigQuery. If you use tabular data, consider storing all data in BigQuery and using the BigQuery Storage API to read data from it.
- Ensure you have enough data to develop an ML model. To predict a category, the recommended number of examples for each category is 10 times the number of features. The more categories you want to predict, the more data you need.
- Prepare data for consumption. When you configure your data pipeline, make sure that it can process both batch and stream data so that you get consistent results from both types of data.
Model development and training best practices
- Choose managed or custom-trained model development. When you build your model, consider the highest level of abstraction possible. Use AutoML when possible so that the development and training tasks are handled for you. Consider the Vertex AI training service instead of self-managed training on Compute Engine VMs or Deep Learning VM containers. For a JupyterLab environment, consider Vertex AI Workbench, which provides both managed and user-managed JupyterLab environments.
- Use pre-built or custom containers for custom-trained models. Pre-built containers are available for Python training applications that are created for specific TensorFlow, scikit-learn, PyTorch, and XGBoost versions.
- Consider distributed training requirements. Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines. These frameworks automatically coordinate division of work based on environment variables that are set on each machine.
Design for environmental sustainability
- Understand your carbon footprint — visit the Carbon Footprint dashboard
- One simple and effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions — use the Google Cloud Region Picker to balance lowering emissions with other requirements, such as pricing and network latency
- Migrate workloads to managed services. Also consider that many workloads don’t require VMs. Often you can utilize a serverless offering instead. These managed services can optimize cloud resource usage, often automatically, which simultaneously reduces cloud costs and carbon footprint.
- Identify idle or overprovisioned resources and either delete them or rightsize them.
- Reduce emissions for batch workloads — for more info see Reduce emissions for batch workloads
View in one page :
https://cloud.google.com/architecture/framework/printable
Conclusion
By adhering to the System Design principles outlined in the GCAF, you can construct secure, scalable, and cost-effective cloud systems on Google Cloud. This framework empowers you to leverage Google Cloud’s vast capabilities to create innovative and high-performing solutions that drive business value.