Google Cloud Data Engineer Study Guide

Section 1: Designing Data Processing Systems
Section 2: Ingesting and Processing the Data
Section 3: Storing the Data
Section 4: Preparing and Using Data for Analysis
Section 5: Maintaining and Automating Data Workloads

Section 1: Designing Data Processing Systems (~22%)

1.1 Designing for Security and Compliance

Identity and Access Management (e.g., Cloud IAM and organization policies)
Data security (encryption and key management)
Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)
Regional considerations (data sovereignty) for data access and storage
Legal and regulatory compliance

1.2 Designing for Reliability and Fidelity

Preparing and cleaning data (e.g., Dataprep, Dataflow, and Cloud Data Fusion)
Monitoring and orchestration of data pipelines
Disaster recovery and fault tolerance
Making decisions related to ACID compliance and availability
Data validation

1.3 Designing for Flexibility and Portability

Mapping current and future business requirements to the architecture
Designing for data and application portability (multi-cloud, data residency)
Data staging, cataloging, and discovery (data governance)

1.4 Designing Data Migrations

Analyzing current stakeholder needs and planning for desired state
Planning migration to Google Cloud (e.g., BigQuery, Datastream)
Designing migration validation strategy
Designing dataset and table architecture for proper governance

Important points

BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

Sample query to load data from cloud storage to big query

bq load \
--source_format=CSV \
--autodetect \
--noreplace  \
nyctaxi.2018trips \
      gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_2.csv

Datastream for BigQuery features seamless replication from operational database sources such as AlloyDB, MySQL, PostgreSQL, and Oracle, directly into BigQuery, Google Cloud's serverless data warehouse. With a serverless, auto-scaling architecture, Datastream allows you to easily set up an ELT (Extract, Load, Transform) pipeline for low-latency data replication enabling real-time insights.

Cloud Service Equivalents – GCP vs AWS vs Azure
Category	GCP Service	AWS Equivalent	Azure Equivalent	Purpose
Data Warehouse	BigQuery	Amazon Redshift	Azure Synapse Analytics	Analytical SQL warehouse for large-scale queries.
Data Lake Storage	Cloud Storage (GCS)	Amazon S3	Azure Data Lake Storage (ADLS) Gen2 / Blob Storage	Object storage for raw, semi-/structured data.
Batch/Stream Processing	Dataflow (Apache Beam)	Kinesis Data Analytics / AWS Glue Streaming ETL	Azure Stream Analytics / Data Factory Mapping Data Flows	Serverless batch + streaming ETL.
Managed Hadoop/Spark	Dataproc	Amazon EMR	Azure HDInsight / Synapse Spark Pools	Managed Hadoop/Spark/Hive/Presto clusters.
Data Orchestration	Cloud Composer (Airflow)	AWS MWAA / Step Functions	Azure Data Factory	Workflow orchestration and scheduling.
Real-time Messaging	Pub/Sub	Kinesis Data Streams / SNS	Event Hubs / Service Bus	Pub/sub messaging for real-time ingestion.
ETL/ELT Service	Dataprep (Trifacta)	AWS Glue DataBrew	ADF Wrangling Data Flows	No-code data prep for analytics.
Database Migration	DMS	AWS DMS	Azure Database Migration Service	Migrate DBs to the cloud.
NoSQL Wide-Column Store	Cloud Bigtable	Amazon DynamoDB	Cosmos DB (Cassandra API)	Low-latency, high-throughput NoSQL store.
Machine Learning	Vertex AI	SageMaker	Azure Machine Learning	Managed ML platform for training, deployment, MLOps.

Cloud Service Equivalents – GCP vs AWS vs Azure

What I have learnt today

MyMenu

Google Cloud – Professional Data Engineer Certification Exam Notes

Google Cloud Data Engineer Study Guide

Table of Contents

Section 1: Designing Data Processing Systems (~22%)

1.1 Designing for Security and Compliance

1.2 Designing for Reliability and Fidelity

1.3 Designing for Flexibility and Portability

1.4 Designing Data Migrations

Important points

Data services in clouds

Blog Archive