Google Cloud – Professional Data Engineer Certification Exam Notes
Google Cloud Data Engineer Study Guide
Section 1: Designing Data Processing Systems (~22%)
1.1 Designing for Security and Compliance
- Identity and Access Management (e.g., Cloud IAM and organization policies)
- Data security (encryption and key management)
- Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)
- Regional considerations (data sovereignty) for data access and storage
- Legal and regulatory compliance
1.2 Designing for Reliability and Fidelity
- Preparing and cleaning data (e.g., Dataprep, Dataflow, and Cloud Data Fusion)
- Monitoring and orchestration of data pipelines
- Disaster recovery and fault tolerance
- Making decisions related to ACID compliance and availability
- Data validation
1.3 Designing for Flexibility and Portability
- Mapping current and future business requirements to the architecture
- Designing for data and application portability (multi-cloud, data residency)
- Data staging, cataloging, and discovery (data governance)
1.4 Designing Data Migrations
- Analyzing current stakeholder needs and planning for desired state
- Planning migration to Google Cloud (e.g., BigQuery, Datastream)
- Designing migration validation strategy
- Designing dataset and table architecture for proper governance
Important points
- BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
- Sample query to load data from cloud storage to big query
bq load \
--source_format=CSV \
--autodetect \
--noreplace \
nyctaxi.2018trips \
gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_2.csv
- Datastream for BigQuery features seamless replication from operational database sources such as AlloyDB, MySQL, PostgreSQL, and Oracle, directly into BigQuery, Google Cloud's serverless data warehouse. With a serverless, auto-scaling architecture, Datastream allows you to easily set up an ELT (Extract, Load, Transform) pipeline for low-latency data replication enabling real-time insights.