| Category | GCP Service | AWS Equivalent | Azure Equivalent | Purpose |
|---|---|---|---|---|
| Data Warehouse | BigQuery | Amazon Redshift | Azure Synapse Analytics | Analytical SQL warehouse for large-scale queries. |
| Data Lake Storage | Cloud Storage (GCS) | Amazon S3 | Azure Data Lake Storage (ADLS) Gen2 / Blob Storage | Object storage for raw, semi-/structured data. |
| Batch/Stream Processing | Dataflow (Apache Beam) | Kinesis Data Analytics / AWS Glue Streaming ETL | Azure Stream Analytics / Data Factory Mapping Data Flows | Serverless batch + streaming ETL. |
| Managed Hadoop/Spark | Dataproc | Amazon EMR | Azure HDInsight / Synapse Spark Pools | Managed Hadoop/Spark/Hive/Presto clusters. |
| Data Orchestration | Cloud Composer (Airflow) | AWS MWAA / Step Functions | Azure Data Factory | Workflow orchestration and scheduling. |
| Real-time Messaging | Pub/Sub | Kinesis Data Streams / SNS | Event Hubs / Service Bus | Pub/sub messaging for real-time ingestion. |
| ETL/ELT Service | Dataprep (Trifacta) | AWS Glue DataBrew | ADF Wrangling Data Flows | No-code data prep for analytics. |
| Database Migration | DMS | AWS DMS | Azure Database Migration Service | Migrate DBs to the cloud. |
| NoSQL Wide-Column Store | Cloud Bigtable | Amazon DynamoDB | Cosmos DB (Cassandra API) | Low-latency, high-throughput NoSQL store. |
| Machine Learning | Vertex AI | SageMaker | Azure Machine Learning | Managed ML platform for training, deployment, MLOps. |
MyMenu
!-end>!-start>
Data services in clouds
HDFS Commands
Listing files:
Changing file/folder permissions:
Changing the folder permissions recursively:
Copying files to local from HDFS:
Copying files to HDFS from local disk:
Deleting a folder in HDFS:
Deleting a file in HDFS:
sudo -u hdfs hadoop fs -ls /tmp
Changing file/folder permissions:
sudo -u hdfs hadoop fs -chmod 777 /tmp
Changing the folder permissions recursively:
sudo -u hdfs hadoop fs -chmod -R 777 /tmp
Copying files to local from HDFS:
sudo -u hdfs hadoop fs -copyToLocal <HDFS Path> <LOCAL SERVER PATH>
Copying files to HDFS from local disk:
sudo -u hdfs hadoop fs -copyFromLocal <LOCAL SERVER PATH> <HDFS Path>
Deleting a folder in HDFS:
sudo -u hdfs hadoop fs -rm -r <HDFS Folder>
Deleting a file in HDFS:
sudo -u hdfs hadoop fs -rm <HDFS Folder>
Working with Sqoop
Apache Sqoop
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz installation
$ tar -xvf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz
$ mv sqoop-1.4.4.bin__hadoop-1.0.0 sqoop144_h100
$ sudo gedit .bashrc
$ . .bashrc
Test sqoop installation with the command
$ sqoop144_h100/bin/sqoop help
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: $HADOOP_HOME is deprecated.
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
hduser2@bala:~$
Working with Apache Hive on Ubuntu
tar -xvf apache-hive-0.13.1-bin.tar.gz
mv apache-hive-0.13.1-bin hive
editing .bashrc file
hduser2@bala:~$ sudo gedit .bashrc
creating warehouse folder in HDFS
hduser2@bala:~$ hadoop111/bin/hadoop fs -mkdir /home/hduser2/tmp/hive/warehouse
giving read write permissions to warehouse folder
hduser2@bala:~$ hadoop111/bin/hadoop fs -chmod g+w /home/hduser2/tmp/hive/warehouse
Adding hadoop path in hive config file
hduser2@bala:~$ sudo gedit hive0131/bin/hive-config.sh
# Allow alternate conf dir location.
HIVE_CONF_DIR="${HIVE_CONF_DIR:-$HIVE_HOME/conf}"
export HIVE_CONF_DIR=$HIVE_CONF_DIR
export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH
export HADOOP_HOME=/home/hduser2/hadoop111
# Default to use 256MB
export HADOOP_HEAPSIZE=${HADOOP_HEAPSIZE:-256}
Launch hive
hduser2@bala:~$hive
hduser2@bala:~$ hive
Logging initialized using configuration in jar:file:/home/hduser2/hive0131/lib/hive-common-0.13.1.jar!/hive-log4j.properties
hive> show tables;
OK
Time taken: 0.233 seconds
hive> exit;
hduser2@bala:~$
Hive Commands:
Creating table
Loading data
Inserting data into the data
dropping the table
listing the tables
Updating the table data
deleting the data from the table
A snippet from Hadoop: The Definitive Guide(3rd edition):
mv apache-hive-0.13.1-bin hive
editing .bashrc file
hduser2@bala:~$ sudo gedit .bashrc
hduser2@bala:~$ hadoop111/bin/hadoop fs -mkdir /home/hduser2/tmp/hive/warehouse
giving read write permissions to warehouse folder
hduser2@bala:~$ hadoop111/bin/hadoop fs -chmod g+w /home/hduser2/tmp/hive/warehouse
Adding hadoop path in hive config file
hduser2@bala:~$ sudo gedit hive0131/bin/hive-config.sh
# Allow alternate conf dir location.
HIVE_CONF_DIR="${HIVE_CONF_DIR:-$HIVE_HOME/conf}"
export HIVE_CONF_DIR=$HIVE_CONF_DIR
export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH
export HADOOP_HOME=/home/hduser2/hadoop111
# Default to use 256MB
export HADOOP_HEAPSIZE=${HADOOP_HEAPSIZE:-256}
Launch hive
hduser2@bala:~$hive
hduser2@bala:~$ hive
Logging initialized using configuration in jar:file:/home/hduser2/hive0131/lib/hive-common-0.13.1.jar!/hive-log4j.properties
hive> show tables;
OK
Time taken: 0.233 seconds
hive> exit;
hduser2@bala:~$
Hive Commands:
Creating table
Loading data
Inserting data into the data
dropping the table
listing the tables
Updating the table data
deleting the data from the table
UPDATE or DELETE a record isn't allowed in Hive, but INSERT INTO is acceptable.A snippet from Hadoop: The Definitive Guide(3rd edition):
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these features have not been considered a part of Hive's feature set. This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. For a data warehousing application that runs over large portions of the dataset, this works well.Hive doesn't support updates (or deletes), but it does support INSERT INTO, so it is possible to add new rows to an existing table.
Apache Pig Installation on Ubuntu
cd /home/hduser2/
tar -xvf pig-0.13.0.tar.gz
mv pig-0.13.0 pig
Set the java home and pig install directory
hduser2@bala:~$ sudo gedit /etc/profile
export PIG_INSTALL=/home/hduser2/pig0130
export PATH=$PATH:$PIG_INSTALL/bin
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
export PIG_CLASSPATH=/home/hduser2/hadoop111/conf/
hduser2@bala:~$ source /etc/profile
logout from ubuntu and login again
hduser2@bala:~$ pig
Warning: $HADOOP_HOME is deprecated.
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-26 07:40:47,060 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:29:34
2014-08-26 07:40:47,061 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hduser2/pig_1409019047060.log
2014-08-26 07:40:47,083 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hduser2/.pigbootup not found
2014-08-26 07:40:47,205 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:54310
2014-08-26 07:40:47,392 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:54311
grunt>
tar -xvf pig-0.13.0.tar.gz
mv pig-0.13.0 pig
Set the java home and pig install directory
hduser2@bala:~$ sudo gedit /etc/profile
export PATH=$PATH:$PIG_INSTALL/bin
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
export PIG_CLASSPATH=/home/hduser2/hadoop111/conf/
hduser2@bala:~$ source /etc/profile
logout from ubuntu and login again
hduser2@bala:~$ pig
Warning: $HADOOP_HOME is deprecated.
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-26 07:40:47,060 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:29:34
2014-08-26 07:40:47,061 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hduser2/pig_1409019047060.log
2014-08-26 07:40:47,083 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hduser2/.pigbootup not found
2014-08-26 07:40:47,205 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:54310
2014-08-26 07:40:47,392 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:54311
grunt>
errors while installing rstudio on Ubuntu 14.0.4
Install sudo apt-get install r-base and then launch rsudio from the bin folder
Error while installing xlsx package:
First install rJava which is a dependent for xlsx
sudo apt-get install r-cran-rjava
Error while loading xml package:
> install.packages("XML") Installing package into ‘/home/bala/R/x86_64-pc-linux-gnu-library/3.0’ (as ‘lib’ is unspecified) trying URL 'http://cran.rstudio.com/src/contrib/XML_3.98-1.1.tar.gz' Content type 'application/x-gzip' length 1582216 bytes (1.5 Mb) opened URL ================================================== downloaded 1.5 Mb * installing *source* package ‘XML’ ... ** package ‘XML’ successfully unpacked and MD5 sums checked checking for gcc... gcc checking for C compiler default output file name... rm: cannot remove 'a.out.dSYM': Is a directory a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking how to run the C preprocessor... gcc -E checking for sed... /bin/sed checking for pkg-config... /usr/bin/pkg-config checking for xml2-config... no Cannot find xml2-config ERROR: configuration failed for package ‘XML’ * removing ‘/home/bala/R/x86_64-pc-linux-gnu-library/3.0/XML’ Warning in install.packages : installation of package ‘XML’ had non-zero exit status The downloaded source packages are in ‘/tmp/RtmpcGhePy/downloaded_packages’ | |
|
Fix: Run the following command
sudo apt-get update
sudo apt-get install libxml2-dev
sudo apt-get install r-cran-xml
JVM not found while starting eclipse on Ubuntu 14.0.4
Add -vm argument to the eclipse.ini file
-startup
plugins/org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.200.v20140603-1326
-product
org.eclipse.epp.package.jee.product
--launcher.defaultAction
openFile
-showsplash
org.eclipse.platform
--launcher.XXMaxPermSize
256m
--launcher.defaultAction
openFile
--launcher.appendVmargs
-vm
/usr/lib/jvm/java-6-oracle/jre/bin/java
-vmargs
-Dosgi.requiredJavaVersion=1.6
-XX:MaxPermSize=256m
-Xms40m
-Xmx512m
Subscribe to:
Comments (Atom)

