Tuesday 26 August 2014

Working with Sqoop

Apache Sqoop

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz installation

$ tar -xvf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz

$ mv sqoop-1.4.4.bin__hadoop-1.0.0 sqoop144_h100

$ sudo gedit .bashrc

$ . .bashrc

Test sqoop installation with the command

$ sqoop144_h100/bin/sqoop help

Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: $HADOOP_HOME is deprecated.

usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

hduser2@bala:~$




Monday 18 August 2014

Working with Apache Hive on Ubuntu

tar -xvf apache-hive-0.13.1-bin.tar.gz

mv apache-hive-0.13.1-bin hive


editing .bashrc file

hduser2@bala:~$ sudo gedit .bashrc

creating warehouse folder in HDFS

hduser2@bala:~$ hadoop111/bin/hadoop fs -mkdir /home/hduser2/tmp/hive/warehouse

giving read write permissions to warehouse folder

hduser2@bala:~$ hadoop111/bin/hadoop fs -chmod g+w /home/hduser2/tmp/hive/warehouse

Adding hadoop path in hive config file

hduser2@bala:~$ sudo gedit hive0131/bin/hive-config.sh

# Allow alternate conf dir location.
HIVE_CONF_DIR="${HIVE_CONF_DIR:-$HIVE_HOME/conf}"

export HIVE_CONF_DIR=$HIVE_CONF_DIR
export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH
export HADOOP_HOME=/home/hduser2/hadoop111
# Default to use 256MB
export HADOOP_HEAPSIZE=${HADOOP_HEAPSIZE:-256}

Launch hive

hduser2@bala:~$hive
hduser2@bala:~$ hive

Logging initialized using configuration in jar:file:/home/hduser2/hive0131/lib/hive-common-0.13.1.jar!/hive-log4j.properties
hive> show tables;
OK
Time taken: 0.233 seconds
hive> exit;
hduser2@bala:~$


Hive Commands:

Creating table



Loading data

Inserting data into the data

dropping the table

listing the tables


Updating the table data

deleting the data from the table

UPDATE or DELETE a record isn't allowed in Hive, but INSERT INTO is acceptable.
A snippet from Hadoop: The Definitive Guide(3rd edition):
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these features have not been considered a part of Hive's feature set. This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. For a data warehousing application that runs over large portions of the dataset, this works well.
Hive doesn't support updates (or deletes), but it does support INSERT INTO, so it is possible to add new rows to an existing table.







Apache Pig Installation on Ubuntu

cd /home/hduser2/

tar -xvf pig-0.13.0.tar.gz

mv pig-0.13.0 pig

Set the java home and pig install directory

hduser2@bala:~$ sudo gedit /etc/profile

export PIG_INSTALL=/home/hduser2/pig0130

export PATH=$PATH:$PIG_INSTALL/bin

export JAVA_HOME=/usr/lib/jvm/java-6-oracle

export PIG_CLASSPATH=/home/hduser2/hadoop111/conf/

hduser2@bala:~$ source /etc/profile

logout from ubuntu and login again

hduser2@bala:~$ pig
Warning: $HADOOP_HOME is deprecated.

14/08/26 07:40:47 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/26 07:40:47 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-26 07:40:47,060 [main] INFO  org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:29:34
2014-08-26 07:40:47,061 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hduser2/pig_1409019047060.log
2014-08-26 07:40:47,083 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hduser2/.pigbootup not found
2014-08-26 07:40:47,205 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:54310
2014-08-26 07:40:47,392 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:54311
grunt>


Tuesday 12 August 2014

errors while installing rstudio on Ubuntu 14.0.4



Install sudo apt-get install r-base and then launch rsudio from the bin folder

Error while installing xlsx package:

First install rJava which is a dependent for xlsx

sudo apt-get install r-cran-rjava

Error while loading xml package:

> install.packages("XML")
Installing package into ‘/home/bala/R/x86_64-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/XML_3.98-1.1.tar.gz'
Content type 'application/x-gzip' length 1582216 bytes (1.5 Mb)
opened URL
==================================================
downloaded 1.5 Mb

* installing *source* package ‘XML’ ...
** package ‘XML’ successfully unpacked and MD5 sums checked
checking for gcc... gcc
checking for C compiler default output file name... 
rm: cannot remove 'a.out.dSYM': Is a directory
a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking how to run the C preprocessor... gcc -E
checking for sed... /bin/sed
checking for pkg-config... /usr/bin/pkg-config
checking for xml2-config... no
Cannot find xml2-config
ERROR: configuration failed for package ‘XML’
* removing ‘/home/bala/R/x86_64-pc-linux-gnu-library/3.0/XML’
Warning in install.packages :
  installation of package ‘XML’ had non-zero exit status

The downloaded source packages are in
 ‘/tmp/RtmpcGhePy/downloaded_packages’
>



Fix: Run the following command

sudo apt-get update

sudo apt-get install libxml2-dev
sudo apt-get install r-cran-xml

JVM not found while starting eclipse on Ubuntu 14.0.4


Add -vm argument to the eclipse.ini file


-startup
plugins/org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.200.v20140603-1326
-product
org.eclipse.epp.package.jee.product
--launcher.defaultAction
openFile
-showsplash
org.eclipse.platform
--launcher.XXMaxPermSize
256m
--launcher.defaultAction
openFile
--launcher.appendVmargs
-vm
/usr/lib/jvm/java-6-oracle/jre/bin/java
-vmargs
-Dosgi.requiredJavaVersion=1.6
-XX:MaxPermSize=256m
-Xms40m
-Xmx512m