Hive Notes

Apache Hive, a framework for data warehousing on top of Hadoop. Hive grew from a need to manage and learn from the huge volumes of data that Facebook was producing every day from its burgeoning social network.


Hive organizes tables into partitions—a way of dividing a table into coarse-grained parts based on the value of a partition column, such as a date. Using partitions can make it faster to do queries on slices of the data.


Tables or partitions may be subdivided further into buckets to give extra structure to the data that may be used for more efficient queries. For example, bucketing by user ID means we can quickly evaluate a user-based query by running it on a randomized sample of the total set of users.


There are two dimensions that govern table storage in Hive: the row format and the file format. The row format dictates how rows, and the fields in a particular row, are stored. In Hive parlance, the row format is defined by a SerDe


The file format dictates the container format for fields in a row. The simplest format is a plain-text file, but there are row-oriented and column-oriented binary formats available, too.


The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only the metadata is deleted; the data is left untouched.


If you want to delete all the data in a table but keep the table definition, use TRUNCATE TABLE. For example:


TRUNCATE TABLE my_table;


If we want to control which reducer a particular row goes to—typically so you can perform some subsequent aggregation. This is what Hive’s DISTRIBUTE BY clause does.


Here’s an example to sort the weather dataset by year and temperature, in such a way as to ensure that all the rows for a given year end up in the same reducer partition:

hive> FROM records2

    > SELECT year, temperature

    > DISTRIBUTE BY year

    > SORT BY year ASC, temperature DESC;

If the columns for SORT BY and DISTRIBUTE BY are the same, you can use CLUSTER BY as a shorthand for specifying both.


Sometimes the query you want to write can’t be expressed easily (or at all) using the built-in functions that Hive provides. By allowing you to write a user-defined function (UDF), Hive makes it easy to plug in your own processing code and invoke it from a Hive query. UDFs have to be written in Java, the language that Hive itself is written in.


There are three types of UDF in Hive: (regular) UDFs, user-defined aggregate functions (UDAFs), and user-defined table-generating functions (UDTFs). They differ in the number of rows that they accept as input and produce as output:


A UDF operates on a single row and produces a single row as its output. Most functions, such as mathematical functions and string functions, are of this type.


A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions as COUNT and MAX.


A UDTF operates on a single row and produces multiple rows—a table—as output.


Hive has an optional component known as HiveServer or HiveThrift that allows access to Hive over a single port. Thrift is a software framework for scalable cross-language services development. Thrift allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.


Frequently used commands:

Connecting to Hive with Tez as execution engine:

hive --hiveconf hive.execution.engine=Tez

For showing headers when a table is queried:


set hive.cli.print.header=true;

Reserved key words usage is restricted, this can be avoided by setting the following param:


set hive.support.sql11.reserved.keywords=false;
set hive.optimize.index.filter=false;

Listing tables:
show tables;
Switching to a db:
use <db-name>;
When partition doesn't show up in a Hive table then repair the table using the command:
msck repair table <table-name>
Listing partitions:
show partitions <table-name>
Drop partition:
alter table email_features drop partition(year=2018,month=05,day=15);
For displaying the table DDL statement:
show create table <table-name>
Alter table: Drop a partition from the table:
ALTER TABLE my_table drop partition(year=2018,month=05,day=06);
Change the location of the path in the external Hive table:
ALTER TABLE my_table SET LOCATION "hdfs://path/to/file";
To drop all partitions in Hive table: 
alter table schema.table drop partition (partitioncolumn != '');

Creating Hive table:
Hive tables can be created as Internal table or External table.

Internal Table: Data is stored in the directory specified by this property 'hive.metastore.warehouse.dir'. Usually the path would be on HDFS. Hive manages the life cycle of the table. When the table is dropped both the meta data and the data would be deleted.


External Table: Data is stored external to the Hive. Meta data is managed by hive. When the table is dropped only the meta data gets deleted. Data wouldn't be deleted.

No comments:

Post a Comment