Learning Hive – Text Processing

One of the unique features of Hive is the text processing. This feature is designed to analyze large scale text data, like online comments, text minings. sentiment analysis.

1) concat, split, explode function

hive>select concat(fname,' ',lname) full_name from program;
hive>select concat_ws('/',fname,lname) full_name from program;
select split(full_name, ' ') from
(select concat(fname,' ',lname) full_name from program)
select explode(split(full_name, ' ')) as x from
(select concat(fname,' ',lname) full_name
from program limit 1)

The result:
concat –
concat_WS –
split –

explode – explode is to break the array and display in multiple rows:

2) sentences and ngrams

sentences function and ngrams are very useful for text analysis. sentences function is to break sentence into words. The results are a two-dimension array. Outer layer displays each sentence, inter layer displays each word. ngrams is used to show the frequency of n-word combination. Here is the example:

hive> select sentences(message) from rating2012 limit 3;
hive> select explode(ngrams(sentences(lower(message)),2,6))
as ngram from rating2012 where prod_id=1274673

Here is the result

Sentences –

Ngrams –

3) Parsing URL

Parsing URL function can identify different parts from a URL string. For example we have anURL: http://www.qizeresearch.com/click.php?A=112&K=115

parse_url(url, 'protocol') /*output: http*/
parse_url(url,'host') /*output: www.qizeresearch.com*/
parse_url(url,'path') /*output: /click.php*/
parse_url(url,'query') /*output: A=112&K=115*/
parse_url(url,'query','A') /*output: 112*/
parse_url(url, 'query','K') /*output: 115*/

This entry was posted in Big Data. Bookmark the permalink.

One Response to Learning Hive – Text Processing

  1. max says:

    Hi, may I know based on what delimiter sentences() function separates a sentences, I’m getting different results for the same sentence if i change the period(.) with something else

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s