One of the unique features of Hive is the text processing. This feature is designed to analyze large scale text data, like online comments, text minings. sentiment analysis.

1) concat, split, explode function

hive>select concat(fname,' ',lname) full_name from program;
hive>select concat_ws('/',fname,lname) full_name from program;
select split(full_name, ' ') from
(select concat(fname,' ',lname) full_name from program)
select explode(split(full_name, ' ')) as x from
(select concat(fname,' ',lname) full_name
from program limit 1)

The result:
concat –
concat_WS –
split –

explode – explode is to break the array and display in multiple rows:

2) sentences and ngrams

sentences function and ngrams are very useful for text analysis. sentences function is to break sentence into words. The results are a two-dimension array. Outer layer displays each sentence, inter layer displays each word. ngrams is used to show the frequency of n-word combination. Here is the example:

hive> select sentences(message) from rating2012 limit 3;
hive> select explode(ngrams(sentences(lower(message)),2,6))
as ngram from rating2012 where prod_id=1274673

Here is the result

Sentences –

Ngrams –

3) Parsing URL

Parsing URL function can identify different parts from a URL string. For example we have anURL: http://www.qizeresearch.com/click.php?A=112&K=115

parse_url(url, 'protocol') /*output: http*/
parse_url(url,'host') /*output: www.qizeresearch.com*/
parse_url(url,'path') /*output: /click.php*/
parse_url(url,'query') /*output: A=112&K=115*/
parse_url(url,'query','A') /*output: 112*/
parse_url(url, 'query','K') /*output: 115*/

  1. max says:

    Hi, may I know based on what delimiter sentences() function separates a sentences, I’m getting different results for the same sentence if i change the period(.) with something else

