Skip to main content

How to create hive external table for nutch's hbase webpage schema?

In order to query hbase table using hive, an external table should be created.

CREATE EXTERNAL TABLE webpage_hive (key string, baseUrl string, status int, prevFetchTime bigint, fetchTime bigint, fetchInterval bigint, retriesSinceFetch int, reprUrl string, content string, contentType string, protocolStatus string, modifiedTime bigint, prevModifiedTime bigint, batchId string, title string, text string, parseStatus int, signature string, prevSignature string, score int, headers map<string,string>, inlinks map<string,string>, outlinks map<string,string>, metadata map<string,string>, markers map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:bas,f:st,f:pts#b,f:ts#b,f:fi#b,f:rsf,f:rpr,f:cnt,f:typ,f:prot,f:mod#b,f:pmod#b,f:bid,p:t,p:c,p:st,p:sig,p:psig,s:s,h:,il:,ol:,mtdt:,mk:") TBLPROPERTIES ("" = "webpage");

after executing this statement columns are created like:

baseurl string from deserializer
batchid string from deserializer
content string from deserializer
contenttype string from deserializer
fetchinterval bigint from deserializer
fetchtime bigint from deserializer
headers map<string,string> from deserializer
inlinks map<string,string> from deserializer
key string from deserializer
markers map<string,string> from deserializer
metadata map<string,string> from deserializer
modifiedtime bigint from deserializer
outlinks map<string,string> from deserializer
parsestatus int from deserializer
prevfetchtime bigint from deserializer
prevmodifiedtime bigint from deserializer
prevsignature string from deserializer
protocolstatus string from deserializer
reprurl string from deserializer
retriessincefetch int from deserializer
score int from deserializer
signature string from deserializer
status int from deserializer
text string from deserializer
title string from deserializer

some of example queries are:

Following query converts bigint epoch to readable date format:
select baseurl,from_unixtime(fetchtime, "[dd/MM/yyyy:HH:mm:ss Z]") AS ft from webpage_hive order by baseurl desc;

Following query explode outlinks in a lateral view and displays as key,value pairs:
SELECT baseurl, outl_key,outl_value FROM webpage_hive LATERAL VIEW explode(outlinks) olTable AS outl_key,outl_value;


Popular posts from this blog

Find and replace with sed command in Linux

Find and replace feature is always handy. It can turn into a torture when it comes to change or delete a simple constant string in a text file. There is a handy tool in linux for doing these kind of tihngs. Actually sed is not a text editor but it is used outside of the text file to make changes.