Skip to main content

Integrating Apache Nutch and Hbase using Gora

There are six steps for using hbase as gora-nutch backend.

First, you have to download Nutch 2 from a mirror site of Apache. Download and extraxt it to where you want to install.

# tar -zxvf apache-nutch-2.X-src.tar.gz

Second, you have to downlad and install Hbase from a mirror site of Apache. Now Gora 0.2 supports Hbase 0.90.X branch.

Third, Gora backend must be specified in nutch-site.xml. Before compiling nutch, all usual configuration parameters should be set in the nutch-site.xml.

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

Fourth, gora-hbase dependency must be available in the ivy/ivy.xml.

<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

Fifth, in the gora.properties file default datastore must be specified as HBaseStore.

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Sixth, compile nutch. (Ant has to be available)

# ant runtime

Now, you should then be able to use nutch.


Comments

  1. hi, I have followed your steps. But when I run the nutch crawl command I am getting below exception..

    Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local1224515128_0002
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

    I use this command. PLease let me know your suggestions to fix this issue

    Thanks,
    RP

    ReplyDelete
  2. Hi, can you give full crawl command you run?
    Second, have you run nutch as distributed or standalone mode?
    When you run nutch in $NUTCH_HOME/runtime/local folder that's standalone,
    When you run nutch from $NUTCH_HOME/runtime/deploy that's distributed.
    You may have forgotten specifying urls directory that should be injected for the first run.

    ReplyDelete
  3. Hi! I have the same error
    Run nutch in standalone folder
    Where is my script
    bin/nutch crawl urls -depth 3 -topN 5



    ReplyDelete
  4. Hello! I have the same error, but it happens with fetch. Here's my stacktrace:

    Exception in thread "main" java.lang.RuntimeException: job failed: name=fetch, jobid=job_local_0007
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:194)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:161)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
    I ran the command bin/nutch crawl urls -depth 20 -topN 10000

    ReplyDelete

Post a Comment

Popular posts from this blog