Running Nutch 2.0 with Cloudera’s CDH3 – missing plugins

Nutch 2.0 typically won’t run on a full (Hadoop base services, Hbase, Hue) SCM installation of Cloudera’s CDH3.  

Same problem will occur on a CDH3 distro (without SCM) with Hue distros installed.  The error is caused by a bug in MAPREDUCE-967 which modifies the way MapReduce unpacks the job’s jar. Previously, the whole jar would be unpacked; after the update, only classes/ and lib/ would be unpacked.  That way, Nutch would complain about a missing plugins/ directory.

Workaround:

1) force unpacking of the plugin/ directory by adding the following properties to nutch-site.xml:

<property>
<name>mapreduce.job.jar.unpack.pattern</name>
<value>(?:classes/|lib/|plugins/).*</value>
</property>

<property>
<name>plugin.folders</name>
<value>${job.local.dir}/../jars/plugins</value>
</property>

2) remove hue-plugins-1.2.0-cdh3u1.jar from the hadoop lib folder (e.g. /usr/lib/hadoop-0.20/lib)

3) recreate the Nutch job file using ant

4) set HADOOP_OPTS=”-Djob.local.dir=/<MY HOME>/nutch/plugins” in hadoop-env.sh

 

See Nutch Wiki for more information:

http://wiki.apache.org/nutch/ErrorMessagesInNutch2

 

 

Technologist, parallel entrepreneur. Interests: travel, photography, big data, analytics, predictive modeling.

Posted in Uncategorized
One comment on “Running Nutch 2.0 with Cloudera’s CDH3 – missing plugins
  1. Michael Alatortsev says:

    Jeff, thanks for noticing. We are planning another Nutch install on CDH3 a few weeks from now, will be happy to provide more details to your team.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: