If there is issues with Hadoop installation where the local directories on each node are different, consider using http://projects.reductivelabs.com/projects/puppet/wiki/Big_Picture
If there is issues with Hadoop installation where the local directories on each node are different, consider using http://projects.reductivelabs.com/projects/puppet/wiki/Big_Picture
I have been using Hadoop to parse web logs. Using Hadoop, I have been able to parse the logs to get multiple features. The output results are separated using a comma. The output can then be fed into Weka to perform clustering analysis.
I have been using Weka rather than Apache Mahout. Reasons:
I will move onto Apache Mahout soon, once I understand the relationship of 1 feature with another.
Here are some links for the research areas as outlined for Hadoop:
http://wiki.apache.org/hadoop/ProjectSuggestions
I need to look deeply in the design of Hadoop to start working on some of these projects.
Checkout the difference in writing a code with and without a combiner class. The code I wrote without a combiner class was taking a long time (1.5 days and did not complete) to execute; extremely long for the size of data that I work with. Taking a look at the slowest link, I realized the reducer job is the slowest.
I noticed that my code did not have a combiner class (darn, should have realized it earlier). With addition of a Combiner method (which BTW is same as the reducer class), the code finished in 15-20 minutes! Now, the reasons for this performance enhancement are obvious:
Lesson: Use a combiner class!
ChainMapper’s are a way to perform: [MAP+ / REDUCE MAP*] operations.
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(chainMapper.class);
conf.setJobName("Indexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
JobConf mapAConf = new JobConf(false);
ChainMapper.addMapper(conf, LineIndexMapper.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(conf, timeReducer.class, Text.class, Text.class, Text.class, Text.class, true, reduceConf);
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
Notes:
[MAP+ / REDUCE MAP*] can be reduced to [MAP / REDUCE MAP]Notes:
If you have a files in HDFS:
try {
DistributedCache.addCacheFile(new URI("/user/hadoop/GeoLiteCity.dat"), conf);
} catch (URISyntaxException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
try {
localFiles = new Path[0];
localFiles = DistributedCache.getLocalCacheFiles(job);
//Access the files you put in the cache as localFiles[0].toString() etc.
} catch (IOException e) {
System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(e));
//To change body of catch statement use File | Settings | File Templates.
}
}
Karmasphere has released a great tool for for Hadoop for netbeans editor. It visualizes all steps in Hadoop during various stages. Check this out, highly recommended.
I have a Hadoop installation going with just 2 boxes right now. I have my eyes on a few more machines. Hopefully, I get access to more machines and then, my ETL processing times for this enormous 9 terabyte data will be cut down further. I wonder by what factor…