Gautam's Blog

The technical blog of Gautam!

Browsing Posts in distributedComputing

If there is issues with Hadoop installation where the local directories on each node are different, consider using http://projects.reductivelabs.com/projects/puppet/wiki/Big_Picture

I have been using Hadoop to parse web logs. Using Hadoop, I have been able to parse the logs to get multiple features. The output results are separated using a comma. The output can then be fed into Weka to perform clustering analysis.

I have been using Weka rather than Apache Mahout. Reasons:

  • Weka gives me a visual analysis of results.
  • Gui-based mechanism is helpful to identify and understand the relation of one dimension with another when visually represented on a 2-dimensional space.

I will move onto Apache Mahout soon, once I understand the relationship of 1 feature with another.

Here are some links for the research areas as outlined for Hadoop:

http://wiki.apache.org/hadoop/ProjectSuggestions

I need to look deeply in the design of Hadoop to start working on some of these projects.

Checkout the difference in writing a code with and without  a combiner class. The code I wrote without a combiner class was taking a long time (1.5 days and did not complete) to execute; extremely long for the size of data that I work with. Taking a look at the slowest link, I realized the reducer job is the slowest.

I noticed that my code did not have a combiner class (darn, should have realized it earlier). With addition of a Combiner method (which BTW is same as the reducer class), the code finished in 15-20 minutes! Now, the reasons for this performance enhancement are obvious:

  • <K, V> are in memory and network latency and traffic to reducers is decreased.
  • Disk operations are minimal at the reducers as a result of combine operations.

Lesson: Use a combiner class!

ChainMapper’s are a way to perform: [MAP+ / REDUCE MAP*] operations.

  • Find below an example main function written to handle a chainmapper.
public static void main(String[] args) {
        JobClient client = new JobClient();
        JobConf conf = new JobConf(chainMapper.class);
        conf.setJobName("Indexer");
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(Text.class);

        JobConf mapAConf = new JobConf(false);
        ChainMapper.addMapper(conf, LineIndexMapper.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
        JobConf reduceConf = new JobConf(false);
        ChainReducer.setReducer(conf, timeReducer.class, Text.class, Text.class, Text.class, Text.class, true, reduceConf);

        FileInputFormat.addInputPath(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));
        client.setConf(conf);
        try {
            JobClient.runJob(conf);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

Notes:
  • While a chainmapper can be used to simplify processing. Usually be deftly handling the data, most [MAP+ / REDUCE MAP*] can be reduced to [MAP / REDUCE MAP]
  • Add GeoCityLite.dat to HDFS.
  • Use Distributed cache to refer the file in your Map program. Some code is available in my previous post: http://singaraju.com/blogs/gautam/2010/02/20/adding-data-to-distributed-caching/
  • Once the GeoLiteCity.dat is available in HDFS, use Maxmind’s LookupService to create an object linking to the GeoLiteCity.dat.
  • create a Location Object and pass the IP to it.
  • A significant IPs might come back with a null city/ country. Use try and catch blocks to catch nullpointerexceptions and process them accordingly.

Notes:

  • Reduce the creation of the LookupService objects. These are resource intensive.
  • Similarly, reduce the creation of Location Objects.

If you have a files in HDFS:

  • Add files to distributed cache using Hadoop fs -put local_file HDFS_file
  • Create a JobClient Object and add the files’ URI to the distributedCache.
try {
            DistributedCache.addCacheFile(new URI("/user/hadoop/GeoLiteCity.dat"), conf);
        } catch (URISyntaxException e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
        }
  • Create a configure method that overrides org.apache.hadoop.mapred.MapReduceBase and org.apache.hadoop.mapred.jobConfigurable
private Path[] localFiles;
public void configure(JobConf job) {
            // Get the cached archives/files
            try {
                localFiles = new Path[0];
                localFiles = DistributedCache.getLocalCacheFiles(job);
                //Access the files you put in the cache as localFiles[0].toString() etc.
            } catch (IOException e) {
                System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(e));
                //To change body of catch statement use File | Settings | File Templates.
            }

        }

Karmasphere has  released a great tool for for Hadoop for netbeans editor. It visualizes all steps in Hadoop during various stages. Check this out, highly recommended.

http://www.hadoopstudio.org/tutorial-jobdev-walkthru.html

Hadoop!

No comments

I have a Hadoop installation going with just 2 boxes right now. I have my eyes on a few more machines. Hopefully, I get access to more machines and then, my ETL processing times for this enormous 9 terabyte data will be cut down further. I wonder by what factor…

On a single cluster: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29

On a multi-cluster: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29

Powered by WordPress Web Design by SRS Solutions © 2012 Gautam's Blog Design by SRS Solutions