Gautam's Blog

The technical blog of Gautam!

Browsing Posts in Research

If there is issues with Hadoop installation where the local directories on each node are different, consider using http://projects.reductivelabs.com/projects/puppet/wiki/Big_Picture

Increase the JVM heapspace of weka in the properties file from 128M to 1024M in:

C:\Program Files\Weka-3-6\RunWeka.ini

change the entry for maxheap to:

maxheap=1024m

I have been using Hadoop to parse web logs. Using Hadoop, I have been able to parse the logs to get multiple features. The output results are separated using a comma. The output can then be fed into Weka to perform clustering analysis.

I have been using Weka rather than Apache Mahout. Reasons:

  • Weka gives me a visual analysis of results.
  • Gui-based mechanism is helpful to identify and understand the relation of one dimension with another when visually represented on a 2-dimensional space.

I will move onto Apache Mahout soon, once I understand the relationship of 1 feature with another.

Here are some links for the research areas as outlined for Hadoop:

http://wiki.apache.org/hadoop/ProjectSuggestions

I need to look deeply in the design of Hadoop to start working on some of these projects.

This is a better mechanism to find distance between two IPs. It is faster than using Haversine formula as discussed in my previous post because this is faster than getting latitude and longitude and computing the distance.

import com.maxmind.geoip.Location;
import com.maxmind.geoip.LookupService;
import java.io.IOException;

public class GeoDistanceTwoIPs {
    public static void main(String args[]) throws Exception {
         //args[0] is the geolite.dat  .......... args[1] is ip1 args[2] is ip2
         LookupService lookupService = new LookupService(args[0]);
         Location location1 = lookupService.getLocation(args[1]);
         Location location2 = lookupService.getLocation(args[2]);
         System.out.print("Distance: " + location1.distance(location2) + " kilometers");
     }
}

Checkout the difference in writing a code with and without  a combiner class. The code I wrote without a combiner class was taking a long time (1.5 days and did not complete) to execute; extremely long for the size of data that I work with. Taking a look at the slowest link, I realized the reducer job is the slowest.

I noticed that my code did not have a combiner class (darn, should have realized it earlier). With addition of a Combiner method (which BTW is same as the reducer class), the code finished in 15-20 minutes! Now, the reasons for this performance enhancement are obvious:

  • <K, V> are in memory and network latency and traffic to reducers is decreased.
  • Disk operations are minimal at the reducers as a result of combine operations.

Lesson: Use a combiner class!

ChainMapper’s are a way to perform: [MAP+ / REDUCE MAP*] operations.

  • Find below an example main function written to handle a chainmapper.
public static void main(String[] args) {
        JobClient client = new JobClient();
        JobConf conf = new JobConf(chainMapper.class);
        conf.setJobName("Indexer");
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(Text.class);

        JobConf mapAConf = new JobConf(false);
        ChainMapper.addMapper(conf, LineIndexMapper.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
        JobConf reduceConf = new JobConf(false);
        ChainReducer.setReducer(conf, timeReducer.class, Text.class, Text.class, Text.class, Text.class, true, reduceConf);

        FileInputFormat.addInputPath(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));
        client.setConf(conf);
        try {
            JobClient.runJob(conf);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

Notes:
  • While a chainmapper can be used to simplify processing. Usually be deftly handling the data, most [MAP+ / REDUCE MAP*] can be reduced to [MAP / REDUCE MAP]
  • Add GeoCityLite.dat to HDFS.
  • Use Distributed cache to refer the file in your Map program. Some code is available in my previous post: http://singaraju.com/blogs/gautam/2010/02/20/adding-data-to-distributed-caching/
  • Once the GeoLiteCity.dat is available in HDFS, use Maxmind’s LookupService to create an object linking to the GeoLiteCity.dat.
  • create a Location Object and pass the IP to it.
  • A significant IPs might come back with a null city/ country. Use try and catch blocks to catch nullpointerexceptions and process them accordingly.

Notes:

  • Reduce the creation of the LookupService objects. These are resource intensive.
  • Similarly, reduce the creation of Location Objects.

If you have a files in HDFS:

  • Add files to distributed cache using Hadoop fs -put local_file HDFS_file
  • Create a JobClient Object and add the files’ URI to the distributedCache.
try {
            DistributedCache.addCacheFile(new URI("/user/hadoop/GeoLiteCity.dat"), conf);
        } catch (URISyntaxException e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
        }
  • Create a configure method that overrides org.apache.hadoop.mapred.MapReduceBase and org.apache.hadoop.mapred.jobConfigurable
private Path[] localFiles;
public void configure(JobConf job) {
            // Get the cached archives/files
            try {
                localFiles = new Path[0];
                localFiles = DistributedCache.getLocalCacheFiles(job);
                //Access the files you put in the cache as localFiles[0].toString() etc.
            } catch (IOException e) {
                System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(e));
                //To change body of catch statement use File | Settings | File Templates.
            }

        }

A fantastic article on security.

https://www.eff.org/deeplinks/2009/09/new-cookie-technologies-harder-see-and-remove-wide

https://www.eff.org/deeplinks/2009/09/online-trackers-and-social-networks

http://www.eff.org/deeplinks/2010/01/tracking-by-user-agent

Powered by WordPress Web Design by SRS Solutions © 2012 Gautam's Blog Design by SRS Solutions