If there is issues with Hadoop installation where the local directories on each node are different, consider using http://projects.reductivelabs.com/projects/puppet/wiki/Big_Picture
If there is issues with Hadoop installation where the local directories on each node are different, consider using http://projects.reductivelabs.com/projects/puppet/wiki/Big_Picture
Increase the JVM heapspace of weka in the properties file from 128M to 1024M in:
C:\Program Files\Weka-3-6\RunWeka.ini
change the entry for maxheap to:
maxheap=1024m
I have been using Hadoop to parse web logs. Using Hadoop, I have been able to parse the logs to get multiple features. The output results are separated using a comma. The output can then be fed into Weka to perform clustering analysis.
I have been using Weka rather than Apache Mahout. Reasons:
I will move onto Apache Mahout soon, once I understand the relationship of 1 feature with another.
Here are some links for the research areas as outlined for Hadoop:
http://wiki.apache.org/hadoop/ProjectSuggestions
I need to look deeply in the design of Hadoop to start working on some of these projects.
This is a better mechanism to find distance between two IPs. It is faster than using Haversine formula as discussed in my previous post because this is faster than getting latitude and longitude and computing the distance.
import com.maxmind.geoip.Location;
import com.maxmind.geoip.LookupService;
import java.io.IOException;
public class GeoDistanceTwoIPs {
public static void main(String args[]) throws Exception {
//args[0] is the geolite.dat .......... args[1] is ip1 args[2] is ip2
LookupService lookupService = new LookupService(args[0]);
Location location1 = lookupService.getLocation(args[1]);
Location location2 = lookupService.getLocation(args[2]);
System.out.print("Distance: " + location1.distance(location2) + " kilometers");
}
}
Checkout the difference in writing a code with and without a combiner class. The code I wrote without a combiner class was taking a long time (1.5 days and did not complete) to execute; extremely long for the size of data that I work with. Taking a look at the slowest link, I realized the reducer job is the slowest.
I noticed that my code did not have a combiner class (darn, should have realized it earlier). With addition of a Combiner method (which BTW is same as the reducer class), the code finished in 15-20 minutes! Now, the reasons for this performance enhancement are obvious:
Lesson: Use a combiner class!
ChainMapper’s are a way to perform: [MAP+ / REDUCE MAP*] operations.
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(chainMapper.class);
conf.setJobName("Indexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
JobConf mapAConf = new JobConf(false);
ChainMapper.addMapper(conf, LineIndexMapper.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(conf, timeReducer.class, Text.class, Text.class, Text.class, Text.class, true, reduceConf);
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
Notes:
[MAP+ / REDUCE MAP*] can be reduced to [MAP / REDUCE MAP]Notes:
If you have a files in HDFS:
try {
DistributedCache.addCacheFile(new URI("/user/hadoop/GeoLiteCity.dat"), conf);
} catch (URISyntaxException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
try {
localFiles = new Path[0];
localFiles = DistributedCache.getLocalCacheFiles(job);
//Access the files you put in the cache as localFiles[0].toString() etc.
} catch (IOException e) {
System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(e));
//To change body of catch statement use File | Settings | File Templates.
}
}