A fantastic post by Kristóf Kovács: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
A fantastic post by Kristóf Kovács: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
File size effects the performance of Hadoop. To evaluate the performance, I wrote a java program to create specific file sizes and uploaded it into Hadoop for performance.
The sum of all file sizes was 512MB. The hadoop streaming ran cat as mapper and wc as the reducer. The results are attached below:

It can be seen that Hadoop is faster when the file size is larger.
Here are some links for the research areas as outlined for Hadoop:
http://wiki.apache.org/hadoop/ProjectSuggestions
I need to look deeply in the design of Hadoop to start working on some of these projects.
My programs require a long time to execute due to the large amount of datasets that I operate on. To ease the constant need to check the execution status, I wrote a small util class to notify the status via gtalk. Gtalk is based on Jabber protocol. Smack API is a java implementation of the Jabber protocol and can be downloaded here: http://www.igniterealtime.org/projects/smack/
package util;
import org.jivesoftware.smack.*;
import org.jivesoftware.smack.packet.Message;
import java.util.Calendar;
import java.util.HashMap;
import java.util.Iterator;
import java.text.SimpleDateFormat;
/**
* Created by IntelliJ IDEA.
* User: Gautam
* Date: Dec 16, 2009
* Time: 11:32:23 AM
* To change this template use File | Settings | File Templates.
*/
public class gchat {
static final String DATE_FORMAT_NOW = "yyyy-MM-dd HH:mm:ss";
static HashMap hp = new HashMap();
static ConnectionConfiguration config;
static XMPPConnection connection;
static Chat chat;
static String user = "";
public static String now() {
Calendar cal = Calendar.getInstance();
SimpleDateFormat sdf = new SimpleDateFormat(DATE_FORMAT_NOW);
return sdf.format(cal.getTime());
}
public static void setup() throws XMPPException {
hp.put("ToUser@gmail.com", "");
//IP of talk.google.com
config = new ConnectionConfiguration("64.233.169.125", 5222, "gmail.com");
connection = new XMPPConnection(config);
connection.connect();
connection.login("FromUser@gmail.com", "Password");
}
public void disconnect() {
connection.disconnect();
}
public static void sendMessage(String messages) throws Exception {
if (hp.isEmpty())
setup();
try {
Iterator it = hp.keySet().iterator();
while (it.hasNext()) {
user = it.next().toString();
chat = connection.getChatManager().createChat(user, new MessageListener() {
public void processMessage(Chat chat, Message message) {
System.out.println("Received message: " + message);
}
});
chat.sendMessage("Automated message from Gautam's robot at: " + now() + ". Message: " + messages);
}
}
catch (XMPPException e) {
e.printStackTrace();
}
}
}
With the need to break lines into smaller tokens for a large number of logs for my malware research, I evaluated the performance of Java’s StringTokenizer, Pattern and Scanner.
Evaluation criterion: x-axis is the number of tokens to be broken in a string. Each string was broken down 1000 times.
Instances of Java StringTokenizer and Scanner, were created dynamically; whereas a single instance of Pattern was created. However, Pattern.split() creates a new array in memory and the time taken is included in this performance.

Conclusion: Use StringTokenizer whenever possible to improve performance. Where more flexibility is needed use other constructs.