I have been using Hadoop to parse web logs. Using Hadoop, I have been able to parse the logs to get multiple features. The output results are separated using a comma. The output can then be fed into Weka to perform clustering analysis.

I have been using Weka rather than Apache Mahout. Reasons:

  • Weka gives me a visual analysis of results.
  • Gui-based mechanism is helpful to identify and understand the relation of one dimension with another when visually represented on a 2-dimensional space.

I will move onto Apache Mahout soon, once I understand the relationship of 1 feature with another.