Skip to content

Create dataset for Apache Storm benchmark

Zubair Nabi edited this page May 20, 2015 · 8 revisions

In this step, you will create the dataset file necessary for the StormEmailBenchmark.

Before you begin: Make sure you have performed this step: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset )

Three different datasets can be generated. The generation code for all three is present within the package com.ibm.streamsx.storm.email.benchmark.testing.

  1. Compressed and Serialized: for the main application benchmark
    • Generated using CreateDatasetSequential
    • For use with topologies: EnronTopology, BareboneTopology, and TrivialTopology1
  2. Compressed and Unserialized
    • Generated using CreateCompressedDatasetSequential
    • For use with topology TrivialTopology2
  3. Uncompressed and Serialized
    • Generated using CreateSerializedDatasetSequential
    • For use with topology RestrictedTopology

The input to these is the output of the preprocessing stage and their arguments are similar.

For instance, to generate the serialized/compressed data:

java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.testing.CreateDatasetSequential <input_path: the output of CoalesceEnronDataset> <output_file_path_and_filename_with_ext>

Next steps:

[Running Apache Storm benchmark](Running Apache Storm benchmark)