Create dataset for Apache Storm benchmark

In this step, you will create the dataset file necessary for the StormEmailBenchmark.

Before you begin: Make sure you have performed this step: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset )

Three different datasets can be generated. The generation code for all three is present within the package com.ibm.streamsx.storm.email.benchmark.testing.

Compressed and Serialized: for the main application benchmark
- Generated using CreateDatasetSequential
- For use with topologies: EnronTopology, BareboneTopology, and TrivialTopology1
Compressed and Unserialized
- Generated using CreateCompressedDatasetSequential
- For use with topology TrivialTopology2
Uncompressed and Serialized
- Generated using CreateSerializedDatasetSequential
- For use with topology RestrictedTopology

The input to these is the output of the preprocessing stage and their arguments are similar.

For instance, to generate the serialized/compressed data:

java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.testing.CreateDatasetSequential <input_path: the output of CoalesceEnronDataset> <output_file_path_and_filename_with_ext>

Next steps:

[Running Apache Storm benchmark](Running Apache Storm benchmark)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset for Apache Storm benchmark

Next steps:

Clone this wiki locally