-
Notifications
You must be signed in to change notification settings - Fork 16
Create dataset for Apache Storm benchmark
In this step, you will create the dataset file necessary for the StormEmailBenchmark.
Before you begin: Make sure you have performed this step: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset )
Three different datasets can be generated. The generation code for all
three is present within the package com.ibm.streamsx.storm.email.benchmark.testing
.
- Compressed and Serialized: for the main application benchmark
- Generated using
CreateDatasetSequential
- For use with topologies:
EnronTopology
,BareboneTopology
, andTrivialTopology1
- Generated using
- Compressed and Unserialized
- Generated using
CreateCompressedDatasetSequential
- For use with topology
TrivialTopology2
- Generated using
- Uncompressed and Serialized
- Generated using
CreateSerializedDatasetSequential
- For use with topology
RestrictedTopology
- Generated using
The input to these is the output of the preprocessing stage and their arguments are similar.
For instance, to generate the serialized/compressed data:
java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar
com.ibm.streamsx.storm.email.benchmark.testing.CreateDatasetSequential
<input_path: the output of CoalesceEnronDataset>
<output_file_path_and_filename_with_ext>
[Running Apache Storm benchmark](Running Apache Storm benchmark)