storm-teacup-white
Data Engineering

Storm in a Teacup

Lesezeit
10 ​​min

Notice:
This post is older than 5 years – the content might be outdated.

I wanted to call this blog article something like „Storm in a Nutshell“ but decided against it as

  1. there is probably a book by that name out there somewhere, and I wanted to avoid any unannounced visits in the dead of night from shady-looking types from the copyright police, and
  2. I really wanted to use a corny pun.

So think of a teacup as conceptually similar to a nutshell, but bigger.

On a recent project, we used Apache Storm as the real-time component of a complex, cloud-based environment used for fraud detection. In this article I would like to offer an introductory overview of storm, showing how to define a simple spout and bolt, as well as highlighting some of the issues that are important when building storm topologies.

Batch or real-time?

By way of introduction, let’s briefly describe which tools fit into which space in the real-time/batch paradigm.

Storm vs. Map Reduce vs. Flink vs. Spark

Apache Storm is basically a streaming tool, but also offers mini-batch capabilities with its Trident abstraction layer. Map-reduce is firmly in the batch paradigm, and Apache Spark offers mini-batching (somewhat confusingly referred to as „Spark Streaming“) and batch processing („Spark SQL“). Apache Flink, like Storm, covers streaming- and mini-batch use-cases, but at the time of writing is not yet bundled with any of the Hadoop distributions (Hortonworks, Cloudera and MapR).

We chose Storm as we wanted the reliability of a distribution-backed (and tested) component that could deliver streaming capabilities.

Bolts and Spouts

Storm uses three main types of object: spouts, bolts and topologies that combine these elements in a chain.

Spout: a spout acts as an input to a flow, or stream, of tuples through a process defined by a topology. There can be multiple spouts in a topology, but typically there will be just one, pulling data from a source such as Kafka or Eventhub (as we have in Azure). You can also define your own spouts for testing purposes that generate data and emit it to bolts.

Bolt: a bolt consumes tuples from an input stream. Storm comes with abstact classes that you can extend: the simplest of these is the BaseBasicBolt class, where only two methods have to be implemented: the execute() method (where any processing is done) and the declareOutputFields() method which defines what the objects emitted from the execute() method look like (specifically, what the fields are called and to which output stream they belong). Any „acking“ (ack = acknowledgement notification) is implemented by this abstract class behind the scenes, as is tuple chaining (emitting a tuple identifier along with the array of emitted values, so that the topology can track which tuples have made it through all bolts successfully). The BaseRichBolt abstract class, on the other hand, requires that you implement any acking or chaining yourself.

Topology: a topology combines spouts and bolts, defining which output streams exist and by which bolts they are consumed. A topology can be started in local mode (for testing) or in cluster mode. A tuple passing through a topology can optionally be acked so that the spout can take specific actions, such as replaying that tuple (this guarantees „at least once“ processing).

A simple example

A spout in its simplest form is listed below. We initialize the collector object in the open() method, which is used for emitting randomnly-generated, base64-encoded strings. We use a simple mechanism for limiting the number of tuples that can be active (i.e. not yet acked) in the topology: this is to avoid filling the internal queues to the point of overflow (which can lead to out-of-memory exceptions). We can circumnavigate this simple queue machanism by setting maxPendingMsgs to UNLIMITED_PENDING in the constructor. There is no replay (the class is intentionally simplified), as failed tuples are simply removed from the pending queue. If a queue limit has been specified (i.e. maxPendingMsgs <> UNLIMITED_PENDING), and this limit has been reached, then nothing new is emitted from the spout until there is space in the queue.

A simple bolt that consumes data from this spout is listed below.

Note that the information listed in the declareOutputFields() methods must be consistent across the topology (i.e. either explicitly when retrieving tuple fields by name, or implicitly when doing so by position, as above) otherwise the topology will throw an exception on deployment.

Lastly, our topology links the spout and bolt together:

This topology launches in local mode and shuts down after 15 minutes.

We now move on to considering some not-so-trivial aspects of a storm topology that may be of interest.

Parallelism

In the topology above, we had defined our parallelism in these two lines:

We defined our spout as having a parallelism hint of 1, but the bolt was defined with a hint of 4 ( .setBolt("bolt", new ReadMapBolt(), 4)) and also 4 tasks: .setNumTasks(4).

What is the difference?

The first hint – in setSpout() and setBolt() – is actually the number of executors, where an executor is a thread of execution within the JVM. The second hint is the number of tasks, or instances, of a spout or bolt that have been created.

So Storm parallelism is defined by stating how many actual threads should be applied to a spout/bolt, as well as how many instances of this spout/bolt should be initialised on topology deployment. By default, (number of tasks/instances) = (number of executors/threads), but if we set the number of running tasks/instances to a value higher than what we expect to need, then we can adjust the number of threads up (or down again) without having to stop the topology.

At cluster-level we can also set the number of workers (=JVMs): conf.setNumWorkers(1);. A good rule of thumb is:

(number of workers) = (number of worker nodes in cluster) = (number of spout partitions)

e.g. if we have a cluster made up of 4 worker nodes and we are using a spout-source that uses partitions (such as EventHub), then we should set up our source to have 4 partitions, too. In this way we can have one instance of the spout running in the single JVM on each node, reading from one partition (either exclusively or in round-robin fashion).

Serialization

When linking our bolt to our spout, we defined a shuffleGrouping distribution:

This means, according to the javadoc comment, that „tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.“ This makes perfect sense as it goes a long way to guaranteeing a balanced topology, but it incurs the overhead of object serialization, which takes place whenever tuples are pssed across a JVM boundary:

storm-2

Therefore, each object we emit has to be serializable. This is fine for primitives and simple objects, but for more complex ones we may have to implement this ourselves. One approach is to always emit complex objects as a byte array. We make use of avro classes at certain stages of the topology, and the serialization can be achieved in just a few lines:

However, this means that we are serializing and deserializing even when we *don’t* cross a JVM boundary (since with shuffleGrouping all bolts emit to all instances of the next bolt in the chain, including ones in the same JVM). A better approach is to make use of the Kryo classes within Storm that take care of the serialization (but which only serialize when needed). We can define our serialization code as above, but wrap this in a class that we register with storm, like this:

and

In this way we only incur the serialization overhead when it is needed.

Initialization

A topology is deployed by using the storm command-line tool. Certain checks – e.g. that the topology chain is consistent (i.e. that all defined inputs actually exist), that local resources and remote systems referenced in the topology set-up (i.e. spout/bolt constructors) are available – are carried out before deployment to the cluster. The instances of spout and bolt are then created on the nodes of the cluster. In terms of the spout/bolt code, this means that any objects instantiated in the constructor have to be serializable: any objects that are not, have to be declared transient in the class and instantiated once the prepare() method is called on the node, like this:

We’re now set – we have looked at a simple topology as well as a couple of issues that may crop up when dealing with more complex use cases.

Happy storming!

Read on …

So you’re interested in processing heaps of data? Have a look at our website and read about the services we offer to our customers.

Join us!

Are you looking for a job in big data processing or analytics? We’re currently hiring!

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert