Kafka Streams is a new feature in Kafka 0.10.0 that adds stream processing capabilities into the existing Kafka platform.
Over the past few months I got my first chance to work with Kafka Streams. For those who don’t know, Kafka Streams is a new feature in Kafka 0.10.0 that adds stream processing capabilities into the existing Kafka platform. In this post we are going to take a high-level look at my experience and my opinion about what is the sweet spot for Kafka Streams. If you want an introduction to what Kafka Streams is, read that here, or for detailed discussion about the architecture and API, read that here.
Having worked with several other streaming frameworks I was intrigued by the different approach that Kafka Streams has taken. Most stream processing frameworks (Spark Streaming, Storm,Flink, Samza) require that you have processing infrastructure available or require the use of a 3rd party service (AWS Lambda, Google Cloud DataFlow) in order to accomplish your stream processing. The overhead associated with these sorts of solutions, either the operational effort required for the infrastructure or direct cost, can sometimes be prohibitive to getting a project off the ground. Kafka Streams has taken a different approach and does not require any external infrastructure beyond Kafka. Instead of relying on this external infrastructure, Kafka Streams integrates into your project as a normal library. This approach minimizes the upfront cost and reduces the adoption penalty (i.e. it’s easy to use for just one project), but it does come with some trade-offs.
Since Kafka Streams is implemented as a library without infrastructure dependencies, other than Kafka, it very easy to get started using it in a project. Once you have Kafka set up it is only the matter of a few lines of code to set up the configuration and connect to a Kafka instance; then you are ready to begin your stream processing. Unlike some other frameworks this means you can have a basic stream processing application in a matter of minutes, rather than hours to days that may be required for some infrastructure-based solutions. Since it is built on Kafka, Kafka Streams can easily be used to build robust distributed applications by leveraging the distributed fault tolerant nature of Kafka.
Kafka Streams has support for joining, data transform, windowing and aggregation of streams into other streams or Kafka topics (read more here). This allows you to quickly build applications to handle use cases such as joining two incoming data streams (e.g. data ETL), denormalizing incoming data (e.g. CompanyID to Company Name) or creating aggregates (e.g. Rolling average). In fact, these sorts of simple use cases seem to be the sweet spot for Kafka Streams.
Kafka Streams also has the concept of viewing your stream as a changelog (KStream) or as a snapshot (KTable). This is something they refer to as the stream-table duality. This duality allows you to easily process data in different ways based upon its nature (static vs. dynamic) or how you need to interact with it. If you are interested in the underlying reasoning and scenarios that drive this need, there is a very informative article from LinkedIn Engineering on this topic, which is available here.
The first release of Kafka Streams provides a lot of functionality out of the box but we found some of the features to be immature/incomplete. Joining messages is one example of this lack of richness in the feature set. Joining messages require that the message keys are an exact match. Let’s say that I had the following two incoming stream messages that I want to join into a new message.
Desired Outcome of Stream Join
In this example, since the keys do not match between Message A and Message B, you will first have to do an intermediate transformation of the key for the Message A from {firstname, lastname} to {lastname, firstname}. This new intermediate Message A will have an exact match on the key so it will then be able to be joined to Message B.
Actual Process for Creating Stream Join
In the simple case shown above this does not seem too onerous; however, it quickly becomes a real headache when you want to merge and transform objects of different formats into a new message. In the example below I want to create an OrderLineItem containing the Line Item, the Order ID and the Company Name from its component raw data (a LineItem message, an Order message and a Company message).
Current Process for Creating an OrderLineItem from its components
As you can see it is possible to accomplish this complex join but it requires multiple additional transforms and joins, which comes with the associated overhead and code complexity. While this example is manageable it quickly becomes unruly when you apply it to any reasonably complex problem. I would prefer it if you were given the option to manually specify the joining parameters.
Unfortunately this sort of sparsity in the feature set is not unique to just Kafka Streams. While Kafka Streams lacked features around things such as joins, aggregations and windowing, there were other parts of the platform that had similar issues such as KafkaConnect pushing all messages with a single default schema (read more here) where you end up needing to write more code than is ideal.
An issue you run into on any reasonably complex project is that due to Kafka Streams being a library, your application is now responsible for managing all the resources that it uses. This is a double-edged sword. The minimal dependency on external infrastructure allows your code to be more self-contained and quicker to develop on, the trade-off being that you now need to handle tasks such as monitoring and application restart that come as standard capabilities with other streaming platforms. While this is not necessarily a bad thing it is something to be cognizant of when making the decision on what streaming platform to use.
As with any new library release one of the biggest headaches is the lack of documentation and examples available. While the documentation provided by Confluent (see here) is informative and mostly complete, it is lacking in useful real-world examples. Since the tool is so new it was nearly impossible to find any working example code for anything other than the most trivial use case. While this is not unique to Kafka Streams or any newly released products, this early adopter penalty to consider.
Kafka Streams is a very solid first release and has some real benefits for certain use cases. It is a solid addition to the stream processing landscape and offers a great tool for simple stream processing applications written on top of Kafka. The features that are lacking are really just due to the immaturity of the product and I am hopeful that these gaps will be addressed in upcoming releases.
For a more complex stream processing application, or one that is not solely built on Kafka, I would still recommend using a more robust stream processing framework such as AWS Lambda,Google Cloud DataFlow, Spark Streaming, Storm, Flink, or Samza. Which one you choose depends more on what sort of processing you application and business need (i.e. continuous vs. windowed, on premises vs IaaS, etc.), but that is a discussion for another blog post.
Tell us what you need and one of our experts will get back to you.