ShoutFlow: Building an event-sourced Twitter clone on Apache Flink (Part 1)

In the following series of posts I will talk about building a Twitter clone on an event-sourced architecture. You can view the codebase so far on GitHub

The aim of this exercise is to explore CQRS, Event-Sourcing, how the two fit together and how one might go about implementing these patterns in the form of stream processing in Apache Flink.

Event-sourcing is based on the notion that the current state came about from a series of incremental state changes. If we capture every state change in the form of an event and record it, we could start from nothing and derive the current state from this log of events. This renders our state a cache, something we hold onto and update by applying new events. We could lose this state and event-source it from the event log, the application’s canonical data store.

Command Query Reponsibility Segregation is the idea that write and read operations can have very different requirements and we might benefit by addressing these requirements more explicitly through a separate command-side and query-side model.

For example, different queries are best served by different types of databases.

Social network traversal queries might be best handled by Neo4j. Serving product search results would be a good fit for something that excels at full-text search, e.g. Elastic Search. Twitter is an example of an application with a read frequency orders of magnitude greater than that of the writes. We’ll do well out of denormalization: preparing and serving documents that have already pulled in data from related entities in order to avoid joins and additional queries. That’s a lot of data to keep in sync!

Event-sourcing can help us implement the Q in CQRS without polluting the command-side of our application with nasty transactional writes spanning multiple database systems. Instead, the writes simply append onto the back of the event log. Then a set of projectors, consume the event log and keep the state of read models up to date (at the cost of eventual consistency).

If at some point down the line another database technology is required, simply implement another projector that populates this new database. You might learn of some query pattern that has become expensive, for example a view on the mobile app requires that three different endpoints are contacted, every time. In this scenario you could develop a projector to event-source a document to back this view in its entirety and return it to the client within one query. The client has its data in fewer round trips and your data stores serve fewer queries.

There’s no migrations, the code that will process historical data is the same code that will keep the database up to date from live data. You can deploy these projectors with no downtime to the system. As soon they’re caught up with the back of the event log you can start directing client-facing APIs to these new read models.

If you’d like to hear more about the ideas surrounding building a system around streams of events, I recommend you watch “Turning the database inside out with Apache Samza” by Martin Kleppmann.

Technology

One of the technologies I’m exploring is Apache Flink - the latest hype in stream processing.

Apache Flink equips you with various operators as means to express computation on continuous streams of data. These operators let you map, filter, partition, aggregate and otherwise transform streams, one element at a time, into more streams. By applying these operators to streams and then further chaining the resulting streams with more operators we can express computation as a dataflow graph.

Flink also features fault-tolerant state for you to use within your operators that can be updated and then accessed when processing subsequent elements. Flink can periodically snapshot this state for you and restore it when recovering from failure.

In this talk Stephan Ewen talks about the emergence of Flink as the platform upon which an application’s core logic is executed. This is significant because stream processors originally came to prominence by solving the problem of large scale data analytics tasks - aggregations and various other windowed reductions, data that is perhaps fed into internal dashboards to support operations. This is the talk that inspired my choice to build an application on a stream processor.

Deepstream.io

Deepstream.io is a self-hosted, free and open source alternative to Firebase. It offers data synchronisation, publish/subscribe and request/response functionality. There’s clients for JavaScript, Java and Objective C (transpiled from Java).

The data sync feature mirrors Firebase. Clients subscribe to some data and the client will be notified of the latest data and then all changes that follow. Unlike Firebase, Deepstream.io maintains many records and lets you subscribe to a particular record. On the other hand, Firebase maintains one big document and lets clients address and subscribe to some element within the document.

Architecture

graph TD client[HTTP Client] subgraph Command-Side restapi[REST API] end subgraph Apache Flink application[Application Job] projection[Projection Job] analytics[Analytics Job] end subgraph Apache Kafka commands[Commands Topic] events[Events Topic] end subgraph Read-Side deepstream[DeepStream.io] influxdb[InfluxDB] end dashboard[Internal Dashboard] client--HTTP Request-->restapi restapi--Commands-->commands commands-->application application--Events-->events events-->projection events-->analytics projection--Read Models-->deepstream client--WS Subscription-->deepstream analytics--Metrics-->influxdb dashboard-->influxdb

Commands capture a users intent to make changes to the application.

The HTTP API takes user requests, transforms them to commands and pushes them to Apache Kafka.

The job partitions the command stream per user. The commands are ran against the user entity and if valid, an event is produced.

Flink can help us respond to a load increase. If we provision it with more resources, it will redistribute the work to take advantage of the additional machines.

Events capture application state changes and are recorded as an append-only sequence of immutable messages, persisted in Kafka.

In some sense, Kafka becomes the database. We can’t query it but we can make durable writes and sequential reads very quickly. Each write of an event describes our application going from one consistent state to another.

The job filters the event stream for specific event types, transforms them to derive more events, partitions the events (for example, by the ID of the user they affect), joins them with other event streams and feeds these streams into the sink operators which make writes to a store the web client will query.

Projectors build up documents that will ultimately back components the user will see in the web browser. This means that any relevant information is returned by querying for that particular entity. Querying for a user will return their tweet, following, follower and like counts. Querying a post will return the number of likes on the post, the user’s current display name and avatar hash. There’s no need for the client to make additional queries in order to get supplementary information.

We’re getting rid of query complexity at the expense of storage space.

DeepStream.io pushes document changes to clients in real-time.

The projected documents are persisted in Deepstream.io, available to be subscribed to by clients. Existing subscribers are notified of data changes.

Summary

In this post I tried to give an overview of the big ideas and how they fit together. In the following series of posts I’ll talk about the implementation - from the backend, through the HTTP API to the frontend.

Once again, you can view the implementation so far on GitHub

See you next time!