Real Time Data Infra Stack

A look at the popular options available

photo by amanda jones Feather unsplash

First, we started with data infrastructure development, where we evolved the original data monolith step by step into an architecture that can support both real-time analytics and various data governance.

However, in that article, we did not describe which technology selections were available, but only a high-level view of the architecture development process.

In this article, we will focus on real-time analysis stacks and list some hot or slowly popular options.

Before starting, let’s briefly introduce the whole architecture of real-time analytics.

As we mentioned earlier, the core of the entire real-time infrastructure is streaming.

In order to quickly process all events from event producers, streaming plays an important role, and within streaming there are stream platforms and stream processors. The Stream Platform is a streaming broker, which is used to store and deliver the stream to processors. And the processor after processing the stream will send the stream back to the platform.

Why doesn’t the processor deliver the stream straight back?

Because the stream can pass through more than one processor. To meet the full use case, the stream may go through multiple processors, each of them focusing on what they want to do. It’s a similar concept to a data pipeline.

When a stream is processed, it is persisted and available to users as needed. Therefore, the serving layer should be a data store with high throughput and provide a variety of complex queries. For this requirement, traditional RDBMS cannot meet the throughput requirements, so the serving layer is usually not a relational database.

The back end is for presenting the data to the end-user, which can be tables, diagrams or even complete reports.

We already know that event producers are responsible for generating various “events”, but what kind of events exactly?

There are three types.

  • existing oltp database
  • event tracker
  • language sdk

existing oltp database

Any system will have a database, whether it is a relational database or a NoSQL database, and as long as the application has storage requirements, it will use the database that best suits its needs.

To capture data changes from these databases and deliver them to the streaming platform, we often use debezium,

event tracker

When a user operates a system, be it a web frontend or a mobile application, we always want to capture those events for later analysis of user behavior.

We want trackers to be able to digest events that are generated for us fast, but we also want them to have some customization, such as enriching events. Therefore, plug-ins and customization will be a priority when choosing a tracker.

Common options are listed below.

SNOWPLOW While an open source version provides segment does not.

language sdk

The latter are events generated by various application backends, which are distributed through the SDK provided by the stream platform. Here the technical selection will depend on the stream platform and programming language in use.

The concept of Stream platform is very simple, it is a broker with high throughput.

most common option kafka, but there are also many open-source software and managed services. By the way, the following order does not represent a recommended order.

open source

managed service

A stream processor, as the name suggests, is a role that handles streams. It must have scalability, availability, and fault tolerance, and the ability to support a variety of data sources and sinks is also an important consideration.

apache flinkThe one most often mentioned is one of these options, and there are many others.

open source

managed service

The function of the serving layer is to maintain the results of stream processing and make them readily available to users. Therefore, it must have two important conditions, first, a large throughput, and second, the ability to support more complex query operations.

In general, there are two different approaches, one is to choose a common NoSQL database, such as MongoDB, Elasticsearch either apache cassandra, All these NoSQL databases have good scalability and can support complex queries. Furthermore, these databases are very mature, so the learning curve for both usage and operation is low.

In addition, some NoSQL databases are also on the rise for low latency and large data, for example, Shila,

On the other hand, there have been many newcomers in the SQL family. These new SQL-compatible databases have completely different implementation logic than traditional relational databases, and thus have a higher throughput and can also interact directly with stream platforms.

Furthermore, these databases have scalability that traditional RDBMS do not have, and can still have low query latency in big data scenarios.

The most common in the frontend is still to build services using common web frameworks, such as these three most common frameworks.

Furthermore, low-code frameworks are becoming popular in recent years. Less code means that developers only need to write a small amount of code to make use of many pre-defined functions, which can significantly reduce development time and speed up releases.

To make the data more real-time, the aim is to create value from the data as quickly as possible, so agile development is also a reasonable idea in a production environment. Here are two popular low-code frameworks.

Finally, there are various data visualization platforms.

Although this article lists a lot of targets for technical selection, there are certainly others that I haven’t listed, which may be either out-of-date, less-used options such as apache storm Or out of my radar from the start, like the Java ecosystem.

In addition, I did not put links to the three major public cloud platforms that are already relatively mature (AWS, GCP, Azure), because they can be found in many resources on the Internet at any time.

Even though these technical stacks are listed by category, some fields actually overlap. for example though implement classified as a stream processor, it makes sense to treat it as a serving layer because it is essentially a streaming database, and the same is true ksqldb,

Each project has its own strengths and applications. When making a selection, it is important to consider the objectives you want to accomplish, as well as existing practices and stacks within the organization, so that you can find the right answer from among the many options.

If there is a project that I haven’t listed and you think is worth mentioning, please don’t hesitate to leave me a comment and I’ll find time to survey it.

Leave a Reply