流数据处理的博文

发布时间：2021-01-07 16:31:06 所属栏目：大数据来源：网络整理

导读：副标题#e# The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is

副标题[/!--empirenews.page--]

The world beyond batch: Streaming 101

A high-level tour of modern data-processing concepts.

By Tyler Akidau

August 5,2015

Three women wading in a stream gathering leeches (source: Wellcome Library,London).

Editor's note: This is the first post in a two-part series about the evolution of data processing,with a focus on streaming systems,unbounded data sets,and the future of big data. See part two.

Streaming data processing is a big deal in big data these days,and for good reasons. Amongst them:

Businesses crave ever more timely data,and switching to streaming is a good way to achieve lower latency.
The massive,unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
Processing data as they arrive spreads workloads out more evenly over time,yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming,the majority of streaming systems in existence remain relatively immature compared to their batch brethren,which has resulted in a lot of exciting,active development in the space recently.

Get O'Reilly's weekly data newsletter

As someone who’s worked on massive-scale streaming systems at Google for the last five+ years (MillWheel,Cloud Dataflow),I’m delighted by this streaming zeitgeist,to say the least. I’m also interested in making sure that folks understand everything that streaming systems are capable of and how they are best put to use,particularly given the semantic gap that remains between most existing batch and streaming systems. To that end,the fine folks at O’Reilly have invited me to contribute a written rendition of my Say Goodbye to Batch talk from Strata + Hadoop World London 2015. Since I have quite a bit to cover,I’ll be splitting this across two separate posts:

Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing,both batch and streaming.
The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow,facilitated by a concrete example applied across a diverse set of use cases. After that,I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

So,long-winded introductions out of the way,let’s get nerdy.

Background

To begin with,I’ll cover some important background information that will help frame the rest of the topics I want to discuss. We’ll do this in three specific sections:

Terminology: To talk precisely about complex topics requires precise definitions of terms. For some terms that have overloaded interpretations in current use,I’ll try to nail down exactly what I mean when I say them.
Capabilities: I’ll remark on the oft-perceived shortcomings of streaming systems. I’ll also propose the frame of mind that I believe data processing system builders need to adopt in order to address the needs of modern data consumers going forward.
Time domains: I’ll introduce the two primary domains of time that are relevant in data processing,show how they relate,and point out some of the difficulties these two domains impose.

Terminology: What is streaming?

Before going any further,I’d like to get one thing out of the way: what is streaming? The term “streaming” is used today to mean a variety of different things (and for simplicity,I’ve been using it somewhat loosely up until now),which can lead to misunderstandings about what streaming really is,or what streaming systems are actually capable of. As such,I would prefer to define the term somewhat precisely.

The crux of the problem is that many things that ought to be described by what they are (e.g.,unbounded data processing,approximate results,etc.),have come to be described colloquially by how they historically have been accomplished (i.e.,via streaming execution engines). This lack of precision in terminology clouds what streaming really means,and in some cases,burdens streaming systems themselves with the implication that their capabilities are limited to characteristics frequently described as “streaming,” such as approximate or speculative results. Given that well-designed streaming systems are just as capable (technically more so) of producing correct,consistent,repeatable results as any existing batch engine,I prefer to isolate the term streaming to a very specific meaning: a type of data processing engine that is designed with infinite data sets in mind. Nothing more. (For completeness,it’s perhaps worth calling out that this definition includes both true streaming and micro-batch implementations.)

As to other common uses of “streaming,” here are a few that I hear regularly,each presented with the more precise,descriptive terms that I suggest we as a community should try to adopt:

Get O'Reilly's weekly data newsletter

（编辑：应用网_阳江站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

1/19

尾页

绕过使用大数据的保护	用Elastic Block Stor
技术迷途者指南我有问	转向未来的AI自动化测