流数据处理的博文

发布时间：2021-01-07 16:31:06 所属栏目：大数据来源：网络整理

导读：副标题#e# The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is

In an ideal world,event time and processing time would always be equal,with events being processed immediately as they occur. Reality is not so kind,however,and the skew between event time and processing time is not only non-zero,but often a highly variable function of the characteristics of the underlying input sources,execution engine,and hardware. Things that can affect the level of skew include:

Shared resource limitations,such as network congestion,network partitions,or shared CPU in a non-dedicated environment.
Software causes,such as distributed system logic,contention,etc.
Features of the data themselves,including key distribution,variance in throughput,or variance in disorder (e.g.,a plane full of people taking their phones out of airplane mode after having used them offline for the entire flight).

As a result,if you plot the progress of event time and processing time in any real-world system,you typically end up with something that looks a bit like the red line in Figure 1.

Figure 1: Example time domain mapping. The X-axis represents event time completeness in the system,i.e. the time X in event time up to which all data with event times less than X have been observed. The Y-axis represents the progress of processing time,i.e. normal clock time as observed by the data processing system as it executes. Image: Tyler Akidau.

The black dashed line with a slope of one represents the ideal,where processing time and event time are exactly equal; the red line represents reality. In this example,the system lags a bit at the beginning of processing time,veers closer toward the ideal in the middle,then lags again a bit toward the end. The horizontal distance between the ideal and the red line is the skew between processing time and event time. That skew is essentially the latency introduced by the processing pipeline.

BOOK

Data Algorithms

By Mahmoud Parsian

Shop now

Since the mapping between event time and processing time is not static,this means you cannot analyze your data solely within the context of when they are observed in your pipeline if you care about their event times (i.e.,when the events actually occurred). Unfortunately,this is the way most existing systems designed for unbounded data operate. To cope with the infinite nature of unbounded data sets,these systems typically provide some notion of windowing the incoming data. We’ll discuss windowing in great depth below,but it essentially means chopping up a data set into finite pieces along temporal boundaries.

If you care about correctness and are interested in analyzing your data in the context of their event times,you cannot define those temporal boundaries using processing time (i.e.,processing time windowing),as most existing systems do; with no consistent correlation between processing time and event time,some of your event time data are going to end up in the wrong processing time windows (due to the inherent lag in distributed systems,the online/offline nature of many types of input sources,throwing correctness out the window,as it were. We’ll look at this problem in more detail in a number of examples below as well as in the next post.

Unfortunately,the picture isn’t exactly rosy when windowing by event time,either. In the context of unbounded data,disorder and variable skew induce a completeness problem for event time windows: lacking a predictable mapping between processing time and event time,how can you determine when you’ve observed all the data for a given event time X? For many real-world data sources,you simply can’t. The vast majority of data processing systems in use today rely on some notion of completeness,which puts them at a severe disadvantage when applied to unbounded data sets.

I propose that instead of attempting to groom unbounded data into finite batches of information that eventually become complete,we should be designing tools that allow us to live in the world of uncertainty imposed by these complex data sets. New data will arrive,old data may be retracted or updated,and any system we build should be able to cope with these facts on its own,with notions of completeness being a convenient optimization rather than a semantic necessity.

（编辑：应用网_阳江站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

7/19

首页

尾页

绕过使用大数据的保护	用Elastic Block Stor
技术迷途者指南我有问	转向未来的AI自动化测