流数据处理的博文
In an ideal world,event time and processing time would always be equal,with events being processed immediately as they occur. Reality is not so kind,however,and the skew between event time and processing time is not only non-zero,but often a highly variable function of the characteristics of the underlying input sources,execution engine,and hardware. Things that can affect the level of skew include:
As a result,if you plot the progress of event time and processing time in any real-world system,you typically end up with something that looks a bit like the red line in Figure 1. Figure 1: Example time domain mapping. The X-axis represents event time completeness in the system,i.e. the time X in event time up to which all data with event times less than X have been observed. The Y-axis represents the progress of processing time,i.e. normal clock time as observed by the data processing system as it executes. Image: Tyler Akidau. The black dashed line with a slope of one represents the ideal,where processing time and event time are exactly equal; the red line represents reality. In this example,the system lags a bit at the beginning of processing time,veers closer toward the ideal in the middle,then lags again a bit toward the end. The horizontal distance between the ideal and the red line is the skew between processing time and event time. That skew is essentially the latency introduced by the processing pipeline. BOOKData AlgorithmsBy Mahmoud Parsian Shop now Since the mapping between event time and processing time is not static,this means you cannot analyze your data solely within the context of when they are observed in your pipeline if you care about their event times (i.e.,when the events actually occurred). Unfortunately,this is the way most existing systems designed for unbounded data operate. To cope with the infinite nature of unbounded data sets,these systems typically provide some notion of windowing the incoming data. We’ll discuss windowing in great depth below,but it essentially means chopping up a data set into finite pieces along temporal boundaries. If you care about correctness and are interested in analyzing your data in the context of their event times,you cannot define those temporal boundaries using processing time (i.e.,processing time windowing),as most existing systems do; with no consistent correlation between processing time and event time,some of your event time data are going to end up in the wrong processing time windows (due to the inherent lag in distributed systems,the online/offline nature of many types of input sources,throwing correctness out the window,as it were. We’ll look at this problem in more detail in a number of examples below as well as in the next post. Unfortunately,the picture isn’t exactly rosy when windowing by event time,either. In the context of unbounded data,disorder and variable skew induce a completeness problem for event time windows: lacking a predictable mapping between processing time and event time,how can you determine when you’ve observed all the data for a given event time X? For many real-world data sources,you simply can’t. The vast majority of data processing systems in use today rely on some notion of completeness,which puts them at a severe disadvantage when applied to unbounded data sets. I propose that instead of attempting to groom unbounded data into finite batches of information that eventually become complete,we should be designing tools that allow us to live in the world of uncertainty imposed by these complex data sets. New data will arrive,old data may be retracted or updated,and any system we build should be able to cope with these facts on its own,with notions of completeness being a convenient optimization rather than a semantic necessity. (编辑:应用网_阳江站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |