流数据处理的博文

发布时间：2021-01-07 16:31:06 所属栏目：大数据来源：网络整理

导读：副标题#e# The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is

Good points aside,there is one very big downside to processing time windowing: if the data in question have event times associated with them,those data must arrive in event time order if the processing time windows are to reflect the reality of when those events actually happened. Unfortunately,event-time ordered data are uncommon in many real-world,distributed input sources.

As a simple example,imagine any mobile app that gathers usage statistics for later processing. In cases where a given mobile device goes offline for any amount of time (brief loss of connectivity,airplane mode while flying across the country,the data recorded during that period won’t be uploaded until the device comes online again. That means data might arrive with an event time skew of minutes,hours,days,weeks,or more. It’s essentially impossible to draw any sort of useful inferences from such a data set when windowed by processing time.

As another example,many distributed input sources may seem to provide event-time ordered (or very nearly so) data when the overall system is healthy. Unfortunately,the fact that event-time skew is low for the input source when healthy does not mean it will always stay that way. Consider a global service that processes data collected on multiple continents. If network issues across a bandwidth-constrained transcontinental line (which,sadly,are surprisingly common) further decrease bandwidth and/or increase latency,suddenly a portion of your input data may start arriving with much greater skew than before. If you are windowing that data by processing time,your windows are no longer representative of the data that actually occurred within them; instead,they represent the windows of time as the events arrived at the processing pipeline,which is some arbitrary mix of old and current data.

What we really want in both of those cases is to window data by their event times in a way that is robust to the order of arrival of events. What we really want is event time windowing.

Windowing by event time

Event time windowing is what you use when you need to observe a data source in finite chunks that reflect the times at which those events actually happened. It’s the gold standard of windowing. Sadly,most data processing systems in use today lack native support for it (though any system with a decent consistency model,like Hadoop or Spark Streaming,could act as a reasonable substrate for building such a windowing system).

This diagram shows an example of windowing an unbounded source into one-hour fixed windows:

Figure 10: Windowing into fixed windows by event time. Data are collected into windows based on the times they occurred. The white arrows call out example data that arrived in processing time windows that differed from the event time windows to which they belonged. Image: Tyler Akidau.

The solid white lines in the diagram call out two particular data of interest. Those two data both arrived in processing time windows that did not match the event time windows to which they belonged. As such,if these data had been windowed into processing time windows for a use case that cared about event times,the calculated results would have been incorrect. As one would expect,event time correctness is one nice thing about using event time windows.

（编辑：应用网_阳江站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

15/19

首页

尾页

绕过使用大数据的保护	用Elastic Block Stor
技术迷途者指南我有问	转向未来的AI自动化测