流数据处理的博文
Contrary to the ad hoc nature of most batch-based unbounded data processing approaches,streaming systems are built for unbounded data. As I noted earlier,for many real-world,distributed input sources,you not only find yourself dealing with unbounded data,but also data that are:
There are a handful of approaches one can take when dealing with data that have these characteristics. I generally categorize these approaches into four groups:
We’ll now spend a little bit of time looking at each of these approaches. Time-agnostic Time-agnostic processing is used in cases where time is essentially irrelevant — i.e.,all relevant logic is data driven. Since everything about such use cases is dictated by the arrival of more data,there’s really nothing special a streaming engine has to support other than basic data delivery. As a result,essentially all streaming systems in existence support time-agnostic use cases out of the box (modulo system-to-system variances in consistency guarantees,for those of you that care about correctness). Batch systems are also well suited for time-agnostic processing of unbounded data sources,by simply chopping the unbounded source into an arbitrary sequence of bounded data sets and processing those data sets independently. We’ll look at a couple of concrete examples in this section,but given the straightforwardness of handling time-agnostic processing,won’t spend much more time on it beyond that. Filtering A very basic form of time-agnostic processing is filtering. Imagine you’re processing Web traffic logs,and you want to filter out all traffic that didn’t originate from a specific domain. You would look at each record as it arrived,see if it belonged to the domain of interest,and drop it if not. Since this sort of thing depends only on a single element at any time,the fact that the data source is unbounded,unordered,and of varying event time skew is irrelevant. Figure 5: Filtering unbounded data. A collection of data (flowing left to right) of varying types is filtered into a homogeneous collection containing a single type. Image: Tyler Akidau. Inner-joins Another time-agnostic example is an inner-join (or hash-join). When joining two unbounded data sources,if you only care about the results of a join when an element from both sources arrive,there’s no temporal element to the logic. Upon seeing a value from one source,you can simply buffer it up in persistent state; you only need to emit the joined record once the second value from the other source arrives. (In truth,you’d likely want some sort of garbage collection policy for unemitted partial joins,which would likely be time based. But for a use case with little or no uncompleted joins,such a thing might not be an issue.) Figure 6: Performing an inner join on unbounded data. Joins are produced when matching elements from both sources are observed. Image: Tyler Akidau. Switching semantics to some sort of outer join introduces the data completeness problem we’ve talked about: once you’ve seen one side of the join,how do you know whether the other side is ever going to arrive or not? Truth be told,you don’t,so you have to introduce some notion of a timeout,which introduces an element of time. That element of time is essentially a form of windowing,which we’ll look at more closely in a moment. Approximation algorithms (编辑:应用网_阳江站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |