Of course,powerful semantics rarely come for free,and event time windows are no exception. Event time windows have two notable drawbacks due to the fact that windows must often live longer (in processing time) than the actual length of the window itself:
-
Buffering: Due to extended window lifetimes,more buffering of data is required. Thankfully,persistent storage is generally the cheapest of the resource types most data processing systems depend on (the others being primarily CPU,network bandwidth,and RAM). As such,this problem is typically much less of a concern than one might think when using any well-designed data-processing system with strongly consistent persistent state and a decent in-memory caching layer. Also,many useful aggregations do not require the entire input set to be buffered (e.g.,sum,or average),but instead can be performed incrementally,with a much smaller,intermediate aggregate stored in persistent state.
-
Completeness: Given that we often have no good way of knowing when we’ve seen all the data for a given window,how do we know when the results for the window are ready to materialize? In truth,we simply don’t. For many types of inputs,the system can give a reasonably accurate heuristic estimate of window completion via something like MillWheel’s watermarks (which I’ll talk about more in Part 2). But in cases where absolute correctness is paramount (again,think billing),the only real option is to provide a way for the pipeline builder to express when they want results for windows to be materialized,and how those results should be refined over time. Dealing with window completeness (or lack,thereof),is a fascinating topic,but one perhaps best explored in the context of concrete examples,which we’ll look at next time.
Conclusion
Whew! That was a lot of information. To those of you that have made it this far: you are to be commended! At this point we are roughly halfway through the material I want to cover,so it’s probably reasonable to step back,recap what I’ve covered so far,and let things settle a bit before diving into Part 2. The upside of all this is that Part 1 is the boring post; Part 2 is where the fun really begins.
Recap
To summarize,in this post I’ve:
- Clarified terminology,specifically narrowing the definition of “streaming” to apply to execution engines only,while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.
- Assessed the relative capabilities of well-designed batch and streaming systems,positing that streaming is in fact a strict superset of batch,and that notions like the Lambda Architecture,which are predicated on streaming being inferior to batch,are destined for retirement as streaming systems mature.
- Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch,those being correctness and tools for reasoning about time,respectively.
- Established the important differences between event time and processing time,characterized the difficulties those differences impose when analyzing data in the context of when they occurred,and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.
- Looked at the major data processing approaches in common use today for bounded and unbounded data,via both batch and streaming engines,roughly categorizing the unbounded approaches into: time-agnostic,approximation,windowing by processing time,and windowing by event time.
Next time
This post provides the context necessary for the concrete examples I’ll be exploring in Part 2. That post will consist of roughly the following:
- A conceptual look at how we’ve broken up the notion of data processing in the Dataflow Model across four related axes: what,where,when,and how.
- A detailed look at processing a simple,concrete example data set across multiple scenarios,highlighting the plurality of use cases enabled by the Dataflow Model,and the concrete APIs involved. These examples will help drive home the notions of event time and processing time introduced in this post,while additionally exploring new concepts,such as watermarks.
- A comparison of existing data-processing systems across the important characteristics covered in both posts,to better enable educated choice amongst them,and to encourage improvement in areas that are lacking,with my ultimate goal being the betterment of data processing systems in general,and streaming systems in particular,across the entire big data community.
Should be a good time. See you then!
(编辑:应用网_阳江站长网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|