流数据处理的博文
At this point in time,we have enough background established that we can start looking at the core types of usage patterns common across bounded and unbounded data processing today. We’ll look at both types of processing,and where relevant,within the context of the two main types of engines we care about (batch and streaming,where in this context,I’m essentially lumping micro-batch in with streaming since the differences between the two aren’t terribly important at this level). Bounded data Processing bounded data is quite straightforward,and likely familiar to everyone. In the diagram below,we start out on the left with a data set full of entropy. We run it through some data processing engine (typically batch,though a well-designed streaming engine would work just as well),such as MapReduce,and on the right end up with a new structured data set with greater inherent value: Figure 2: Bounded data processing with a classic batch engine. A finite pool of unstructured data on the left is run through a data processing engine,resulting in corresponding structured data on the right. Image: Tyler Akidau. Though there are,of course,infinite variations on what you can actually calculate as part of this scheme,the overall model is quite simple. Much more interesting is the task of processing an unbounded data set. Let’s now look at the various ways unbounded data are typically processed,starting with the approaches used with traditional batch engines,and then ending up with the approaches one can take with a system designed for unbounded data,such as most streaming or micro-batch engines. Unbounded data — batch Batch engines,though not explicitly designed with unbounded data in mind,have been used to process unbounded data sets since batch systems were first conceived. As one might expect,such approaches revolve around slicing up the unbounded data into a collection of bounded data sets appropriate for batch processing. Fixed windows The most common way to process an unbounded data set using repeated runs of a batch engine is by windowing the input data into fixed-sized windows,then processing each of those windows as a separate,bounded data source. Particularly for input sources like logs,where events can be written into directory and file hierarchies whose names encode the window they correspond to,this sort of thing appears quite straightforward at first blush since you’ve essentially performed the time-based shuffle to get data into the appropriate event time windows ahead of time. In reality,most systems still have a completeness problem to deal with: what if some of your events are delayed en route to the logs due to a network partition? What if your events are collected globally and must be transferred to a common location before processing? What if your events come from mobile devices? This means some sort of mitigation may be necessary (e.g.,delaying processing until you’re sure all events have been collected,or re-processing the entire batch for a given window whenever data arrive late). Figure 3: Unbounded data processing via ad hoc fixed windows with a classic batch engine. An unbounded data set is collected up front into finite,fixed-size windows of bounded data that are then processed via successive runs a of classic batch engine. Image: Tyler Akidau. Sessions This approach breaks down even more when you try to use a batch engine to process unbounded data into more sophisticated windowing strategies,like sessions. Sessions are typically defined as periods of activity (e.g.,for a specific user) terminated by a gap of inactivity. When calculating sessions using a typical batch engine,you often end up with sessions that are split across batches,as indicated by the red marks in the diagram below. The number of splits can be reduced by increasing batch sizes,but at the cost of increased latency. Another option is to add additional logic to stitch up sessions from previous runs,but at the cost of further complexity. (编辑:应用网_阳江站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |