流数据处理的博文

发布时间：2021-01-07 16:31:06 所属栏目：大数据来源：网络整理

导读：副标题#e# The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is

Unbounded data: A type of ever-growing,essentially infinite data set. These are often referred to as “streaming data.” However,the terms streaming or batch are problematic when applied to data sets,because as noted above,they imply the use of a certain type of execution engine for processing those data sets. The key distinction between the two types of data sets in question is,in reality,their finiteness,and it’s thus preferable to characterize them by terms that capture this distinction. As such,I will refer to infinite “streaming” data sets as unbounded data,and finite “batch” data sets as bounded data.
Unbounded data processing: An ongoing mode of data processing,applied to the aforementioned type of unbounded data. As much as I personally like the use of the term streaming to describe this type of data processing,its use in this context again implies the employment of a streaming execution engine,which is at best misleading; repeated runs of batch engines have been used to process unbounded data since batch systems were first conceived (and conversely,well-designed streaming systems are more than capable of handling “batch” workloads over bounded data). As such,for the sake of clarity,I will simply refer to this as unbounded data processing.
Low-latency,approximate,and/or speculative results: These types of results are most often associated with streaming engines. The fact that batch systems have traditionally not been designed with low-latency or speculative results in mind is a historical artifact,and nothing more. And of course,batch engines are perfectly capable of producing approximate results if instructed to. Thus,as with the terms above,it’s far better describing these results as what they are (low-latency,and/or speculative) than by how they have historically been manifested (via streaming engines).

From here on out,any time I use the term “streaming,” you can safely assume I mean an execution engine designed for unbounded data sets,and nothing more. When I mean any of the other terms above,I will explicitly say unbounded data,or low-latency / approximate / speculative results. These are the terms we’ve adopted within Cloud Dataflow,and I encourage others to take a similar stance.

On the greatly exaggerated limitations of streaming

Next up,let’s talk a bit about what streaming systems can and can’t do,with an emphasis on can; one of the biggest things I want to get across in these posts is just how capable a well-designed streaming system can be. Streaming systems have long been relegated to a somewhat niche market of providing low-latency,inaccurate/speculative results,often in conjunction with a more capable batch system to provide eventually correct results,i.e. the Lambda Architecture.

For those of you not already familiar with the Lambda Architecture,the basic idea is that you run a streaming system alongside a batch system,both performing essentially the same calculation. The streaming system gives you low-latency,inaccurate results (either because of the use of an approximation algorithm,or because the streaming system itself does not provide correctness),and some time later a batch system rolls along and provides you with correct output. Originally proposed by Twitter’s Nathan Marz (creator of Storm),it ended up being quite successful because it was,in fact,a fantastic idea for the time; streaming engines were a bit of a letdown in the correctness department,and batch engines were as inherently unwieldy as you’d expect,so Lambda gave you a way to have your proverbial cake and eat it,too. Unfortunately,maintaining a Lambda system is a hassle: you need to build,provision,and maintain two independent versions of your pipeline,and then also somehow merge the results from the two pipelines at the end.

As someone who has spent years working on a strongly-consistent streaming engine,I also found the entire principle of the Lambda Architecture a bit unsavory. Unsurprisingly,I was a huge fan of Jay Kreps’ Questioning the Lambda Architecture post when it came out. Here was one of the first highly visible statements against the necessity of dual-mode execution; delightful. Kreps addressed the issue of repeatability in the context of using a replayable system like Kafka as the streaming interconnect,and went so far as to propose the Kappa Architecture,which basically means running a single pipeline using a well-designed system that’s appropriately built for the job at hand. I’m not convinced that notion itself requires a name,but I fully support the idea in principle.

EBOOK

（编辑：应用网_阳江站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

3/19

首页

尾页

绕过使用大数据的保护	用Elastic Block Stor
技术迷途者指南我有问	转向未来的AI自动化测