Skip to content

Drill Within the Big Data Ecosystem

Paul Rogers edited this page Jul 1, 2017 · 3 revisions

Having worked as a developer in the big data space for a while now, answers are emerging to the question: "how does Drill fit into the Big Data ecosystem? When would someone use Drill vs. the many other excellent tools available." This note summarizes that (evolving) understanding.

Context

While there are many definitions of Big Data, a useful one here is simply the idea that an application has so much data that it goes beyond the capabilities of "traditional" technologies such as a relational database or data warehouse. Why this definition? An application developer just needs to get the job done. If Oracle (or Teradata or SQL-Server, etc.) handled the task, just use that and get the app built.

The challenge is when the volume of data is simply too large for existing tools to handle (or handle in a way that fits the project budget). Most traditional tools were "scale-up": meaning the application needed ever larger machines, or expensive clusters of specialized hardware. Enter the idea of spreading the data across a large number of commodity machines, then writing simple tasks to process the data step-by-step.

Without a vendor to provide a proprietary storage format, data storage defaulted to simple, generic formats: CSV, JSON, and later others such as ORC and Parquet.

At first, Google and other hand-crafted the processing. Quickly Google realized that that most of the framework is generic, created Map-Reduce, which led to the open-source Hadoop. In short order, many DAG-based projects came about: Microsoft Dryad, Hadoop, Spark, Storm and many more.

The distributed DAG model has data transforms as nodes and (typically network) exchanges as edges. Writing a data transform is a specialized skill. Soon Hive came along to automate the process: submit HiveQL in one end and get a DAG out the other.

Old-time Hadoop used disk files as exchanges which was convenient, but slow. Dryad used network exchanges to keep data in memory, as did Storm and others.

The purpose here is not to provide a complete history of Big Data, but rather to provide the context for Drill. Apache Drill is a DAG-based, distributed execution engine with in-memory processing and network exchanges between nodes. Drill's defining feature is that Drill uses SQL to define the DAG: Drill is responsible for planning a DAG that efficiently implements a set of data transforms expressed as SQL.

Understanding Drill within the Big Data Ecosystem

This definition of Drill helps us understand several ways that Drill differs from, or is like, the other members of the Big Data menagerie.

SQL-based Transforms

Drill transforms SQL into a DAG of data transforms. This means that Drill, not the user, defines the set of transform tasks. Unlike Spark, Storm, Apex and others; the user has no ability to code up an operator. Instead, the set of operators available are those needed to execute a SQL statement.

As it turns out, SQL has a long history and firm theoretical foundation; SQL can express a very large percentage of the operations that people perform on tabular (like) data. Still, there are operations where SQL is a poor fit. SQL is not ideal for machine-learning tasks (it has no way to express iteration, to build learning models, etc.) SQL is also not intended for stream processing (though it is great for working on such data once "parked" in storage.)

This fact is the first way to differentiate Drill. When a data transform task can be represented as SQL, Drill automates all the work required to execute the operation: just submit the query and get the results -- a huge savings relative to any tool that requires hand-created transforms or DAGs.

In-Memory and Network Exchanges

Unlike old-time Map-Reduce (or Hive), Drill does not use disk files to transfer data between exchanges. Instead, Drill buffers data in memory (in so-called "record batches") and transfers batches between nodes (actually, between clusters of nodes called "fragments") using the network.

The result is far faster processing than Map-Reduce or Hive can obtain. (Though, Hive LLAP is beginning to also offer network exchanges.) However, this speed-up comes at an important cost: Drill needs plenty of memory to hold in-flight batches. There is no free lunch: Drill allows an application to gain faster processing at the expense of more memory (relative to a tool which uses disk-based exchanges.)

Late-Schema, Many File Formats

As noted above, Big Data is a loose collection of tools. There is no single engine (such as Oracle, DB2, SQL-Server, etc.) to define the on-disk format (typically highly optimized for that one engine.) Instead, Big Data offers the "Data Lake" a distributed storage system such as HDFS into which one can toss files. Without a reason to standard on one format or another, Data Lakes often take the approach of "let 100 formats bloom": CSV, JSON, Parquet, ORC, Sequence File, you name it. And, this does not include the very large number of specialized formats (Pcap, Web logs, binary formats and zillions of others.)

The lack of a standard, optimized format presents challenges to any Big Data tool. The most obvious is the need for parsers to transform some peculiar format into the record (or "row" or "tuple") format needed by a particular tool.

For example, Map-Reduce allows developers to right record readers that parse an arbitrary file into a standardized Hadoop key/value format. Drill supports "storage plugins" and "format plugins" to solve the same problem.

Clone this wiki locally