Apache Impala

In this post I will be discussing about Apache Impala.

Apache Impala is a incubating project which provides  high-performance, low-latency SQL queries on data stored in Apache Hadoop file formats.

So the next question that gets into your mind is why Impala if Hive,Pig which are already existing.

Let us go to the origins of Hive and Pig.

Origins of Pig and Hive:

Hadoop was used extensively at Yahoo! and Facebook to store, process, analyze huge amounts of data and to process data they used Map reduce. Both the companies spent lot of time writing Java code to run under Map Reduce.

Both wanted to work faster and to let non-programmers get at their data directly. Yahoo! invented a language of its own, called Apache Pig, for that purpose. Facebook, likely thinking about using existing skills rather than training people with new ones, built the SQL system called Hive.

Both hive and pig work similarly.A user types a query in either Pig or Hive. A parser reads the query, figures out what the user wants and runs a series of MapReduce jobs that do the work.

( Ofcourse Hive is a declarative language and Pig is a programming language.I don’t want to get into differences between hive and pig now.)

A criticism of Hadoop is that, MapReduce is a batch data processing engine. It turns out to be a hugely useful batch data processing engine that’s drastically changed the data management industry, but it’s a fair criticism: By design, MapReduce is a heavyweight, high-latency execution framework. Jobs are slow to start and have all kinds of overhead.

Hive and Pig carry these properties.

But users expect to get the result of a query in very less time which is not possible through Hive and Pig since both of them use Map Reduce which takes lot of time to process a query.

The answer to this problem is Apache Impala.

Apache Impala :

Impala has its own MPP (massive parallel processing) unlike hive and pig which uses Map Reduce.As a result, Impala doesn’t suffer the latencies that those operations impose. An Impala query begins to deliver results right away. Because it’s designed from the ground up for SQL query execution, not as a general-purpose distributed processing system like MapReduce, it’s able to deliver much better performance for that particular workload.

Every node in a Cloudera cluster has Impala code installed, waiting for SQL queries to execute. That code lives right alongside the MapReduce engine on every node and the Apache HBase engine (not based on MapReduce),and Apache Spark that users may choose to deploy. Those engines all have access to exactly the same data, and users can choose which of them to use, based on the problem they’re trying to solve.

Impala Binaries And Architecture :

Impala ships as a few binaries :
1. Impalad : Impala daemon. Installed on N instances of an N node cluster
Impala daemons are the one that do “real work”. They form the core of the Impala execution engine and are the ones reading data from HDFS/HBase and aggregating it. All Impalads are equivalent.
2. StateStored : Statestore daemon. Installed on 1 instance of an N node cluster. StateStore daemon is a name service i.e. it keeps track of which Impalad’s are up and running, and relays this information to all the Impalads in the cluster so they are aware of this information when distributing tasks to other impalads. This isn’t a single point of failure since Impala daemons still continue to function even when statestored dies, albeit with stale name service data.
3. Catalogd : Catalog daemon. Installed on 1 instance of an N node cluster.  It distributes metadata (table names, column names, types, etc.) to Impala daemons via the statestored (which acts as a pub-sub mechanism). This isn’t a single point of failure since impala daemons still continue to function. The user has to manually run a command to invalidate the metadata if there is an update to it.

Architecture : 
An impala daemon can be divided into 3 parts : planner, coordinator and executor.
1. Planner : It is responsible for parsing out the query and creating a Directed Acyclic Graph of operators from it. An operator is something that operates on the data; select, where, group by, are all examples of operators. The planning happens in 2 parts. First, a single node plan is made, as if all the data in the cluster resided on just one node, and secondly, this single node plan is converted to a distributed plan based on the location of various data sources in the cluster (thereby leveraging data locality).
2. Coordinator : It is responsible for coordinating the execution of the entire query. In particular, it sends requests to various executors to read and process data. It then receives the data back from these executors and streams it back to the client via JDBC/ODBC.
3. Executor : It is responsible for reading the data from HDFS/HBase and/or doing aggregations on data (which is read locally or if not available locally could be streamed from executors of other Impala daemons).

Refer to following diagram for getting overview of Impala architecture.


With my personal experience Impala almost reduces the execution time to one-sixth of what Hive takes.

Since , Impala is in incubating stage there are many things which Impala does not support compared to Hive.

Some of them include:

1)Slower performance of join operations. ( You can tune before performing join operation to reduce execution time like compute stats of tables used so that impala knows the statistics of tables and can optimize the join operation itself)

2)It does not support setting of variables.

3)It does not have many inbuilt functions which we generally use in PL-SQL like to_char .

4)It does not support insertion into avro format tables but supports insertion into parquet format tables.

Note : 

YARN is disabled for impala internally.So memory limit can be set manually for impala.But once YARN is enabled, allocation of memory by YARN will override our manual setting.

Hope this article give you basic understanding on Apache Impala.