Impala daemon is a core component in the Impala architecture. The daemon process runs on each data node and is the process to which the clients (Hue,JDBC,ODBC) connect to issue queries. When a query gets submitted to an Impala daemon , that node serves as the coordinator node for that query. Impala daemon acting as the co-ordinator parallelizes the queries and distributes work to other nodes in the Impala cluster. The other nodes transmit partial results back to the coordinator, which constructs the final result set for a query.
It is a recommended practice to run Impalad on each of the Data nodes in a cluster , as Impala takes advantage of the data locality while processing its queries. So most of the time the Impala clients connect to any of the data nodes to run their queries. This might create a single point of failure for the clients if the clients are always issuing queries to a single data node. In addition to that the node acting as a coordinator node for each Impala query potentially requires more memory and CPU cycles than the other nodes that process the query. For clusters running production workloads, High Availability from the Impala clients standpoint and load distribution across the nodes can be achieved by having a proxy server or load-balancer to issue queries to impala daemons using a round-robin scheduling.
HAProxy is free, open source load balancer that can be used as a proxy-server or load balancer to distribute the load across different impala daemons. The high level architecture for this setup looks like below.
Install the load balancer:
HAProxy can installed and configured on Red Hat Enterprise Linux system and Centos OS using the following instructions.
yum install haproxy |
Set up the configuration file: /etc/haproxy/haproxy.cfg.
See the following section for a sample configuration file
global log 127.0.0.1 local2 chroot /var/lib/haproxy pidfile /var/run/haproxy.pid maxconn 4000 user haproxy group haproxy daemon # turn on stats unix socket stats socket /var/lib/haproxy/stats #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults mode tcp log global retries 3 timeout connect 50000s timeout client 50000s timeout server 50000s maxconn 3000 #--------------------------------------------------------------------- # main frontend which proxys to the backends - change the port # if you want #--------------------------------------------------------------------- frontend main *:5000 acl url_static path_beg -i /static /images /javascript/stylesheets acl url_static path_end -i .jpg .gif .png .css .js use_backend static if url_static default_backend impala #--------------------------------------------------------------------- #static backend for serving up images, stylesheets and such #--------------------------------------------------------------------- backend static balance roundrobin server static 127.0.0.1:4331 check #--------------------------------------------------------------------- #round robin balancing between the various backends #--------------------------------------------------------------------- backend impala mode tcp option tcplog balance roundrobin #balance leastconn #--------------------------------------------------------------------- # Replace the ip addresses with your client nodes ip addresses #--------------------------------------------------------------------- server client1 192.168.3.163:21000 server client2 192.168.3.164:21000 server client3 192.168.3.165:21000 |
Run the following command after done the changes
service haproxy reload; |
Note:
The key configuration options are balance and server in the backend impala section. As well as the timeout configuration options in the defaults section. The server with the lowest number of connections receives the connection only when the balance parameter is set to leastconn. If balance parameter is set to roundrobin, the proxy server can issue queries to each connection uses a different coordinator node.
- On systems managed by Cloudera Manager, on the page Impala Daemons Load Balancer field. Specify the address of the load balancer in host:port format. This setting lets Cloudera Manager route all appropriate Impala-related operations through the proxy server. , specify a value for the
- For any scripts, jobs, or configuration settings for applications that formerly connected to a specific datanode to run Impala SQL statements, change the connection information (such as the -i option inimpala-shell) to point to the load balancer instead.
Test Impala through a Proxy for High Availability:
Manual testing with HAProxy:
Stop the impala daemon service one by one and run the queries, check the impala high availability is working fine or not.
Test the impala high availability with shell script using HAProxy:
Run the following shell script and test the impala high availability using HAProxy.
Note: Please change the ‘table_name’ and ‘database_name’ placeholders.
for (( i = 0 ; i < 5; i++ )) do impala-shell -i localhost:5000 -q "select * from {table_name}" -d {database_name} done |
Result: Run the above script and find the usage of load balancer.
-
Query should be executing on different impala daemon nodes for each iteration (when balance is roundrobin).
Pingback: Impala Load Balancing with Amazon Elastic Load Balancer | The Razor's Edge