Articles - R Conference

implyr: A dplyr Backend for a Apache Impala

  |   411  |  Post a comment  |  R Conference  |  dplyr, useR 2017, Big data, SQL
This talk introduces implyr, a new dplyr backend for Apache Impala.

Impala is a massively parallel processing query engine that enables low-latency SQL queries on data stored in the Hadoop Distributed File System (HDFS), Apache HBase, Apache Kudu, and Amazon Simple Storage Service (S3).

The distributed architecture of Impala enables fast interactive queries on petabyte-scale data, but it imposes limitations on the dplyr interface. For example, row ordering of a result set must be performed in the final phase of query processing. The author describes the methods used to work around this and other limitations.

Finally, the author discusses broader issues regarding the DBI-compatible interfaces that dplyr requires for underlying connectivity to database sources.

implyr is designed to work with any DBI-compatible interface to Impala, such as the general packages odbc and RJDBC, whereas other dplyr database backends typically rely on one particular package or mode of connectivity.



Source: useR! 2017