Request for feedback on work intent for non-equijoin support

Andres.Quiroz Tue, 31 Mar 2015 20:13:10 -0700

Dear Hive development community members,

I am interested in learning more about the current support for non-equijoins in 
Hive and/or other Hadoop SQL engines, and in getting feedback about community 
interest in more extensive support for such a feature. I intend to work on this 
challenge, assuming people find it compelling, and I intend to contribute 
results to the community. Where possible, it would be great to receive feedback 
and engage in collaborations along the way (for a bit more context, see the 
postscript of this message).


My initial goal is to support query conditions such as the following:

A.x < B.y
A.x in_range [B.y, B.z]
distance(A.x, B.y) < D

where A and B are distinct tables/files. It is my understanding that current 
support for performing non-equijoins like those above is quite limited, and 
where some forms are supported (like in Cloudera's Impala), this support is 
based on doing a potentially expensive cross product join. Depending on the 
data types involved, I believe that joins with these conditions can be made to 
be tractable (at least on the average) with join algorithms that exploit 
properties of the data types, possibly with some pre-scanning of the data.

I am asking for feedback on the interest & need in the community for this work, 
as well as any pointers to similar work. In particular, I would appreciate any 
answers people could give on the following questions:

- Is my understanding of the state of the art in Hive and similar tools 
accurate? Are there groups currently working on similar or related issues, or 
tools that already accomplish some or all of what I have proposed?
- Is there significant value to the community in the support of such a feature? 
In other words, are the manual workarounds necessary because of the absence of 
non-equijoins such as these enough of a pain to justify the work I propose?
- Being aware that the potential pre-scanning adds to the cost of the join, and 
that data could still blow-up in the worst case, am I missing any other 
important considerations and tradeoffs for this problem?
- What would be a good avenue to contribute this feature to the community (e.g. 
as a standalone tool on top of Hadoop, or as a Hive extension or plugin)?
- What is the best way to get started in working with the community?

Thanks for your attention and any info you can provide!

Andres Quiroz

P.S. If you are interested in some context, and why/how I am proposing to do 
this work, please read on.

I am part of a small project team at PARC working on the general problems of 
data integration and automated ETL. We have proposed a tool called HiperFuse 
that is designed to accept declarative, high-level queries in order to produce 
joined (fused) data sets from multiple heterogeneous raw data sources. In our 
preliminary work, which you can find here (pointer to the paper), we designed 
the architecture of the tool and obtained some results separately on the 
problems of automated data cleansing, data type inference, and query planning. 
One of the planned prototype implementations of HiperFuse relies on Hadoop MR, 
and because the declarative language we proposed was closely related to SQL, we 
thought that we could exploit the existing work in Hive and/or other 
open-source tools for handling the SQL part and layer our work on top of that. 
For example, the query given in the paper could easily be expressed in SQL-like 
form with a non-equijoin condition:

SELECT web_access_log.ip, census.income
FROM web_access_log, ip2zip, census
WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
AND ip2zip.zip = census.zip

As you can see, the first impasse that we hit in order to bring the elements 
together to solve this query end-to-end was the realization and performance of 
the non-equality join in the query. The intent now is to tackle this problem in 
a general sense and provide a solution for a wide range of queries.

The work I propose to do would be based on three main components within 
HiperFuse:

- Enhancements to the extensible data type framework in HiperFuse that would 
categorize data types based on the properties needed to support the join 
algorithms, in order to write join-ready domain-specific data type libraries.
- The join algorithms themselves, based on Hive or directly on Hadoop MR.
- A query planner, which would determine the right algorithm to apply and 
automatically schedule any necessary pre-scanning of the data.

Request for feedback on work intent for non-equijoin support

Reply via email to