Re: Request for feedback on work intent for non-equijoin support

Lefty Leverenz Wed, 01 Apr 2015 21:43:14 -0700

Andrés, I followed that link and got the dread 404 Not Found:

"The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
found on this server."


-- Lefty

On Wed, Apr 1, 2015 at 7:23 PM, <andres.qui...@parc.com> wrote:

> Dear Lefty,
>
> Thank you very much for pointing that out and for your initial pointers.
> Here is the missing link:
>
> ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>
> Regards,
>
> Andrés
>
> -----Original Message-----
> From: Lefty Leverenz [mailto:leftylever...@gmail.com]
> Sent: Wednesday, April 01, 2015 12:48 AM
> To: dev@hive.apache.org
> Subject: Re: Request for feedback on work intent for non-equijoin support
>
> Hello Andres, the link to your paper is missing:
>
> In our preliminary work, which you can find here (pointer to the paper) ...
>
>
> You can find general information about contributing to Hive in the
> wiki:  Resources
> for Contributors
> <
> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
> >
> , How to Contribute
> <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>
> -- Lefty
>
> On Tue, Mar 31, 2015 at 10:42 PM, <andres.qui...@parc.com> wrote:
>
> >  Dear Hive development community members,
> >
> >
> >
> > I am interested in learning more about the current support for
> > non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
> > feedback about community interest in more extensive support for such a
> > feature. I intend to work on this challenge, assuming people find it
> > compelling, and I intend to contribute results to the community. Where
> > possible, it would be great to receive feedback and engage in
> > collaborations along the way (for a bit more context, see the
> > postscript of this message).
> >
> >
> >
> > My initial goal is to support query conditions such as the following:
> >
> >
> >
> > A.x < B.y
> >
> > A.x in_range [B.y, B.z]
> >
> > distance(A.x, B.y) < D
> >
> >
> >
> > where A and B are distinct tables/files. It is my understanding that
> > current support for performing non-equijoins like those above is quite
> > limited, and where some forms are supported (like in Cloudera's
> > Impala), this support is based on doing a potentially expensive cross
> product join.
> > Depending on the data types involved, I believe that joins with these
> > conditions can be made to be tractable (at least on the average) with
> > join algorithms that exploit properties of the data types, possibly
> > with some pre-scanning of the data.
> >
> >
> >
> > I am asking for feedback on the interest & need in the community for
> > this work, as well as any pointers to similar work. In particular, I
> > would appreciate any answers people could give on the following
> questions:
> >
> >
> >
> > - Is my understanding of the state of the art in Hive and similar
> > tools accurate? Are there groups currently working on similar or
> > related issues, or tools that already accomplish some or all of what I
> have proposed?
> >
> > - Is there significant value to the community in the support of such a
> > feature? In other words, are the manual workarounds necessary because
> > of the absence of non-equijoins such as these enough of a pain to
> > justify the work I propose?
> >
> > - Being aware that the potential pre-scanning adds to the cost of the
> > join, and that data could still blow-up in the worst case, am I
> > missing any other important considerations and tradeoffs for this
> problem?
> >
> > - What would be a good avenue to contribute this feature to the
> > community (e.g. as a standalone tool on top of Hadoop, or as a Hive
> > extension or plugin)?
> >
> > - What is the best way to get started in working with the community?
> >
> >
> >
> > Thanks for your attention and any info you can provide!
> >
> >
> >
> > Andres Quiroz
> >
> >
> >
> > P.S. If you are interested in some context, and why/how I am proposing
> > to do this work, please read on.
> >
> >
> >
> > I am part of a small project team at PARC working on the general
> > problems of data integration and automated ETL. We have proposed a
> > tool called HiperFuse that is designed to accept declarative,
> > high-level queries in order to produce joined (fused) data sets from
> > multiple heterogeneous raw data sources. In our preliminary work,
> > which you can find here (pointer to the paper), we designed the
> > architecture of the tool and obtained some results separately on the
> > problems of automated data cleansing, data type inference, and query
> > planning. One of the planned prototype implementations of HiperFuse
> > relies on Hadoop MR, and because the declarative language we proposed
> > was closely related to SQL, we thought that we could exploit the
> > existing work in Hive and/or other open-source tools for handling the
> > SQL part and layer our work on top of that. For example, the query
> > given in the paper could easily be expressed in SQL-like form with a
> > non-equijoin
> > condition:
> >
> >
> >
> > SELECT web_access_log.ip, census.income
> >
> > FROM web_access_log, ip2zip, census
> >
> > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
> >
> > AND ip2zip.zip = census.zip
> >
> >
> >
> > As you can see, the first impasse that we hit in order to bring the
> > elements together to solve this query end-to-end was the realization
> > and performance of the non-equality join in the query. The intent now
> > is to tackle this problem in a general sense and provide a solution
> > for a wide range of queries.
> >
> >
> >
> > The work I propose to do would be based on three main components
> > within
> > HiperFuse:
> >
> >
> >
> > - Enhancements to the extensible data type framework in HiperFuse that
> > would categorize data types based on the properties needed to support
> > the join algorithms, in order to write join-ready domain-specific data
> > type libraries.
> >
> > - The join algorithms themselves, based on Hive or directly on Hadoop MR.
> >
> > - A query planner, which would determine the right algorithm to apply
> > and automatically schedule any necessary pre-scanning of the data.
> >
> >
> >
>

Re: Request for feedback on work intent for non-equijoin support

Reply via email to