Re: Request for feedback on work intent for non-equijoin support

Andres.Quiroz Fri, 15 May 2015 13:39:48 -0700

Ok, that would be great! Except for Monday and Friday, I could meet any
day next week in the afternoon (Pacific time), since it is the end of the
day for me.


Thanks a lot,

Andrés

On 5/15/15, 4:13 PM, "Thejas Nair" <thejas.n...@gmail.com> wrote:

>Hi Andres,
>Glad to hear about the progress!
>
>Vikram is a hive join implementation expert. He can guide you through
>this.
>We can setup a webex or google hangout and discuss this. Does sometime
>next week work for you ? (Please let us know some hours that work for
>you,  in Pacific time zone).
>
>Anybody else who is interested in the theta join work is also welcome
>to join the discussion. Please let me know.
>
>Thanks,
>Thejas
>
>
>On Fri, May 15, 2015 at 12:48 PM,  <andres.qui...@parc.com> wrote:
>> Hello,
>>
>> At this point, I have implemented a standalone version of the
>> 1-bucket-theta join algorithm described in the northeastern paper on
>> Hadoop MR, and would like to start porting it to Hive.
>>
>> I have been looking at the code and believe that the main goal would be
>>to
>> implement a new JoinOperator. However, it¹s still not very clear to me
>>how
>> this class interacts with the rest of the platform (i.e. How it fits in
>> the overall query processing workflow).
>>
>> Could someone please provide or point me to a crash course on
>>implementing
>> a join operator? If nothing else, a list of steps and other classes
>>that I
>> may have to touch or add would be a very helpful starting point.
>>
>> Also, I suppose tez is preferred for the implementation, right?
>>
>> Thanks for your help,
>>
>> Andrés
>>
>> On 4/8/15, 2:32 PM, "Thejas Nair" <thejas.n...@gmail.com> wrote:
>>
>>>Yes, the theta join paper in northeastern is a good place to start.
>>>There is also a presentation from the folks in youtube, which is also
>>>very useful.
>>>I had a look at this issue as well earlier, and I had written up a
>>>rough proposal.  I had not organized the document well enough for
>>>sharing publicly, but in case you find it useful, I have attached it
>>>to wiki -
>>>https://cwiki.apache.org/confluence/download/attachments/27362075/theta%
>>>20
>>>join%20proposal%20-%20thejas.pdf?version=1&modificationDate=142851770295
>>>4&
>>>api=v2
>>>It also includes a list of some of the changes that are needed (it is
>>>probably not comprehensive enough).
>>>
>>>
>>>On Wed, Apr 8, 2015 at 5:49 AM,  <andres.qui...@parc.com> wrote:
>>>> So, I'd like to get started on this. The description in the design doc
>>>>and the theta join paper from Northeastern seem like a good place to
>>>>start, to have a baseline that I can later use for the more specific
>>>>join algorithms I want to try.
>>>>
>>>> I created a JIRA account, and my username is Andres.Quiroz
>>>>
>>>> Brock, since I'm completely new to this code, could you (or anyone
>>>>else) please point me to the relevant modules to start learning and
>>>>ramping up? Also, please let me know if I can contact you directly for
>>>>discussing this specific topic, or if I should always send a message to
>>>>the mailing list.
>>>>
>>>> Thank you,
>>>>
>>>> Andrés
>>>>
>>>> -----Original Message-----
>>>> From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
>>>> Sent: Thursday, April 02, 2015 9:07 AM
>>>> To: dev@hive.apache.org
>>>> Subject: RE: Request for feedback on work intent for non-equijoin
>>>>support
>>>>
>>>> This is a great pointer, Szehon and Brock, thank you. I will catch up
>>>>with the material on theta joins and circle back.
>>>>
>>>> Andrés
>>>>
>>>> -----Original Message-----
>>>> From: Brock Noland [mailto:br...@apache.org]
>>>> Sent: Thursday, April 02, 2015 1:31 AM
>>>> To: dev@hive.apache.org
>>>> Subject: Re: Request for feedback on work intent for non-equijoin
>>>>support
>>>>
>>>> Nice, it'd be great if someone finally implemented this :)
>>>>
>>>> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sze...@cloudera.com>
>>>>wrote:
>>>>> From Hive side, there has been some thought on the subject here:
>>>>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
>>>>> some ideas but nobody has gotten around to giving it a try.  It might
>>>>> be of interest.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>>
>>>>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
>>>>> <leftylever...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> D'oh!  Thanks Chao.
>>>>>>
>>>>>> -- Lefty
>>>>>>
>>>>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <c...@cloudera.com> wrote:
>>>>>>
>>>>>> > Hey Lefty,
>>>>>> >
>>>>>> > You need to use the ftp protocol, not http.
>>>>>> > After clicking the link, you'll need to remove "http://"; from the
>>>>>> address
>>>>>> > bar.
>>>>>> >
>>>>>> > Best,
>>>>>> > Chao
>>>>>> >
>>>>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
>>>>>> > <leftylever...@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>>>>> > >
>>>>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>>>> > > was not found on this server."
>>>>>> > >
>>>>>> > > -- Lefty
>>>>>> > >
>>>>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <andres.qui...@parc.com> wrote:
>>>>>> > >
>>>>>> > > > Dear Lefty,
>>>>>> > > >
>>>>>> > > > Thank you very much for pointing that out and for your initial
>>>>>> > pointers.
>>>>>> > > > Here is the missing link:
>>>>>> > > >
>>>>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>>>> > > >
>>>>>> > > > Regards,
>>>>>> > > >
>>>>>> > > > Andrés
>>>>>> > > >
>>>>>> > > > -----Original Message-----
>>>>>> > > > From: Lefty Leverenz [mailto:leftylever...@gmail.com]
>>>>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>>>>> > > > To: dev@hive.apache.org
>>>>>> > > > Subject: Re: Request for feedback on work intent for
>>>>>> > > > non-equijoin
>>>>>> > support
>>>>>> > > >
>>>>>> > > > Hello Andres, the link to your paper is missing:
>>>>>> > > >
>>>>>> > > > In our preliminary work, which you can find here (pointer to
>>>>>> > > > the
>>>>>> paper)
>>>>>> > > ...
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > You can find general information about contributing to Hive in
>>>>>> > > > the
>>>>>> > > > wiki:  Resources
>>>>>> > > > for Contributors
>>>>>> > > > <
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> 
>>>>>>https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>>>>>> orContributors
>>>>>> > > > >
>>>>>> > > > , How to Contribute
>>>>>> > > >
>>>>>><https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>>>>> > > >
>>>>>> > > > -- Lefty
>>>>>> > > >
>>>>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <andres.qui...@parc.com>
>>>>>>wrote:
>>>>>> > > >
>>>>>> > > > >  Dear Hive development community members,
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > I am interested in learning more about the current support
>>>>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines,
>>>>>> > > > > and in
>>>>>> getting
>>>>>> > > > > feedback about community interest in more extensive support
>>>>>> > > > > for
>>>>>> such
>>>>>> > a
>>>>>> > > > > feature. I intend to work on this challenge, assuming people
>>>>>> > > > > find
>>>>>> it
>>>>>> > > > > compelling, and I intend to contribute results to the
>>>>>>community.
>>>>>> > Where
>>>>>> > > > > possible, it would be great to receive feedback and engage
>>>>>>in
>>>>>> > > > > collaborations along the way (for a bit more context, see
>>>>>>the
>>>>>> > > > > postscript of this message).
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > My initial goal is to support query conditions such as the
>>>>>> following:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > A.x < B.y
>>>>>> > > > >
>>>>>> > > > > A.x in_range [B.y, B.z]
>>>>>> > > > >
>>>>>> > > > > distance(A.x, B.y) < D
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > where A and B are distinct tables/files. It is my
>>>>>> > > > > understanding
>>>>>> that
>>>>>> > > > > current support for performing non-equijoins like those
>>>>>>above
>>>>>> > > > > is
>>>>>> > quite
>>>>>> > > > > limited, and where some forms are supported (like in
>>>>>> > > > > Cloudera's Impala), this support is based on doing a
>>>>>> > > > > potentially expensive
>>>>>> cross
>>>>>> > > > product join.
>>>>>> > > > > Depending on the data types involved, I believe that joins
>>>>>> > > > > with
>>>>>> these
>>>>>> > > > > conditions can be made to be tractable (at least on the
>>>>>> > > > > average)
>>>>>> with
>>>>>> > > > > join algorithms that exploit properties of the data types,
>>>>>> > > > > possibly with some pre-scanning of the data.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > I am asking for feedback on the interest & need in the
>>>>>> > > > > community
>>>>>> for
>>>>>> > > > > this work, as well as any pointers to similar work. In
>>>>>> > > > > particular,
>>>>>> I
>>>>>> > > > > would appreciate any answers people could give on the
>>>>>> > > > > following
>>>>>> > > > questions:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > - Is my understanding of the state of the art in Hive and
>>>>>> > > > > similar tools accurate? Are there groups currently working
>>>>>>on
>>>>>> > > > > similar or related issues, or tools that already accomplish
>>>>>> > > > > some or all of
>>>>>> what
>>>>>> > I
>>>>>> > > > have proposed?
>>>>>> > > > >
>>>>>> > > > > - Is there significant value to the community in the support
>>>>>> > > > > of
>>>>>> such
>>>>>> > a
>>>>>> > > > > feature? In other words, are the manual workarounds
>>>>>>necessary
>>>>>> because
>>>>>> > > > > of the absence of non-equijoins such as these enough of a
>>>>>> > > > > pain to justify the work I propose?
>>>>>> > > > >
>>>>>> > > > > - Being aware that the potential pre-scanning adds to the
>>>>>> > > > > cost of
>>>>>> the
>>>>>> > > > > join, and that data could still blow-up in the worst case,
>>>>>>am
>>>>>> > > > > I missing any other important considerations and tradeoffs
>>>>>> > > > > for this
>>>>>> > > > problem?
>>>>>> > > > >
>>>>>> > > > > - What would be a good avenue to contribute this feature to
>>>>>> > > > > the community (e.g. as a standalone tool on top of Hadoop,
>>>>>>or
>>>>>> > > > > as a Hive extension or plugin)?
>>>>>> > > > >
>>>>>> > > > > - What is the best way to get started in working with the
>>>>>> community?
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > Thanks for your attention and any info you can provide!
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > Andres Quiroz
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > P.S. If you are interested in some context, and why/how I am
>>>>>> > proposing
>>>>>> > > > > to do this work, please read on.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > I am part of a small project team at PARC working on the
>>>>>> > > > > general problems of data integration and automated ETL. We
>>>>>> > > > > have proposed a tool called HiperFuse that is designed to
>>>>>> > > > > accept declarative, high-level queries in order to produce
>>>>>> > > > > joined (fused) data sets
>>>>>> from
>>>>>> > > > > multiple heterogeneous raw data sources. In our preliminary
>>>>>> > > > > work, which you can find here (pointer to the paper), we
>>>>>> > > > > designed the architecture of the tool and obtained some
>>>>>> > > > > results separately on
>>>>>> the
>>>>>> > > > > problems of automated data cleansing, data type inference,
>>>>>> > > > > and
>>>>>> query
>>>>>> > > > > planning. One of the planned prototype implementations of
>>>>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative
>>>>>> > > > > language we
>>>>>> proposed
>>>>>> > > > > was closely related to SQL, we thought that we could exploit
>>>>>> > > > > the existing work in Hive and/or other open-source tools for
>>>>>> > > > > handling
>>>>>> the
>>>>>> > > > > SQL part and layer our work on top of that. For example, the
>>>>>> > > > > query given in the paper could easily be expressed in
>>>>>> > > > > SQL-like form with
>>>>>> a
>>>>>> > > > > non-equijoin
>>>>>> > > > > condition:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > SELECT web_access_log.ip, census.income
>>>>>> > > > >
>>>>>> > > > > FROM web_access_log, ip2zip, census
>>>>>> > > > >
>>>>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low,
>>>>>> > > > > ip2zip.ip_high]
>>>>>> > > > >
>>>>>> > > > > AND ip2zip.zip = census.zip
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > As you can see, the first impasse that we hit in order to
>>>>>> > > > > bring the elements together to solve this query end-to-end
>>>>>> > > > > was the
>>>>>> realization
>>>>>> > > > > and performance of the non-equality join in the query. The
>>>>>> > > > > intent
>>>>>> now
>>>>>> > > > > is to tackle this problem in a general sense and provide a
>>>>>> > > > > solution for a wide range of queries.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > The work I propose to do would be based on three main
>>>>>> > > > > components within
>>>>>> > > > > HiperFuse:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > - Enhancements to the extensible data type framework in
>>>>>> > > > > HiperFuse
>>>>>> > that
>>>>>> > > > > would categorize data types based on the properties needed
>>>>>>to
>>>>>> support
>>>>>> > > > > the join algorithms, in order to write join-ready
>>>>>> > > > > domain-specific
>>>>>> > data
>>>>>> > > > > type libraries.
>>>>>> > > > >
>>>>>> > > > > - The join algorithms themselves, based on Hive or directly
>>>>>> > > > > on
>>>>>> Hadoop
>>>>>> > > MR.
>>>>>> > > > >
>>>>>> > > > > - A query planner, which would determine the right algorithm
>>>>>> > > > > to
>>>>>> apply
>>>>>> > > > > and automatically schedule any necessary pre-scanning of the
>>>>>>data.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Best,
>>>>>> > Chao
>>>>>> >
>>>>>>
>>

Re: Request for feedback on work intent for non-equijoin support

Reply via email to