Ok, that would be great! Except for Monday and Friday, I could meet any day next week in the afternoon (Pacific time), since it is the end of the day for me.
Thanks a lot, Andrés On 5/15/15, 4:13 PM, "Thejas Nair" <thejas.n...@gmail.com> wrote: >Hi Andres, >Glad to hear about the progress! > >Vikram is a hive join implementation expert. He can guide you through >this. >We can setup a webex or google hangout and discuss this. Does sometime >next week work for you ? (Please let us know some hours that work for >you, in Pacific time zone). > >Anybody else who is interested in the theta join work is also welcome >to join the discussion. Please let me know. > >Thanks, >Thejas > > >On Fri, May 15, 2015 at 12:48 PM, <andres.qui...@parc.com> wrote: >> Hello, >> >> At this point, I have implemented a standalone version of the >> 1-bucket-theta join algorithm described in the northeastern paper on >> Hadoop MR, and would like to start porting it to Hive. >> >> I have been looking at the code and believe that the main goal would be >>to >> implement a new JoinOperator. However, it¹s still not very clear to me >>how >> this class interacts with the rest of the platform (i.e. How it fits in >> the overall query processing workflow). >> >> Could someone please provide or point me to a crash course on >>implementing >> a join operator? If nothing else, a list of steps and other classes >>that I >> may have to touch or add would be a very helpful starting point. >> >> Also, I suppose tez is preferred for the implementation, right? >> >> Thanks for your help, >> >> Andrés >> >> On 4/8/15, 2:32 PM, "Thejas Nair" <thejas.n...@gmail.com> wrote: >> >>>Yes, the theta join paper in northeastern is a good place to start. >>>There is also a presentation from the folks in youtube, which is also >>>very useful. >>>I had a look at this issue as well earlier, and I had written up a >>>rough proposal. I had not organized the document well enough for >>>sharing publicly, but in case you find it useful, I have attached it >>>to wiki - >>>https://cwiki.apache.org/confluence/download/attachments/27362075/theta% >>>20 >>>join%20proposal%20-%20thejas.pdf?version=1&modificationDate=142851770295 >>>4& >>>api=v2 >>>It also includes a list of some of the changes that are needed (it is >>>probably not comprehensive enough). >>> >>> >>>On Wed, Apr 8, 2015 at 5:49 AM, <andres.qui...@parc.com> wrote: >>>> So, I'd like to get started on this. The description in the design doc >>>>and the theta join paper from Northeastern seem like a good place to >>>>start, to have a baseline that I can later use for the more specific >>>>join algorithms I want to try. >>>> >>>> I created a JIRA account, and my username is Andres.Quiroz >>>> >>>> Brock, since I'm completely new to this code, could you (or anyone >>>>else) please point me to the relevant modules to start learning and >>>>ramping up? Also, please let me know if I can contact you directly for >>>>discussing this specific topic, or if I should always send a message to >>>>the mailing list. >>>> >>>> Thank you, >>>> >>>> Andrés >>>> >>>> -----Original Message----- >>>> From: andres.qui...@parc.com [mailto:andres.qui...@parc.com] >>>> Sent: Thursday, April 02, 2015 9:07 AM >>>> To: dev@hive.apache.org >>>> Subject: RE: Request for feedback on work intent for non-equijoin >>>>support >>>> >>>> This is a great pointer, Szehon and Brock, thank you. I will catch up >>>>with the material on theta joins and circle back. >>>> >>>> Andrés >>>> >>>> -----Original Message----- >>>> From: Brock Noland [mailto:br...@apache.org] >>>> Sent: Thursday, April 02, 2015 1:31 AM >>>> To: dev@hive.apache.org >>>> Subject: Re: Request for feedback on work intent for non-equijoin >>>>support >>>> >>>> Nice, it'd be great if someone finally implemented this :) >>>> >>>> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sze...@cloudera.com> >>>>wrote: >>>>> From Hive side, there has been some thought on the subject here: >>>>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has >>>>> some ideas but nobody has gotten around to giving it a try. It might >>>>> be of interest. >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> >>>>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz >>>>> <leftylever...@gmail.com> >>>>> wrote: >>>>> >>>>>> D'oh! Thanks Chao. >>>>>> >>>>>> -- Lefty >>>>>> >>>>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <c...@cloudera.com> wrote: >>>>>> >>>>>> > Hey Lefty, >>>>>> > >>>>>> > You need to use the ftp protocol, not http. >>>>>> > After clicking the link, you'll need to remove "http://" from the >>>>>> address >>>>>> > bar. >>>>>> > >>>>>> > Best, >>>>>> > Chao >>>>>> > >>>>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz >>>>>> > <leftylever...@gmail.com> >>>>>> > wrote: >>>>>> > >>>>>> > > Andrés, I followed that link and got the dread 404 Not Found: >>>>>> > > >>>>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf >>>>>> > > was not found on this server." >>>>>> > > >>>>>> > > -- Lefty >>>>>> > > >>>>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <andres.qui...@parc.com> wrote: >>>>>> > > >>>>>> > > > Dear Lefty, >>>>>> > > > >>>>>> > > > Thank you very much for pointing that out and for your initial >>>>>> > pointers. >>>>>> > > > Here is the missing link: >>>>>> > > > >>>>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf >>>>>> > > > >>>>>> > > > Regards, >>>>>> > > > >>>>>> > > > Andrés >>>>>> > > > >>>>>> > > > -----Original Message----- >>>>>> > > > From: Lefty Leverenz [mailto:leftylever...@gmail.com] >>>>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM >>>>>> > > > To: dev@hive.apache.org >>>>>> > > > Subject: Re: Request for feedback on work intent for >>>>>> > > > non-equijoin >>>>>> > support >>>>>> > > > >>>>>> > > > Hello Andres, the link to your paper is missing: >>>>>> > > > >>>>>> > > > In our preliminary work, which you can find here (pointer to >>>>>> > > > the >>>>>> paper) >>>>>> > > ... >>>>>> > > > >>>>>> > > > >>>>>> > > > You can find general information about contributing to Hive in >>>>>> > > > the >>>>>> > > > wiki: Resources >>>>>> > > > for Contributors >>>>>> > > > < >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> >>>>>>https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf >>>>>> orContributors >>>>>> > > > > >>>>>> > > > , How to Contribute >>>>>> > > > >>>>>><https://cwiki.apache.org/confluence/display/Hive/HowToContribute>. >>>>>> > > > >>>>>> > > > -- Lefty >>>>>> > > > >>>>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <andres.qui...@parc.com> >>>>>>wrote: >>>>>> > > > >>>>>> > > > > Dear Hive development community members, >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > I am interested in learning more about the current support >>>>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, >>>>>> > > > > and in >>>>>> getting >>>>>> > > > > feedback about community interest in more extensive support >>>>>> > > > > for >>>>>> such >>>>>> > a >>>>>> > > > > feature. I intend to work on this challenge, assuming people >>>>>> > > > > find >>>>>> it >>>>>> > > > > compelling, and I intend to contribute results to the >>>>>>community. >>>>>> > Where >>>>>> > > > > possible, it would be great to receive feedback and engage >>>>>>in >>>>>> > > > > collaborations along the way (for a bit more context, see >>>>>>the >>>>>> > > > > postscript of this message). >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > My initial goal is to support query conditions such as the >>>>>> following: >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > A.x < B.y >>>>>> > > > > >>>>>> > > > > A.x in_range [B.y, B.z] >>>>>> > > > > >>>>>> > > > > distance(A.x, B.y) < D >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > where A and B are distinct tables/files. It is my >>>>>> > > > > understanding >>>>>> that >>>>>> > > > > current support for performing non-equijoins like those >>>>>>above >>>>>> > > > > is >>>>>> > quite >>>>>> > > > > limited, and where some forms are supported (like in >>>>>> > > > > Cloudera's Impala), this support is based on doing a >>>>>> > > > > potentially expensive >>>>>> cross >>>>>> > > > product join. >>>>>> > > > > Depending on the data types involved, I believe that joins >>>>>> > > > > with >>>>>> these >>>>>> > > > > conditions can be made to be tractable (at least on the >>>>>> > > > > average) >>>>>> with >>>>>> > > > > join algorithms that exploit properties of the data types, >>>>>> > > > > possibly with some pre-scanning of the data. >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > I am asking for feedback on the interest & need in the >>>>>> > > > > community >>>>>> for >>>>>> > > > > this work, as well as any pointers to similar work. In >>>>>> > > > > particular, >>>>>> I >>>>>> > > > > would appreciate any answers people could give on the >>>>>> > > > > following >>>>>> > > > questions: >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > - Is my understanding of the state of the art in Hive and >>>>>> > > > > similar tools accurate? Are there groups currently working >>>>>>on >>>>>> > > > > similar or related issues, or tools that already accomplish >>>>>> > > > > some or all of >>>>>> what >>>>>> > I >>>>>> > > > have proposed? >>>>>> > > > > >>>>>> > > > > - Is there significant value to the community in the support >>>>>> > > > > of >>>>>> such >>>>>> > a >>>>>> > > > > feature? In other words, are the manual workarounds >>>>>>necessary >>>>>> because >>>>>> > > > > of the absence of non-equijoins such as these enough of a >>>>>> > > > > pain to justify the work I propose? >>>>>> > > > > >>>>>> > > > > - Being aware that the potential pre-scanning adds to the >>>>>> > > > > cost of >>>>>> the >>>>>> > > > > join, and that data could still blow-up in the worst case, >>>>>>am >>>>>> > > > > I missing any other important considerations and tradeoffs >>>>>> > > > > for this >>>>>> > > > problem? >>>>>> > > > > >>>>>> > > > > - What would be a good avenue to contribute this feature to >>>>>> > > > > the community (e.g. as a standalone tool on top of Hadoop, >>>>>>or >>>>>> > > > > as a Hive extension or plugin)? >>>>>> > > > > >>>>>> > > > > - What is the best way to get started in working with the >>>>>> community? >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > Thanks for your attention and any info you can provide! >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > Andres Quiroz >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > P.S. If you are interested in some context, and why/how I am >>>>>> > proposing >>>>>> > > > > to do this work, please read on. >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > I am part of a small project team at PARC working on the >>>>>> > > > > general problems of data integration and automated ETL. We >>>>>> > > > > have proposed a tool called HiperFuse that is designed to >>>>>> > > > > accept declarative, high-level queries in order to produce >>>>>> > > > > joined (fused) data sets >>>>>> from >>>>>> > > > > multiple heterogeneous raw data sources. In our preliminary >>>>>> > > > > work, which you can find here (pointer to the paper), we >>>>>> > > > > designed the architecture of the tool and obtained some >>>>>> > > > > results separately on >>>>>> the >>>>>> > > > > problems of automated data cleansing, data type inference, >>>>>> > > > > and >>>>>> query >>>>>> > > > > planning. One of the planned prototype implementations of >>>>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative >>>>>> > > > > language we >>>>>> proposed >>>>>> > > > > was closely related to SQL, we thought that we could exploit >>>>>> > > > > the existing work in Hive and/or other open-source tools for >>>>>> > > > > handling >>>>>> the >>>>>> > > > > SQL part and layer our work on top of that. For example, the >>>>>> > > > > query given in the paper could easily be expressed in >>>>>> > > > > SQL-like form with >>>>>> a >>>>>> > > > > non-equijoin >>>>>> > > > > condition: >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > SELECT web_access_log.ip, census.income >>>>>> > > > > >>>>>> > > > > FROM web_access_log, ip2zip, census >>>>>> > > > > >>>>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, >>>>>> > > > > ip2zip.ip_high] >>>>>> > > > > >>>>>> > > > > AND ip2zip.zip = census.zip >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > As you can see, the first impasse that we hit in order to >>>>>> > > > > bring the elements together to solve this query end-to-end >>>>>> > > > > was the >>>>>> realization >>>>>> > > > > and performance of the non-equality join in the query. The >>>>>> > > > > intent >>>>>> now >>>>>> > > > > is to tackle this problem in a general sense and provide a >>>>>> > > > > solution for a wide range of queries. >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > The work I propose to do would be based on three main >>>>>> > > > > components within >>>>>> > > > > HiperFuse: >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > - Enhancements to the extensible data type framework in >>>>>> > > > > HiperFuse >>>>>> > that >>>>>> > > > > would categorize data types based on the properties needed >>>>>>to >>>>>> support >>>>>> > > > > the join algorithms, in order to write join-ready >>>>>> > > > > domain-specific >>>>>> > data >>>>>> > > > > type libraries. >>>>>> > > > > >>>>>> > > > > - The join algorithms themselves, based on Hive or directly >>>>>> > > > > on >>>>>> Hadoop >>>>>> > > MR. >>>>>> > > > > >>>>>> > > > > - A query planner, which would determine the right algorithm >>>>>> > > > > to >>>>>> apply >>>>>> > > > > and automatically schedule any necessary pre-scanning of the >>>>>>data. >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Best, >>>>>> > Chao >>>>>> > >>>>>> >>