Never mind, I got it working with UDF. I just pass the file location to my evaluate function. Thanks! :) From: tben...@hotmail.com To: user@hive.apache.org Subject: RE: Limiting outer join Date: Tue, 7 Jul 2015 09:59:22 -0700
Thanks for your replies. I see how extracting the first country would work, however I was hoping to speed up my query by stopping the search once a country has been found. Are you suggesting that I pass the whole IP table to a UDF and perform the search myself? I've only programmed simple UDFs so far (ex: reformat a string), so any additional details would be appreciated. I am mostly concerned about importing said table (currently stored in Hive) and distributing the task across nodes (note that I use Tez). Regards, B > Date: Mon, 6 Jul 2015 18:18:44 -0700 > Subject: Re: Limiting outer join > From: gop...@apache.org > To: user@hive.apache.org > > > > In the following query, it is possible to limit the amount of entries > >returned by an outer join to a single value? I want to obtain a single > >country from ipv4geotable for each entry in logontable. > > Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can > read the first row out of each logon.IP except, there¹s no way to force > which country wins over the other without an order by country in the > OVER() clause as well. > > That said, it will only get slower to produce 1 row per group, because of > the distributed nature of the SQL engine, the reduction of data happens > after a ordering shuffle. > > You¹re doing range joins in a SQL engine without theta joins and MapReduce > had no way to implement those at runtime (Tez has, with EdgeManager > plugins). > > The easiest/traditional approach out of doing geo-IP lookups is a compact > UDF model without any joins at all. > > There¹s some old threads on discussing this as a built-in & some code > (with potential licensing issues) - > http://markmail.org/message/w54j4upwg2wbh3xg > > Cheers, > Gopal > >