Re: Unions causing many scans of input - workaround?

Tim Robertson Mon, 08 Nov 2010 08:22:37 -0800

For anyone stumbling across this thread the (pretty much) working UDTF
can be seen:
  
http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-spatial/src/main/java/org/gbif/occurrence/spatial/udf/TaxonDensityUDTF.java


allowing for the following query:

create table taxon_density as
select taxonId,tileX,tileY,zoom,clusterX,clusterY,sum(count) from
occurrence_record lateral view taxonDensityUDTF(kingdom_concept_id,
phylum_concept_id, class_concept_id, order_concept_id,
family_concept_id, genus_concept_id,
species_concept_id,nub_concept_id, latitude, longitude, 10) e
as taxonId,tileX,tileY,zoom,clusterX,clusterY,count
group by taxonId,tileX,tileY,zoom,clusterX,clusterY;

Thanks again for the pointers Sonal and Namit, and also on the other thread,

Tim




On Mon, Nov 8, 2010 at 9:17 AM, Tim Robertson <timrobertson...@gmail.com> wrote:
> I am writing a GenericUDTF now, but notice on
> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>
> the method docs show:
> /**
>   * Called to notify the UDTF that there are no more rows to process. Note 
> that
>   * forward() should not be called in this function. Only clean up code should
>   * be run.
>   */
>  public abstract void close() throws HiveException;
>
> but the example does exactly that:
> @Override
>  public void close() throws HiveException {
>    forwardObj[0] = count;
>    forward(forwardObj);
>    forward(forwardObj);
>  }
>
> I'll assume the example is correct and continue, but it might be worth
> fixing that page.
>
> Cheers,
> Tim
>
>
>
>
>
> On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson <timrobertson...@gmail.com> 
> wrote:
>> Thank you both,
>>
>> A quick glance looks like that is what I am looking for.  When I get
>> it working, I'll post the solution.
>>
>> Cheers,
>> Tim
>>
>> On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain <nj...@facebook.com> wrote:
>>> Other option would be to create a wrapper script (not use either UDF or
>>> UDTF)
>>> That script, in any language, can emit any number of output rows per input
>>> row.
>>>
>>> Look at:
>>> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
>>> for details
>>>
>>> ________________________________
>>> From: Sonal Goyal [sonalgoy...@gmail.com]
>>> Sent: Sunday, November 07, 2010 8:40 PM
>>> To: user@hive.apache.org
>>> Subject: Re: Unions causing many scans of input - workaround?
>>>
>>> Hey Tim,
>>>
>>> You have an interesting problem. Have you tried creating a UDTF for your
>>> case, so that you can possibly emit more than one record for each row of
>>> your input?
>>>
>>> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>>>
>>> Thanks and Regards,
>>> Sonal
>>>
>>> Sonal Goyal | Founder and CEO | Nube Technologies LLP
>>> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson <timrobertson...@gmail.com>
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I am porting custom MR code to Hive and have written working UDFs
>>>> where I need them.  Is there a work around to having to do this in
>>>> Hive:
>>>>
>>>> select * from
>>>> (
>>>>    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
>>>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
>>>> f2_y, count (1) as count
>>>>    from table
>>>>    group by name_id, x, y, f2_x, f2_y
>>>>
>>>>    UNION ALL
>>>>
>>>>    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
>>>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
>>>> f2_y, count (1) as count
>>>>    from table
>>>>    group by name_id, x, y, f2_x, f2_y
>>>>
>>>>   --- etc etc increasing in zoom
>>>> )
>>>>
>>>> The issue being that this does many passes over the table, whereas
>>>> previously in my Map() I would just emit many times from the same
>>>> input record and then let it all group in the shuffle and sort.
>>>> I actually emit 184 times for an input record (23 zoom levels of
>>>> google maps, and 8 ways to derive the name_id) for a single record
>>>> which means 184 union statements - Is it possible in hive to force it
>>>> to emit many times from the source record in the stage-1 map?
>>>>
>>>> (ahem) Does anyone know if Pig can do this if not in Hive?
>>>>
>>>> I hope I have explained this well enough to make sense.
>>>>
>>>> Thanks in advance,
>>>> Tim
>>>
>>>
>>
>

Re: Unions causing many scans of input - workaround?

Reply via email to