Re: Bucketing external tables

Mark Grover Sat, 06 Apr 2013 08:07:51 -0700

Glad to hear!

On Fri, Apr 5, 2013 at 3:02 PM, Sadananda Hegde <saduhe...@gmail.com> wrote:


> Thanks, Mark.
>
> I found the problem. For some reason, Hive is not able to write Avro
> output file when the schema has a complex field with NULL option. It read
> without any problem; but cannot write with that structure.  For example,
> Insert was failing on this array of structure field.
>
> { "name": "Passenger", "type":
>                        [{"type":"array","items":
>                            {"type":"record",
>                              "name": "PAXStruct",
>                              "fields": [
>                                        { "name":"PAXCode",
> "type":["string", "null"] },
>                                        {
> "name":"PAXQuantity","type":["int", "null"] }
>                                        ]
>                            }
>                         }, "null"]
>      }
>
> I removed the last "null" clause and it's working okay now.
>
> Regards,
> Sadu
>
>
> On Thu, Apr 4, 2013 at 12:36 AM, Mark Grover 
> <grover.markgro...@gmail.com>wrote:
>
>> Can you please check your Jobtracker logs? The is a generic error related
>> to grabbing the Task Attempt Log URL, the real error is in JT logs.
>>
>>
>> On Wed, Apr 3, 2013 at 7:17 PM, Sadananda Hegde <saduhe...@gmail.com>wrote:
>>
>>> Hi Dean,
>>>
>>> I tried inserting a bucketed hive table from a non-bucketed table using
>>> insert overwrite .... select from clause; but I get the following error.
>>>
>>> ----------------------------------------------------------------------------------
>>> Exception in thread "Thread-225" java.lang.NullPointerException
>>>         at
>>> org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44)
>>>         at
>>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
>>>         at
>>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
>>>         at java.lang.Thread.run(Thread.java:662)
>>> FAILED: Execution Error, return code 2 from
>>> org.apache.hadoop.hive.ql.exec.MapRedTask
>>>
>>> --------------------------------------------------------------------------------------------------------------------------
>>>
>>> Both tables have same structure except that that one has CLUSTERED BY
>>> CLAUSE and other not.
>>>
>>> Some columns are defined as Array of Structs. The Insert statement works
>>> fine if I take out those complex columns. Are there any known issues
>>> loading STRUCT or ARRAY OF STRUCT fields?
>>>
>>>
>>> Thanks for your time and help.
>>>
>>> Sadu
>>>
>>>
>>>
>>>
>>> On Sat, Mar 30, 2013 at 7:00 PM, Dean Wampler <
>>> dean.wamp...@thinkbiganalytics.com> wrote:
>>>
>>>> The table can be external. You should be able to use this data with
>>>> other tools, because all bucketing does is ensure that all occurrences for
>>>> records with a given key are written into the same block. This is why
>>>> clustered/blocked data can be joined on those keys using map-side joins;
>>>> Hive knows it can cache ab individual block in memory and the block will
>>>> hold all records across the table for the keys in that block.
>>>>
>>>> So, Java MR apps and Pig can still read the records, but they won't
>>>> necessarily understand how the data is organized. I.e., it might appear
>>>> unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
>>>> but I'm not sure.
>>>>
>>>> dean
>>>>
>>>>
>>>> On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde 
>>>> <saduhe...@gmail.com>wrote:
>>>>
>>>>> Thanks, Dean.
>>>>>
>>>>> Does that mean, this bucketing is exclusively Hive feature and not
>>>>> available to others like Java, Pig, etc?
>>>>>
>>>>> And also, my final tables have to be managed tables; not external
>>>>> tables, right?
>>>>>  .
>>>>> Thank again for your time and help.
>>>>>
>>>>> Sadu
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
>>>>> dean.wamp...@thinkbiganalytics.com> wrote:
>>>>>
>>>>>> I don't know of any way to avoid creating new tables and moving the
>>>>>> data. In fact, that's the official way to do it, from a temp table to the
>>>>>> final table, so Hive can ensure the bucketing is done correctly:
>>>>>>
>>>>>>  https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>>>>>
>>>>>> In other words, you might have a big move now, but going forward,
>>>>>> you'll want to stage your data in a temp table, use this procedure to put
>>>>>> it in the final location, then delete the temp data.
>>>>>>
>>>>>> dean
>>>>>>
>>>>>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde <saduhe...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We run M/R jobs to parse and process large and highly complex xml
>>>>>>> files into AVRO files. Then we build external Hive tables on top the 
>>>>>>> parsed
>>>>>>> Avro files. The hive tables are partitioned by day; but they are still 
>>>>>>> huge
>>>>>>> partitions and joins do not perform that well. So I would like to try
>>>>>>> out creating buckets on the join key. How do I create the buckets on the
>>>>>>> existing HDFS files? I would prefer to avoid creating another set of 
>>>>>>> tables
>>>>>>> (bucketed) and load data from non-bucketed table to bucketed tables if 
>>>>>>> at
>>>>>>> all possible. Is it possible to do the bucketing in Java as part of the 
>>>>>>> M/R
>>>>>>> jobs while creating the Avro files?
>>>>>>>
>>>>>>> Any help / insight would greatly be appreciated.
>>>>>>>
>>>>>>> Thank you very much for your time and help.
>>>>>>>
>>>>>>> Sadu
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Dean Wampler, Ph.D.*
>>>>>> thinkbiganalytics.com
>>>>>> +1-312-339-1330
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Dean Wampler, Ph.D.*
>>>> thinkbiganalytics.com
>>>> +1-312-339-1330
>>>>
>>>>
>>>
>>
>

Re: Bucketing external tables

Reply via email to