[jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter

Sushanth Sowmyan (JIRA) Thu, 30 Oct 2014 02:22:46 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189838#comment-14189838
 ]


Sushanth Sowmyan commented on HIVE-4329:
----------------------------------------

Despite my initial reservations on approach, I've been trying to extend and 
make this patch work and get it in 0.14 because the functionality it introduces 
is important.  Last week, I'd pinged Vikram to get it okayed for 0.14. However, 
as of this time, on reviewing and debugging, this patch is still incomplete. 

The test failure from 
org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.testPigPopulation 
reported above is because this does not call FileSinkOperator.checkOutputSpecs, 
which thus, does not wind up populating the "actualOutputFormat", and thus, 
PassthroughOutputFormat thinks its underlying OutputFormat is null. Also, it's 
not a simple matter of simply calling that function, since that function 
depends on the FileSinkOperator having been instantiated, and having a 
TableDesc in its context. That, at least, is fixable, since HCatalog does have 
access to a TableDesc, in which case, HCatalog will then need to do some 
detection to see if the underlying OF is a PassthroughOutputFormat, and if so, 
then will need to instantiate PassthroughOutputFormat appropriately by calling 
a refactored FileSinkOperator.checkOutputSpecs that does not require the 
Operator itself.

This currently still breaks the traditional M/R OutputFormat usage under 
HCatalog usecase. At this point, I think it's easier to try and fix the 
underlying issue of making Avro work with HCatalog than to try rushing this 
patch into a 0.14 timeframe.

( Having said that, PassthroughOutputFormat is itself pretty broken, since it 
stores the realoutputFormat as a static string in HiveFileFormatUtils, which 
currently breaks current usecases like calling HBase through HS2, and then 
attempting to use any other M/R O/F like Accumulo (since HS2 winds up being a 
persistent process that retains the older versions of that static variable). It 
doesn't break in cases of hive commandline itself, if you write to only one 
M/R-OF based output in one query. That is a separate bug that is not this 
patch's fault, but this patch makes HCatalog depend on PassthroughOutputFormat, 
and HCat does get used in a multiple use per process scenario which affects it. 
(I'll file another jira on that issue soon - I've been debugging that issue) We 
may rely on PassthroughOutputFormat in the short term, but we really need to 
move off that and support M/R OFs natively(with native MR OutputCommitter 
semantics) in hive )


> HCatalog should use getHiveRecordWriter rather than getRecordWriter
> -------------------------------------------------------------------
>
>                 Key: HIVE-4329
>                 URL: https://issues.apache.org/jira/browse/HIVE-4329
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Serializers/Deserializers
>    Affects Versions: 0.14.0
>         Environment: discovered in Pig, but it looks like the root cause 
> impacts all non-Hive users
>            Reporter: Sean Busbey
>            Assignee: David Chen
>            Priority: Critical
>             Fix For: 0.14.0
>
>         Attachments: HIVE-4329.0.patch, HIVE-4329.1.patch, HIVE-4329.2.patch, 
> HIVE-4329.3.patch, HIVE-4329.4.patch, HIVE-4329.5.patch
>
>
> Attempting to write to a HCatalog defined table backed by the AvroSerde fails 
> with the following stacktrace:
> {code}
> java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be 
> cast to org.apache.hadoop.io.LongWritable
>       at 
> org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat$1.write(AvroContainerOutputFormat.java:84)
>       at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:253)
>       at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:53)
>       at 
> org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:242)
>       at org.apache.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:52)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>       at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:559)
>       at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
> {code}
> The proximal cause of this failure is that the AvroContainerOutputFormat's 
> signature mandates a LongWritable key and HCat's FileRecordWriterContainer 
> forces a NullWritable. I'm not sure of a general fix, other than redefining 
> HiveOutputFormat to mandate a WritableComparable.
> It looks like accepting WritableComparable is what's done in the other Hive 
> OutputFormats, and there's no reason AvroContainerOutputFormat couldn't also 
> be changed, since it's ignoring the key. That way fixing things so 
> FileRecordWriterContainer can always use NullWritable could get spun into a 
> different issue?
> The underlying cause for failure to write to AvroSerde tables is that 
> AvroContainerOutputFormat doesn't meaningfully implement getRecordWriter, so 
> fixing the above will just push the failure into the placeholder RecordWriter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter

Reply via email to