Re: Hive ORC Malformed while loading into spark data frame

Umesh Kacha Sat, 03 Oct 2015 23:12:05 -0700

Thanks much Zhan Zhang. I will open a JIRA saying orc files created using
hiveContext.sql can't be read by dataframe reader.


Regards,
Umesh
On Oct 4, 2015 10:14, "Zhan Zhang" <zzh...@hortonworks.com> wrote:

> HI Umesh,
>
> It depends on how you create and read the orc file, although everything
> happens in side of spark. Because there are two paths in spark to create
> table, one is through hive, and the other one is through data frame. Due to
> version compatibility issue,
> there may be conflicts between these two paths. You have to use
> dataframe.write and dataframe.read to avoid such issue. The ORC path has to
> be upgraded to the same version as hive to solve this issue.
>
> Because ORC becomes a independent project now, and we are waiting for it
> to be totally isolated from hive. Then we can upgrade ORC to latest
> version, and put it to SqlContext. I think you can open a JIRA to tracking
> this upgrade.
>
> BTW, my name is Zhan Zhang instead of Zang.
>
> Thanks.
>
> Zhan Zhang
>
> On Oct 3, 2015, at 2:18 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
>
> Hi Zang any idea why is this happening? I can load ORC files created by
> Hive table but I cant load ORC files created by Spark itself. It looks like
> bug.
>
> On Wed, Sep 30, 2015 at 12:03 PM, Umesh Kacha <umesh.ka...@gmail.com>
> wrote:
>
>> Hi Zang thanks much please find the code below
>>
>> Working code loading data from a path created by Hive table using hive
>> console outside of spark :
>>
>> DataFrame df =
>> hiveContext.read().format("orc").load("/hdfs/path/to/hive/table/partition")
>>
>> Not working code inside spark hive tables created using hiveContext.sql
>> insert into partition queries
>>
>> DataFrame df =
>> hiveContext.read().format("orc").load("/hdfs/path/to/hive/table/partition/created/by/spark")
>>
>> You see above is same in both cases just second code is trying to load
>> orc data created by Spark.
>> On Sep 30, 2015 11:22 AM, "Zhan Zhang" <zzh...@hortonworks.com> wrote:
>>
>>> Hi Umesh,
>>>
>>> The potential reason is that Hive and Spark does not use same
>>> OrcInputFormat. In new hive version, there are NewOrcInputFormat, but it is
>>> not in spark because of backward compatibility (which is not available in
>>> hive-0.12).
>>> Do you mind post the code that works and not works for you?
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> On Sep 29, 2015, at 10:05 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
>>>
>>> Hi I can read/load orc data created by hive table in a dataframe why is
>>> it throwing Malformed ORC exception when I try to load data created by
>>> hiveContext.sql into dataframe?
>>> On Sep 30, 2015 2:37 AM, "Hortonworks" <zzh...@hortonworks.com> wrote:
>>>
>>>> You can try to use data frame for both read and write
>>>>
>>>> Thanks
>>>>
>>>> Zhan Zhang
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Sep 29, 2015, at 1:56 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
>>>>
>>>> Hi Zang, thanks for the response. Table is created using Spark
>>>> hiveContext.sql and data inserted into table also using hiveContext.sql.
>>>> Insert into partition table. When I try to load orc data into dataframe I
>>>> am loading particular partition data stored in path say
>>>> /user/xyz/Hive/xyz.db/sparktable/partition1=abc
>>>>
>>>> Regards,
>>>> Umesh
>>>> On Sep 30, 2015 02:21, "Hortonworks" <zzh...@hortonworks.com> wrote:
>>>>
>>>>> How was the table is generated, by hive or by spark?
>>>>>
>>>>> If you generate table using have but read it by data frame, it may
>>>>> have some comparability issue.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Zhan Zhang
>>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> > On Sep 29, 2015, at 1:47 PM, unk1102 <umesh.ka...@gmail.com> wrote:
>>>>> >
>>>>> > Hi I have a spark job which creates hive tables in orc format with
>>>>> > partitions. It works well I can read data back into hive table using
>>>>> hive
>>>>> > console. But if I try further process orc files generated by Spark
>>>>> job by
>>>>> > loading into dataframe  then I get the following exception
>>>>> > Caused by: java.io.IOException: Malformed ORC file
>>>>> > hdfs://localhost:9000/user/hive/warehouse/partorc/part_tiny.txt.
>>>>> Invalid
>>>>> > postscript.
>>>>> >
>>>>> > Dataframe df = hiveContext.read().format("orc").load(to/path);
>>>>> >
>>>>> > Please guide.
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Hive-ORC-Malformed-while-loading-into-spark-data-frame-tp24876.html
>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com <http://nabble.com/>.
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity to
>>>>> which it is addressed and may contain information that is confidential,
>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader
>>>>> of this message is not the intended recipient, you are hereby notified
>>>>> that
>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>> received this communication in error, please contact the sender
>>>>> immediately
>>>>> and delete it from your system. Thank You.
>>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>>>> entity to which it is addressed and may contain information that is
>>>> confidential, privileged and exempt from disclosure under applicable law.
>>>> If the reader of this message is not the intended recipient, you are hereby
>>>> notified that any printing, copying, dissemination, distribution,
>>>> disclosure or forwarding of this communication is strictly prohibited. If
>>>> you have received this communication in error, please contact the sender
>>>> immediately and delete it from your system. Thank You.
>>>
>>>
>>>
>
>

Re: Hive ORC Malformed while loading into spark data frame

Reply via email to