Re: Double pass over ORC data files even after supplying schema and setting inferSchema = false

Dongjoon Hyun Wed, 21 Nov 2018 09:46:49 -0800

Hi, Thakrar.

Which version are you using now? If it's below Spark 2.4.0, please try to
use 2.4.0.


There was an improvement related to that.

https://issues.apache.org/jira/browse/SPARK-25126

Bests,
Dongjoon.


On Wed, Nov 21, 2018 at 6:17 AM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Hi All,
>
>
>
> We have some batch processing where we read 100s of thousands of ORC files.
>
> What I found is that this was taking too much time AND that there was a
> long pause between the point the read begins in the code and the executors
> get into action.
>
> That period is about 1.5+ hours where only the driver seems to be busy.
>
>
>
> I have a feeling that this is due to double pass over the data for schema
> inference AND validation (e.g. if one of the files has a missing field,
> there is an exception).
>
> I tried providing the schema upfront as well as setting inferSchema to
> false, yet the same thing happens.
>
>
>
> Is there any explanation for this and is there any way to avoid it?
>
>
>
> Thanks,
>
> Jayesh
>
>
>

Re: Double pass over ORC data files even after supplying schema and setting inferSchema = false

Reply via email to