Double pass over ORC data files even after supplying schema and setting inferSchema = false

Thakrar, Jayesh Wed, 21 Nov 2018 06:17:47 -0800

Hi All,

We have some batch processing where we read 100s of thousands of ORC files.
What I found is that this was taking too much time AND that there was a long 
pause between the point the read begins in the code and the executors get into 
action.
That period is about 1.5+ hours where only the driver seems to be busy.


I have a feeling that this is due to double pass over the data for schema 
inference AND validation (e.g. if one of the files has a missing field, there 
is an exception).
I tried providing the schema upfront as well as setting inferSchema to false, 
yet the same thing happens.

Is there any explanation for this and is there any way to avoid it?

Thanks,
Jayesh

Double pass over ORC data files even after supplying schema and setting inferSchema = false

Reply via email to