Hi, Thakrar. Which version are you using now? If it's below Spark 2.4.0, please try to use 2.4.0.
There was an improvement related to that. https://issues.apache.org/jira/browse/SPARK-25126 Bests, Dongjoon. On Wed, Nov 21, 2018 at 6:17 AM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Hi All, > > > > We have some batch processing where we read 100s of thousands of ORC files. > > What I found is that this was taking too much time AND that there was a > long pause between the point the read begins in the code and the executors > get into action. > > That period is about 1.5+ hours where only the driver seems to be busy. > > > > I have a feeling that this is due to double pass over the data for schema > inference AND validation (e.g. if one of the files has a missing field, > there is an exception). > > I tried providing the schema upfront as well as setting inferSchema to > false, yet the same thing happens. > > > > Is there any explanation for this and is there any way to avoid it? > > > > Thanks, > > Jayesh > > >