[ 
https://issues.apache.org/jira/browse/HADOOP-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arghya Saha updated HADOOP-17755:
---------------------------------
    Description: 
Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s 
and using s3a

I have around 700 GB of data to read and around 200 executors (5 vCore and 30G 
each).

Its able to read most of the files in problematic stage (Scan orc => Filter => 
Project) but is failing with few files at the end with below error.  The size 
of the file mentioned in error is around 140 MB and all other files are of 
similar size.

I am able to read and rewrite the specific file mentioned which suggest the 
file is not corrupted.

Let me know if further information is required

 
{code:java}
java.io.IOException: Error reading file: 
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
 Error reading file: 
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
 at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) 
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
 at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
 at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at 
java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException: 
End of file reached before reading fully. at 
org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at 
org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
 at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
 at 
org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
 at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) 
at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) 
at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
 at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) 
... 20 more
{code}
 

 

  was:
Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s 
and using s3a

I have around 700 GB of data to read and around 200 executors (5 vCore and 30G 
each).

Its able to read most of the files in problematic stage (Scan orc => Filter => 
Project) but is failing with few files at the end with below error. 

I am able to read and rewrite the specific file mentioned which suggest the 
file is not corrupted.

Let me know if further information is required

 
{code:java}
java.io.IOException: Error reading file: 
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
 Error reading file: 
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
 at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) 
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
 at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
 at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at 
java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException: 
End of file reached before reading fully. at 
org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at 
org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
 at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
 at 
org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
 at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) 
at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) 
at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
 at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) 
... 20 more
{code}
 

 


> EOF reached error reading ORC file on S3A
> -----------------------------------------
>
>                 Key: HADOOP-17755
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17755
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>         Environment: Hadoop 3.2.0
>            Reporter: Arghya Saha
>            Priority: Major
>
> Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s 
> and using s3a
> I have around 700 GB of data to read and around 200 executors (5 vCore and 
> 30G each).
> Its able to read most of the files in problematic stage (Scan orc => Filter 
> => Project) but is failing with few files at the end with below error.  The 
> size of the file mentioned in error is around 140 MB and all other files are 
> of similar size.
> I am able to read and rewrite the specific file mentioned which suggest the 
> file is not corrupted.
> Let me know if further information is required
>  
> {code:java}
> java.io.IOException: Error reading file: 
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
>  Error reading file: 
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
>  at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
>  at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
>  at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: 
> java.io.EOFException: End of file reached before reading fully. at 
> org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) 
> at 
> org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
>  at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
>  at 
> org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
>  at 
> org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) 
> at 
> org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256)
>  at 
> org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
>  at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) 
> ... 20 more
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to