Might want to try to use gzip as opposed to parquet. The only way i ever reliably got parquet to work on S3 is by using Alluxio as a buffer, but it's a decent amount of work.
On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > Also, and this is unrelated to the actual question... Why don't these > messages show up in the archive? > > http://apache-spark-user-list.1001560.n3.nabble.com/ > > Ideally I'd want to post a link to our internal wiki for these questions, > but can't find them in the archive. > > On 11 May 2017 at 07:16, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: >> >> Looks like this isn't viable in spark 2.0.0 (and greater I presume). I'm >> pretty sure I came across this blog and ignored it due to that. >> >> Any other thoughts? The linked tickets in: >> https://issues.apache.org/jira/browse/SPARK-10063 >> https://issues.apache.org/jira/browse/HADOOP-13786 >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too. >> >> On 10 May 2017 at 22:24, Miguel Morales <therevolti...@gmail.com> wrote: >>> >>> Try using the DirectParquetOutputCommiter: >>> http://dev.sortable.com/spark-directparquetoutputcommitter/ >>> >>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com >>> <lucas.g...@gmail.com> wrote: >>> > Hi users, we have a bunch of pyspark jobs that are using S3 for loading >>> > / >>> > intermediate steps and final output of parquet files. >>> > >>> > We're running into the following issues on a semi regular basis: >>> > * These are intermittent errors, IE we have about 300 jobs that run >>> > nightly... And a fairly random but small-ish percentage of them fail >>> > with >>> > the following classes of errors. >>> > >>> > S3 write errors >>> > >>> >> "ERROR Utils: Aborting task >>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, >>> >> AWS >>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS >>> >> Error >>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc=" >>> > >>> > >>> >> >>> >> "Py4JJavaError: An error occurred while calling o43.parquet. >>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status >>> >> Code: >>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS >>> >> Error >>> >> Message: One or more objects could not be deleted, S3 Extended Request >>> >> ID: >>> >> null" >>> > >>> > >>> > >>> > S3 Read Errors: >>> > >>> >> [Stage 1:=================================================> (27 >>> >> + 4) >>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage >>> >> 1.0 >>> >> (TID 11) >>> >> java.net.SocketException: Connection reset >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196) >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122) >>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442) >>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554) >>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509) >>> >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927) >>> >> at >>> >> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884) >>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102) >>> >> at >>> >> >>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198) >>> >> at >>> >> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) >>> >> at >>> >> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200) >>> >> at >>> >> >>> >> org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103) >>> >> at >>> >> >>> >> org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168) >>> >> at >>> >> >>> >> org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228) >>> >> at >>> >> >>> >> org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174) >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181) >>> >> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203) >>> >> at >>> >> org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187) >>> > >>> > >>> > >>> > We have literally tons of logs we can add but it would make the email >>> > unwieldy big. If it would be helpful I'll drop them in a pastebin or >>> > something. >>> > >>> > Our config is along the lines of: >>> > >>> > spark-2.1.0-bin-hadoop2.7 >>> > '--packages >>> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 >>> > pyspark-shell' >>> > >>> > Given the stack overflow / googling I've been doing I know we're not >>> > the >>> > only org with these issues but I haven't found a good set of solutions >>> > in >>> > those spaces yet. >>> > >>> > Thanks! >>> > >>> > Gary Lucas >> >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org