Some things just didn't work as i had first expected it.  For example,
when writing from a spark collection to an alluxio destination didn't
persist them to s3 automatically.

I remember having to use the alluxio library directly to force the
files to persist to s3 after spark finished writing to alluxio.

On Fri, May 12, 2017 at 6:52 AM, Gene Pang <gene.p...@gmail.com> wrote:
> Hi,
>
> Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog post
> on Spark + Alluxio + S3, and here is some documentation for configuring
> Alluxio + S3 and configuring Spark + Alluxio.
>
> You mentioned that it required a lot of effort to get working. May I ask
> what you ran into, and how you got it to work?
>
> Thanks,
> Gene
>
> On Thu, May 11, 2017 at 11:55 AM, Miguel Morales <therevolti...@gmail.com>
> wrote:
>>
>> Might want to try to use gzip as opposed to parquet.  The only way i
>> ever reliably got parquet to work on S3 is by using Alluxio as a
>> buffer, but it's a decent amount of work.
>>
>> On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com
>> <lucas.g...@gmail.com> wrote:
>> > Also, and this is unrelated to the actual question... Why don't these
>> > messages show up in the archive?
>> >
>> > http://apache-spark-user-list.1001560.n3.nabble.com/
>> >
>> > Ideally I'd want to post a link to our internal wiki for these
>> > questions,
>> > but can't find them in the archive.
>> >
>> > On 11 May 2017 at 07:16, lucas.g...@gmail.com <lucas.g...@gmail.com>
>> > wrote:
>> >>
>> >> Looks like this isn't viable in spark 2.0.0 (and greater I presume).
>> >> I'm
>> >> pretty sure I came across this blog and ignored it due to that.
>> >>
>> >> Any other thoughts?  The linked tickets in:
>> >> https://issues.apache.org/jira/browse/SPARK-10063
>> >> https://issues.apache.org/jira/browse/HADOOP-13786
>> >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
>> >>
>> >> On 10 May 2017 at 22:24, Miguel Morales <therevolti...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Try using the DirectParquetOutputCommiter:
>> >>> http://dev.sortable.com/spark-directparquetoutputcommitter/
>> >>>
>> >>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
>> >>> <lucas.g...@gmail.com> wrote:
>> >>> > Hi users, we have a bunch of pyspark jobs that are using S3 for
>> >>> > loading
>> >>> > /
>> >>> > intermediate steps and final output of parquet files.
>> >>> >
>> >>> > We're running into the following issues on a semi regular basis:
>> >>> > * These are intermittent errors, IE we have about 300 jobs that run
>> >>> > nightly... And a fairly random but small-ish percentage of them fail
>> >>> > with
>> >>> > the following classes of errors.
>> >>> >
>> >>> > S3 write errors
>> >>> >
>> >>> >> "ERROR Utils: Aborting task
>> >>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code:
>> >>> >> 404,
>> >>> >> AWS
>> >>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null,
>> >>> >> AWS
>> >>> >> Error
>> >>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
>> >>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException:
>> >>> >> Status
>> >>> >> Code:
>> >>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null,
>> >>> >> AWS
>> >>> >> Error
>> >>> >> Message: One or more objects could not be deleted, S3 Extended
>> >>> >> Request
>> >>> >> ID:
>> >>> >> null"
>> >>> >
>> >>> >
>> >>> >
>> >>> > S3 Read Errors:
>> >>> >
>> >>> >> [Stage 1:=================================================>
>> >>> >> (27
>> >>> >> + 4)
>> >>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in
>> >>> >> stage
>> >>> >> 1.0
>> >>> >> (TID 11)
>> >>> >> java.net.SocketException: Connection reset
>> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>> >>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>> >>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>> >>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>> >>> >> at
>> >>> >> sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>> >>> >> at
>> >>> >>
>> >>> >> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>> >>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
>> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >>> >> at
>> >>> >> com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>> >>> >> at
>> >>> >>
>> >>> >> org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)
>> >>> >
>> >>> >
>> >>> >
>> >>> > We have literally tons of logs we can add but it would make the
>> >>> > email
>> >>> > unwieldy big.  If it would be helpful I'll drop them in a pastebin
>> >>> > or
>> >>> > something.
>> >>> >
>> >>> > Our config is along the lines of:
>> >>> >
>> >>> > spark-2.1.0-bin-hadoop2.7
>> >>> > '--packages
>> >>> >
>> >>> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
>> >>> > pyspark-shell'
>> >>> >
>> >>> > Given the stack overflow / googling I've been doing I know we're not
>> >>> > the
>> >>> > only org with these issues but I haven't found a good set of
>> >>> > solutions
>> >>> > in
>> >>> > those spaces yet.
>> >>> >
>> >>> > Thanks!
>> >>> >
>> >>> > Gary Lucas
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to