Re: No filesystem found for scheme s3 using FileIO

Magnus Runesson Sat, 21 Sep 2019 07:06:41 -0700

Hi!


You probably miss the S3 filesystem in your classpath.

If I remember correctly you must include thishttps://search.maven.org/search?q=a:beam-sdks-java-io-amazon-web-servicespackage in your classpath/fat-jar.


/Magnus

On 2019-09-19 23:13, Koprivica,Preston Blake wrote:

Hello everyone. I’m getting the following error when attempting to usethe FileIO apis (beam-2.15.0) and integrating with a 3rd partyfilesystem, in this case AWS S3:
java.lang.IllegalArgumentException: No filesystem found for scheme s3
atorg.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
atorg.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105)
    at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
atorg.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:83)
atorg.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:32)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480)
atorg.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93)
atorg.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
atorg.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
atorg.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
atorg.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
atorg.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
atorg.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:107)
atorg.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
atorg.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)

    at java.lang.Thread.run(Thread.java:748)

For reference, the write code resembles this:

FileIO.Write<?, GenericRecord> write = FileIO.<GenericRecord>write()

.via(ParquetIO.sink(schema))
.to(options.getOutputDir()). // will be something like:s3://<bucket>/<path>
.withSuffix(".parquet");

records.apply(String.format("Write(%s)", options.getOutputDir()), write);
I have setup the PipelineOptions with all the relevant AWS options andthe issue does not appear to be related to ParquetIO.sink() directly. I am able to reliably reproduce the issue using JSON formatted recordsand TextIO.sink(), as well.
Just trying some different knobs, I went ahead and set the followingoption:
        write = write.withNoSpilling();
This actually seemed to fix the issue, only to have it reemerge as Iscaled up the data set size. The stack trace, while very similar, reads:
java.lang.IllegalArgumentException: No filesystem found for scheme s3
atorg.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
atorg.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105)
    at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)

    at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82)

    at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480)
atorg.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93)
atorg.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
atorg.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
atorg.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
atorg.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
atorg.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
atorg.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:94)
atorg.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
atorg.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)

    at java.lang.Thread.run(Thread.java:748)
I’ll be interested to hear some theories on thedifferences/similarities in the stacks. And lastly, I tried addingthe following deprecated option (with and without the withNoSpilling()option):
write = write.withIgnoreWindowing();
This seemed to fix the issue altogether but aside from having to relyon a deprecated feature, there is the bigger issue of why?
In reading through some of the source, it seems a common pattern tohave to manually register the pipeline options to seed the filesystemregistry during the setup part of the operator lifecycle, e.g.:https://github.com/apache/beam/blob/release-2.15.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java#L304-L313
Is it possible that I have hit upon a couple scenarios where that hasnot taken place? Unfortunately, I’m not yet at a position to suggesta fix, but I’m guessing there’s some missing initialization code inone or more of the batch operators. If this is indeed a legitimateissue, I’ll be happy to log an issue, but I’ll hold off until thecommunity gets a chance to look at it.
Thanks,

  * Preston
CONFIDENTIALITY NOTICE This message and any included attachments arefrom Cerner Corporation and are intended only for the addressee. Theinformation contained in this message is confidential and mayconstitute inside or non-public information under international,federal, or state securities laws. Unauthorized forwarding, printing,copying, distribution, or use of such information is strictlyprohibited and may be unlawful. If you are not the addressee, pleasepromptly delete this message and notify the sender of the deliveryerror by e-mail or you may call Cerner's corporate offices in KansasCity, Missouri, U.S.A at (+1) (816)221-1024.

Re: No filesystem found for scheme s3 using FileIO

Reply via email to