Hello,

I am a developer trying to use Apache Beam in my Java application, and I'm
running into an issue with reading multiple Parquet files from a directory
in S3. I'm able to successfully run this line of code, where tempPath  =
"s3://<bucket-name>/*.parquet":
PCollection<GenericRecord> records = pipeline.apply("Read parquet file in
as Generic Records", ParquetIO.read(schema).from(tempPath));

My problem is reading the schema beforehand. At runtime, I only have the
name of the S3 bucket, which has all the Parquet files I need underneath
it. However, I am unable to use that same tempPath above to retrieve my
schema. Because the path is not pointing to a singular parquet file, the
ParquetFileReader class from Apache Hadoop throws an error: No such file or
directory: s3a://<bucket-name>/*.parquet.

To read my schema, I'm using this chunk of code:

Configuration configuration = new Configuration();
configuration.set("fs.s3a.access.key","<access_key>);
configuration.set("fs.s3a.secret.key", "<secret_key>");
configuration.set("fs.s3a.session.token","<session_token>");
configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
configuration.set("fs.s3a.server-side-encryption-algorithm", "<algorithm>");
configuration.set("fs.s3a.proxy.host", "<proxy_host>");
configuration.set("fs.s3a.proxy.port", "<proxy_port>");
configuration.set("fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");

String hadoopFilePath = new Path("s3a://<bucket-name>/*.parquet");
ParquetFileReader r =
ParquetFileReader.open(HadoopInputFile.fromPath(hadoopFilePath,
configuration));
MessageType messageType = r.getFooter().getFileMetaData().getSchema();
AvroSchemaConverter converter = new AvroSchemaConverter();
Schema schema = converter.convert(messageType);

The red line is where the code is failing. Is there maybe a Hadoop
Configuration I can set to force Hadoop to read recursively?

I realize this is kind of a Beam-adjacent problem, but I've been struggling
with this for a while, so any help would be appreciated!

Thanks and sincerely,
Ramya

______________________________________________________________________



The information contained in this e-mail may be confidential and/or proprietary 
to Capital One and/or its affiliates and may only be used solely in performance 
of work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



Reply via email to