Hello, I am a developer trying to use Apache Beam in my Java application, and I'm running into an issue with reading multiple Parquet files from a directory in S3. I'm able to successfully run this line of code, where tempPath = "s3://<bucket-name>/*.parquet": PCollection<GenericRecord> records = pipeline.apply("Read parquet file in as Generic Records", ParquetIO.read(schema).from(tempPath));
My problem is reading the schema beforehand. At runtime, I only have the name of the S3 bucket, which has all the Parquet files I need underneath it. However, I am unable to use that same tempPath above to retrieve my schema. Because the path is not pointing to a singular parquet file, the ParquetFileReader class from Apache Hadoop throws an error: No such file or directory: s3a://<bucket-name>/*.parquet. To read my schema, I'm using this chunk of code: Configuration configuration = new Configuration(); configuration.set("fs.s3a.access.key","<access_key>); configuration.set("fs.s3a.secret.key", "<secret_key>"); configuration.set("fs.s3a.session.token","<session_token>"); configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"); configuration.set("fs.s3a.server-side-encryption-algorithm", "<algorithm>"); configuration.set("fs.s3a.proxy.host", "<proxy_host>"); configuration.set("fs.s3a.proxy.port", "<proxy_port>"); configuration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"); String hadoopFilePath = new Path("s3a://<bucket-name>/*.parquet"); ParquetFileReader r = ParquetFileReader.open(HadoopInputFile.fromPath(hadoopFilePath, configuration)); MessageType messageType = r.getFooter().getFileMetaData().getSchema(); AvroSchemaConverter converter = new AvroSchemaConverter(); Schema schema = converter.convert(messageType); The red line is where the code is failing. Is there maybe a Hadoop Configuration I can set to force Hadoop to read recursively? I realize this is kind of a Beam-adjacent problem, but I've been struggling with this for a while, so any help would be appreciated! Thanks and sincerely, Ramya ______________________________________________________________________ The information contained in this e-mail may be confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.