Tao, my experience of using ParquetIO is good (version: 2.11, 2.18, 2.21) We mainly leverage it for hadoop sink by converting avro record to parquet, and we checked data loss, quality, etc. are good, and no performance issue.
Here is one code snippet: (why we have own parquetIO is to remove partition field from the record base on user requirement as hive/spark partition table already include the value in HDFS path and use it for scan filtering) def toHadoop(basePath: String, recordPartition: RecordPartition, fileNaming: FileNaming, shardNum: Int, includePartitionFields: Boolean = false): Unit = { val baseDir = HadoopClient.resolve(basePath, env) pCollection.apply("darwin.write.hadoop.parquet." + postfix, FileIO.writeDynamic[String, GenericRecord]() .by(recordPartition.partitionFunc) .withDestinationCoder(StringUtf8Coder.of()) .via(*DarwinParquetIO*.sink(recordPartition.getOutputSchema(avroSchema, includePartitionFields), recordPartition.getPartitionFields(), includePartitionFields)) .to(baseDir) .withCompression(Compression.LZO) .withNaming((partitionFolder: String) => relativeFileNaming(StaticValueProvider.of[String](baseDir + Path.SEPARATOR + partitionFolder), fileNaming)) .withNumShards(shardNum)) } On Tue, Dec 1, 2020 at 3:44 AM Alexey Romanenko <aromanenko....@gmail.com> wrote: > ParquetIO exists in Beam since 2.5.0 release, so it can be considered > quite stable and mature. I’m not aware about any open major issues and you > can check the performance here [1][2] > > On the other hand, you are right - it’s annotated with @Experimental as > many other Beam Java IOs and components that make people confusing. There > is a long story on this in Beam and we had several related discussions (the > latest one [3]) on how to reduce the number of these "experimental”s. > > [1] > http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=16&fullscreen&orgId=1 > [2] > http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=17&fullscreen&orgId=1 > [3] > https://lists.apache.org/thread.html/0f769736be1cf2fc5227f7a25dd3fdbb9296afe8a071761cb91f588a%40%3Cdev.beam.apache.org%3E > > On 30 Nov 2020, at 22:13, Tao Li <t...@zillow.com> wrote: > > Hi Beam community, > > According to this link the ParquetIO is still considered experimental: > https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html > > Does it mean it’s not yet ready for prod usage? If that’s the case, when > will it be ready? > > Also, is there any known performance/scalability/reliability issue with > ParquetIO? > > Thanks a lot! > > > -- Yours Sincerely Kobe Feng