Re: Quick question regarding production readiness of ParquetIO

Kobe Feng Tue, 01 Dec 2020 15:05:52 -0800

Tao, my experience of using ParquetIO is good (version: 2.11, 2.18, 2.21)
We mainly leverage it for hadoop sink by converting avro record to parquet,
and we checked data loss, quality, etc. are good, and no performance issue.


Here is one code snippet: (why we have own parquetIO is to remove partition
field from the record base on user requirement as hive/spark partition
table already include the value in HDFS path and use it for scan filtering)

def toHadoop(basePath: String, recordPartition: RecordPartition,
fileNaming: FileNaming, shardNum: Int, includePartitionFields: Boolean
= false): Unit = {
  val baseDir = HadoopClient.resolve(basePath, env)
  pCollection.apply("darwin.write.hadoop.parquet." + postfix,
FileIO.writeDynamic[String, GenericRecord]()
    .by(recordPartition.partitionFunc)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(*DarwinParquetIO*.sink(recordPartition.getOutputSchema(avroSchema,
includePartitionFields), recordPartition.getPartitionFields(),
includePartitionFields))
    .to(baseDir)
    .withCompression(Compression.LZO)
    .withNaming((partitionFolder: String) =>
relativeFileNaming(StaticValueProvider.of[String](baseDir +
Path.SEPARATOR + partitionFolder), fileNaming))
    .withNumShards(shardNum))
}


On Tue, Dec 1, 2020 at 3:44 AM Alexey Romanenko <[email protected]>
wrote:

> ParquetIO exists in Beam since 2.5.0 release, so it can be considered
> quite stable and mature. I’m not aware about any open major issues and you
> can check the performance here [1][2]
>
> On the other hand, you are right  - it’s annotated with @Experimental as
> many other Beam Java IOs and components that make people confusing. There
> is a long story on this in Beam and we had several related discussions (the
> latest one [3]) on how to reduce the number of these "experimental”s.
>
> [1]
> http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=16&fullscreen&orgId=1
> [2]
> http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=17&fullscreen&orgId=1
> [3]
> https://lists.apache.org/thread.html/0f769736be1cf2fc5227f7a25dd3fdbb9296afe8a071761cb91f588a%40%3Cdev.beam.apache.org%3E
>
> On 30 Nov 2020, at 22:13, Tao Li <[email protected]> wrote:
>
> Hi Beam community,
>
> According to this link the  ParquetIO is still considered experimental:
> https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html
>
> Does it mean it’s not yet ready for prod usage? If that’s the case, when
> will it be ready?
>
> Also, is there any known performance/scalability/reliability issue with
> ParquetIO?
>
> Thanks a lot!
>
>
>

-- 
Yours Sincerely
Kobe Feng

Re: Quick question regarding production readiness of ParquetIO

Reply via email to