Re: Setting rowGroupSize in ParquetIO

Bashir Sadjad Fri, 12 Mar 2021 08:29:59 -0800

Thanks David. Yes, I looked at passing it through the HadoopConfiguration
but it seems row-group size is not there or at least ParquetWriter.Builder
seems to set that directly from its rowGroupSize property. I filed
BEAM-11969 <https://issues.apache.org/jira/browse/BEAM-11969> for this so
if you can contribute your patch for plumbing this, that would be great.
Otherwise, I can send a PR.


Regards

-B

On Fri, Mar 12, 2021 at 8:25 AM David Hollands <david.holla...@bbc.co.uk>
wrote:

> Hi Bashir,
>
>
>
> I think it is just a case of somebody bothering to plumbing it in
> explicitly, e.g.
>
>
>
> /** Specifies row group size. By default, DEFAULT_BLOCK_SIZE. */
>
>
>
> public Sink withRowGroupSize(int rowGroupSize) {
>
>
>
> return toBuilder().setRowGroupSize(rowGroupSize).build();
>
> }
>
>
>
> and
>
>
>
> this.writer =
>
>
>
> AvroParquetWriter.<GenericRecord>builder(beamParquetOutputFile)
>
>
>
> .withRowGroupSize(getRowGroupSize()) // Ze patch to set RowGroupSize
>
>
>
> .withSchema(schema)
>
>
>
> .withCompressionCodec(getCompressionCodec())
>
>
>
> .withWriteMode(OVERWRITE)
>
> .build();
>
>
>
> Etc.
>
>
>
> *However, it might worth exploring if it can be set via the
> HadoopConfiguration “parquet.block.size” property, but I’m not sure that it
> actually can.*
>
>
>
> We patched in something explicitly last year but didn’t contribute
> upstream as there was quite a bit of activity on the ParquetIO (e.g.
> conversion to SDF) at the time.
>
>
>
> The use case we had at the time was that some downstream consumers of the
> parquet (AWS S3 Select) couldn’t handle rowGroupSizes > 64MB uncompressed.
> I’m sure there are other use cases out there that need this fined grained
> control.
>
>
>
> Cheers, David
>
>
>
> *David Hollands*
>
> BBC Broadcast Centre, London, W12
>
> Email: david.holla...@bbc.co.uk
>
>
>
>
>
> *From: *Bashir Sadjad <bas...@google.com>
> *Reply to: *"user@beam.apache.org" <user@beam.apache.org>
> *Date: *Friday, 12 March 2021 at 07:58
> *To: *"user@beam.apache.org" <user@beam.apache.org>
> *Subject: *Setting rowGroupSize in ParquetIO
>
>
>
> Hi all,
>
>
>
> I wonder how I can set the row group size for files generated by
> ParquetIO.Sink
> <https://beam.apache.org/releases/javadoc/2.20.0/org/apache/beam/sdk/io/parquet/ParquetIO.Sink.html>.
> It doesn't seem to provide the option for setting that and IIUC from the
> code
> <https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1117>,
>  it
> uses the default value in ParquetWriter.Builder
> <https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636>
>  which
> is 128MB. Is there any reason not to expose this parameter in ParquetIO?
>
>
>
> Thanks
>
>
>
> -B
>
>
>

Re: Setting rowGroupSize in ParquetIO

Reply via email to