Thanks David. Yes, I looked at passing it through the HadoopConfiguration but it seems row-group size is not there or at least ParquetWriter.Builder seems to set that directly from its rowGroupSize property. I filed BEAM-11969 <https://issues.apache.org/jira/browse/BEAM-11969> for this so if you can contribute your patch for plumbing this, that would be great. Otherwise, I can send a PR.
Regards -B On Fri, Mar 12, 2021 at 8:25 AM David Hollands <david.holla...@bbc.co.uk> wrote: > Hi Bashir, > > > > I think it is just a case of somebody bothering to plumbing it in > explicitly, e.g. > > > > /** Specifies row group size. By default, DEFAULT_BLOCK_SIZE. */ > > > > public Sink withRowGroupSize(int rowGroupSize) { > > > > return toBuilder().setRowGroupSize(rowGroupSize).build(); > > } > > > > and > > > > this.writer = > > > > AvroParquetWriter.<GenericRecord>builder(beamParquetOutputFile) > > > > .withRowGroupSize(getRowGroupSize()) // Ze patch to set RowGroupSize > > > > .withSchema(schema) > > > > .withCompressionCodec(getCompressionCodec()) > > > > .withWriteMode(OVERWRITE) > > .build(); > > > > Etc. > > > > *However, it might worth exploring if it can be set via the > HadoopConfiguration “parquet.block.size” property, but I’m not sure that it > actually can.* > > > > We patched in something explicitly last year but didn’t contribute > upstream as there was quite a bit of activity on the ParquetIO (e.g. > conversion to SDF) at the time. > > > > The use case we had at the time was that some downstream consumers of the > parquet (AWS S3 Select) couldn’t handle rowGroupSizes > 64MB uncompressed. > I’m sure there are other use cases out there that need this fined grained > control. > > > > Cheers, David > > > > *David Hollands* > > BBC Broadcast Centre, London, W12 > > Email: david.holla...@bbc.co.uk > > > > > > *From: *Bashir Sadjad <bas...@google.com> > *Reply to: *"user@beam.apache.org" <user@beam.apache.org> > *Date: *Friday, 12 March 2021 at 07:58 > *To: *"user@beam.apache.org" <user@beam.apache.org> > *Subject: *Setting rowGroupSize in ParquetIO > > > > Hi all, > > > > I wonder how I can set the row group size for files generated by > ParquetIO.Sink > <https://beam.apache.org/releases/javadoc/2.20.0/org/apache/beam/sdk/io/parquet/ParquetIO.Sink.html>. > It doesn't seem to provide the option for setting that and IIUC from the > code > <https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1117>, > it > uses the default value in ParquetWriter.Builder > <https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636> > which > is 128MB. Is there any reason not to expose this parameter in ParquetIO? > > > > Thanks > > > > -B > > >