Thank you for your contribution, Bashir!

Alexey

> On 17 Mar 2021, at 15:32, Bashir Sadjad <bas...@google.com> wrote:
> 
> To close the loop here: The fix is merged and a configuration parameter is 
> added to ParquetIO.Sink now for rowGroupSize (which is not a great name 
> <https://github.com/apache/beam/pull/14227#discussion_r595409910>, BTW).
> 
> Thanks
> 
> -B
> 
> On Fri, Mar 12, 2021 at 11:55 AM David Hollands <david.holla...@bbc.co.uk 
> <mailto:david.holla...@bbc.co.uk>> wrote:
> Tbh mate, I reckon it would be quicker if you progress your PR.
> 
> Cheers,
> David
> 
> From: Bashir Sadjad <bas...@google.com <mailto:bas...@google.com>>
> Sent: 12 March 2021 16:29
> To: user@beam.apache.org <mailto:user@beam.apache.org> <user@beam.apache.org 
> <mailto:user@beam.apache.org>>
> Subject: Re: Setting rowGroupSize in ParquetIO
>  
> Thanks David. Yes, I looked at passing it through the HadoopConfiguration but 
> it seems row-group size is not there or at least ParquetWriter.Builder seems 
> to set that directly from its rowGroupSize property. I filed BEAM-11969 
> <https://issues.apache.org/jira/browse/BEAM-11969> for this so if you can 
> contribute your patch for plumbing this, that would be great. Otherwise, I 
> can send a PR.
> 
> Regards
> 
> -B
> 
> On Fri, Mar 12, 2021 at 8:25 AM David Hollands <david.holla...@bbc.co.uk 
> <mailto:david.holla...@bbc.co.uk>> wrote:
> Hi Bashir,
> 
>  
> 
> I think it is just a case of somebody bothering to plumbing it in explicitly, 
> e.g.
> 
>  
> 
> /** Specifies row group size. By default, DEFAULT_BLOCK_SIZE. */
> 
> public Sink withRowGroupSize(int rowGroupSize) {
> 
> return toBuilder().setRowGroupSize(rowGroupSize).build();
> 
> }
> 
>  
> 
> and
> 
>  
> 
> this.writer =
> 
> AvroParquetWriter.<GenericRecord>builder(beamParquetOutputFile)
> 
> .withRowGroupSize(getRowGroupSize()) // Ze patch to set RowGroupSize
> 
> .withSchema(schema)
> 
> .withCompressionCodec(getCompressionCodec())
> 
> .withWriteMode(OVERWRITE)
> 
> .build();
> 
>  
> 
> Etc.
> 
>  
> 
> However, it might worth exploring if it can be set via the 
> HadoopConfiguration “parquet.block.size” property, but I’m not sure that it 
> actually can.
> 
>  
> 
> We patched in something explicitly last year but didn’t contribute upstream 
> as there was quite a bit of activity on the ParquetIO (e.g. conversion to 
> SDF) at the time.
> 
>  
> 
> The use case we had at the time was that some downstream consumers of the 
> parquet (AWS S3 Select) couldn’t handle rowGroupSizes > 64MB uncompressed.  
> I’m sure there are other use cases out there that need this fined grained 
> control.
> 
>  
> 
> Cheers, David
> 
>  
> 
> David Hollands 
> 
> BBC Broadcast Centre, London, W12
> 
> Email: david.holla...@bbc.co.uk <mailto:david.holla...@bbc.co.uk>
>  
> 
>  
> 
> From: Bashir Sadjad <bas...@google.com <mailto:bas...@google.com>>
> Reply to: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Date: Friday, 12 March 2021 at 07:58
> To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Subject: Setting rowGroupSize in ParquetIO
> 
>  
> 
> Hi all,
> 
>  
> 
> I wonder how I can set the row group size for files generated by 
> ParquetIO.Sink 
> <https://beam.apache.org/releases/javadoc/2.20.0/org/apache/beam/sdk/io/parquet/ParquetIO.Sink.html>.
>  It doesn't seem to provide the option for setting that and IIUC from the 
> code 
> <https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1117>,
>  it uses the default value in ParquetWriter.Builder 
> <https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636>
>  which is 128MB. Is there any reason not to expose this parameter in 
> ParquetIO?
> 
>  
> 
> Thanks
> 
>  
> 
> -B
> 
>  
> 

Reply via email to