Hi Sri,

Afaik, you have to create “PCollection" of "GenericRecord”s and define your 
Avro schema manually to write your data into Parquet files. 
In this case, you will need to create a ParDo for this translation. Also, I 
expect that your schema is the same for all CSV files.

Basic example of using Parquet Sink with Java SDK could be found here [1]

[1] https://git.io/fhcfV <https://git.io/fhcfV>


> On 14 Jan 2019, at 02:00, Sridevi Nookala <snook...@parallelwireless.com> 
> wrote:
> 
> hi,
> 
> I have a bunch of CSV data files that i need to store in Parquet format. I 
> did look at basic documentation on ParquetIO. and ParquetIO.sink() can be 
> used to achive the same.
> However there is a dependency on the Avro Schema.
> how do i infer/generate Avro schema from CSV document data ?
> Does beam have any API for the same.
> I tried using Kite SDK API CSVUtil / JsonUtil but had no luck generating avro 
> schema
> my CSV data files have headers in them and quite a few of the header fields 
> are hyphenated which are not liked by Kite 's CSVUtil
> 
> I think it will be a redundant effort to convert CSV documents to json 
> documents .
> Any suggestions on how to infer avro schema from CSV data or a JSON schema 
> will be helpful
> 
> thanks
> Sri

Reply via email to