Hi Sri, Afaik, you have to create “PCollection" of "GenericRecord”s and define your Avro schema manually to write your data into Parquet files. In this case, you will need to create a ParDo for this translation. Also, I expect that your schema is the same for all CSV files.
Basic example of using Parquet Sink with Java SDK could be found here [1] [1] https://git.io/fhcfV <https://git.io/fhcfV> > On 14 Jan 2019, at 02:00, Sridevi Nookala <snook...@parallelwireless.com> > wrote: > > hi, > > I have a bunch of CSV data files that i need to store in Parquet format. I > did look at basic documentation on ParquetIO. and ParquetIO.sink() can be > used to achive the same. > However there is a dependency on the Avro Schema. > how do i infer/generate Avro schema from CSV document data ? > Does beam have any API for the same. > I tried using Kite SDK API CSVUtil / JsonUtil but had no luck generating avro > schema > my CSV data files have headers in them and quite a few of the header fields > are hyphenated which are not liked by Kite 's CSVUtil > > I think it will be a redundant effort to convert CSV documents to json > documents . > Any suggestions on how to infer avro schema from CSV data or a JSON schema > will be helpful > > thanks > Sri