Re: ParquetIO write of CSV document data

2019-01-25 Thread Sridevi Nookala
e them as one big parquet on a daily basis the source provided 15 min parquet chunks Any suggestions here will be helpful thanks Sri From: Alexey Romanenko Sent: Friday, January 25, 2019 10:31:37 AM To: user@beam.apache.org Subject: Re: ParquetIO write

Re: ParquetIO write of CSV document data

2019-01-25 Thread Alexey Romanenko
not there yet to solve BEAM jira's , but it will help immensly > if AVRO scehma inference is avoided > some thing like python pandas/pyarrow does > > thanks for your help > Sri > > From: Sridevi Nookala > Sent: Wednesday, January 23, 2019 9:41:02 PM > To: user@beam

Re: ParquetIO write of CSV document data

2019-01-23 Thread Sridevi Nookala
a inference is avoided some thing like python pandas/pyarrow does thanks for your help Sri From: Sridevi Nookala Sent: Wednesday, January 23, 2019 9:41:02 PM To: user@beam.apache.org Subject: Re: ParquetIO write of CSV document data Hi Alex, Thanks for the sugg

Re: ParquetIO write of CSV document data

2019-01-23 Thread Sridevi Nookala
ay, January 15, 2019 7:02:56 AM To: user@beam.apache.org Subject: Re: ParquetIO write of CSV document data Hi Sri, it's exactly as Alexey says, although there are plans/ideas to improve ParquetIO in a way that would not require defining the schema manually. Some Jiras that might be interest

Re: ParquetIO write of CSV document data

2019-01-15 Thread Łukasz Gajowy
Hi Sri, it's exactly as Alexey says, although there are plans/ideas to improve ParquetIO in a way that would not require defining the schema manually. Some Jiras that might be interesting in this topic but not yet resolved (maybe you are willing to contribute?): https://issues.apache.org/jira/bro

Re: ParquetIO write of CSV document data

2019-01-14 Thread Alexey Romanenko
Hi Sri, Afaik, you have to create “PCollection" of "GenericRecord”s and define your Avro schema manually to write your data into Parquet files. In this case, you will need to create a ParDo for this translation. Also, I expect that your schema is the same for all CSV files. Basic example of us

ParquetIO write of CSV document data

2019-01-13 Thread Sridevi Nookala
hi, I have a bunch of CSV data files that i need to store in Parquet format. I did look at basic documentation on ParquetIO. and ParquetIO.sink() can be used to achive the same. However there is a dependency on the Avro Schema. how do i infer/generate Avro schema from CSV document data ? Doe