Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Sunil Parmar
Can you use partitioning ( by day ) ? That will make it easier to drop data older than x days outside streaming job. Sunil Parmar On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang wrote: > I have a spark structured streaming job which dump data into a parquet > file. To avoid the parque

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-05 Thread Sunil Parmar
We use Impala to access parquet files in the directories. Any pointers on achieving at least once semantic with spark streaming or partial files ? Sunil Parmar On Fri, Mar 2, 2018 at 2:57 PM, Tathagata Das wrote: > Structured Streaming's file sink solves these problems by writing

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Sunil Parmar
trying to deal with partial files by writing .tmp files and renaming them as the last step. We only commit offset after rename is successful. This way we get at least once semantic and partial file write issue. Thoughts ? Sunil Parmar On Wed, Feb 28, 2018 at 1:59 PM, Tathagata Das wrote: > The