Can you use partitioning ( by day ) ? That will make it easier to drop
data older than x days outside streaming job.
Sunil Parmar
On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang wrote:
> I have a spark structured streaming job which dump data into a parquet
> file. To avoid the parque
We use Impala to access parquet files in the directories. Any pointers on
achieving at least once semantic with spark streaming or partial files ?
Sunil Parmar
On Fri, Mar 2, 2018 at 2:57 PM, Tathagata Das
wrote:
> Structured Streaming's file sink solves these problems by writing
trying to deal with
partial files by writing .tmp files and renaming them as the last step. We
only commit offset after rename is successful. This way we get at least
once semantic and partial file write issue.
Thoughts ?
Sunil Parmar
On Wed, Feb 28, 2018 at 1:59 PM, Tathagata Das
wrote:
> The