subject:"\[Structured Streaming\] Checkpoint file compact file grows big"

Re: [Structured Streaming] Checkpoint file compact file grows big

2020-04-19 Thread Jungtaek Lim

Deleting the latest .compact file would lose the ability for exactly-once and lead Spark fail to read from the output directory. If you're reading the output directory from non-Spark then metadata on output directory doesn't matter, but there's no exactly-once (exactly-once is achieved leveraging t

Re:[Structured Streaming] Checkpoint file compact file grows big

2020-04-15 Thread Kelvin Qin

SEE:http://spark.apache.org/docs/2.3.1/streaming-programming-guide.html#checkpointing "Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed." As far as I know, the officia

[Structured Streaming] Checkpoint file compact file grows big

2020-04-15 Thread Ahn, Daniel

Are Spark Structured Streaming checkpoint files expected to grow over time indefinitely? Is there a recommended way to safely age-off old checkpoint data? Currently we have a Spark Structured Streaming process reading from Kafka and writing to an HDFS sink, with checkpointing enabled and writing