Hi, I have a very simple SSS pipeline which does:
val query = df .dropDuplicates(Array("Id", "receivedAt")) .withColumn(timePartitionCol, timestamp_udfnc(col("receivedAt"))) .writeStream .format("parquet") .partitionBy("availabilityDomain", timePartitionCol) .trigger(Trigger.ProcessingTime(5, TimeUnit.MINUTES)) .option("path", "/data") .option("checkpointLocation", "/data_checkpoint") .start() After ingesting 2T records, the state under checkpoint folder on HDFS (replicator factor 2) grows to 2T bytes. My cluster has only 2T bytes which means the cluster can barely handle further data growth. Online spark documents (https://docs.databricks.com/spark/latest/structured-streaming/production.html) says using rocksdb help SSS job reduce JVM memory overhead. But I cannot find any document how to setup rocksdb for SSS. Spark class CheckpointReader seems to only handle HDFS. Any suggestions? Thanks!