Hi,

I have a very simple SSS pipeline which does:

val query = df
  .dropDuplicates(Array("Id", "receivedAt"))
  .withColumn(timePartitionCol, timestamp_udfnc(col("receivedAt")))
  .writeStream
  .format("parquet")
  .partitionBy("availabilityDomain", timePartitionCol)
  .trigger(Trigger.ProcessingTime(5, TimeUnit.MINUTES))
  .option("path", "/data")
  .option("checkpointLocation", "/data_checkpoint")
  .start()

After ingesting 2T records, the state under checkpoint folder on HDFS
(replicator factor 2) grows to 2T bytes.
My cluster has only 2T bytes which means the cluster can barely handle
further data growth.

Online spark documents
(https://docs.databricks.com/spark/latest/structured-streaming/production.html)
says using rocksdb help SSS job reduce JVM memory overhead. But I
cannot find any document how

to setup rocksdb for SSS. Spark class CheckpointReader seems to only
handle HDFS.

Any suggestions? Thanks!

Reply via email to