The performant way would be to partition your dataset into reasonably small chunks and use a bloom filter to see if the entity might be in your set before you make a lookup.
Check the bloom filter, if the entity might be in the set, rely on partition pruning to read and backfill the relevant partition. If the entity isn't in the set, just save as new data. Sooner or later you probably would want to compact the appended partitions to reduce the amount of small files. Delta Lake has update and compation semantics unless you want to do it manually. Since 2.4.0 Spark is also able to prune buckets. But as far as I know there's no way to backfill a single bucket. If it was the combination of partition and bucket pruning could dramatically limit the amount data you needed to read/write from/to disk. RDD vs Dataframe, I'm not sure exactly how and when Tungsten is able to be used when using RDD:s, if at all. Because of that I always try to use Dataframes and the built in fucntions as long as possible just to get the sweet offheap allocation and the "expressions to byte code" thingy along the Catalyst optimizations. That will probably make more for your performance than anything else. The memory overhead of JVM objects and GC runs might be brutal on your performance and memory usage depending on your dataset and use case. br, molotch -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org