Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-10 Thread Josh Rosen
Based on Ben's helpful error description, I managed to reproduce this bug and found the root cause: There's a bug in MemoryStore's PartiallySerializedBlock class: it doesn't close a serialization stream before attempting to deserialize its serialized values, causing it to miss any data stored in t

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Josh Rosen
cache() / persist() is definitely *not* supposed to affect the result of a program, so the behavior that you're seeing is unexpected. I'll try to reproduce this myself by caching in PySpark under heavy memory pressure, but in the meantime the following questions will help me to debug: - Does t