The concern is really just the runtime overhead and memory footprint of Java-serializing an already-serialized byte array again. We originally noticed this when we were using RDD.toLocalIterator() which serializes the entire 64MB partition. We worked around this issue by kryo-serializing and snappy-compressing the partition on the executor side before returning it back to the driver, but this operation just felt redundant.
Your explanation about reporting the time taken makes it clearer why it¹s designed this way. Since the byte array for the serialized task result shouldn¹t account for the majority of memory footprint anyways, I¹m okay with leaving it as is, then. Thanks, Mingyu On 3/4/15, 5:07 PM, "Patrick Wendell" <pwend...@gmail.com> wrote: >Hey Mingyu, > >I think it's broken out separately so we can record the time taken to >serialize the result. Once we serializing it once, the second >serialization should be really simple since it's just wrapping >something that has already been turned into a byte buffer. Do you see >a specific issue with serializing it twice? > >I think you need to have two steps if you want to record the time >taken to serialize the result, since that needs to be sent back to the >driver when the task completes. > >- Patrick > >On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim <m...@palantir.com> wrote: >> Hi all, >> >> It looks like the result of task is serialized twice, once by >>serializer (I.e. Java/Kryo depending on configuration) and once again by >>closure serializer (I.e. Java). To link the actual code, >> >> The first one: >>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp >>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc >>ala-23L213&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJ >>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9 >>WMY_2Z07ulA&s=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aA&e= >> The second one: >>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp >>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc >>ala-23L226&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJ >>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9 >>WMY_2Z07ulA&s=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxM&e= >> >> This serializes the "value", which is the result of task run twice, >>which affects things like collect(), takeSample(), and >>toLocalIterator(). Would it make sense to simply serialize the >>DirectTaskResult once using the regular "serializer" (as opposed to >>closure serializer)? Would it cause problems when the Accumulator values >>are not Kryo-serializable? >> >> Alternatively, if we can assume that Accumator values are small, we can >>closure-serialize those, put the serialized byte array in >>DirectTaskResult with the raw task result "value", and serialize >>DirectTaskResult. >> >> What do people think? >> >> Thanks, >> Mingyu --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org