We ended up implementing custom Hadoop InputFormats and RecordReaders by
extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to
read it as an RDD.
On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov
wrote:
> We have a huge binary file in a custom serialization format (e.g. heade
We have a somewhat complex pipeline which has multiple output files on
HDFS, and we'd like the materialization of those outputs to happen
concurrently.
Internal to Spark, any "save" call creates a new "job", which runs
synchronously -- that is, the line of code after your save() executes once
the
Done, thanks.
https://issues.apache.org/jira/browse/SPARK-13631
Will continue discussion there.
On Wed, Mar 2, 2016 at 4:09 PM Shixiong(Ryan) Zhu
wrote:
> I think it's a bug. Could you open a ticket here:
> https://issues.apache.org/jira/browse/SPARK
>
> On Wed, Mar 2, 2016
We are seeing something that looks a lot like a regression from spark 1.2.
When we run jobs with multiple threads, we have a crash somewhere inside
getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside
org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs
instea