from:"Andy Sloane"

Re: binary file deserialization

2016-03-09 Thread Andy Sloane

We ended up implementing custom Hadoop InputFormats and RecordReaders by extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to read it as an RDD. On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov wrote: > We have a huge binary file in a custom serialization format (e.g. heade

Saving multiple outputs in the same job

2016-03-08 Thread Andy Sloane

We have a somewhat complex pipeline which has multiple output files on HDFS, and we'd like the materialization of those outputs to happen concurrently. Internal to Spark, any "save" call creates a new "job", which runs synchronously -- that is, the line of code after your save() executes once the

Re: getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane

Done, thanks. https://issues.apache.org/jira/browse/SPARK-13631 Will continue discussion there. On Wed, Mar 2, 2016 at 4:09 PM Shixiong(Ryan) Zhu wrote: > I think it's a bug. Could you open a ticket here: > https://issues.apache.org/jira/browse/SPARK > > On Wed, Mar 2, 2016

getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane

We are seeing something that looks a lot like a regression from spark 1.2. When we run jobs with multiple threads, we have a crash somewhere inside getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs instea

Re: binary file deserialization

Saving multiple outputs in the same job

Re: getPreferredLocations race condition in spark 1.6.0?

getPreferredLocations race condition in spark 1.6.0?

4 matches

Site Navigation

Mail list logo

Footer information