Where do you load all IDs of your dataset ? In your custom
InputFormat#getSplits ?  getSplits will be invoked in driver side to build
the Partition which will be serialized to executor as part of the task.

Do you put all the ids in the InputSplit ? That would make it pretty large.

In your case, I think you can load the ids directly rather than creating
custom Hadoop InputFormat.  e.g.

sc.textFile(id_file, 100).map(load data using the id)

Please make sure use a high partition number ( I use 100 here) in
sc.textFile to get high parallelism.

On Fri, Nov 27, 2015 at 2:06 PM, Anfernee Xu <anfernee...@gmail.com> wrote:

> Hi Spark experts,
>
> First of all, happy Thanksgiving!
>
> The comes to my question, I have implemented custom Hadoop InputFormat to
> load millions of entities from my data source to Spark(as JavaRDD and
> transform to DataFrame). The approach I took in implementing the custom
> Hadoop RDD is loading all ID's of my data entity(each entity has an unique
> ID: Long) and split the ID list(contains 3 millions of Long number for
> example) into configured splits, each split contains a sub-set of ID's, in
> turn my custom RecordReader will load the full entity(a plain Java Bean)
> from my data source for each ID in the specific split.
>
> My first observation is some Spark tasks were timeout, and looks like
> Spark broadcast variable is being used to distribute my splits, is that
> correct? If so, from performance perspective, what enhancement I can make
> to make it better?
>
> Thanks
>
> --
> --Anfernee
>



-- 
Best Regards

Jeff Zhang

Reply via email to