Re: FlatMapPair run for longer time

Dean Wampler Tue, 07 Apr 2015 10:15:14 -0700

It's hard for us to diagnose your performance problems, because we don't
have your environment and fixing one will simply reveal the next one to be
fixed. So, I suggest you use the following strategy to figure out what
takes the most time and hence what you might try to optimize. Try replacing
expressions that might be expensive with "stubs". Your calls to HBase for
example. What happens if your replace the call with fake, hard-coded data?
Does performance improve dramatically? If so, then optimize how you query
HBase. If it makes no significant difference, try something else.

Also try looking at the Spark source code to understand what happens "under
the hood." At this point, your best bet is to develop your intuition about
the performance overhead of various constructs in real-world scenarios and
how Spark distributes computation. Then you'll find it easier to know what
to optimize. You'll want to understand what happens in flatMap, filter,
join, groupBy, reduce, etc.

Don't forget this guide, too:
https://spark.apache.org/docs/latest/tuning.html. The Learning Spark book
from O'Reilly is also really helpful.

I also recommend that you switch to Java 8 and Lambdas, or go all the way
to Scala, so all that noisy code shrinks down to simpler expressions.
You'll be surprised how helpful that is for comprehending your code and
reasoning about it.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Apr 7, 2015 at 12:54 PM, Jeetendra Gangele <gangele...@gmail.com>
wrote:

> Hi All I am running the below code and its running for very long time
> where input to flatMapTopair is record of 50K. and I am calling Hbase for
> 50K times just a range scan query to should not take time. can anybody
> guide me what is wrong here?
>
> JavaPairRDD<VendorRecord, Iterable<VendorRecord>> pairvendorData
> =matchRdd.flatMapToPair( new PairFlatMapFunction<VendorRecord,
> VendorRecord, VendorRecord>(){
>
> @Override
> public Iterable<Tuple2<VendorRecord,VendorRecord>> call(
> VendorRecord t) throws Exception {
> List<Tuple2<VendorRecord, VendorRecord>> pairs = new
> LinkedList<Tuple2<VendorRecord, VendorRecord>>();
> MatcherKeys matchkeys=CompanyMatcherHelper.getBlockinkeys(t);
> List<VendorRecord> Matchedrecords
> =ckdao.getMatchingRecordsWithscan(matchkeys);
> for(int i=0;i<Matchedrecords.size();i++){
> pairs.add( new Tuple2<VendorRecord,VendorRecord>(t,Matchedrecords.get(i)));
> }
>  return pairs;
> }
>  }
> ).groupByKey(200).persist(StorageLevel.DISK_ONLY_2());
>

Re: FlatMapPair run for longer time

Reply via email to