It's hard for us to diagnose your performance problems, because we don't have your environment and fixing one will simply reveal the next one to be fixed. So, I suggest you use the following strategy to figure out what takes the most time and hence what you might try to optimize. Try replacing expressions that might be expensive with "stubs". Your calls to HBase for example. What happens if your replace the call with fake, hard-coded data? Does performance improve dramatically? If so, then optimize how you query HBase. If it makes no significant difference, try something else.
Also try looking at the Spark source code to understand what happens "under the hood." At this point, your best bet is to develop your intuition about the performance overhead of various constructs in real-world scenarios and how Spark distributes computation. Then you'll find it easier to know what to optimize. You'll want to understand what happens in flatMap, filter, join, groupBy, reduce, etc. Don't forget this guide, too: https://spark.apache.org/docs/latest/tuning.html. The Learning Spark book from O'Reilly is also really helpful. I also recommend that you switch to Java 8 and Lambdas, or go all the way to Scala, so all that noisy code shrinks down to simpler expressions. You'll be surprised how helpful that is for comprehending your code and reasoning about it. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Tue, Apr 7, 2015 at 12:54 PM, Jeetendra Gangele <gangele...@gmail.com> wrote: > Hi All I am running the below code and its running for very long time > where input to flatMapTopair is record of 50K. and I am calling Hbase for > 50K times just a range scan query to should not take time. can anybody > guide me what is wrong here? > > JavaPairRDD<VendorRecord, Iterable<VendorRecord>> pairvendorData > =matchRdd.flatMapToPair( new PairFlatMapFunction<VendorRecord, > VendorRecord, VendorRecord>(){ > > @Override > public Iterable<Tuple2<VendorRecord,VendorRecord>> call( > VendorRecord t) throws Exception { > List<Tuple2<VendorRecord, VendorRecord>> pairs = new > LinkedList<Tuple2<VendorRecord, VendorRecord>>(); > MatcherKeys matchkeys=CompanyMatcherHelper.getBlockinkeys(t); > List<VendorRecord> Matchedrecords > =ckdao.getMatchingRecordsWithscan(matchkeys); > for(int i=0;i<Matchedrecords.size();i++){ > pairs.add( new Tuple2<VendorRecord,VendorRecord>(t,Matchedrecords.get(i))); > } > return pairs; > } > } > ).groupByKey(200).persist(StorageLevel.DISK_ONLY_2()); >