Andrew, you have pretty much consolidated my entire experience, please give a presentation in a meetup on this, and send across the links :)
Regards, Gourav On Wed, Jul 20, 2016 at 4:35 AM, Andrew Ehrlich <and...@aehrlich.com> wrote: > Try: > > - filtering down the data as soon as possible in the job, dropping columns > you don’t need. > - processing fewer partitions of the hive tables at a time > - caching frequently accessed data, for example dimension tables, lookup > tables, or other datasets that are repeatedly accessed > - using the Spark UI to identify the bottlenecked resource > - remove features or columns from the output data, until it runs, then add > them back in one at a time. > - creating a static dataset small enough to work, and editing the query, > then retesting, repeatedly until you cut the execution time by a > significant fraction > - Using the Spark UI or spark shell to check the skew and make sure > partitions are evenly distributed > > On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID > <zchl.j...@yahoo.com.invalid>> wrote: > > Thanks a lot for your reply . > > In effect , here we tried to run the sql on kettle, hive and spark hive > (by HiveContext) respectively, the job seems frozen to finish to run . > > In the 6 tables , need to respectively read the different columns in > different tables for specific information , then do some simple calculation > before output . > join operation is used most in the sql . > > Best wishes! > > > > > On Monday, July 18, 2016 6:24 PM, Chanh Le <giaosu...@gmail.com> wrote: > > > Hi, > What about the network (bandwidth) between hive and spark? > Does it run in Hive before then you move to Spark? > Because It's complex you can use something like EXPLAIN command to show > what going on. > > > > > > > On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID > <zchl.j...@yahoo.com.invalid>> wrote: > > the sql logic in the program is very much complex , so do not describe the > detailed codes here . > > > On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu < > zchl.j...@yahoo.com.INVALID <zchl.j...@yahoo.com.invalid>> wrote: > > > Hi All, > > Here we have one application, it needs to extract different columns from 6 > hive tables, and then does some easy calculation, there is around 100,000 > number of rows in each table, > finally need to output another table or file (with format of consistent > columns) . > > However, after lots of days trying, the spark hive job is unthinkably > slow - sometimes almost frozen. There is 5 nodes for spark cluster. > > Could anyone offer some help, some idea or clue is also good. > > Thanks in advance~ > > Zhiliang > > > > > > >