Re: Data Locality Issue

2015-11-15 Thread Renu Yadav
what are the parameters on which locality depends On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav wrote: > Hi, > > I am working on spark 1.4 and reading a orc table using dataframe and > converting that DF to RDD > > I spark UI I observe that 50 % task are running on locality and ANY and > very few

Re: Data locality with HDFS not being seen

2015-08-21 Thread Sameer Farooqui
Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352 On Thu, Aug 20, 2015 at 4:09 AM, Sunil wrote: > Hello . I am seeing some unexpected issues with achieving HDFS > data > locality. I expect the tasks to be ex

Re: Data locality across jobs

2015-04-03 Thread Ajay Srivastava
You can read same partition from every hour's output, union these RDDs and then repartition them as a single partition. This will be done for all partitions one by one. It may not necessarily improve the performance, will depend on size of spills in job when all the data was processed together.

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records correspond

Re: data locality in logs

2015-01-28 Thread hnahak
Hi How to set a preferred location for an InputSplit in spark standalone? I have data in specific machine and I want to read them using Splits which is created for that node only, by assigning some property which help Spark to create a split in that node only. -- View this message in co

Re: Data Locality

2015-01-28 Thread hnahak
I have wrote a custom input split and I want to set to the specific node, where my data is stored. but currently split can start at any node and pick data from different node in the cluster. any suggestion, how to set host in spark -- View this message in context: http://apache-spark-user-

Re: Data Locality

2015-01-28 Thread Harihar Nahak
Hi Guys, I have the similar question and doubt. How spark create an executor on the same node where is data block stored? Does it first take information from HDFS name mode, get the block information and then place executor on the same node is spark-worker demon is installed? - --

Re: Data locality running Spark on Mesos

2015-01-11 Thread Michael V Le
execution time roughly the same) spark.storage.memoryFraction 0.9 Mike From: Timothy Chen To: Michael V Le/Watson/IBM@IBMUS Cc: user Date: 01/10/2015 04:31 AM Subject: Re: Data locality running Spark on Mesos Hi Michael, I see you capped the cores to 60. I wonder

Re: Data locality running Spark on Mesos

2015-01-10 Thread Timothy Chen
1 PM---How did you run this > benchmark, and is there a open version I can try it with? > > From: Tim Chen > To: Michael V Le/Watson/IBM@IBMUS > Cc: user > Date: 01/08/2015 03:04 PM > Subject: Re: Data locality running Spark on Mesos > > > > How did

Re: Data locality running Spark on Mesos

2015-01-09 Thread Michael V Le
least in the fine-grained case. BTW: i'm using Spark ver 1.1.0 and Mesos ver 0.20.0 Thanks, Mike From: Tim Chen To: Michael V Le/Watson/IBM@IBMUS Cc: user Date: 01/08/2015 03:04 PM Subject: Re: Data locality running Spark on Mesos How did you run this benchmark, an

Re: Data locality running Spark on Mesos

2015-01-08 Thread Tim Chen
How did you run this benchmark, and is there a open version I can try it with? And what is your configurations, like spark.locality.wait, etc? Tim On Thu, Jan 8, 2015 at 11:44 AM, mvle wrote: > Hi, > > I've noticed running Spark apps on Mesos is significantly slower compared > to > stand-alone

Re: Data Locality

2015-01-06 Thread Andrew Ash
You can also read about locality here in the docs: http://spark.apache.org/docs/latest/tuning.html#data-locality On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger wrote: > No, not all rdds have location information, and in any case tasks may be > scheduled on non-local nodes if there is idle capaci

Re: Data Locality

2015-01-06 Thread Cody Koeninger
No, not all rdds have location information, and in any case tasks may be scheduled on non-local nodes if there is idle capacity. see spark.locality.wait http://spark.apache.org/docs/latest/configuration.html On Tue, Jan 6, 2015 at 10:17 AM, gtinside wrote: > Does spark guarantee to push the

Re: data locality, task distribution

2014-11-13 Thread Aaron Davidson
You mentioned that the 3.1 min run was the one that did the actual caching, so did that run before any data was cached, or after? I would recommend checking the Storage tab of the UI, and clicking on the RDD, to see both how full the executors' storage memory is (which may be significantly less th

Re: data locality, task distribution

2014-11-13 Thread Nathan Kronenfeld
I am seeing skewed execution times. As far as I can tell, they are attributable to differences in data locality - tasks with locality PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest. This seems entirely as it should be - the question is, why the different locality levels? I am seein

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
Spark's scheduling is pretty simple: it will allocate tasks to open cores on executors, preferring ones where the data is local. It even performs "delay scheduling", which means waiting a bit to see if an executor where the data resides locally becomes available. Are yours tasks seeing very skewed

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld
Sorry, I think I was not clear in what I meant. I didn't mean it went down within a run, with the same instance. I meant I'd run the whole app, and one time, it would cache 100%, and the next run, it might cache only 83% Within a run, it doesn't change. On Wed, Nov 12, 2014 at 11:31 PM, Aaron Da

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
The fact that the caching percentage went down is highly suspicious. It should generally not decrease unless other cached data took its place, or if unless executors were dying. Do you know if either of these were the case? On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < nkronenf...@oculusinf

Re: data locality

2014-08-30 Thread Chris Fregly
7;s location when Spark is running as >> a **standalone** cluster. >> >> Please correct me if I'm wrong. Thank you for your patience! >> >> >> -- >> >> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] >> *Sent:* 2014年7月22日 9:47 >> >>

Re: data locality

2014-07-25 Thread Tsai Li Ming
ndy Ryza [mailto:sandy.r...@cloudera.com] > Sent: 2014年7月22日 9:47 > > > To: user@spark.apache.org > Subject: Re: data locality > > > > This currently only works for YARN. The standalone default is to place an > executor on every node for every job. > > &

Re: data locality

2014-07-22 Thread Sandy Ryza
ease correct me if I'm wrong. Thank you for your patience! > > > -- > > *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] > *Sent:* 2014年7月22日 9:47 > > *To:* user@spark.apache.org > *Subject:* Re: data locality > > > > Th

RE: data locality

2014-07-21 Thread Haopu Wang
;m wrong. Thank you for your patience! From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: 2014年7月22日 9:47 To: user@spark.apache.org Subject: Re: data locality This currently only works for YARN. The standalone default is to place an executor on every

Re: data locality

2014-07-21 Thread Sandy Ryza
hosts, how does Spark decide how many total > executors to use for this application? > > > > Thanks again! > > > -- > > *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] > *Sent:* Friday, July 18, 2014 3:44 PM > *To:* user@spark.apac

RE: data locality

2014-07-18 Thread Haopu Wang
executors to use for this application? Thanks again! From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: Friday, July 18, 2014 3:44 PM To: user@spark.apache.org Subject: Re: data locality Hi Haopu, Spark will ask HDFS for file block locations and

Re: data locality

2014-07-18 Thread Sandy Ryza
Hi Haopu, Spark will ask HDFS for file block locations and try to assign tasks based on these. There is a snag. Spark schedules its tasks inside of "executor" processes that stick around for the lifetime of a Spark application. Spark requests executors before it runs any jobs, i.e. before it ha