what are the parameters on which locality depends
On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav wrote:
> Hi,
>
> I am working on spark 1.4 and reading a orc table using dataframe and
> converting that DF to RDD
>
> I spark UI I observe that 50 % task are running on locality and ANY and
> very few
Hi Sunil,
Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil wrote:
> Hello . I am seeing some unexpected issues with achieving HDFS
> data
> locality. I expect the tasks to be ex
You can read same partition from every hour's output, union these RDDs and then
repartition them as a single partition. This will be done for all partitions
one by one. It may not necessarily improve the performance, will depend on size
of spills in job when all the data was processed together.
This isn't currently a capability that Spark has, though it has definitely
been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The
primary obstacle at this point is that Hadoop's FileInputFormat doesn't
guarantee that each file corresponds to a single split, so the records
correspond
Hi
How to set a preferred location for an InputSplit in spark standalone?
I have data in specific machine and I want to read them using Splits which
is created for that node only, by assigning some property which help Spark
to create a split in that node only.
--
View this message in co
I have wrote a custom input split and I want to set to the specific node,
where my data is stored. but currently split can start at any node and pick
data from different node in the cluster. any suggestion, how to set host in
spark
--
View this message in context:
http://apache-spark-user-
Hi Guys,
I have the similar question and doubt. How spark create an executor on the
same node where is data block stored? Does it first take information from
HDFS name mode, get the block information and then place executor on the
same node is spark-worker demon is installed?
-
--
execution time roughly the same)
spark.storage.memoryFraction 0.9
Mike
From: Timothy Chen
To: Michael V Le/Watson/IBM@IBMUS
Cc: user
Date: 01/10/2015 04:31 AM
Subject: Re: Data locality running Spark on Mesos
Hi Michael,
I see you capped the cores to 60.
I wonder
1 PM---How did you run this
> benchmark, and is there a open version I can try it with?
>
> From: Tim Chen
> To: Michael V Le/Watson/IBM@IBMUS
> Cc: user
> Date: 01/08/2015 03:04 PM
> Subject: Re: Data locality running Spark on Mesos
>
>
>
> How did
least in the fine-grained case.
BTW: i'm using Spark ver 1.1.0 and Mesos ver 0.20.0
Thanks,
Mike
From: Tim Chen
To: Michael V Le/Watson/IBM@IBMUS
Cc: user
Date: 01/08/2015 03:04 PM
Subject: Re: Data locality running Spark on Mesos
How did you run this benchmark, an
How did you run this benchmark, and is there a open version I can try it
with?
And what is your configurations, like spark.locality.wait, etc?
Tim
On Thu, Jan 8, 2015 at 11:44 AM, mvle wrote:
> Hi,
>
> I've noticed running Spark apps on Mesos is significantly slower compared
> to
> stand-alone
You can also read about locality here in the docs:
http://spark.apache.org/docs/latest/tuning.html#data-locality
On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger wrote:
> No, not all rdds have location information, and in any case tasks may be
> scheduled on non-local nodes if there is idle capaci
No, not all rdds have location information, and in any case tasks may be
scheduled on non-local nodes if there is idle capacity.
see spark.locality.wait
http://spark.apache.org/docs/latest/configuration.html
On Tue, Jan 6, 2015 at 10:17 AM, gtinside wrote:
> Does spark guarantee to push the
You mentioned that the 3.1 min run was the one that did the actual caching,
so did that run before any data was cached, or after?
I would recommend checking the Storage tab of the UI, and clicking on the
RDD, to see both how full the executors' storage memory is (which may be
significantly less th
I am seeing skewed execution times. As far as I can tell, they are
attributable to differences in data locality - tasks with locality
PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest.
This seems entirely as it should be - the question is, why the different
locality levels?
I am seein
Spark's scheduling is pretty simple: it will allocate tasks to open cores
on executors, preferring ones where the data is local. It even performs
"delay scheduling", which means waiting a bit to see if an executor where
the data resides locally becomes available.
Are yours tasks seeing very skewed
Sorry, I think I was not clear in what I meant.
I didn't mean it went down within a run, with the same instance.
I meant I'd run the whole app, and one time, it would cache 100%, and the
next run, it might cache only 83%
Within a run, it doesn't change.
On Wed, Nov 12, 2014 at 11:31 PM, Aaron Da
The fact that the caching percentage went down is highly suspicious. It
should generally not decrease unless other cached data took its place, or
if unless executors were dying. Do you know if either of these were the
case?
On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
nkronenf...@oculusinf
7;s location when Spark is running as
>> a **standalone** cluster.
>>
>> Please correct me if I'm wrong. Thank you for your patience!
>>
>>
>> --
>>
>> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
>> *Sent:* 2014年7月22日 9:47
>>
>>
ndy Ryza [mailto:sandy.r...@cloudera.com]
> Sent: 2014年7月22日 9:47
>
>
> To: user@spark.apache.org
> Subject: Re: data locality
>
>
>
> This currently only works for YARN. The standalone default is to place an
> executor on every node for every job.
>
>
&
ease correct me if I'm wrong. Thank you for your patience!
>
>
> --
>
> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
> *Sent:* 2014年7月22日 9:47
>
> *To:* user@spark.apache.org
> *Subject:* Re: data locality
>
>
>
> Th
;m wrong. Thank you for your patience!
From: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Sent: 2014年7月22日 9:47
To: user@spark.apache.org
Subject: Re: data locality
This currently only works for YARN. The standalone default is to place an
executor on every
hosts, how does Spark decide how many total
> executors to use for this application?
>
>
>
> Thanks again!
>
>
> --
>
> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
> *Sent:* Friday, July 18, 2014 3:44 PM
> *To:* user@spark.apac
executors to use for this application?
Thanks again!
From: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Sent: Friday, July 18, 2014 3:44 PM
To: user@spark.apache.org
Subject: Re: data locality
Hi Haopu,
Spark will ask HDFS for file block locations and
Hi Haopu,
Spark will ask HDFS for file block locations and try to assign tasks based
on these.
There is a snag. Spark schedules its tasks inside of "executor" processes
that stick around for the lifetime of a Spark application. Spark requests
executors before it runs any jobs, i.e. before it ha
25 matches
Mail list logo