Hi,
Some of my jobs failed due to no space left on device and on those jobs I
was monitoring the shuffle space...when the job failed shuffle space did
not clean and I had to manually clean it...
Is there a JIRA already tracking this issue ? If no one has been assigned
to it, I can take a look.
T
+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:50 min
mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.x
2.1. statisti
According to hive documentation, "sort by" is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
effe
Ah okay, I turned on spark.localExecution.enabled and the performance
returned to what Spark 1.0.2 had. However I can see how users can
inadvertently incur memory and network strain in fetching the whole
partition to the driver.
I¹ll evaluate on my side if we want to turn this on or not. Thanks fo
You might be seeing the result of this patch:
https://github.com/apache/spark/commit/d069c5d9d2f6ce06389ca2ddf0b3ae4db72c5797
which was introduced in 1.1.1. This patch disabled the ability for take()
to run without launching a Spark job, which means that the latency is
significantly increased for
I actually tested Spark 1.2.0 with the code in the rdd.take() method
swapped out for what was in Spark 1.0.2. The run time was still slower,
which indicates to me something at work lower in the stack.
-Matt Cheah
On 2/18/15, 4:54 PM, "Patrick Wendell" wrote:
>I believe the heuristic governing t
I believe the heuristic governing the way that take() decides to fetch
partitions changed between these versions. It could be that in certain
cases the new heuristic is worse, but it might be good to just look at
the source code and see, for your number of elements taken and number
of partitions, i
Hi everyone,
Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take()
consistently has a slower execution time on the later release. I was
wondering if anyone else has had similar observations.
I have two setups where this reproduces. The first is a local test. I
launched a spark clust
On Wed, Feb 18, 2015 at 6:13 PM, Patrick Wendell wrote:
>> Patrick this link gives a 404:
>> https://people.apache.org/keys/committer/pwendell.asc
>
> Works for me. Maybe it's some ephemeral issue?
Yes works now; I swear it didn't before! that's all set now. The
signing key is in that file.
Another alternative would be to compress the partition in memory in a
streaming fashion instead of calling .toArray on the iterator. Would it be
an easier mitigation to the problem? Or, is it hard to compress the rows
one by one without materializing the full partition in memory using the
compressi
You have a JIRA for it...
https://issues.apache.org/jira/browse/SPARK-3066
I added the PR on the JIRA...
On Wed, Feb 18, 2015 at 3:07 PM, Xiangrui Meng wrote:
> Please create a JIRA for it and we should discuss the APIs first
> before updating the code. -Xiangrui
>
> On Tue, Feb 17, 2015 at 4:
Please create a JIRA for it and we should discuss the APIs first
before updating the code. -Xiangrui
On Tue, Feb 17, 2015 at 4:10 PM, Debasish Das wrote:
> It will be really help us if we merge it but I guess it is already diverged
> from the new ALS...I will also take a look at it again and try
Yes, that's a bug and should be using the standard serializer.
On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen wrote:
> That looks, at the least, inconsistent. As far as I know this should
> be changed so that the zero value is always cloned via the non-closure
> serializer. Any objection to that?
>
That looks, at the least, inconsistent. As far as I know this should
be changed so that the zero value is always cloned via the non-closure
serializer. Any objection to that?
On Wed, Feb 18, 2015 at 10:28 PM, Matt Cheah wrote:
> But RDD.aggregate() has this code:
>
> // Clone the zero value s
Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844
On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng wrote:
> There are three different regParams defined in the grid and there are
> tree folds. For simplicity, we didn't split the dataset into three and
> reuse them, but do the split
But RDD.aggregate() has this code:
// Clone the zero value since we will also be serializing it as part of
tasks
var jobResult = Utils.clone(zeroValue,
sc.env.closureSerializer.newInstance())
I do see the SparkEnv.get.serializer used in aggregateByKey however. Perhaps
we just missed it an
this is done.
On Wed, Feb 18, 2015 at 2:00 PM, shane knapp wrote:
> i'm actually going to do this now -- it's really quiet today.
>
> there are two spark pull request builds running, which i will kill and
> retrigger once jenkins is back up:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPull
i'm actually going to do this now -- it's really quiet today.
there are two spark pull request builds running, which i will kill and
retrigger once jenkins is back up:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27689/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequ
This would be pretty tricky to do -- the issue is that right now
sparkContext.runJob has you pass in a function from a partition to *one*
result object that gets serialized and sent back: Iterator[T] => U, and
that idea is baked pretty deep into a lot of the internals, DAGScheduler,
Task, Executors
i'll be kicking jenkins to up the open file limits on the workers. it
should be a very short downtime, and i'll post updates on my progress
tomorrow.
shane
> UISeleniumSuite:
> *** RUN ABORTED ***
> java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
> ...
This is a newer test suite. There is something flaky about it, we
should definitely fix it, IMO it's not a blocker though.
>
> Patrick this link gives a 404:
> https://people.apache.org
I've recently run into problems caused by ticket SPARK-5008
https://issues.apache.org/jira/browse/SPARK-5008
This seems like quite a serious regression in 1.2.0, meaning that it's not
really possible to use persistent-hdfs. The config for the persistent-hdfs
points to the wrong part of the filesy
Hi Spark devs,
I'm creating a streaming export functionality for RDDs and am having some
trouble with large partitions. The RDD.toLocalIterator() call pulls over a
partition at a time to the driver, and then streams the RDD out from that
partition before pulling in the next one. When you have la
It looks like this was fixed in
https://issues.apache.org/jira/browse/SPARK-4743 /
https://github.com/apache/spark/pull/3605. Can you see whether that patch
fixes this issue for you?
On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah wrote:
> Hi everyone,
>
> I was using JavaPairRDD’s combineByKey()
On OS X and Ubuntu I see the following test failure in the source
release for 1.3.0-RC1:
UISeleniumSuite:
*** RUN ABORTED ***
java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
...
Patrick this link gives a 404:
https://people.apache.org/keys/committer/pwendell.asc
Finally, I alrea
The serializer is created with
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
Which is definitely not the closure serializer and so should respect
what you are setting with spark.serializer.
Maybe you can do a quick bit of debugging to see where that assumption
break
I do not think it makes sense to make the web server configurable.
Mostly because there's no real problem in running an HTTP service
internally based on Netty while you run your own HTTP service based on
something else like Tomcat. What's the problem?
On Wed, Feb 18, 2015 at 3:14 AM, Niranda Perer
Hey Committers,
Now that Spark 1.3 rc1 is cut, please restrict branch-1.3 merges to
the following:
1. Fixes for issues blocking the 1.3 release (i.e. 1.2.X regressions)
2. Documentation and tests.
3. Fixes for non-blocker issues that are surgical, low-risk, and/or
outside of the core.
If there i
Please vote on releasing the following candidate as Apache Spark version 1.3.0!
The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
The release files, including signatures, digests, etc. ca
29 matches
Mail list logo