Hi,
This is on a 4 nodes cluster each with 32 cores/256GB Ram.
(0.9.0) is deployed in a stand alone mode.
Each worker is configured with 192GB. Spark executor memory is also 192GB.
This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a copy-
Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!
Best,
Xiangrui
On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes befo
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
wrote:
> I would love to work
As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().
This can happen for example if y
It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
> Hey there fellow Dukes of Data,
>
> How can I tell how many partitions my RDD is split into?
>
> I'm interested in knowing because, from what I gather, having a good
>
Hey there fellow Dukes of Data,
How can I tell how many partitions my RDD is split into?
I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out
I would love to work on this (and other) stuff if I can bother someone
with questions offline or on a dev mailing list.
Ognen
On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I
haven't seen raised before and this raises another issue wh
Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).
On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevs
Ah, interesting. count() without distinct is streaming and does not require
that a single partition fits in memory, for instance. That said, the
behavior may change if you increase the number of partitions in your input
RDD by using RDD.repartition()
On Sun, Mar 23, 2014 at 11:47 AM, Kane wrote:
Bleh, strike that, one of my slaves was at 100% inode utilization on the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.
Thanks (
Aaron, thanks for replying. I am very much puzzled as to what is going
on. A job that used to run on the same cluster is failing with this
mysterious message about not having enough disk space when in fact I can
see through "watch df -h" that the free space is always hovering around
3+GB on the
I don’t see the errors anymore. Thanks Aaron.
On 24-Mar-2014, at 12:52 am, Aaron Davidson wrote:
> These errors should be fixed on master with Sean's PR:
> https://github.com/apache/spark/pull/209
>
> The orbit errors are quite possibly due to using https instead of http,
> whether or not the
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your
SparkContext? It tries to serialize that many objects together at a time, which
might be too much. By default the batchSize is 1024.
Matei
On Mar 23, 2014, at 10:11 AM, Jeremy Freeman wrote:
> Hi all,
>
> Hitting a myst
Hello,
In spark we can use *newAPIHadoopRDD *to access the different distributed
system like HDFS, HBase, and MongoDB via different inputformat.
Is it possible to access the *inputsplit *in Spark directly? Spark can
cache data in local memory.
Perform local computation/aggregation on the local
inpu
By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead create only
P files. Disk space consumption is largely unaffected, however. by the
number of partitions unless each partition
Hi
Thanks for reporting this. It'll be great if you can check a couple of
things:
1. Are you trying to use this with Hadoop2 by any chance ? There was an
incompatible ASM version bug that we fixed for Hadoop2
https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it,
but I just wan
I am really interested in using Spark from R and have tried to use SparkR,
but always get the same error.
This is how I installed:
- I successfully installed Spark version 0.9.0 with Scala 2.10.3 (OpenJDK
64-Bit Server VM, Java 1.7.0_45)
I can run examples from spark-shell and Python
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than /tmp
if /tmp is full. Actually it’s recommended to have multiple local
disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions i
Hey All,
I think the old thread is here:
https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
The method proposed in that thread is to create a utility class for
doing single-pass aggregations. Using Algebird is a pretty good way to
do this and is a bit more flexible since y
On 3/23/14, 5:35 PM, Aaron Davidson wrote:
On some systems, /tmp/ is an in-memory tmpfs file system, with its own
size limit. It's possible that this limit has been exceeded. You might
try running the "df" command to check to free space of "/tmp" or root
if tmp isn't listed.
3 GB also seems
You can set spark.local.dir to put this data somewhere other than /tmp if /tmp
is full. Actually it’s recommended to have multiple local disks and set to to a
comma-separated list of directories, one per disk.
Matei
On Mar 23, 2014, at 3:35 PM, Aaron Davidson wrote:
> On some systems, /tmp/ i
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the "df" command to check to free space of "/tmp" or root if tmp
isn't listed.
3 GB also seems pretty low for the remaining free space of a disk
Hello,
I have a weird error showing up when I run a job on my Spark cluster.
The version of spark is 0.9 and I have 3+ GB free on the disk when this
error shows up. Any ideas what I should be looking for?
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
167.0:3 failed
i currently typically do something like this:
scala> val rdd = sc.parallelize(1 to 10)
scala> import com.twitter.algebird.Operators._
scala> import com.twitter.algebird.{Max, Min}
scala> rdd.map{ x => (
| 1L,
| Min(x),
| Max(x),
| x
| )}.reduce(_ + _)
res0: (Long,
These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209
The orbit errors are quite possibly due to using https instead of http,
whether or not the SSL cert was bad. Let us know if they go away with
reverting to http.
On Sun, Mar 23, 2014 at 11:48 AM, Debas
I am getting these weird errors which I have not seen before:
[error] Server access Error: handshake alert: unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
[info] Resolving org.eclipse.je
Yes, there was an error in data, after fixing it - count fails with Out of
Memory Error.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Koert, Patrick,
do you already have an elegant solution to combine multiple operations on a
single RDD?
Say for example that I want to do a sum over one column, a count and an
average over another column,
thanks in advance,
Richard
On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling wrote:
>
Hi all,
Hitting a mysterious error loading large text files, specific to PySpark
0.9.0.
In PySpark 0.8.1, this works:
data = sc.textFile("path/to/myfile")
data.count()
But in 0.9.0, it stalls. There are indications of completion up to:
14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1
Andrew, this should be fixed in 0.9.1, assuming it is the same hash
collision error we found there.
Kane, is it possible your bigger data is corrupt, such that that any
operations on it fail?
On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash wrote:
> FWIW I've seen correctness errors with spark.shu
I'm also seeing this. It also was working for me previously AFAIK.
Tthe proximate cause is my well-intentioned change that uses HTTPS to
access all artifact repos. The default for Maven Central before would have
been HTTP. While it's a good idea to use HTTPS, it may run into
complications.
I see:
I am facing a weird failure where "sbt/sbt assembly” shows a lot of SSL
certificate errors for repo.maven.apache.org. Is anyone else facing the same
problems? Any idea why this is happening? Yesterday I was able to successfully
run it.
Loading https://repo.maven.apache.org shows an invalid cert
Hi,
At the reduceBuyKey stage, it takes a few minutes before the tasks start
working.
I have -Dspark.default.parallelism=127 cores (n-1).
CPU/Network/IO is idling across all nodes when this is happening.
And there is nothing particular on the master log file. From the spark-shell:
14/03/23 1
33 matches
Mail list logo