Hi Everyone,
I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
GridFS collection data using Java Spark Mapreduce. previously i have
successfully processed normal mongoDB collections(not GridFS) with Apache
spark using Mongo-Hadoop connector. now i'm unable to handle input
Hi all,
Here I am sharing a blog for beginners, about creating spark streaming
stand alone application and bundle the app as single runnable jar. Take a
look and drop your comments in blog page.
http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html
http://prabstec
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote:
> I am running the latest version of PySpark branch-0.9 and having some
> trouble with join.
>
> One RDD is about 100G (25GB compressed and serialized in memory) with
> 130K records, the other RDD is about 10G (2.5G compressed and
> serialized in
Great!!!
When i built it on another disk whose format is ext4, it works right now.
hadoop@ubuntu-1:~$ df -Th
FilesystemType Size Used Avail Use% Mounted on
/dev/sdb6 ext4 135G 8.6G 119G 7% /
udev devtmpfs 7.7G 4.0K 7.7G 1% /dev
tmpfs
1.: I will paste the full content of the environment page of the example
application running against the cluster at the end of this message.
2. and 3.: Following #2 I was able to see that the count was incorrectly 0
when running against the cluster, and following #3 I was able to get the
messa
ok yeah we are using StageInfo and TaskInfo too...
On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or wrote:
> Hi Koert,
>
> Other users have expressed interest for us to expose similar classes too
> (i.e. StageInfo, TaskInfo). In the newest release, they will be available
> as part of the developer API
Hi Koert,
Other users have expressed interest for us to expose similar classes too
(i.e. StageInfo, TaskInfo). In the newest release, they will be available
as part of the developer API. The particular PR that will change this is:
https://github.com/apache/spark/pull/274.
Cheers,
Andrew
On Mon,
any reason why RDDInfo suddenly became private in SPARK-1132?
we are using it to show users status of rdds
Few things that would be helpful.
1. Environment settings - you can find them on the environment tab in the
Spark application UI
2. Are you setting the HDFS configuration correctly in your Spark program?
For example, can you write a HDFS file from a Spark program (say
spark-shell) to your HDFS ins
Hi,
I am looking for users of spark to join my teams here at Amazon. If you are
reading this you probably qualify.
I am looking for developer of ANY level, but with an interest in spark. My
teams are leveraging spark to solve real business scenarios.
If you are interested, just shoot me a note a
Thanks Shivaram! Will give it a try and let you know.
Regards,
Pawan Venugopal
On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:
> You can create standalone jobs in SparkR as just R files that are run
> using the sparkR script. These commands will be sen
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating
version 0.9.0 without any Hadoop at all, and need some help. I run into the
following error with the StatefulNetworkWordCount example (and similarly in my
prototype app, when I use the updateStateByKey operation). I get t
You can create standalone jobs in SparkR as just R files that are run using
the sparkR script. These commands will be sent to a Spark cluster and the
examples on the SparkR repository (
https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in
fact standalone jobs.
However I don't th
Hmm -- That is strange. Can you paste the command you are using to launch
the instances ? The typical workflow is to use the spark-ec2 wrapper script
using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html
Shivaram
On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini <
silvio.co
Hi Guys,
I would like understanding why the Driver's RAM goes down, Does the
processing occur only in the workers?
Thanks
# Start Tests
computer1(Worker/Source Stream)
23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44
total used free sharedbuffers
Hi,
Is it possible to create a standalone job in scala using sparkR? If
possible can you provide me with the information of the setup process.
(Like the dependencies in SBT and where to include the JAR files)
This is my use-case:
1. I have a Spark Streaming standalone Job running in local machin
Hi Shivaram,
OK so let's assume the script CANNOT take a different user and that it must
be 'root'. The typical workaround is as you said, allow the ssh with the
root user. Now, don't laugh, but, this worked last Friday, but today
(Monday) it no longer works. :D Why? ...
...It seems that NOW, whe
Right now the spark-ec2 scripts assume that you have root access and a lot
of internal scripts assume have the user's home directory hard coded as
/root. However all the Spark AMIs we build should have root ssh access --
Do you find this not to be the case ?
You can also enable root ssh access i
got it thanks
On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng wrote:
> This is fixed in https://github.com/apache/spark/pull/281. Please try
> again with the latest master. -Xiangrui
>
> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers wrote:
> > i noticed that for spark 1.0.0-SNAPSHOT which i chec
This is fixed in https://github.com/apache/spark/pull/281. Please try
again with the latest master. -Xiangrui
On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers wrote:
> i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago
> (apr 5) that the "application detail ui" no longer show
i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago
(apr 5) that the "application detail ui" no longer shows any RDDs on the
storage tab, despite the fact that they are definitely cached.
i am running spark in standalone mode.
I might be wrong here but I don't believe it's discouraged. Maybe part
of the reason there's not a lot of examples is that sql2rdd returns an
RDD (TableRDD that is
https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala).
I haven't done anything too complicated yet but m
That work is under submission at an academic conference and will be made
available if/when the paper is published.
In terms of algorithms for hyperparameter tuning, we consider Grid Search,
Random Search, a couple of older derivative-free optimization methods, and
a few newer methods - TPE (aka Hy
Hi,
I'm trying to use SparkContext.addFile() to propagate a file to worker
nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the
master). I don't have HDFS or any distributed file system. Just playing with
basic stuff.
Here's the code in my driver (actually spark-shell runnin
Hi all,
On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
Also, it is the default user for the Spark-EC2 script.
Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
instead of 'root'.
I can see that the Spark-EC2 script allows you to specify which user to l
Hi TD
Could you explain me this code part?
.reduceByKeyAndWindow(
109 new Function2() {
110 public Integer call(Integer i1, Integer i2) { return i1 +
i2; }
111 },
112 new Function2() {
113 public Integer call(Integer i1, Integer i2) { return i1 -
i2;
It might help if I clarify my questions. :-)
1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're tr
Yeah, the reason it happens is that sortByKey tries to sample the data to
figure out the right range partitions for it. But we could do this later, as
the suggestion in there says.
Matei
On Apr 7, 2014, at 10:06 AM, Diana Carroll wrote:
> Aha! Well I'm not crazy then, thanks.
>
>
> On Mon,
Hi,
We have a situation where a Pyspark script works fine as a local process
("local" url) on the Master and the Worker nodes, which would indicate that
all python dependencies are set up properly on each machine.
But when we try to run the script at the cluster level (using the master's
url), if
Thanks Rahul, let me try that.
On Apr 7, 2014 7:33 PM, "Rahul Singhal" wrote:
> Hi Sai,
>
> I recently also ran into this problem on 0.9.1. The problem is that
> spark tries to read yarn's class path but when it finds it be empty does
> not fallback to it's default value. To resolve this, eith
> For issue #2 I was concerned that the build & packaging had to be
> internal. So I am using the already packaged make-distribution.sh
> (modified to use a maven build) to create a tar ball which I then package
> it using a RPM spec file.
Hi Rahul, so the issue for downstream operating system dis
Hi Sai,
I recently also ran into this problem on 0.9.1. The problem is that spark tries
to read yarn's class path but when it finds it be empty does not fallback to
it's default value. To resolve this, either set yarn.application.classpath in
yarn-site.xml to its default value or put in a bug f
Hi All,
I wanted Spark on Yarn to up and running.
I did "*SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*"
Then i ran
"*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1
Hi,
I was going through Matei's Advanced Spark presentation at
https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions.
The presentation of this video is at
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
The PageRank example int
Hi Shark,
Should I assume that Shark users should not use the shark APIs since there
are no documentations for it? If there are documentations, can you point it
out?
Best Regards,
Jerry
On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam wrote:
> Hello everyone,
>
> I have successfully installed Shark
Hi,
Any thoughts on this? Thanks.
-Suren
On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:
> Hi,
>
> I know if we call persist with the right options, we can have Spark
> persist an RDD's data on disk.
>
> I am wondering what happens in intermediate operat
Dear all,
We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are
familiar with python, but some of features are developed under java. I am
looking for a way to integrate java and python on spark.
I notice that the initialization of pyspark does not include a field to
distribute ja
I am seeing a small standalone cluster (master, slave) hang when I reach a
certain memory threshold, but I cannot detect how to configure memory to avoid
this.
I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this
allocated, but it does not help.
The reduce is by key to get th
Can you provide an example?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
39 matches
Mail list logo