Hi all!
I use Spark SQL1.2 start the thrift server on yarn.
I want to use fair scheduler in the thrift server.
I set the properties in spark-defaults.conf like this:
spark.scheduler.mode FAIR
spark.scheduler.allocation.file
/opt/spark-1.2.0-bin-2.4.1/conf/fairscheduler.xml
In the thrift server
Thanks for your answer, Xuefeng Wu.
But, I don't understand how to save a graph as object. :(
Do you have any sample codes?
2014-12-31 13:27 GMT+09:00 Jason Hong :
> Thanks for your answer, Xuefeng Wu.
>
> But, I don't understand how to save a graph as object. :(
>
> Do you have any sample code
This is still using a non-existent hadoop-2.5 profile, and
-Dscala-2.10 won't do anything. These don't matter though; this error
is just some scalac problem. I don't see this error when compiling.
On Wed, Dec 31, 2014 at 12:48 AM, j_soft wrote:
> no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhad
Thanks Matei.
-D
On Tue, Dec 30, 2014 at 4:49 PM, Matei Zaharia
wrote:
> This file needs to be on your CLASSPATH actually, not just in a directory.
> The best way to pass it in is probably to package it into your application
> JAR. You can put it in src/main/resources in a Maven or SBT project,
Anyone has suggestions?
On Tue, Dec 23, 2014 at 3:08 PM, Chen Song wrote:
> Silly question, what is the best way to shuffle protobuf messages in Spark
> (Streaming) job? Can I use Kryo on top of protobuf Message type?
>
> --
> Chen Song
>
>
--
Chen Song
I have a spark app that involves series of mapPartition operations and then
a keyBy operation. I have measured the time inside mapPartition function
block. These blocks take trivial time. Still the application takes way too
much time and even sparkUI shows that much time.
So i was wondering where d
https://issues.apache.org/jira/browse/SPARK-1911
Is one of several tickets on the problem.
> On Dec 30, 2014, at 8:36 PM, Davies Liu wrote:
>
> Could you share a link about this? It's common to use Java 7, that
> will be nice if we can fix this.
>
> On Mon, Dec 29, 2014 at 1:27 PM, Eric Fried
I am running the job on 1.1.1.
I will let the job run overnight and send you more info on computation vs GC
time tomorrow.
BTW, do you know what the stage description named "getCallSite at
DStream.scala:294" might mean?
Thanks,RK
On Tuesday, December 30, 2014 6:02 PM, Tathagata Das
wrot
Which version of Spark Streaming are you using.
When the batch processing time increases to 15-20 seconds, could you
compare the task times compared to the tasks time when the application
is just launched? Basically is the increase from 6 seconds to 15-20
seconds is caused by increase in computati
Thats is kind of expected due to data locality. Though you should see
some tasks running on the executors as the data gets replicated to
other nodes and can therefore run tasks based on locality. You have
two solutions
1. kafkaStream.repartition() to explicitly repartition the received
data across
1. Of course, a single block / partition has many Kafka messages, and
from different Kafka topics interleaved together. The message count is
not related to the block count. Any message received within a
particular block interval will go in the same block.
2. Yes, the receiver will be started on an
There is a known bug with local scheduler, will be fixed by
https://github.com/apache/spark/pull/3779
On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist
wrote:
> I’m trying to run the stateful network word count at
> https://github.com/apache/spark/blob/master/examples/src/main/python/streamin
Could you share a link about this? It's common to use Java 7, that
will be nice if we can fix this.
On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman
wrote:
> Was your spark assembly jarred with Java 7? There's a known issue with jar
> files made with that version. It prevents them from being used
Hey Josh,
I am still trying to prune this to a minimal example, but it has been
tricky since scale seems to be a factor. The job runs over ~720GB of data
(the cluster's total RAM is around ~900GB, split across 32 executors). I've
managed to run it over a vastly smaller data set without issues. Cur
This file needs to be on your CLASSPATH actually, not just in a directory. The
best way to pass it in is probably to package it into your application JAR. You
can put it in src/main/resources in a Maven or SBT project, and check that it
makes it into the JAR using jar tf yourfile.jar.
Matei
>
no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0
-Dscala-2.10 -X -DskipTests clean package
...
[DEBUG] /opt/xdsp/spark-1.2.0/core/src/main/scala
[DEBUG] includes = [**/*.scala,**/*.java,]
[DEBUG] excludes = []
[WARNING] Zinc server is not available a
Hi Patrick, to follow up on the below discussion, I am including a short code
snippet that produces the problem on 1.1. This is kind of stupid code since
it’s a greatly simplified version of what I’m actually doing but it has a
number of the key components in place. I’m also including some examp
I am not sure , the way I can pass jets3t.properties file for spark-submit.
--file option seems not working.
can some one please help me. My production spark jobs get hung up when
reading s3 file sporadically.
Thanks,
-D
--
View this message in context:
http://apache-spark-user-list.1001560.n
Hi,
I am trying to use the MultipleTextOutputFormat to rename the output files
of my Spark job something different from the default "part-N".
I have implemented a custom MultipleTextOutputFormat class as follows:
*class DriveOutputRenameMultipleTextOutputFormat extends
MultipleTextOutputFor
Hi Experts,
Few general Queries :
1. Can a single block/partition in a RDD have more than 1 kafka message? or
there will be one & only one kafka message per block? In a more broader way,
is the message count related to block in any way or its just that any
message received with in a particular b
Here is the code for my streaming job.
~~val sparkConf = new
SparkConf().setAppName("SparkStreamingJob")
sparkConf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")sparkConf.set("spark.default.parallelism",
"100")sparkConf.s
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored. The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the S
Thanks. Will look at other options.
On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das wrote:
> I am not sure that can be done. Receivers are designed to be run only
> on the executors/workers, whereas a SQLContext (for using Spark SQL)
> can only be defined on the driver.
>
>
> On Mon, Dec 29, 201
To configure the Python executable used by PySpark, see the "Using the
Shell" Python section in the Spark Programming Guide:
https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
You can set the PYSPARK_PYTHON environment variable to choose the Python
executable that will be
Hi Sven,
Do you have a small example program that you can share which will allow me
to reproduce this issue? If you have a workload that runs into this, you
should be able to keep iteratively simplifying the job and reducing the
data set size until you hit a fairly minimal reproduction (assuming
For windows that large (1 hour), you will probably also have to
increase the batch interval for efficiency.
TD
On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das wrote:
> You can use reduceByKeyAndWindow for that. Here's a pretty clean example
> https://github.com/apache/spark/blob/master/examples/src/
I am not sure that can be done. Receivers are designed to be run only
on the executors/workers, whereas a SQLContext (for using Spark SQL)
can only be defined on the driver.
On Mon, Dec 29, 2014 at 6:45 PM, sranga wrote:
> Hi
>
> Could Spark-SQL be used from within a custom actor that acts as a
Yes. I can do a just in time init… I can see that the first map was done.
However, I can’t see that the last map was done I think.. and the shutdown
is the key part. Without it all my daemon threads won’t properly exit and
I will not have all messages sent over the wire.
On Sun, Dec 28, 2014 at
Hi
Does spark have built in possiblity of exposing current value of
Accumulator [1] using Monitoring and Instrumentation [2].
Unfortunately I couldn't find anything in Sources which could be used.
Does it mean only way to expose current accumulator value is to implement
new Source which would hook
Hi
I am using Aanonda Python. Is there any way to specify the Python which we
have o use for running pyspark in a cluster.
Best regards
Jagan
On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman
wrote:
> The Python installed in your cluster is 2.5. You need at least 2.6.
>
>
> Eric Friedman
>
Anytime you see "java.lang.NoSuchMethodError" it means that you have
multiple conflicting versions of a library on the classpath, or you are
trying to run code that was compiled against the wrong version of a library.
On Tue, Dec 30, 2014 at 1:43 AM, sachin Singh
wrote:
> I have a table(csv file
Without caching, each action is recomputed. So assuming rdd2 and rdd3
result in separate actions answer is yes.
On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet wrote:
> If I have 2 RDDs which depend on the same RDD like the following:
>
> val rdd1 = ...
>
> val rdd2 = rdd1.groupBy()...
>
> val rdd3
Hi all,
I'm investigating spark for a new project and I'm trying to use
spark-jobserver because... I need to reuse and share RDDs and from what I
read in the forum that's the "standard" :D
Turns out that spark-jobserver doesn't seem to work on yarn, or at least it
does not on 1.1.1
My config
This here may also be of help:
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html.
Make sure to spread your objects across multiple partitions to not be rate
limited by S3.
-Sven
On Mon, Dec 22, 2014 at 10:20 AM, durga katakam wrote:
> Yes . I am reading thousan
I'm half-way there
follow
1. compiled and installed open blas library
2. ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3
3. compiled and built spark:
mvn -Pnetlib-lgpl -DskipTests clean compile package
So far so fine. Then I run into problems by testing the solution:
bin/run-exampl
Hey all,
Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails
during shuffle. I've tried reverting from the sort-based shuffle back to
the hash one, and that fails as well. Does anyone see similar problems or
has an idea on where to look next?
For the sort-based shuffle I get a
Hi Michael,
I’ve looked through the example and the test cases and I think I understand
what we need to do - so I’ll give it a go.
I think what I’d like to try to do is allow files to be added at anytime, so
perhaps I can cache partition info, and also what may be useful for us would be
to d
Did you check firewall rules in security groups?
On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed
wrote:
> Hi,
>
> I am using spark standalone on EC2. I can access ephemeral hdfs from
> spark-shell interface but I can't access hdfs in standalone application. I
> am using spark 1.2.0 with hadoop 2.4.0 a
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the
spark.executor.uri within spark-env.sh (and directly within bash as well),
the Mesos slaves do not seem to be able to access the spark tgz file via
HTTP or HDFS as per the message below.
14/12/30 15:57:35 INFO SparkILoop:
Hi,
I am using spark standalone on EC2. I can access ephemeral hdfs from
spark-shell interface but I can't access hdfs in standalone application. I am
using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from
my local machine. In my pom file I have given hadoop client as 2.4.
Do your debug println show values? i.e. what would you see if in rowToString
you output println(" row to string "+row+" "+sub)?
Another thing to check would be to do schemaRDD.take(3) or something to
make sure you actually have data
you can also try this: rowToString(schemaRDD.first,list) a
Some time ago I did the (2) approach, I installed Anaconda on every node.
But to avoid screwing RedHat (it was CentOS in my case, which is the same)
I installed Anaconda on every node using the user "yarn" and made it the
default python only for that user.
After you install it, Anaconda asks if i
I'm not sure exactly what you're trying to do, but take a look at
rdd.toLocalIterator if you haven't already.
On Tue, Dec 30, 2014 at 6:16 AM, Sean Owen wrote:
> collect()-ing a partition still implies copying it to the driver, but
> you're suggesting you can't collect() the whole data set to th
Hi.
I'm trying to configure a spark standalone cluster, with three master nodes
(bigdata1, bigdata2 and bigdata3) managed by Zookeeper.
It seems there's a configuration problem, since everyone is saying it is the
cluster leader:
.
14/12/30 13:54:59 INFO Master: I have been elec
how about save as object?
Yours, Xuefeng Wu 吴雪峰 敬上
> On 2014年12月30日, at 下午9:27, Jason Hong wrote:
>
> Dear all:)
>
> We're trying to make a graph using large input data and get a subgraph
> applied some filter.
>
> Now, we wanna save this graph to HDFS so that we can load later.
>
> Is it p
Dear all:)
We're trying to make a graph using large input data and get a subgraph
applied some filter.
Now, we wanna save this graph to HDFS so that we can load later.
Is it possible to store graph or subgraph directly into HDFS and load it as
a graph for future use?
We will be glad for your su
The Python installed in your cluster is 2.5. You need at least 2.6.
Eric Friedman
> On Dec 30, 2014, at 7:45 AM, Jaggu wrote:
>
> Hi Team,
>
> I was trying to execute a Pyspark code in cluster. It gives me the following
> error. (Wne I run the same job in local it is working fine too :-(
Hi Team,
I was trying to execute a Pyspark code in cluster. It gives me the following
error. (Wne I run the same job in local it is working fine too :-()
Eoor
Error from python worker:
/usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning:
'with' will become a reserved keyw
On Tue, Dec 30, 2014 at 11:45 AM, xhudik wrote:
> I also tried Java8 and scala 2.11 (no -Dscala.usejavacp=true), but I failed
> for some other problem:
>
> /mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -Dscala-2.11 -X -DskipTests
> clean package
There is no "hadoop-2.5" profile, as the output t
collect()-ing a partition still implies copying it to the driver, but
you're suggesting you can't collect() the whole data set to the
driver. What do you mean: collect() 1 partition? or collect() some
smaller result from each partition?
On Tue, Dec 30, 2014 at 11:54 AM, DEVAN M.S. wrote:
> Hi all
Thanks Abhishek. We are good know with an answer to try.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20906.html
Sent from the Apache Spark User List mailing list archive at Nabble.
Hi all,
i have one large data-set. when i am getting the number of partitions its
showing 43.
We can't collect() the large data-set in to memory so i am thinking like
this, collect() each partitions so that it will be small in size.
Any thoughts ?
Hi,
well, spark 1.2 was prepared for scala 2.10. If you want stable and fully
functional tool I'd compile it this default compiler.
*I was able to compile Spar 1.2 by Java 7 and scala 2.10 seamlessly.*
I also tried Java8 and scala 2.11 (no -Dscala.usejavacp=true), but I failed
for some other prob
Frankly saying I never tried for this volume in practical. But I believe it
should work.
On 30 Dec 2014 15:26, "Sasi [via Apache Spark User List]" <
ml-node+s1001560n20902...@n3.nabble.com> wrote:
> Thanks Abhishek. We understand your point and will try using REST URL.
> However one concern, we ha
This poor soul had the exact same problem and solution:
http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile
ᐧ
On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji wrote:
> Hi, I'm facing a weird issue. Any help appreciated.
>
> When I e
Thank Sean.
That was helpful.
Regards,
Sam
On Dec 30, 2014, at 4:12 PM, Sean Owen wrote:
> The DStream model is one RDD of data per interval, yes. foreachRDD
> performs an operation on each RDD in the stream, which means it is
> executed once* for the one RDD in each interval.
>
> * ignoring th
Hi, I'm facing a weird issue. Any help appreciated.
When I execute the below code and compare "input" and "output", each record
in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.
I'm usi
The DStream model is one RDD of data per interval, yes. foreachRDD
performs an operation on each RDD in the stream, which means it is
executed once* for the one RDD in each interval.
* ignoring the possibility here of failure and retry of course
On Mon, Dec 29, 2014 at 4:49 PM, SamyaMaiti wrote:
While I was doing JOIN operation of three tables using Spark 1.1.1, and
always got the following error. However, I've never met the exception in
Spark 1.1.0 with the same operation and same data. Does anyone meet the
problem?
14/12/30 17:49:33 ERROR CliDriver:
org.apache.hadoop.hive.ql.metadata.Hi
Foreach iterates through the partitions in the RDD and executes the operations
for each partitions i guess.
> On 29-Dec-2014, at 10:19 pm, SamyaMaiti wrote:
>
> Hi All,
>
> Please clarify.
>
> Can we say 1 RDD is generated every batch interval?
>
> If the above is true. Then, is the foreachRDD()
Thanks Abhishek. We understand your point and will try using REST URL.
However one concern, we had around 1 lakh rows in our Cassandra table
presently. Will REST URL result can withstand the response size?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ne
I have a table(csv file) loaded data on that by creating POJO as per table
structure,and created SchemaRDD as under
JavaRDD testSchema =
sc.textFile("D:/testTable.csv").map(GetTableData);/* GetTableData will
transform the all table data in testTable object*/
JavaSchemaRDD schemaTest = sqlContext.ap
Thanks Sandy, It was the issue with the no of cores.
Another issue I was facing is that tasks are not getting distributed evenly
among all executors and are running on the NODE_LOCAL locality level i.e.
all the tasks are running on the same executor where my kafkareceiver(s)
are running even thoug
Ohh...
Just curious, we did similar use case like yours getting data out of
Cassandra since job server is a rest architecture all we need is an URL to
access it. Why integrating with your framework matters here when all we
need is a URL.
On 30 Dec 2014 14:05, "Sasi [via Apache Spark User List]" <
Kmeans really needs to have identified number of clusters in advance. There
are multiple algorithms (XMeans, ART,...) which do not need this
information. Unfortunately, none of them is implemented in MLLib for the
moment (you can give a hand and help community).
Anyway, it seems to me you will not
The reason being, we had Vaadin (Java Framework) application which displays
data from Spark RDD, which in turn gets data from Cassandra. As we know, we
need to use Maven for building Spark API in Java.
We tested the spark-jobserver using SBT and able to run it. However, for our
requirement, we nee
Hey,
why specific in maven??
we setup a spark job server thru sbt which is easy way to up and running
job server.
On 30 Dec 2014 13:32, "Sasi [via Apache Spark User List]" <
ml-node+s1001560n20896...@n3.nabble.com> wrote:
>
> Does my question make sense or required some elaboration?
>
> Sasi
>
> _
Does my question make sense or required some elaboration?
Sasi
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20896.html
Sent from the Apache Spark User List mailing list archive at
68 matches
Mail list logo