Hi all,
I'm trying to run the master version of spark in order to test some alpha
components in ml package.
I follow the build spark documentation and build it with :
$ mvn clean package
The build is successful but when I try to run spark-shell I got the
following errror :
*Exception in thr
Hi All,
I have a requirement where I need to consume messages from ActiveMQ and do
live stream processing as well as batch processing using Spark. Is there a
spark-plugin or library that can enable this? If not, then do you know any
other way this could be done?
Regards
Mohit
Hi,
Anyone has implemented the default Pig Loader in Spark? (loading delimited
text files with .pig_schema)
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/
Hi,
I want to increase the maxPrintString the Spark repl to look at SQL query
plans, as they are truncated by default at 800 chars, but don't know how to
set this. You don't seem to be able to do it in the same way as you would
with with Scala repl.
Anyone know how to set this?
Also anyone kno
I want to write whole schemardd to single in hdfs but facing following
exception
rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder
DFSClient_NONMAPREDUCE_-564238432_57 doe
Hi,
To keep processing the older file also you can use fileStream instead of
textFileStream. It has a parameter to specify to look for already present
files.
For deleting the processed files one way is to get the list of all files in
the dStream. This can be done by using the foreachRDD api of th
Greetings!
Thanks for the response.
Below is an example of the exception I saw.I'd rather not post code at the
moment, so I realize it is completely unreasonable to ask for a
diagnosis.However, I will say that adding a "partitionBy()" was the last change
before this error was created.
Thanks fo
I am not sure that this way can help you. There is my situation that I can
not see any input in terminal after some work gets done via spark-shell, I
used to run a command "stty echo" , and It fixed.
Best,
Amoners
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabb
Hello Akhil,
Thank you for taking your time for a detailed answer. I managed to solve it
in a very similar manner.
Kind regards,
Emre Sevinç
On Mon, Feb 2, 2015 at 8:22 PM, Akhil Das
wrote:
> Hi Emre,
>
> This is how you do that in scala:
>
> val lines = ssc.fileStream[LongWritable, Text,
> T
Yes, I see this too. I think the Jetty shading still needs a tweak.
It's not finding the servlet API classes. Let's converge on SPARK-5557
to discuss.
On Tue, Feb 3, 2015 at 2:04 AM, Jaonary Rabarisoa wrote:
> Hi all,
>
> I'm trying to run the master version of spark in order to test some alpha
>
Hi,
I am using Spark 0.9.1 and I am looking for a proper viz tools that
supports that specific version. As far as I have seen all relevant tools
(e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no
mentions about older versions of Spark. Any ideas or suggestions?
*// Adamantio
I have a RDD which is of type
org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))]
I want to write it as a csv file.
Please suggest how this can be done.
myrdd.map(line => (line._1 + "," + line._2._1.mkString(",") + "," +
line._2._2.mkString(','))).saveAsTextFile("hdfs://.
this is more of a scala question, so probably next time you'd like to
address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala
val optArrStr:Option[Array[String]] = ???
optArrStr.map(arr => arr.mkString(",")).getOrElse("") // empty string or
whatever default value you have for th
Thanks Gerard !!
This is working.
On Tue, Feb 3, 2015 at 6:44 PM, Gerard Maas wrote:
> this is more of a scala question, so probably next time you'd like to
> address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala
>
> val optArrStr:Option[Array[String]] = ???
> optArrStr.map(
Hello Adamantios,
Thanks for the poke and the interest.
Actually, you're the second asking about backporting it. Yesterday (late),
I created a branch for it... and the simple local spark test worked! \o/.
However, it'll be the 'old' UI :-/. Since I didn't ported the code using
play 2.2.6 to the ne
You might also try "stty sane".
From: amoners
I am not sure that this way can help you. There is my situation that I can
not see any input in terminal after some work gets done via spark-shell, I
used to run a command "stty echo" , and It fixed.
Hi,
I just built Spark 1.3 master using maven via make-distribution.sh;
./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive
-Phive-thriftserver -Phive-0.12.0
When trying to start the standalone spark master on a cluster I get the
following stack trace;
15/02/04 08:53:56 I
Already come up several times today:
https://issues.apache.org/jira/browse/SPARK-5557
On Tue, Feb 3, 2015 at 8:04 AM, Night Wolf wrote:
> Hi,
>
> I just built Spark 1.3 master using maven via make-distribution.sh;
>
> ./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive
> -Ph
I think this is a separate issue with how the EdgeRDDImpl partitions
edges. If you can merge this change in and rebuild, it should work:
https://github.com/apache/spark/pull/4136/files
If you can't, I just called the Graph.partitonBy() method right after
construction my graph but before perfo
Hi Everyone,
Is LogisticRegressionWithSGD in MLlib scalable?
If so, what is the idea behind the scalable implementation?
Thanks in advance,
Peng
-
Peng Zhang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-LogisticRegressionWithSGD-in-MLlib-sca
Hi All,
In Spark 1.2.0-rc1, I have tried to set the hive.metastore.warehouse.dir to
share with the Hive warehouse location on HDFS, however, it does NOT work on
yarn-cluster mode. On the Namenode audit log, I see that spark is trying to
access the default hive warehouse location which is
/user/
Hi all,
issue has bee resolved,
when I used
rdd.foreachRDD(new Function, Void>() {
@Override
public Void call(JavaRDD rdd) throws Exception {
if(rdd!=null)
{
List result = rdd.col
Hi Gen
Thanks for your feedback. We do have a business reason to run spark on windows.
We have an existing application that is built on C# .NET running on windows. We
are considering adding spark to the application for parallel processing of
large data. We want spark to run on windows so it int
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS somehow.
I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.
If I want to process 800 GB of data (assuming
You could also just push the data to Amazon S3, which would un-link the
size of the cluster needed to process the data from the size of the data.
DR
On 02/03/2015 11:43 AM, Joe Wass wrote:
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS so
Hi,
After some research I have decided that Spark (SQL) would be ideal for
building an OLAP engine. My goal is to push aggregated data (to Cassandra
or other low-latency data storage) and then be able to project the results
on a web page (web service). New data will be added (aggregated) once a
da
The version I'm using was already pre-built for Hadoop 2.3.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p21485.html
Sent from the Apache Spark User List mailing list archive at Nabble
Hi,
Any thoughts ?
Thanks,
On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel
wrote:
> Spark 1.2
>
> SchemaRDD has schema with decimal columns created like
>
> x1 = new StructField("a", DecimalType(14,4), true)
>
> x2 = new StructField("b", DecimalType(14,4), true)
>
> Registering as SQL Temp table
The data is coming from S3 in the first place, and the results will be
uploaded back there. But even in the same availability zone, fetching 170
GB (that's gzipped) is slow. From what I understand of the pipelines,
multiple transforms on the same RDD might involve re-reading the input,
which very q
We use S3 as a main storage for all our input data and our generated
(output) data. (10's of terabytes of data daily.) We read gzipped data
directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as
long as you parallelize the work well by distributing the processing
across enough
I have about 500 MB of data and I'm trying to process it on a single
`local` instance. I'm getting an Out of Memory exception. Stack trace at
the end.
Spark 1.1.1
My JVM has --Xmx2g
spark.driver.memory = 1000M
spark.executor.memory = 1000M
spark.kryoserializer.buffer.mb = 256
spark.kryoserializer
We have gone down a similar path at Webtrends, Spark has worked amazingly well
for us in this use case. Our solution goes from REST, directly into spark, and
back out to the UI instantly.
Here is the resulting product in case you are curious (and please pardon the
self promotion):
https://www
Write out the rdd to a cassandra table. The datastax driver provides
saveToCassandra() for this purpose.
On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais <
adamantios.cor...@gmail.com> wrote:
> Hi,
>
> After some research I have decided that Spark (SQL) would be ideal for
> building an OLAP en
Thanks very much, that's good to know, I'll certainly give it a look.
Can you give me a hint about you unzip your input files on the fly? I
thought that it wasn't possible to parallelize zipped inputs unless they
were unzipped before passing to Spark?
Joe
On 3 February 2015 at 17:48, David Rosen
Using s3a protocol (introduced in hadoop 2.6.0) would be faster compared to
s3.
The upcoming hadoop 2.7.0 contains some bug fixes for s3a.
FYI
On Tue, Feb 3, 2015 at 9:48 AM, David Rosenstrauch
wrote:
> We use S3 as a main storage for all our input data and our generated
> (output) data. (10'
Hi Folks,
I'm new to GraphX and Scala and my sendMsg function needs to index into an
input list to my algorithm based on the pregel()() iteration number, but I
don't see a way to access that. I see in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Preg
I don't think its possible to access. What I've done before is send the
current or next iteration index with the message, where the message is a
case class.
HTH
Dan
On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell
wrote:
> Hi Folks,
>
> I'm new to GraphX and Scala and my sendMsg function needs
You should be able to do something like:
sbt -Dscala.repl.maxprintstring=64000 hive/console
Here's an overview of catalyst:
https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit#heading=h.vp2tej73rtm2
On Tue, Feb 3, 2015 at 1:37 AM, Mick Davies
wr
I'll add i usually just do
println(query.queryExecution)
On Tue, Feb 3, 2015 at 11:34 AM, Michael Armbrust
wrote:
> You should be able to do something like:
>
> sbt -Dscala.repl.maxprintstring=64000 hive/console
>
> Here's an overview of catalyst:
> https://docs.google.com/a/databricks.com/docu
Not all of our input files are zipped. The ones that are obviously are
not parallelized - they're just processed by a single task. Not a big
issue for us, though, as the those zipped files aren't too big.
DR
On 02/03/2015 01:08 PM, Joe Wass wrote:
Thanks very much, that's good to know, I'll
Adamantios,
As said, I backported it to 0.9.x and now it's pushed on this branch:
https://github.com/andypetrella/spark-notebook/tree/spark-0.9.x.
I didn't created dist atm, because I'd prefer to do it only if necessary
:-).
So, if you want to try it out, just clone the repo, checked out in this
Hey Joe,
With the ephemeral HDFS, you get the instance store of your worker nodes.
For m3.xlarge that will be two 40 GB SSDs local to each instance, which are
very fast.
For the persistent HDFS, you get whatever EBS volumes the launch script
configured. EBS volumes are always network drives, so t
I am trying to implement secondary sort in spark as we do in map-reduce.
Here is my data(tab separated, without c1, c2, c2).
c1c2 c3
1 2 4
1 3 6
2 4 7
2 6 8
3 5 5
3 1 8
3 2 0
To do secondary sort, I crea
Just to add, I am suing Spark 1.1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Nitin,
Suing Spark is not going to help. Perhaps you should sue someone else :-) Just
kidding!
Mohammed
-Original Message-
From: nitinkak001 [mailto:nitinkak...@gmail.com]
Sent: Tuesday, February 3, 2015 1:57 PM
To: user@spark.apache.org
Subject: Re: Sort based shuffle not working prop
Hm, I don't think the sort partitioner is going to cause the result to
be ordered by c1,c2 if you only partitioned on c1. I mean, it's not
even guaranteed that the type of c2 has an ordering, right?
On Tue, Feb 3, 2015 at 3:38 PM, nitinkak001 wrote:
> I am trying to implement secondary sort in sp
I thought thats what sort based shuffled did, sort the keys going to the
same partition.
I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that
ordering of c2 type is the problem here.
On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen wrote:
> Hm, I don't think the sort partitioner is go
This is an exerpt from the Design document of the implementation of Sort
based shuffle.. I am thinking I might be wrong in my perception of sort
based shuffle. Dont completely understand it though.
*Motivation*
A sortbased shuffle can be more scalable than Spark’s current hashbased
one because
Michael,
you are right, there is definitely some limit at 2GB. Here is a trivial
example to demonstrate it:
import org.apache.spark.storage.StorageLevel
val d = sc.parallelize(1 to 1e6.toInt, 1).map{i => new
Array[Byte](5e3.toInt)}.persist(StorageLevel.DISK_ONLY)
d.count()
It gives the same err
To be clear, there is no distinction between partitions and blocks for RDD
caching (each RDD partition corresponds to 1 cache block). The distinction
is important for shuffling, where by definition N partitions are shuffled
into M partitions, creating N*M intermediate blocks. Each of these blocks
m
cc dev list
How are you saving the data? There are two relevant 2GB limits:
1. Caching
2. Shuffle
For caching, a partition is turned into a single block.
For shuffle, each map partition is partitioned into R blocks, where R =
number of reduce tasks. It is unlikely a shuffle block > 2G, altho
Thank you!
This is very helpful.
-Mike
From: Aaron Davidson
To: Imran Rashid
Cc: Michael Albert ; Sean Owen ;
"user@spark.apache.org"
Sent: Tuesday, February 3, 2015 6:13 PM
Subject: Re: 2GB limit for partitions?
To be clear, there is no distinction between partitions and blocks
Thanks for the explanations, makes sense. For the record looks like this
was worked on a while back (and maybe the work is even close to a solution?)
https://issues.apache.org/jira/browse/SPARK-1476
and perhaps an independent solution was worked on here?
https://issues.apache.org/jira/browse/SP
In case anyone needs to merge all of their part-n files (small result
set only) into a single *.csv file or needs to generically flatten case
classes, tuples, etc., into comma separated values:
http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
On Tue Feb 03 2015 at 8:23:59 AM k
Greetings!
First, my sincere thanks to all who have given me advice.Following previous
discussion, I've rearranged my code to try to keep the partitions to more
manageable sizes.Thanks to all who commented.
At the moment, the input set I'm trying to work with is about 90GB (avro
parquet format).
Hi Peng,
Short answer: Yes. It has been run on billions of rows and tens of
millions of columns.
Long answer: There are many ways to implement LR in a distributed fashion,
and their dependence on the dataset dimensions and compute cluster size
varies.
The implementation distributes the gradient
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted
is at: OLAP with Cassandra and Spark
http://www.slideshare.net/EvanChan2/2014-07olapcassspark.
On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad wrote:
> Write out the rdd to a cassandra table. The datastax driver provid
That is fairly out of date (we used to run some of our jobs on it ... But
that is forked off 1.1 actually).
Regards
Mridul
On Tuesday, February 3, 2015, Imran Rashid wrote:
> Thanks for the explanations, makes sense. For the record looks like this
> was worked on a while back (and maybe the wo
Spark Doesn't support it, but this connector is open source, you can get it
from github.
The difference between these two DB is depending on what type of solution
you are looking for. Please refer this link :
http://blog.nahurst.com/visual-guide-to-nosql-systems
FYI, from the list of NOSQL in
I am trying to combine multiple RDDs into 1 RDD, and I am using the union
function. I wonder if anyone has seen StackOverflowError as follows:
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.Union
Use SparkContext#union[T](rdds: Seq[RDD[T]])
On Tue, Feb 3, 2015 at 7:43 PM, Thomas Kwan wrote:
> I am trying to combine multiple RDDs into 1 RDD, and I am using the union
> function. I wonder if anyone has seen StackOverflowError as follows:
>
> Exception in thread "main" java.lang.StackOverflo
I'm having a really bad dependency conflict right now with Guava versions
between my Spark application in Yarn and (I believe) Hadoop's version.
The problem is, my driver has the version of Guava which my application is
expecting (15.0) while it appears the Spark executors that are working on
my R
Try spark.yarn.user.classpath.first (see
https://issues.apache.org/jira/browse/SPARK-2996 - only works for YARN).
Also thread at
http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html.
HTH,
Markus
On 02/03/2015 11:20 PM, Corey Nolet wrote:
Hi,
I've been trying to use HiveContext(instead of SQLContext) in my SparkSQL
application and when I run the application simultaneously, it only works on
the first call and every other call throws the following error-
ERROR Datastore.Schema: Failed initialising database.
Failed to start database
Hi Sean,
I'm interested in trying something similar. How was your performance when you
had many concurrent queries running against spark? I know this will work well
where you have a low volume of queries against a large dataset, but am
concerned about having a high volume of queries against t
I have a cluster which running CDH5.1.0 with Spark component.
Because the default version of Spark from CDH5.1.0 is 1.0.0 while I want to
use some features of Spark 1.2.0, I compiled another Spark with Maven.
But when I run into Spark-shell and created a new SparkContext, I met the
below error:
15
Corey,
Which version of Spark do you use? I am using Spark 1.2.0, and guava 15.0.
It seems fine.
Best,
Bo
On Tue, Feb 3, 2015 at 8:56 PM, M. Dale wrote:
> Try spark.yarn.user.classpath.first (see
> https://issues.apache.org/jira/browse/SPARK-2996 - only works for YARN).
> Also thread at
> h
Hi Ningjun,
I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely
for development purposes). I had most recently installed them utilizing
Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy
thread concerning the null\bin\winutils issue is addressed in
HI all,
I need a help.
When i am trying to run spark project it is showing that, "Exception in
thread "main" java.lang.SecurityException: class
"javax.servlet.ServletRegistration"'s signer information does not match
signer information of other classes in the same package".
*After deleting "/home/d
I have 3 text files in hdfs which I am reading using spark sql and
registering them as table. After that I am doing almost 5-6 operations -
including joins , group by etc.. And this whole process is taking hardly 6-7
secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of
70 matches
Mail list logo