Thanks for the info frank.
Twitter's-chill avro serializer looks great.
But how does spark identifies it as serializer, as its not extending from
KryoSerializer.
(sorry scala is an alien lang for me).
-
Thanks & Regards,
Mohan
--
View this message in context:
http://apache-spark-user-list.
Hello everyone,
What should be the normal time difference between Scala and Python using
Spark? I mean running the same program in the same cluster environment.
In my case I am using numpy array structures for the Python code and
vectors for the Scala code, both for handling my data. The time dif
Hello!
Could you please add us to your "powered by" page?
Project name: Ubix.io
Link: http://ubix.io
Components: Spark, Shark, Spark SQL, MLib, GraphX, Spark Streaming, Adam
project
Description:
Hey, i don't think that's the issue, foreach is called on 'results' which is
a DStream of floats, so naturally it passes RDDs to its function.
And either way, changing the code in the first mapper to comment out the map
reduce process on the RDD
Float f = 1.0f; //nnRdd.map(new Function() {
Hello everybody,
I'm new to spark streaming and played a bit around with WordCount and a
PageRank-Algorithm in a cluster-environment.
Am I right, that in the cluster each executor computes data stream
separately? And that the result of each executor is independent of the other
executors?
In the
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just making
Kryo use the JavaSerializer for my messages. The resulting stack trace
made it look like protobuf GeneratedMessageLite is actually using the
classloade
Derp, one caveat to my "solution": I guess Spark doesn't use Kryo for
Function serde :(
On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais wrote:
> Well it looks like this is indeed a protobuf issue. Poked a little more
> with Kryo. Since protobuf messages are serializable, I tried just making
> Kryo
Hi,
I'd made some modifications to the spark source code in the master and
reflected them to the slaves using rsync.
I followed this command:
rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname
:path/to/destdirectory.
This worked perfectly. But, I wanted to simultaneously rs
Hi,
On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek
wrote:
>
> This worked perfectly. But, I wanted to simultaneously rsync all the
> slaves. So, added the other slaves as following:
>
> rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname
> :path/to/destdirectory username@sl
Hi Tobias,
I've copied the files from master to all the slaves.
On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer wrote:
> Hi,
>
> On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek > wrote:
>>
>> This worked perfectly. But, I wanted to simultaneously rsync all the
>> slaves. So, added the other
,
* you have copied a lot of files from various hosts to username@slave3:path*
only from one node to all the other nodes...
On Fri, Sep 19, 2014 at 1:45 PM, rapelly kartheek
wrote:
> Hi Tobias,
>
> I've copied the files from master to all the slaves.
>
> On Fri, Sep 19, 2014 at 1:37 PM, Tobias
Here's what we've tried so far as a first example of a custom Mongo receiver
:
/class MongoStreamReceiver(host: String)
extends NetworkReceiver[String] {
protected lazy val blocksGenerator: BlockGenerator =
new BlockGenerator(StorageLevel.MEMORY_AND_DISK_SER_2)
protected def onStart()
-- Forwarded message --
From: rapelly kartheek
Date: Fri, Sep 19, 2014 at 1:51 PM
Subject: Re: rsync problem
To: Tobias Pfeiffer
any idea why the cluster is dying down???
On Fri, Sep 19, 2014 at 1:47 PM, rapelly kartheek
wrote:
> ,
>
>
> * you have copied a lot of files from
Hi,
On Fri, Sep 19, 2014 at 5:17 PM, rapelly kartheek
wrote:
> > ,
>
> * you have copied a lot of files from various hosts to
> username@slave3:path*
> only from one node to all the other nodes...
>
I don't think rsync can do that in one command as you described. My guess
is that now you have a
I
* followed this command:rsync -avL --progress path/to/spark-1.0.0
username@destinationhostname:*
*path/to/destdirectory. Anyway, for now, I did it individually for each
node.*
I have copied to each node at a time individually using the above command.
So, I guess the copying may not contain any
No, it is actually a quite different 'alpha' project under the same name:
linear algebra DSL on top of H2O and also Spark. It is not really about
algorithm implementations now.
On Sep 19, 2014 1:25 AM, "Matthew Farrellee" wrote:
> On 09/18/2014 05:40 PM, Sean Owen wrote:
>
>> No, the architecture
The product of each mapPartitions call can be an Iterable of one big Map.
You still need to write some extra custom code like what lookup() does to
exploit this data structure.
On Sep 18, 2014 11:07 PM, "Harsha HN" <99harsha.h@gmail.com> wrote:
> Hi All,
>
> My question is related to improving
Hi,
Is there a way to bulk-load to HBase from RDD?
HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but
I cannot figure out how to use it with saveAsHadoopDataset.
Thanks.
One possible reason is maybe that the checkpointing directory
$SPARK_HOME/work is rsynced as well.
Try emptying the contents of the work folder on each node and try again.
On Fri, Sep 19, 2014 at 4:53 AM, rapelly kartheek
wrote:
> I
> * followed this command:rsync -avL --progress path/to/spark
Hi,
After reading several documents, it seems that saveAsHadoopDataset cannot
use HFileOutputFormat.
It's because saveAsHadoopDataset method uses JobConf, so it belongs to the
old Hadoop API, while HFileOutputFormat is a member of mapreduce package
which is for the new Hadoop API.
Am I rig
Hi,
Sorry, I just found saveAsNewAPIHadoopDataset.
Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there
any example code for that?
Thanks.
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, September 19, 2014 8:18 PM
To: user@spark
On 09/19/2014 05:06 AM, Sean Owen wrote:
No, it is actually a quite different 'alpha' project under the same
name: linear algebra DSL on top of H2O and also Spark. It is not really
about algorithm implementations now.
On Sep 19, 2014 1:25 AM, "Matthew Farrellee" mailto:m...@redhat.com>> wrote:
I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat
instead of HFileOutputFormat. But, hopefully this should help you:
val hbaseZookeeperQuorum =
s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", hbase
Thank you for the example code.
Currently I use foreachPartition() + Put(), but your example code can be used
to clean up my code.
BTW, since the data uploaded by Put() goes through normal HBase write path, it
can be slow.
So, it would be nice if bulk-load could be used, since it bypasse
In fact, it seems that Put can be used by HFileOutputFormat, so Put object
itself may not be the problem.
The problem is that TableOutputFormat uses the Put object in the normal way
(that goes through normal write path), while HFileOutFormat uses it to directly
build the HFile.
From: innowi
Apologies in delay in getting back on this. It seems the Kinesis example
does not run on Spark 1.1.0 even when it is built using kinesis-acl profile
because of a dependency conflict in http client (same issue as
http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5j
Agreed that the bulk import would be faster. In my case, I wasn't expecting
a lot of data to be uploaded to HBase and also, I didn't want to take the
pain of importing generated HFiles into HBase. Is there a way to invoke
HBase HFile import batch script programmatically?
On 19 September 2014 17:58
Hi Mohan,
It’s a bit convoluted to follow in their source, but they essentially typedef
KSerializer as being a KryoSerializer, and then their serializers all extend
KSerializer. Spark should identify them properly as Kryo Serializers, but I
haven’t tried it myself.
Regards,
Frank Austin Notha
Jatin,
If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.
RJ
On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng wrote:
> Hi Jatin,
>
> HashingTF should be able to solve the memory problem if you use a
> small feature dimension in HashingTF. Please do
I'm running out of options trying to integrate cassandra, spark, and the
spark-cassandra-connector.
I quickly found out just grabbing the latest versions of everything
(drivers, etc.) doesn't work--binary incompatibilities it would seem.
So last I tried using versions of drivers from the
spark-ca
Thanks, Shivaram.
Kui
> On Sep 19, 2014, at 12:58 AM, Shivaram Venkataraman
> wrote:
>
> As R is single-threaded, SparkR launches one R process per-executor on
> the worker side.
>
> Thanks
> Shivaram
>
> On Thu, Sep 18, 2014 at 7:49 AM, oppokui wrote:
>> Shivaram,
>>
>> As I know, SparkR
onStart should be non-blocking. You may try to create a thread in onStart
instead.
- Original Message -
From: "t1ny"
To: u...@spark.incubator.apache.org
Sent: Friday, September 19, 2014 1:26:42 AM
Subject: Re: Spark Streaming and ReactiveMongo
Here's what we've tried so far as a first e
Thank you Soumya Simantha and Tobias. I've deleted the contents of the work
folder in all the nodes.
Now its working perfectly as it was before.
Thank you
Karthik
On Fri, Sep 19, 2014 at 4:46 PM, Soumya Simanta
wrote:
> One possible reason is maybe that the checkpointing directory
> $SPARK_HOME
It turns out that it was the Hadoop version that was the issue.
spark-1.0.2-hadoop1 and spark-1.1.0-hadoop1 both work.
spark.1.0.2-hadoop2, spark-1.1.0-hadoop2.4 and spark-1.1.0-hadoop2.4 do not
work.
It's strange because for this little test I am not even using HDFS at all.
-- Eric
On Thu,
Excellent - thats exactly what I needed. I saw iterator() but missed the
toLocalIterator() method
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638p14686.html
Sent from the Apache Spark Us
Hi,
I am working with the SVMWithSGD classification algorithm on Spark. It
works fine for me, however, I would like to recognize the instances that
are classified with a high confidence from those with a low one. How do we
define the threshold here? Ultimately, I want to keep only those for which
Hi, Spark experts,
I have the following issue when using aws java sdk in my spark application.
Here I narrowed down the following steps to reproduce the problem
1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster
2) from the master node, I did the following steps.
spark-shell --
Hi Chinchu,
SparkEnv is an internal class that is only meant to be used within Spark.
Outside of Spark, it will be null because there are no executors or driver
to start an environment for. Similarly, SparkFiles is meant to be used
internally (though it's privacy settings should be modified to ref
I think it's normal.
On Fri, Sep 19, 2014 at 12:07 AM, Luis Guerra wrote:
> Hello everyone,
>
> What should be the normal time difference between Scala and Python using
> Spark? I mean running the same program in the same cluster environment.
>
> In my case I am using numpy array structures for t
Hi
I am wrote a little java job to try and figure out how RDD pipe works.
Bellow is my test shell script. If in the script I turn on debugging I get
output. In my console. If debugging is turned off in the shell script, I do
not see anything in my console. Is this a bug or feature?
I am running t
Hi,
I have a program similar to the BinaryClassifier example that I am running
using my data (which is fairly small). I run this for 100 iterations. I
observed the following performance:
Standalone mode cluster with 10 nodes (with Spark 1.0.2): 5 minutes
Standalone mode cluster with 10 nodes (wi
What is in 'rdd' here, to double check? Do you mean the spark shell when
you say console? At the end you're grepping output from some redirected
output but where is that from?
On Sep 19, 2014 7:21 PM, "Andy Davidson"
wrote:
> Hi
>
> I am wrote a little java job to try and figure out how RDD pipe
Hi Andy,
That's a feature -- you'll have to print out the return value from
collect() if you want the contents to show up on stdout.
Probably something like this:
for(Iterator iter = rdd.pipe(pwd +
"/src/main/bin/RDDPipe.sh").collect().iterator(); iter.hasNext();)
System.out.println(iter.next
I am struggling to reproduce the functionality of a Hadoop reducer on
Spark (in Java)
in Hadoop I have a function
public void doReduce(K key, Iterator values)
in Hadoop there is also a consumer (context write) which can be seen as
consume(key,value)
In my code
1) knowing the key is important to
I'm launching a Spark shell with the following parameters
./spark-shell --master yarn-client --executor-memory 32g --driver-memory 4g
--executor-cores 32 --num-executors 8
but when I look at the Spark UI it shows only 209.3 GB total memory.
Executors (10)
- *Memory:* 55.9 GB Used (209.3 GB
How many cores do you have in your boxes?
looks like you are assigning 32 cores "per" executor - is that what you want?
are there other applications running on the cluster? you might want to check
YARN UI to see how many containers are getting allocated to your application.
On Sep 19, 2014, a
It might not be the same as a real hadoop reducer, but I think it would
accomplish the same. Take a look at:
import org.apache.spark.SparkContext._
// val rdd: RDD[(K, V)]
// def zero(value: V): S
// def reduce(agg: S, value: V): S
// def merge(agg1: S, agg2: S): S
val reducedUnsorted: RDD[(K, S)]
Please see http://hbase.apache.org/book.html#completebulkload
LoadIncrementalHFiles has a main() method.
On Fri, Sep 19, 2014 at 5:41 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:
> Agreed that the bulk import would be faster. In my case, I wasn't
> expecting a lot of data to be upl
I'm trying to run Spark ALS using the netflix dataset but failed due to "No
space on device" exception. It seems the exception is thrown after the
training phase. It's not clear to me what is being written and where is the
output directory.
I was able to run the same code on the provided test.data
Hi Jey
Many thanks for the code example. Here is what I really want to do. I want
to use Spark Stream and python. Unfortunately pySpark does not support
streams yet. It was suggested the way to work around this was to use an RDD
pipe. The example bellow was a little experiment.
You can think of m
I have created JIRA ticket 3610 for the issue. thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-app-logs-for-MLLib-programs-in-history-server-tp14627p14706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Your proposed use of rdd.pipe("foo") to communicate with an external
process seems fine. The "foo" program should read its input from
stdin, perform its computations, and write its results back to stdout.
Note that "foo" will be run on the workers, invoked once per
partition, and the result will be
Hey just a minor clarification, you _can_ use SparkFiles.get in your
application only if it runs on the executors, e.g. in the following way:
sc.parallelize(1 to 100).map { i => SparkFiles.get("my.file") }.collect()
But not in general (otherwise NPE, as in your case). Perhaps this should be
docum
Hi,
I am using the latest release Spark 1.1.0. I am trying to build the
streaming examples (under examples/streaming) as a standalone project with
the following streaming.sbt file. When I run sbt assembly, I get an error
stating that object algebird is not a member of package com.twitter. I
tried
Hi Evan,
here a improved version, thanks for your advice. But you know the last step,
the SaveAsTextFile is very Slw, :(
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.net.URL
import java.text.SimpleDateFormat
import c
I successfully did this once.
RDD map to RDD [(ImmutableBytesWritable, KeyValue)]
then
val conf = HBaseConfiguration.create()
val job = new Job (conf, "CEF2HFile")
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]);
job.setMapOutputValueClass (classOf[KeyValue]);
val table = new HTable(con
Have you set spark.local.dir (I think this is the config setting)?
It needs to point to a volume with plenty of space.
By default if I recall it point to /tmp
Sent from my iPhone
> On 19 Sep 2014, at 23:35, "jw.cmu" wrote:
>
> I'm trying to run Spark ALS using the netflix dataset but failed d
What Sean said.
You should also definitely turn on Kryo serialization. The default
Java serialization is really really slow if you're gonna move around
lots of data.Also make sure you use a cluster with high network
bandwidth on.
On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen wrote:
> Base 64 i
Hi,
I'm using Spark Streaming 1.0.
Say I have a source of website click stream, like the following:
('2014-09-19 00:00:00', '192.168.1.1', 'home_page')
('2014-09-19 00:00:01', '192.168.1.2', 'list_page')
...
And I want to calculate the page views (PV, number of logs) and unique
user (UV, identi
So sorry about teasing you with the Scala. But the method is there in Java
too, I just checked.
On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen wrote:
> It might not be the same as a real hadoop reducer, but I think it would
> accomplish the same. Take a look at:
>
> import org.apache.spark.
Hi,
I'm developing an application with spark-streaming-kafka, which
depends on spark-streaming and kafka. Since spark-streaming is
provided in runtime, I want to exclude the jars from the assembly. I
tried the following configuration:
libraryDependencies ++= {
val sparkVersion = "1.0.2"
Seq(
Thanks Andrew. that helps
On Fri, Sep 19, 2014 at 5:47 PM, Andrew Or-2 [via Apache Spark User List] <
ml-node+s1001560n14708...@n3.nabble.com> wrote:
> Hey just a minor clarification, you _can_ use SparkFiles.get in your
> application only if it runs on the executors, e.g. in the following way:
>
62 matches
Mail list logo