Command exited with code 137

2014-06-12 Thread libl
I use standalone mode submit task.But often,I got an error.The stacktrace as 2014-06-12 11:37:36,578 [INFO] [org.apache.spark.Logging$class] [Method:logInfo] [Line:49] [Thread:spark-akka.actor.default-dispatcher-18] - Executor updated: app-20140612092238-0007/0 is now FAILED (Command exited with

multiple passes in mapPartitions

2014-06-12 Thread zhen
I want to take multiple passes through my data in mapPartitions. However, the iterator only allows you to take one pass through the data. If I transformed the iterator into an array using iter.toArray, it is too slow, since it copies all the data into a new scala array. Also it takes twice the memo

Re: Powered by Spark addition

2014-06-12 Thread Sonal Goyal
Hi, Can we get added too? Here are the details: Name: Nube Technologies URL: www.nubetech.co Description: Nube provides solutions for data curation at scale helping customer targetting, accurate inventory and efficient analysis. Thanks! Best Regards, Sonal Nube Technologies

Fwd: ApacheCon CFP closes June 25

2014-06-12 Thread Matei Zaharia
(I’m forwarding this message on behalf of the ApacheCon organizers, who’d like to see involvement from every Apache project!) As you may be aware, ApacheCon will be held this year in Budapest, on November 17-23. (See http://apachecon.eu for more info.) The Call For Papers for that conference is

Re: Native library can not be loaded when using Mllib PCA

2014-06-12 Thread Evan R. Sparks
I wouldn't be surprised if the default BLAS that ships with jblas is not optimized for your target platform. Breeze (which we call into, uses either jblas and falls back to netlib if jblas can't be loaded.) I'd recommend using jblas if you can. You probably want to compile a native BLAS library sp

Re: Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Hao Wang
Hi, Andrew Got it, Thanks! Hao Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Fri, Jun 13, 2014 at 12:42 AM, Andrew Or wrote: > Hi Wang Hao, > > This is

Re: Native library can not be loaded when using Mllib PCA

2014-06-12 Thread yangliuyu
Finally, we solved this problem by building our own netlib-java natives so files on CentOS, it works without any warning but the performance is far from running in Macbook Pro. The matrix size is rows: 6778, columns: 2487 The MBP used 10 seconds to get the PCA result, but CentOS used 110s, event

Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Aaron Davidson
The scripts for Spark 1.0 actually specify this property in /root/spark/conf/spark-defaults.conf I didn't know that this would override the --executor-memory flag, though, that's pretty odd. On Thu, Jun 12, 2014 at 6:02 PM, Aliaksei Litouka < aliaksei.lito...@gmail.com> wrote: > Yes, I am launc

Re: wholeTextFiles not working with HDFS

2014-06-12 Thread yinxusen
Hi Sguj, Could you give me the exception stack? I test it on my laptop and find that it gets the wrong FileSystem. It should be DistributedFileSystem, but it finds the RawLocalFileSystem. If we get the same exception stack, I'll try to fix it. Here is my exception stack: java.io.FileNotFoundEx

RE: Spilled shuffle files not being cleared

2014-06-12 Thread Shao, Saisai
Hi Michael, I think you can set up spark.cleaner.ttl=xxx to enable time-based metadata cleaner, which will clean old un-used shuffle data when it is timeout. For Spark 1.0 another way is to clean shuffle data using weak reference (reference tracking based, configuration is spark.cleaner.referen

Re: Spilled shuffle files not being cleared

2014-06-12 Thread Michael Chang
Bump On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang wrote: > Hi all, > > I'm seeing exceptions that look like the below in Spark 0.9.1. It looks > like I'm running out of inodes on my machines (I have around 300k each in a > 12 machine cluster). I took a quick look and I'm seeing some shuffle

Re: When to use CombineByKey vs reduceByKey?

2014-06-12 Thread Diana Hu
Matei, Thanks for the answer this clarifies this very much. Based on my usage I would use combineByKey, since the output is another custom data structures. I found out my issues with combineByKey were relieved after doing more tuning with the level of parallelism. I've found that it really depend

Re: specifying fields for join()

2014-06-12 Thread SK
This issue is resolved. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/specifying-fields-for-join-tp7528p7544.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: An attempt to implement dbscan algorithm on top of Spark

2014-06-12 Thread Aliaksei Litouka
Vipul, Thanks for your feedback. As far as I understand, mean RDD[(Double, Double)] (note the parenthesis), and each of these Double values is supposed to contain one coordinate of a point. It limits us to 2-dimensional space, which is not suitable for many tasks. I want the algorithm to be able to

Re: Spark SQL - input command in web ui/event log

2014-06-12 Thread Michael Armbrust
Yeah, we should probably add that. Feel free to file a JIRA. You can get it manually by calling sc.setJobDescription with the query text before running the query. Michael On Thu, Jun 12, 2014 at 5:49 PM, shlee0605 wrote: > In shark, the input SQL string was shown at the description of the we

Re: spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 9:10 PM, Zongheng Yang wrote: > Hi Toby, > > It is usually the case that even if the EC2 console says the nodes are > up, they are not really fully initialized. For 16 nodes I have found > `--wait 800` to be the norm that makes things work. > It seems so! resume worked f

RE: Question about RDD cache, unpersist, materialization

2014-06-12 Thread innowireless TaeYun Kim
(I¡¯ve clarified the statement (1) of my previous mail. See below.) From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] Sent: Friday, June 13, 2014 10:05 AM To: user@spark.apache.org Subject: RE: Question about RDD cache, unpersist, materialization Currently I use rdd.coun

Re: use spark-shell in the source

2014-06-12 Thread Kevin Jung
Thanks for answer. Yes, I tried to launch an interactive REPL in the middle of my application :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/use-spark-shell-in-the-source-tp7453p7539.html Sent from the Apache Spark User List mailing list archive at Nabb

RE: Question about RDD cache, unpersist, materialization

2014-06-12 Thread innowireless TaeYun Kim
Currently I use rdd.count() for forceful computation, as Nick Pentreath suggested. I think that it will be nice to have a method that forcefully computes a rdd, so that the unnecessary rdds are safely unpersist()ed. Let’s think a case that a rdd_a is a parent of both: (1) a short-term rd

Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Aliaksei Litouka
Yes, I am launching a cluster with the spark_ec2 script. I checked /root/spark/conf/spark-env.sh on the master node and on slaves and it looks like this: #!/usr/bin/env bash > export SPARK_LOCAL_DIRS="/mnt/spark" > # Standalone cluster options > export SPARK_MASTER_OPTS="" > export SPARK_WORKER_IN

Spark SQL - input command in web ui/event log

2014-06-12 Thread shlee0605
In shark, the input SQL string was shown at the description of the web UI so that it was easy to keep track which stage is corresponding to which query. However, spark-sql does not show any information of the input SQL string. It rather shows the lower level operations to the web ui/event log. (s

Re: Question about RDD cache, unpersist, materialization

2014-06-12 Thread Nicholas Chammas
FYI: Here is a related discussion about this. On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim < taeyun@innowireless.co.kr> wrote: > Maybe It would be nice that unpersist() ‘triggers’ the computat

Re: running Spark Streaming just once and stop it

2014-06-12 Thread Tathagata Das
You should be able to see the streaming tab in the Spark web ui (running on port 4040) if you have created StreamingContext and you are using Spark 1.0 TD On Thu, Jun 12, 2014 at 1:06 AM, Ravi Hemnani wrote: > Hey, > > I did sparkcontext.addstreaminglistener(streaminglistener object) in my >

Re: NullPointerException on reading checkpoint files

2014-06-12 Thread Tathagata Das
Are you using broadcast variables in the streaming program? That might be the cuase of this. As far as I can see, broadcast variables are probably not supported through restarts and checkpoints. TD On Thu, Jun 12, 2014 at 12:16 PM, Kiran wrote: > I am also seeing similar problem when trying t

Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Matei Zaharia
Are you launching this using our EC2 scripts? Or have you set up a cluster by hand? Matei On Jun 12, 2014, at 2:32 PM, Aliaksei Litouka wrote: > spark-env.sh doesn't seem to contain any settings related to memory size :( I > will continue searching for a solution and will post it if I find i

RE: Question about RDD cache, unpersist, materialization

2014-06-12 Thread innowireless TaeYun Kim
Maybe It would be nice that unpersist() ‘triggers’ the computations of other rdds that depends on it but not yet computed. The pseudo code can be as follows: unpersist() { if (this rdd has not been persisted) return; for (all rdds that depends on this rdd but not yet computed)

Re: Doubts about MLlib.linalg in python

2014-06-12 Thread ericjohnston1989
Hi Congrui Yi, Spark is implemented in Scala, so all Scala features are first available in Scala/Java. PySpark is a python wrapper for the Scala code, so it won't always have the latest features. This is especially true for the Machine learning library. Eric -- View this message in context:

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Nan Zhu
ah, I see, I think it’s hard to do something like fs.delete() in spark code (it’s scary as we discussed in the previous PR ) so if you want (C), I guess you have to do some delete work manually Best, -- Nan Zhu On Thursday, June 12, 2014 at 3:31 PM, Daniel Siegmann wrote: > I do no

specifying fields for join()

2014-06-12 Thread SK
Hi, I want to join 2 rdds on specific fields. The first RDD is a set of tuples of the form: (ID, ACTION, TIMESTAMP, LOCATION) The second RDD is a set of tuples of the form: (ID, TIMESTAMP). rdd2 is a subset of rdd1. ID is a string. I want to join the two so that I can get the location correspo

Re: NullPointerExceptions when using val or broadcast on a standalone cluster.

2014-06-12 Thread Gerard Maas
That stack trace is quite similar to the one that is generated when trying to do a collect within a closure. In this case, it feels "wrong" to collect in a closure, but I wonder what's reason behind the NPE. Curious to know whether they are related. Here's a very simple example: rrd1.flatMap(x=>

Re: An attempt to implement dbscan algorithm on top of Spark

2014-06-12 Thread Vipul Pandey
Great! I was going to implement one of my own - but I may not need to do that any more :) I haven't had a chance to look deep into your code but I would recommend accepting an RDD[Double,Double] as well, instead of just a file. val data = IOHelper.readDataset(sc, "/path/to/my/data.csv") And othe

Java Custom Receiver onStart method never called

2014-06-12 Thread jsabin
I create a Java custom receiver and then call ssc.receiverStream(new MyReceiver("localhost", 8081)); // where ssc is the JavaStreamingContext I am expecting that the receiver's onStart method gets called but it does not. Can anyone give me some guidance? What am I not doing? Here's the dependen

NullPointerExceptions when using val or broadcast on a standalone cluster.

2014-06-12 Thread bdamos
Hi, I'm consistently getting NullPointerExceptions when trying to use String val objects defined in my main application -- even for broadcast vals! I'm deploying on a standalone cluster with a master and 4 workers on the same machine, which is not the machine I'm submitting from. The following exa

Re: Normalizations in MLBase

2014-06-12 Thread DB Tsai
Hi Asian, I'm not sure if mlbase code is maintained for the current spark master. The following is the code we use for standardization in my company. I'm intended to clean up, and submit a PR. You could use it for now. def standardize(data: RDD[Vector]): RDD[Vector] = { val summarizer = new

Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Aliaksei Litouka
spark-env.sh doesn't seem to contain any settings related to memory size :( I will continue searching for a solution and will post it if I find it :) Thank you, anyway On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia wrote: > It might be that conf/spark-env.sh on EC2 is configured to set it to 5

An attempt to implement dbscan algorithm on top of Spark

2014-06-12 Thread Aliaksei Litouka
Hi. I'm not sure if messages like this are appropriate in this list; I just want to share with you an application I am working on. This is my personal project which I started to learn more about Spark and Scala, and, if it succeeds, to contribute it to the Spark community. Maybe someone will find

Re: Increase storage.MemoryStore size

2014-06-12 Thread Gianluca Privitera
If you are launching your application with spark-submit you can manually edit the spark-class file to make it 1g as baseline. It’s pretty easy to do and to figure out how once you open the file. This worked for me even if it’s not a final solution of course. Gianluca On 12 Jun 2014, at 15:16, e

Re: Question about RDD cache, unpersist, materialization

2014-06-12 Thread Daniel Siegmann
I've run into this issue. The goal of caching / persist seems to be to avoid recomputing an RDD when its data will be needed multiple times. However, once the following RDDs are computed the cache is no longer needed. The currently design provides no obvious way to detect when the cache is no longe

Increase storage.MemoryStore size

2014-06-12 Thread ericjohnston1989
Hey everyone, I'm having some trouble increasing the default storage size for a broadcast variable. It looks like it defaults to a little less than 512MB every time, and I can't figure out which configuration to change to increase this. INFO storage.MemoryStore: Block broadcast_0 stored as values

Re: Not fully cached when there is enough memory

2014-06-12 Thread Daniel Siegmann
I too have seen cached RDDs not hit 100%, even when they are DISK_ONLY. Just saw that yesterday in fact. In some cases RDDs I expected didn't show up in the list at all. I have no idea if this is an issue with Spark or something I'm not understanding about how persist works (probably the latter).

Re: spark EC2 bring-up problems

2014-06-12 Thread Zongheng Yang
Hi Toby, It is usually the case that even if the EC2 console says the nodes are up, they are not really fully initialized. For 16 nodes I have found `--wait 800` to be the norm that makes things work. In my previous experience I have found this to be the culprit, so if you immediately do 'launch

Re: spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 8:50 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6 > Ah, yes - I mean to say, Amazon Linux. > .Have you tried either: > >1. Retrying launch with the --resume option? >2. Increasing

Re: creating new ami image for spark ec2 commands

2014-06-12 Thread Nicholas Chammas
Yeah, we badly need new AMIs that include at a minimum package/security updates and Python 2.7. There is an open issue to track the 2.7 AMI update , at least. On Thu, Jun 12, 2014 at 3:34 PM, wrote: > Creating AMIs from scratch is a complete pain

Re: spark EC2 bring-up problems

2014-06-12 Thread Nicholas Chammas
Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6. Have you tried either: 1. Retrying launch with the --resume option? 2. Increasing the value of the --wait option? Nick ​ On Thu, Jun 12, 2014 at 3:38 PM, Toby Douglass wrote: > Gents, > > I have been bringing up a c

spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
Gents, I have been bringing up a cluster on EC2 using the spark_ec2.py script. This works if the cluster has a single slave. This fails if the cluster has sixteen slaves, during the work to transfer the SSH key to the slaves. I cannot currently bring up a large cluster. Can anyone shed any lig

Re: creating new ami image for spark ec2 commands

2014-06-12 Thread unorthodox . engineers
Creating AMIs from scratch is a complete pain in the ass. If you have a spare week, sure. I understand why the team avoids it. The easiest way is probably to spin up a working instance and then use Amazons "save as new AMI", but that has some major limitations, especially with software not expe

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Daniel Siegmann
I do not want the behavior of (A) - that is dangerous and should only be enabled to account for legacy code. Personally, I think this option should eventually be removed. I want the option (C), to have Spark delete any existing part files before creating any new output. I don't necessarily want th

Re: NullPointerException on reading checkpoint files

2014-06-12 Thread Kiran
I am also seeing similar problem when trying to continue job using saved checkpoint. Can somebody help in solving this problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-reading-checkpoint-files-tp7306p7507.html Sent from the Ap

Re: Spark-Streaming window processing

2014-06-12 Thread unorthodox . engineers
To get the streaming latency I just look at the stats on the application drivers UI webpage. I don't know if you can do that programatically, but you could CURL and parse the page if you had to. Jeremy Lee BCompSci (Hons) The Unorthodox Engineers > On 10 Jun 2014, at 3:36 pm, Yingjun Wu wr

Re: Merging all Spark Streaming RDDs to one RDD

2014-06-12 Thread unorthodox . engineers
I have much the same issue. While I haven't totally solved it yet, I have found the "window" method useful for batching up archive blocks - but updateStateByKey is probably what we want to use, perhaps multiple times. If that works. My bigger worry now is storage. Unlike non-streaming apps, we

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Nan Zhu
Actually this has been merged to the master branch https://github.com/apache/spark/pull/947 -- Nan Zhu On Thursday, June 12, 2014 at 2:39 PM, Daniel Siegmann wrote: > The old behavior (A) was dangerous, so it's good that (B) is now the default. > But in some cases I really do want to repla

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Daniel Siegmann
The old behavior (A) was dangerous, so it's good that (B) is now the default. But in some cases I really do want to replace the old data, as per (C). For example, I may rerun a previous computation (perhaps the input data was corrupt and I'm rerunning with good input). Currently I have to write se

Re: Writing data to HBase using Spark

2014-06-12 Thread Nan Zhu
you are using spark streaming? master = “local[n]” where n > 1? Best, -- Nan Zhu On Wednesday, June 11, 2014 at 4:23 AM, gaurav.dasgupta wrote: > Hi Kanwaldeep, > > I have tried your code but arrived into a problem. The code is working fine > in local mode. But if I run the same code

Re: Writing data to HBase using Spark

2014-06-12 Thread Kanwaldeep
We are running code in a cluster using Spark Standalone cluster and not seeing this issue. The new data is being saved to HBase. Can you check if you are getting any errors and the reading from Kafka actually stops? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabbl

Re: Powered by Spark addition

2014-06-12 Thread Derek Mansen
Awesome, thank you! On Wed, Jun 11, 2014 at 6:53 PM, Matei Zaharia wrote: > Alright, added you. > > Matei > > On Jun 11, 2014, at 1:28 PM, Derek Mansen wrote: > > Hello, I was wondering if we could add our organization to the "Powered by > Spark" page. The information is: > > Name: Vistar Medi

Re: overwriting output directory

2014-06-12 Thread Nan Zhu
Hi, SK For 1.0.0 you have to delete it manually in 1.0.1 there will be a parameter to enable overwriting https://github.com/apache/spark/pull/947/files Best, -- Nan Zhu On Thursday, June 12, 2014 at 1:57 PM, SK wrote: > Hi, > > When we have multiple runs of a program writing to the sam

overwriting output directory

2014-06-12 Thread SK
Hi, When we have multiple runs of a program writing to the same output file, the execution fails if the output directory already exists from a previous run. Is there some way we can have it overwrite the existing directory, so that we dont have to manually delete it after each run? Thanks for you

Announcing MesosCon schedule, incl Spark presentation

2014-06-12 Thread Dave Lester
Spark friends, Yesterday the Mesos community announced the program for #MesosCon , the ever Apache Mesos conference taking place August 21-22, 2014 in Chicago, IL. Given the close rel

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-12 Thread Shivani Rao
@Marcelo: The command ./bin/spark-shell --jars jar1,jar2,etc,etc did not work for me on a linux machine What I did is to append the class path in the bin/compute-classpath.sh file. Ran the script, then started the spark shell, and that worked Thanks Shivani On Wed, Jun 11, 2014 at 10:52 AM, A

Re: Hanging Spark jobs

2014-06-12 Thread Shivani Rao
I learned this from my co-worker, but it is relevant here. Spark has lazy evaluation by default, which means that all of your code does not get executed until you run your "saveAsTextFile", which does not tell you much about where the problem is occurring. In order to debug this better, you might

Re: Access DStream content?

2014-06-12 Thread Gianluca Privitera
You can use ForeachRDD then access RDD data. Hope this works for you. Gianluca On 12 Jun 2014, at 10:06, Wolfinger, Fred mailto:fwolfin...@cyberpointllc.com>> wrote: Good morning. I have a question related to Spark Streaming. I have reduced some data down to a simple count value (by window),

Re: use spark-shell in the source

2014-06-12 Thread Andrew Or
Not sure if this is what you're looking for, but have you looked at java's ProcessBuilder? You can do something like for (line <- lines) { val command = line.split(" ") // You may need to deal with quoted strings val process = new ProcessBuilder(command) // redirect output of process to main

Re: Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Andrew Or
Hi Wang Hao, This is not removed. We moved it here: http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html If you're building with SBT, and you don't specify the SPARK_HADOOP_VERSION, then it defaults to 1.0.4. Andrew 2014-06-12 6:24 GMT-07:00 Hao Wang : > Hi, all > > Why do

Re: Writing data to HBase using Spark

2014-06-12 Thread Mayur Rustagi
Are you able to use HadoopInputoutput reader for hbase in new hadoop Api reader? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Jun 12, 2014 at 7:49 AM, gaurav.dasgupta wrote: > Is there anyone else who is facing

wholeTextFiles not working with HDFS

2014-06-12 Thread Sguj
I'm trying to get a list of every filename in a directory from HDFS using pySpark, and the only thing that seems like it would return the filenames is the wholeTextFiles function. My code for just trying to collect that data is this: files = sc.wholeTextFiles("hdfs://localhost:port/users/me

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher < schum...@icsi.berkeley.edu> wrote: > On 06/12/2014 05:47 PM, Toby Douglass wrote: > > > In these future jobs, when I come to load the aggregted RDD, will Spark > > load and only load the columns being accessed by the query? or will > Spark > > l

Re: HELP!? Re: Streaming trouble (reduceByKey causes all printing to stop)

2014-06-12 Thread Michael Campbell
I don't know if this matters, but I'm looking at the web site that spark puts up, and I see under the streaming tab: - *Started at: *Thu Jun 12 11:42:10 EDT 2014 - *Time since start: *6 minutes 3 seconds - *Network receivers: *1 - *Batch interval: *5 seconds - *Processed batches: *

Re: initial basic question from new user

2014-06-12 Thread Andre Schumacher
Hi, On 06/12/2014 05:47 PM, Toby Douglass wrote: > In these future jobs, when I come to load the aggregted RDD, will Spark > load and only load the columns being accessed by the query? or will Spark > load everything, to convert it into an internal representation, and then > execute the query?

Re: Using Spark to crack passwords

2014-06-12 Thread Sean Owen
You need a use case where a lot of computation is applied to a little data. How about any of the various distributed computing projects out there? Although the SETI@home use case seems like a cool example, I doubt you want to reimplement its client. It might be far simpler to reimplement a search

Re: Using Spark to crack passwords

2014-06-12 Thread Nicholas Chammas
Indeed, rainbow tables are helpful for working on unsalted hashes. They turn a large amount of computational work into a bit of computational work and a bit of lookup work. The rainbow tables could easily be captured as RDDs. I guess I derailed my own discussion by focusing on password cracking, s

Access DStream content?

2014-06-12 Thread Wolfinger, Fred
Good morning. I have a question related to Spark Streaming. I have reduced some data down to a simple count value (by window), and would like to take immediate action on that value before storing in HDFS. I don't see any DStream member functions that would allow me to access its contents. Is what

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote: > If you need to ad-hoc persist to files, you can can save RDDs using > rdd.saveAsObjectFile(...) [1] and load them afterwards using > sparkContext.objectFile(...) > Appears not available from Python.

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 3:15 PM, FRANK AUSTIN NOTHAFT wrote: > RE: > > > Given that our agg sizes will exceed memory, we expect to cache them to > disk, so save-as-object (assuming there are no out of the ordinary > performance issues) may solve the problem, but I was hoping to store data > is a

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 3:03 PM, Christopher Nguyen wrote: > Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want > for your use case. > Yes. Thankyou. I'm about to see if they exist for Python. > As for Parquet support, that's newly arrived in Spark 1.0.0 together with

HELP!? Re: Streaming trouble (reduceByKey causes all printing to stop)

2014-06-12 Thread Michael Campbell
Ad... it's NOT working. Here's the code: val bytes = kafkaStream.map({ case (key, messageBytes) => messageBytes}) // Map to just get the bytes part out... val things = bytes.flatMap(bytesArrayToThings) // convert to a thing val srcDestinations = things.map(thing => (ip

Re: initial basic question from new user

2014-06-12 Thread FRANK AUSTIN NOTHAFT
RE: > Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible

Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as "building a long-r

Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Hao Wang
Hi, all Why does the Spark 1.0.0 official doc remove how to build Spark with corresponding Hadoop version? It means that if I don't need to specify the Hadoop version with I build my Spark 1.0.0 with `sbt/sbt assembly`? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai

RE: shuffling using netty in spark streaming

2014-06-12 Thread Shao, Saisai
Hi, 1. The performance is based on your hardware and system configurations, you can test it yourself. In my test, the two shuffle implementations have no special performance difference in latest version. 2. That’s correct to turn on netty based shuffle, and there’s no shuffle fetch

How to use SequenceFileRDDFunctions.saveAsSequenceFile() in Java?

2014-06-12 Thread innowireless TaeYun Kim
Hi, How to use SequenceFileRDDFunctions.saveAsSequenceFile() in Java? A simple example will be a great help. Thanks in advance.

Re: Writing data to HBase using Spark

2014-06-12 Thread gaurav.dasgupta
Is there anyone else who is facing this problem of writing to HBase when running Spark on YARN mode or Standalone mode using this example? If not, then do I need to explicitly, specify something in the classpath? Regards, Gaurav On Wed, Jun 11, 2014 at 1:53 PM, Gaurav Dasgupta wrote: > Hi Kan

Re: pmml with augustus

2014-06-12 Thread filipus
@villu: thank you for your help. In prommis I gonna try it! thats cools :-) do you know also the other way around from pmml to a model object in spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7473.html Sent from the Apache S

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas wrote: > The goal of rdd.persist is to created a cached rdd that breaks the DAG > lineage. Therefore, computations *in the same job* that use that RDD can > re-use that intermediate result, but it's not meant to survive between job > runs. > As I und

Re: HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-12 Thread bijoy deb
Hi, The problem was due to a pre-built/binary Tachyon-0.4.1 jar in the SPARK_CLASSPATH, and that Tachyon jar had been built against Hadoop-1.0.4.Building the Tachyon against Hadoop-2.0.0 resolved the issue. Thanks On Wed, Jun 11, 2014 at 11:34 PM, Marcelo Vanzin wrote: > The error is saying t

Re: initial basic question from new user

2014-06-12 Thread Gerard Maas
The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. for example: val baseData = rawDataRdd.map(...).flatMap(...).reduceByKey

Re: how to set spark.executor.memory and heap size

2014-06-12 Thread Laurent T
Hi, Can you give us a little more insight on how you used that file to solve your problem ? We're having the same OOM as you were and haven't been able to solve it yet. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory

How to read a snappy-compressed text file?

2014-06-12 Thread innowireless TaeYun Kim
Hi, Maybe this is a newbie question: How to read a snappy-compressed text file? The OS is Windows 7. Currently, I've done the following steps: 1. Built Hadoop 2.4.0 with snappy option. 'hadoop checknative' command displays the following line: snappy: true D:\hadoop-2.4.0\bin\snappy.d

Re: Using Spark to crack passwords

2014-06-12 Thread Marek Wiewiorka
This actually what I've already mentioned - with rainbow tables kept in memory it could be really fast! Marek 2014-06-12 9:25 GMT+02:00 Michael Cutler : > Hi Nick, > > The great thing about any *unsalted* hashes is you can precompute them > ahead of time, then it is just a lookup to find the p

initial basic question from new user

2014-06-12 Thread Toby Douglass
Gents, I am investigating Spark with a view to perform reporting on a large data set, where the large data set receives additional data in the form of log files on an hourly basis. Where the data set is large there is a possibility we will create a range of aggregate tables, to reduce the volume

Re: Normalizations in MLBase

2014-06-12 Thread Aslan Bekirov
Hi DB, I found a piece of code that uses znorm to normalize data. /** * build training data set from sample and summary data */ val train_data = sample_data.map( v => Array.tabulate[Double](field_cnt)( i => zscore(v._2(i),sample_mean(i),sample_stddev(i)) ) ).cache Please make you

Re: Normalizations in MLBase

2014-06-12 Thread Aslan Bekirov
Thanks a lot DB. I will try to do Znorm normalization using map transformation. BR, Aslan On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai wrote: > Hi Aslan, > > Currently, we don't have the utility function to do so. However, you > can easily implement this by another map transformation. I'm worki

Re: running Spark Streaming just once and stop it

2014-06-12 Thread Ravi Hemnani
Hey, I did sparkcontext.addstreaminglistener(streaminglistener object) in my code and i am able to see some stats in the logs and cant see anything in web UI. How to add the Streaming Tab to the web UI ? I need to get queuing delays and related information. -- View this message in context:

Re: Spark SQL incorrect result on GROUP BY query

2014-06-12 Thread Michael Armbrust
Thanks for verifying! On Thu, Jun 12, 2014 at 12:28 AM, Pei-Lun Lee wrote: > I reran with master and looks like it is fixed. > > > > 2014-06-12 1:26 GMT+08:00 Michael Armbrust : > > I'd try rerunning with master. It is likely you are running into >> SPARK-1994

Re: Spark SQL incorrect result on GROUP BY query

2014-06-12 Thread Pei-Lun Lee
I reran with master and looks like it is fixed. 2014-06-12 1:26 GMT+08:00 Michael Armbrust : > I'd try rerunning with master. It is likely you are running into > SPARK-1994 . > > Michael > > > On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee wrote

Re: Using Spark to crack passwords

2014-06-12 Thread Michael Cutler
Hi Nick, The great thing about any *unsalted* hashes is you can precompute them ahead of time, then it is just a lookup to find the password which matches the hash in seconds -- always makes for a more exciting demo than "come back in a few hours". It is a no-brainer to write a generator function

Re: json parsing with json4s

2014-06-12 Thread Tobias Pfeiffer
Hi, I usually use pattern matching for that, like json \ "key" match { case JInt(i) => i; case _ => 0 /* default value */ } Tobias On Thu, Jun 12, 2014 at 7:39 AM, Michael Cutler wrote: > Hello, > > You're absolutely right, the syntax you're using is returning the json4s > value objects, not na