Re: Problem in Spark Streaming

2014-06-11 Thread nilmish
I used these commands to show the GC timings : -verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps Following is the output I got on the standard output : 4.092: [GC 4.092: [ParNew: 274752K->27199K(309056K), 0.0421460 secs] 274752K->27199K(995776K), 0.0422720 secs] [Times: user=0.33 sys=0.11,

Re: Problem in Spark Streaming

2014-06-11 Thread vinay Bajaj
http://stackoverflow.com/questions/895444/java-garbage-collection-log-messages http://stackoverflow.com/questions/16794783/how-to-read-a-verbosegc-output I think this will help in understanding the logs. On Wed, Jun 11, 2014 at 12:53 PM, nilmish wrote: > > I used these commands to show the GC

Number of Spark streams in Yarn cluster

2014-06-11 Thread tnegi
Hi, I am trying to get a sense of number of streams we can process in parallel on a Spark streaming cluster(Hadoop Yarn). Is there any benchmark for the same? We need a large number of streams(original + transformed) to be processed in parallel. The number is approximately around= 30,. tha

Re: Spark Kafka streaming - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-06-11 Thread gaurav.dasgupta
Thanks Tobias for replying. The problem was that, I have to provide the dependency jars' paths to the StreamingContext within the code. So, providing all the jar paths, resolved my problem. Refer the below code snippet: *JavaStreamingContext ssc = new JavaStreamingContext(args[0], "S

Re: Writing data to HBase using Spark

2014-06-11 Thread gaurav.dasgupta
Hi Kanwaldeep, I have tried your code but arrived into a problem. The code is working fine in *local* mode. But if I run the same code in Spark stand alone mode or YARN mode, then it is continuously executing, but not saving anything in the HBase table. I guess, it is stopping data streaming once

Re: HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-11 Thread bijoy deb
Any suggestions from anyone? Thanks Bijoy On Tue, Jun 10, 2014 at 11:46 PM, bijoy deb wrote: > Hi all, > > I have build Shark-0.9.1 using sbt using the below command: > > *SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.6.0 sbt/sbt assembly* > > My Hadoop cluster is also having version 2.0.0-mr1-cdh4.6.0.

Re: Hanging Spark jobs

2014-06-11 Thread Daniel Darabos
These stack traces come from the stuck node? Looks like it's waiting on data in BlockFetcherIterator. Waiting for data from another node. But you say all other nodes were done? Very curious. Maybe you could try turning on debug logging, and try to figure out what happens in BlockFetcherIterator (

Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Pei-Lun Lee
Hi, I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results. To reproduce, type the following commands in spark-shell connecting to a standalone server: case class Foo(k: String, v: Int) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.

Error During ReceivingConnection

2014-06-11 Thread Surendranauth Hiraman
I have a somewhat large job (10 GB input data but generates about 500 GB of data after many stages). Most tasks completed but a few stragglers on the same node/executor are still active (but doing nothing) after about 16 hours. At about 3 to 4 hours in, the tasks that are hanging have the followi

Normalizations in MLBase

2014-06-11 Thread Aslan Bekirov
Hi All, I have to normalize a set of values in the range 0-500 to the [0-1] range. Is there any util method in MLBase to normalize large set of data? BR, Aslan

RE: Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Cheng, Hao
That’s a good catch, but I think it’s suggested to use HiveContext currently. ( https://github.com/apache/spark/tree/master/sql) Catalyst$> sbt/sbt hive/console case class Foo(k: String, v: Int) val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ List.fill(300)(Foo("c", 3))

Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Surendranauth Hiraman
My team has been using DISK_ONLY. The challenge with this approach is knowing when to unpersist if your job creates a lot of intermediate data. The "right solution" would be to mark a transient RDD as being capable of spilling to disk, rather than having to persist it to force this behavior. Hopefu

Re: NoSuchMethodError in KafkaReciever

2014-06-11 Thread mpieck
The createRawStream method by xtrahotsauce, which operate on byte array could be proposed as a workaround for this bug until it is fixed. The message should be decoded in map/reduce phase then, but it's better than nothing. -- View this message in context: http://apache-spark-user-list.1001560

MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-11 Thread SURAJ SHETH
Hi, I have been trying to build a Decision Tree using a dataset that I have. Dataset Decription : Train data size = 689,763 Test data size = 8,387,813 Each row in the dataset has 321 numerical features out of which 139th value is the ground truth. The number of positives in the dataset is low.

Re: Information on Spark UI

2014-06-11 Thread Daniel Darabos
About more succeeded tasks than total tasks: - This can happen if you have enabled speculative execution. Some partitions can get processed multiple times. - More commonly, the result of the stage may be used in a later calculation, and has to be recalculated. This happens if some of the results

Re: Error During ReceivingConnection

2014-06-11 Thread Surendranauth Hiraman
It looks like this was due to another executor on a different node closing the connection on its side. I found the entries below in the remote side's logs. Can anyone comment on why one ConnectionManager would close its connection to another node and what could be tuned to avoid this? It did not h

Re: Information on Spark UI

2014-06-11 Thread Shuo Xiang
Daniel, Thanks for the explanation. On Wed, Jun 11, 2014 at 8:57 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > About more succeeded tasks than total tasks: > - This can happen if you have enabled speculative execution. Some > partitions can get processed multiple times. > -

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-11 Thread filipus
well I guess your problem is quite unbalanced and due to the information value as a splitting criterion I guess the algo stops after very view splits work arround is oversampling build many training datasets like take randomly 50% of the positives and from the negativ the same amount or let say

Re: Information on Spark UI

2014-06-11 Thread Neville Li
Does cache eviction affect disk storage level too? I tried cranking up replication but still seeing this. On Wednesday, June 11, 2014, Shuo Xiang wrote: > Daniel, > Thanks for the explanation. > > > On Wed, Jun 11, 2014 at 8:57 AM, Daniel Darabos < > daniel.dara...@lynxanalytics.com > > wrote:

Re: Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Michael Armbrust
I'd try rerunning with master. It is likely you are running into SPARK-1994 . Michael On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee wrote: > Hi, > > I am using spark 1.0.0 and found in spark sql some queries use GROUP BY > give weird results. >

Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
Hi, I am currently using spark 1.0 locally on Windows 7. I would like to use classes from external jar in the spark-shell. I followed the instruction in: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCALrNVjWWF6k=c_jrhoe9w_qaacjld4+kbduhfv0pitr8h1f...@mail.gmail.com%3E I ha

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Marcelo Vanzin
Just tried this and it worked fine for me: ./bin/spark-shell --jars jar1,jar2,etc,etc On Wed, Jun 11, 2014 at 10:25 AM, Ulanov, Alexander wrote: > Hi, > > > > I am currently using spark 1.0 locally on Windows 7. I would like to use > classes from external jar in the spark-shell. I followed the i

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Marcelo Vanzin
Ah, not that it should matter, but I'm on Linux and you seem to be on Windows... maybe there is something weird going on with the Windows launcher? On Wed, Jun 11, 2014 at 10:34 AM, Marcelo Vanzin wrote: > Just tried this and it worked fine for me: > > ./bin/spark-shell --jars jar1,jar2,etc,etc >

Re: pmml with augustus

2014-06-11 Thread Villu Ruusmann
Hello Spark/PMML enthusiasts, It's pretty trivial to integrate the JPMML-Evaluator library with Spark. In brief, take the following steps in your Spark application code: 1) Create a Java Map ("arguments") that represents the input data record. You need to specify a key-value mapping for every acti

RE: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
Are you able to import any class from you jars within spark-shell? -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Wednesday, June 11, 2014 9:36 PM To: user@spark.apache.org Subject: Re: Adding external jar to spark-shell classpath in spark 1.0 Ah, not that it

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Andrew Or
This is a known issue: https://issues.apache.org/jira/browse/SPARK-1919. We haven't found a fix yet, but for now, you can workaround this by including your simple class in your application jar. 2014-06-11 10:25 GMT-07:00 Ulanov, Alexander : > Hi, > > > > I am currently using spark 1.0 locally o

RE: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
Could you elaborate on this? I don’t have an application, I just use spark shell. From: Andrew Or [mailto:and...@databricks.com] Sent: Wednesday, June 11, 2014 9:40 PM To: user@spark.apache.org Subject: Re: Adding external jar to spark-shell classpath in spark 1.0 This is a known issue: https://

Having trouble with streaming (updateStateByKey)

2014-06-11 Thread Michael Campbell
I'm having a little trouble getting an "updateStateByKey()" call to work; was wondering if anyone could help. In my chain of calls from getting Kafka messages out of the queue to converting the message to a set of "things", then pulling out 2 attributes of those things to a Tuple2, everything work

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Andrew Or
Ah, of course, there are no application jars in spark-shell, then it seems that there are no workarounds for this at the moment. We will look into a fix shortly, but for now you will have to create an application and use spark-submit (or use spark-shell on Linux). 2014-06-11 10:42 GMT-07:00 Ulano

Re: HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-11 Thread Marcelo Vanzin
The error is saying that your client libraries are older than what your server is using (2.0.0-mr1-cdh4.6.0 is IPC version 7). Try double-checking that your build is actually using that version (e.g., by looking at the hadoop jar files in lib_managed/jars). On Wed, Jun 11, 2014 at 2:07 AM, bijoy

Re: Information on Spark UI

2014-06-11 Thread Shuo Xiang
Using MEMORY_AND_DISK_SER to persist the input RDD[Rating] seems to work right for me now. I'm testing on a larger dataset and will see how it goes. On Wed, Jun 11, 2014 at 9:56 AM, Neville Li wrote: > Does cache eviction affect disk storage level too? I tried cranking up > replication but stil

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-11 Thread SURAJ SHETH
Hi Filipus, The train data is already oversampled. The number of positives I mentioned above is for the test dataset : 12028 (apologies for not making this clear earlier) The train dataset has 61,264 positives out of 689,763 total rows. The number of negatives is 628,499. Oversampling was done for

Powered by Spark addition

2014-06-11 Thread Derek Mansen
Hello, I was wondering if we could add our organization to the "Powered by Spark" page. The information is: Name: Vistar Media URL: www.vistarmedia.com Description: Location technology company enabling brands to reach on-the-go consumers. Let me know if you need anything else. Thanks! Derek Mans

Re: Having trouble with streaming (updateStateByKey)

2014-06-11 Thread Michael Campbell
I rearranged my code to do a reduceByKey which I think is working. I also don't think the problem was that updateState call, but something else; unfortunately I changed a lot in looking for this issue, so not sure what the actual fix might have been, but I think it's working now. On Wed, Jun 11,

Kafka client - specify offsets?

2014-06-11 Thread Michael Campbell
Is there a way in the Apache Spark Kafka Utils to specify an offset to start reading? Specifically, from the start of the queue, or failing that, a specific point?

Re: Normalizations in MLBase

2014-06-11 Thread DB Tsai
Hi Aslan, Currently, we don't have the utility function to do so. However, you can easily implement this by another map transformation. I'm working on this feature now, and there will be couple different available normalization option users can chose. Sincerely, DB Tsai -

Compression with DISK_ONLY persistence

2014-06-11 Thread Surendranauth Hiraman
Hi, Will spark.rdd.compress=true enable compression when using DISK_ONLY persistence? SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io

When to use CombineByKey vs reduceByKey?

2014-06-11 Thread Diana Hu
Hello all, I've seen some performance improvements using combineByKey as opposed to reduceByKey or a groupByKey+map function. I have a couple questions. it'd be great if any one can provide some light into this. 1) When should I use combineByKey vs reduceByKey? 2) Do the containers need to be im

json parsing with json4s

2014-06-11 Thread SK
I have the following piece of code that parses a json file and extracts the age and TypeID val p = sc.textFile(log_file) .map(line => { parse(line) }) .map(json => { val v1 = json \ "person" \ "age" val v2 = js

Re: json parsing with json4s

2014-06-11 Thread Michael Cutler
Hello, You're absolutely right, the syntax you're using is returning the json4s value objects, not native types like Int, Long etc. fix that problem and then everything else (filters) will work as you expect. This is a short snippet of a larger example: [1] val lines = sc.textFile("likes.jso

Re: Compression with DISK_ONLY persistence

2014-06-11 Thread Matei Zaharia
Yes, actually even if you don’t set it to true, on-disk data is compressed. (This setting only affects serialized data in memory). Matei On Jun 11, 2014, at 2:56 PM, Surendranauth Hiraman wrote: > Hi, > > Will spark.rdd.compress=true enable compression when using DISK_ONLY > persistence? >

Re: Powered by Spark addition

2014-06-11 Thread Matei Zaharia
Alright, added you. Matei On Jun 11, 2014, at 1:28 PM, Derek Mansen wrote: > Hello, I was wondering if we could add our organization to the "Powered by > Spark" page. The information is: > > Name: Vistar Media > URL: www.vistarmedia.com > Description: Location technology company enabling bran

Re: Not fully cached when there is enough memory

2014-06-11 Thread Xiangrui Meng
Could you try to click one that RDD and see the storage info per partition? I tried continuously caching RDDs, so new ones kick old ones out when there is not enough memory. I saw similar glitches but the storage info per partition is correct. If you find a way to reproduce this error, please creat

Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Allen Chang
Thanks. We've run into timeout issues at scale as well. We were able to workaround them by setting the following JVM options: -Dspark.akka.askTimeout=300 -Dspark.akka.timeout=300 -Dspark.worker.timeout=300 NOTE: these JVM options *must* be set on worker nodes (and not just the driver/master) for

Re: problem starting the history server on EC2

2014-06-11 Thread zhen
I tried everything including sudo, but it still did not work using the local directory. However, I finally got it working by getting the history server to log into hdfs. I first created a directory in hdfs like the following: ./ephemeral-hdfs/bin/hadoop fs -mkdir /spark_logs Then I started the star

Using Spark to crack passwords

2014-06-11 Thread Nick Chammas
Spark is obviously well-suited to crunching massive amounts of data. How about to crunch massive amounts of numbers? A few years ago I put together a little demo for some co-workers to demonstrate the dangers of using SHA1 to hash and store pas

Hive classes for Catalyst

2014-06-11 Thread Stephen Boesch
Hi, The documentation of Catalyst describes using HiveContext; however, the scala classes do not exist in Master or 1.0.0 Branch. What is the replacement/equivalent in Master? Package does not exist: org.apache.spark.sql.hive Here is code from SQL on Spark meetup slides referencing that packag

Re: Hive classes for Catalyst

2014-06-11 Thread Michael Armbrust
You will need to compile spark with SPARK_HIVE=true. On Wed, Jun 11, 2014 at 5:37 PM, Stephen Boesch wrote: > Hi, > The documentation of Catalyst describes using HiveContext; however, the > scala classes do not exist in Master or 1.0.0 Branch. What is the > replacement/equivalent in Master?

Re: Hive classes for Catalyst

2014-06-11 Thread Mark Hamstra
And the code is right here: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala On Wed, Jun 11, 2014 at 5:38 PM, Michael Armbrust wrote: > You will need to compile spark with SPARK_HIVE=true. > > > On Wed, Jun 11, 2014 at 5:37 PM, Step

Re: Using Spark to crack passwords

2014-06-11 Thread DB Tsai
I think creating the samples in the search space within RDD will be too expensive, and the amount of data will probably be larger than any cluster. However, you could create a RDD of searching ranges, and each range will be searched by one map operation. As a result, in this design, the # of row i

Re: Hive classes for Catalyst

2014-06-11 Thread Stephen Boesch
Thanks for the (super) quick replies. My bad - i was looking under spark/sql/*catalyst* instead of /spark/sql/hive 2014-06-11 17:40 GMT-07:00 Mark Hamstra : > And the code is right here: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext

Re: Using Spark to crack passwords

2014-06-11 Thread Marek Wiewiorka
What about rainbow tables? http://en.wikipedia.org/wiki/Rainbow_table M. 2014-06-12 2:41 GMT+02:00 DB Tsai : > I think creating the samples in the search space within RDD will be > too expensive, and the amount of data will probably be larger than any > cluster. > > However, you could create a

Re: Not fully cached when there is enough memory

2014-06-11 Thread Shuo Xiang
Xiangrui, clicking into the RDD link, it gives the same message, say only 96 of 100 partitions are cached. The disk/memory usage are the same, which is far below the limit. Is this what you want to check or other issue? On Wed, Jun 11, 2014 at 4:38 PM, Xiangrui Meng wrote: > Could you try to cl

Re: When to use CombineByKey vs reduceByKey?

2014-06-11 Thread Matei Zaharia
combineByKey is designed for when your return type from the aggregation is different from the values being aggregated (e.g. you group together objects), and it should allow you to modify the leftmost argument of each function (mergeCombiners, mergeValue, etc) and return that instead of allocatin

Re: Using Spark to crack passwords

2014-06-11 Thread Nicholas Chammas
Yes, I mean the RDD would just have elements to define partitions or ranges within the search space, not have actual hashes. It's really just a using the RDD as a control structure, rather than a real data set. As you noted, we don't need to store any hashes. We just need to check them as they are

History Server renered page not suitable for load balancing

2014-06-11 Thread elyast
Hi, Small issue but still. I run history server through Marathon and balance it through haproxy. The problem is that links generated by HistoryPage (links to completed applications) are absolute, e.g. http://some-server:port/history... , but instead they should relative just /history, so they ca

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-11 Thread elyast
Hi, I tried to use SPARK_JAVA_OPTS in spark-env.sh as well as conf/java-opts file to set additional java system properties. In this case I could connect to tachyon without any problem. However when I tried setting executor and driver extraJavaOptions in spark-defaults.conf it doesn't. I suspect

Re: History Server renered page not suitable for load balancing

2014-06-11 Thread Aaron Davidson
A pull request would be great! On Wed, Jun 11, 2014 at 7:53 PM, elyast wrote: > Hi, > > Small issue but still. > > I run history server through Marathon and balance it through haproxy. The > problem is that links generated by HistoryPage (links to completed > applications) are absolute, e.g. ht

use spark-shell in the source

2014-06-11 Thread JaeBoo Jung
Title: Samsung Enterprise Portal mySingle Hi all,   Can I use spark-shell programmatically in my spark application(in java or scala)? Because I want to convert scala lines to string array and run automatically in my application. For example, for( var line <- lines){ //run this

Re: How to achieve reasonable performance on Spark Streaming?

2014-06-11 Thread Boduo Li
It seems that the slow "reduce" tasks are caused by slow shuffling. Here is the logs regarding one slow "reduce" task: 14/06/11 23:42:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Got remote block shuffle_69_88_18 after 5029 ms 14/06/11 23:42:45 INFO BlockFetcherIterator$BasicBlockFetch

Re: Using Spark to crack passwords

2014-06-11 Thread Akhil Das
You can have a huge dictionary of hashes in one RDD and use a map function to generate a hash for the given password and lookup in your dictionary RDD. Not sure about the performance though. Would be nice to see if you design it. Thanks Best Regards On Thu, Jun 12, 2014 at 7:23 AM, Nicholas Cham

shuffling using netty in spark streaming

2014-06-11 Thread onpoq l
Hi, 1. Does netty perform better than the basic method for shuffling? I found the latency caused by shuffling in a streaming job is not stable with the basic method. 2. However, after I turn on netty for shuffling, I can only see the results for the first two batches, and then no output at all. I