Re: Running foreach on a list of rdds in parallel

2015-07-15 Thread Vetle Leinonen-Roeim
On Thu, Jul 16, 2015 at 7:37 AM Brandon White wrote: > Hello, > > I have a list of rdds > > List(rdd1, rdd2, rdd3,rdd4) > > I would like to save these rdds in parallel. Right now, it is running each > operation sequentially. I tried using a rdd of rdd but that does not work. > > list.foreach { rd

Re: Running foreach on a list of rdds in parallel

2015-07-15 Thread Davies Liu
sc.union(rdds).saveAsTextFile() On Wed, Jul 15, 2015 at 10:37 PM, Brandon White wrote: > Hello, > > I have a list of rdds > > List(rdd1, rdd2, rdd3,rdd4) > > I would like to save these rdds in parallel. Right now, it is running each > operation sequentially. I tried using a rdd of rdd but that do

Running foreach on a list of rdds in parallel

2015-07-15 Thread Brandon White
Hello, I have a list of rdds List(rdd1, rdd2, rdd3,rdd4) I would like to save these rdds in parallel. Right now, it is running each operation sequentially. I tried using a rdd of rdd but that does not work. list.foreach { rdd => rdd.saveAsTextFile("/tmp/cache/") } Any ideas?

Re: Java 8 vs Scala

2015-07-15 Thread spark user
I struggle lots in Scala , almost 10 days n0 improvement , but when i switch to Java 8 , things are so smooth , and I used Data Frame with Redshift and Hive all are looking good .if you are very good In Scala the go with Scala otherwise Java is best fit  . This is just my openion because I am Ja

RE: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-15 Thread Cheng, Hao
Have you ever try query the “select * from temp_table” from the spark shell? Or can you try the option --jars while starting the spark shell? From: Srikanth [mailto:srikanth...@gmail.com] Sent: Thursday, July 16, 2015 9:36 AM To: user Subject: Re: HiveThriftServer2.startWithContext error with reg

回复:Re: HiBench test for hadoop/hive/spark cluster

2015-07-15 Thread luohui20001
Hi Ted Thanks for your advice, i found that there is something wrong with "hadoop fs -get" command, 'cause I believe the localization of hdfs://spark-study:9000/HiBench/Aggregation/temp/user_agents to /tmp/hadoop-root/mapred/local/1437016615898/user_agents is a behaviour like "hadoop fs -g

Re: HiBench test for hadoop/hive/spark cluster

2015-07-15 Thread Ted Yu
>From log file: 15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-root/mapred/local/1437016615898/user_agents <- /opt/HiBench-master/user_agents 15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Localized hdfs://spark-study:9000/HiBench/Aggregation

Re: fileStream with old files

2015-07-15 Thread Terry Hole
Hi, Hunter, *What **behavior do you see with the HDFS? The local file system and HDFS should have the same ** behavior.* *Thanks!* *- Terry* Hunter Morgan 于2015年7月16日周四 上午2:04写道: > After moving the setting of the parameter to SparkConf initialization > instead of after the context is already i

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Ted Yu
Looks like this method should serve Jon's needs: def reduceByWindow( reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Duration On Wed, Jul 15, 2015 at 8:23 PM, N B wrote: > Hi Jon, > > In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used > in

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread N B
Hi Jon, In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used interchangeably. If you are trying to collect multiple batches across a DStream into a single RDD, look at the window() operations. Hope this helps Nikunj On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase wrote: > I should

Spark streaming Processing time keeps increasing

2015-07-15 Thread N B
Hello, We have a Spark streaming application and the problem that we are encountering is that the batch processing time keeps on increasing and eventually causes the application to start lagging. I am hoping that someone here can point me to any underlying cause of why this might happen. The batc

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
You can try selectExpr() of DataFrame. for example, y<-selectExpr(df, "concat(hashtags.text[0],hashtags.text[1])") # [] operator is used to extract an item from an array or sql(hiveContext, "select concat(hashtags.text[0],hashtags.text[1]) from table") Yeah, the documentation of SparkR is

Re: Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
Hi, Thank you. On Wed, Jul 15, 2015 at 9:07 PM, Burak Yavuz wrote: > Hi, > There is no MLlib support in SparkR in 1.4. There will be some support in > 1.5. You can check these JIRAs for progress: > https://issues.apache.org/jira/browse/SPARK-6805 > https://issues.apache.org/jira/browse/SPARK-682

Re: Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread Lokesh Kumar Padhnavis
Thanks a lot :) On Wed, Jul 15, 2015 at 11:48 PM Marcelo Vanzin wrote: > That has never been the correct way to set you app's classpath. > > Instead, look at http://spark.apache.org/docs/latest/configuration.html > and search for "extraClassPath". > > On Wed, Jul 15, 2015 at 9:43 AM, lokeshkumar

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I should note that the amount of data in each batch is very small, so I'm not concerned with performance implications of grouping into a single RDD. On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase wrote: > I'm currently doing something like this in my Spark Streaming program > (Java): > > dSt

Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I'm currently doing something like this in my Spark Streaming program (Java): dStream.foreachRDD((rdd, batchTime) -> { log.info("processing RDD from batch {}", batchTime); // my rdd processing code }); Instead of having my

Re: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-15 Thread Srikanth
Hello, Re-sending this to see if I'm second time lucky! I've not managed to move past this error. Srikanth On Mon, Jul 13, 2015 at 9:14 PM, Srikanth wrote: > Hello, > > I want to expose result of Spark computation to external tools. I plan to > do this with Thrift server JDBC interface by regi

RE: Python DataFrames: length of ArrayType

2015-07-15 Thread Cheng, Hao
Actually it's supposed to be part of Spark 1.5 release, see https://issues.apache.org/jira/browse/SPARK-8230 You're definitely welcome to contribute to it, let me know if you have any question on implementing it. Cheng Hao -Original Message- From: pedro [mailto:ski.rodrig...@gmail.com]

Re: Spark and HDFS

2015-07-15 Thread Marcelo Vanzin
On Wed, Jul 15, 2015 at 5:36 AM, Jeskanen, Elina wrote: > I have Spark 1.4 on my local machine and I would like to connect to our > local 4 nodes Cloudera cluster. But how? > > > > In the example it says text_file = spark.textFile("hdfs://..."), but can > you advise me in where to get this "hdfs

Re: Java 8 vs Scala

2015-07-15 Thread Alan Burlison
On 15/07/2015 21:17, Ted Yu wrote: jshell is nice but it is targeting Java 9 Yes I know, just pointing out that eventually Java would have a REPL as well. -- Alan Burlison -- - To unsubscribe, e-mail: user-unsubscr...@spar

Spark cluster read local files

2015-07-15 Thread Julien Beaudan
Hi all, Is it possible to use Spark to assign each machine in a cluster the same task, but on files in each machine's local file system, and then have the results sent back to the driver program? Thank you in advance! Julien smime.p7s Description: S/MIME Cryptographic Signature

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Chen Song
Ah, cool. Thanks. On Wed, Jul 15, 2015 at 5:58 PM, Tathagata Das wrote: > Spark 1.4.1 just got released! So just download that. Yay for timing. > > On Wed, Jul 15, 2015 at 2:47 PM, Ted Yu wrote: > >> Should be this one: >> [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of >> Serializat

Python DataFrames: length of ArrayType

2015-07-15 Thread pedro
Resubmitting after fixing subscription to mailing list. Based on the list of functions here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions there doesn't seem to be a way to get the length of an array in a dataframe without defining a UDF. What I w

Python DataFrames, length of array

2015-07-15 Thread pedro
Based on the list of functions here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions there doesn't seem to be a way to get the length of an array in a dataframe without defining a UDF. What I would be looking for is something like this (except length_

get java.io.FileNotFoundException when use addFile Function

2015-07-15 Thread prateek arora
Hi I am trying to write a simple program using addFile Function but getting error in my worker node that file doest not exist tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave2.novalocal): java.io.FileNotFoundException: File file:/tmp/

get java.io.FileNotFoundException when use addFile Function

2015-07-15 Thread prateek arora
I am trying to write a simple program using addFile Function but getting error in my worker node that file doest not exist tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave2.novalocal): java.io.FileNotFoundException: File file:/tmp/spar

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
Spark 1.4.1 just got released! So just download that. Yay for timing. On Wed, Jul 15, 2015 at 2:47 PM, Ted Yu wrote: > Should be this one: > [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of > SerializationDebugger bugs and limitations > ... > Closes #6625 from tdas/SPARK-7180 and s

Announcing Spark 1.4.1!

2015-07-15 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.4.1 maintenance release. We recommend all users on the 1.4 branch upgrade to this release, which contain several important bug fixes. Download Spark 1.4.1 - http://spark.apache.org/downloads.html Release notes - http://spark.apache.org/releases/spark-rele

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Ted Yu
Should be this one: [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of SerializationDebugger bugs and limitations ... Closes #6625 from tdas/SPARK-7180 and squashes the following commits: On Wed, Jul 15, 2015 at 2:37 PM, Chen Song wrote: > Thanks > > Can you point me to the patch to

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Andrew Or
Yeah, we could make it a log a warning instead. 2015-07-15 14:29 GMT-07:00 Kelly, Jonathan : > Thanks! Is there an existing JIRA I should watch? > > > ~ Jonathan > > From: Sandy Ryza > Date: Wednesday, July 15, 2015 at 2:27 PM > To: Jonathan Kelly > Cc: "user@spark.apache.org" > Subject: R

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Chen Song
Thanks Can you point me to the patch to fix the serialization stack? Maybe I can pull it in and rerun my job. Chen On Wed, Jul 15, 2015 at 4:40 PM, Tathagata Das wrote: > Your streaming job may have been seemingly running ok, but the DStream > checkpointing must have been failing in the backgr

RE: Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Mohammed Guller
I could be wrong, but it looks like the only implementation available right now is MultivariateOnlineSummarizer. Mohammed From: Nkechi Achara [mailto:nkach...@googlemail.com] Sent: Wednesday, July 15, 2015 4:31 AM To: user@spark.apache.org Subject: Any beginner samples for using ML / MLIB to pro

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
Thanks! Is there an existing JIRA I should watch? ~ Jonathan From: Sandy Ryza mailto:sandy.r...@cloudera.com>> Date: Wednesday, July 15, 2015 at 2:27 PM To: Jonathan Kelly mailto:jonat...@amazon.com>> Cc: "user@spark.apache.org" mailto:user@spark.apache.org>> Subjec

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Sandy Ryza
Hi Jonathan, This is a problem that has come up for us as well, because we'd like dynamic allocation to be turned on by default in some setups, but not break existing users with these properties. I'm hoping to figure out a way to reconcile these by Spark 1.5. -Sandy On Wed, Jul 15, 2015 at 3:18

Re: ALS run method versus ALS train versus ALS fit and transform

2015-07-15 Thread Sean Owen
The first two examples are from the .mllib API. Really, the "new ALS()...run()" form is underneath both of the first two. In the second case, you're calling a convenience method that calls something similar to the first example. The second example is from the new .ml "pipelines" API. Similar ideas

ALS run method versus ALS train versus ALS fit and transform

2015-07-15 Thread Carol McDonald
In the Spark mllib examples MovieLensALS.scala ALS run is used, however in the movie recommendation with mllib tutorial ALS train is used , What is the difference, when should you use one versus the other val model = new ALS() .setRank(params.rank) .setIterations(params.numIterati

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
Your streaming job may have been seemingly running ok, but the DStream checkpointing must have been failing in the background. It would have been visible in the log4j logs. In 1.4.0, we enabled fast-failure for that so that checkpointing failures dont get hidden in the background. The fact that th

Re: spark streaming job to hbase write

2015-07-15 Thread Todd Nist
There are there connector packages listed on spark packages web site: http://spark-packages.org/?q=hbase HTH. -Todd On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora wrote: > Hi > > I have a requirement of writing in hbase table from Spark streaming app > after some processing. > Is Hbase put o

Re: Java 8 vs Scala

2015-07-15 Thread Ted Yu
jshell is nice but it is targeting Java 9 Cheers On Wed, Jul 15, 2015 at 5:31 AM, Alan Burlison wrote: > On 15/07/2015 08:31, Ignacio Blasco wrote: > > The main advantage of using scala vs java 8 is being able to use a console >> > > https://bugs.openjdk.java.net/browse/JDK-8043364 > > -- > Al

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Ted Yu
Can you show us your function(s) ? Thanks On Wed, Jul 15, 2015 at 12:46 PM, Chen Song wrote: > The streaming job has been running ok in 1.2 and 1.3. After I upgraded to > 1.4, I started seeing error as below. It appears that it fails in validate > method in StreamingContext. Is there anything c

Re: Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
I think I found my answer at https://github.com/kayousterhout/trace-analysis: "One thing to keep in mind is that Spark does not currently include instrumentation to measure the time spent reading input data from disk or writing job output to disk (the `Output write wait'' shown in the waterfall is

NotSerializableException in spark 1.4.0

2015-07-15 Thread Chen Song
The streaming job has been running ok in 1.2 and 1.3. After I upgraded to 1.4, I started seeing error as below. It appears that it fails in validate method in StreamingContext. Is there anything changed on 1.4.0 w.r.t DStream checkpointint? Detailed error from driver: 15/07/15 18:00:39 ERROR yarn

small accumulator gives out of memory error

2015-07-15 Thread AlexG
When I call the following minimal working example, the accumulator matrix is 32-by-100K, and each executor has 64G but I get an out of memory error: Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit Here BDM is a Breeze DenseMatrix object BDMAccumulator

Re: Research ideas using spark

2015-07-15 Thread vaquar khan
I would suggest study spark ,flink,strom and based on your understanding and finding prepare your research paper. May be you will invented new spark ☺ Regards, Vaquar khan On 16 Jul 2015 00:47, "Michael Segel" wrote: > Silly question… > > When thinking about a PhD thesis… do you want to tie it

Re: Java 8 vs Scala

2015-07-15 Thread vaquar khan
My choice is java 8 On 15 Jul 2015 18:03, "Alan Burlison" wrote: > On 15/07/2015 08:31, Ignacio Blasco wrote: > > The main advantage of using scala vs java 8 is being able to use a console >> > > https://bugs.openjdk.java.net/browse/JDK-8043364 > > -- > Alan Burlison > -- > > ---

Re: Spark Intro

2015-07-15 Thread vaquar khan
Totally agreed with hafasa, you need to identify your requirements and needs before choose spark. If you want to handle data with fast access go to no sql (mongo,aerospike etc) if you need data analytical then spark is best . Regards, Vaquar khan On 14 Jul 2015 20:39, "Hafsa Asif" wrote: > Hi,

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
Would there be any problem in having spark.executor.instances (or --num-executors) be completely ignored (i.e., even for non-zero values) if spark.dynamicAllocation.enabled is true (i.e., rather than throwing an exception)? I can see how the exception would be helpful if, say, you tried to pass

Re: Research ideas using spark

2015-07-15 Thread Michael Segel
Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? "I am doing my PHD thesis on large scale machine learning e.g Online learning,

Re: Strange Error: "java.lang.OutOfMemoryError: GC overhead limit exceeded"

2015-07-15 Thread Saeed Shahrivari
Yes there is. But the RDD is more than 10 TB and compression does not help. On Wed, Jul 15, 2015 at 8:36 PM, Ted Yu wrote: > bq. serializeUncompressed() > > Is there a method which enables compression ? > > Just wondering if that would reduce the memory footprint. > > Cheers > > On Wed, Jul 15,

spark streaming job to hbase write

2015-07-15 Thread Shushant Arora
Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of ea

Re: Research ideas using spark

2015-07-15 Thread Ravindra
Look at this : http://www.forbes.com/sites/lisabrownlee/2015/07/10/the-11-trillion-internet-of-things-big-data-and-pattern-of-life-pol-analytics/ On Wed, Jul 15, 2015 at 10:19 PM shahid ashraf wrote: > Sorry Guys! > > I mistakenly added my question to this thread( Research ideas using > spark).

Re: Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread Marcelo Vanzin
That has never been the correct way to set you app's classpath. Instead, look at http://spark.apache.org/docs/latest/configuration.html and search for "extraClassPath". On Wed, Jul 15, 2015 at 9:43 AM, lokeshkumar wrote: > Hi forum > > I have downloaded the latest spark version 1.4.0 and starte

Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
I am trying to analyze my program, in particular to see what the bottleneck is (IO, CPU, network), and started using the event timeline for this. When looking at my Job 0, Stage 0 (the sampler function taking up 5.6 minutes of my 40 minute program), I see in the even timeline that all time is spe

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Wush Wu
Dear Sujit, Thanks for your suggestion. After testing, the `joinWithCassandraTable` does the trick like what you mentioned. The rdd2 only query those data which have the same key in rdd1. Best, Wush 2015-07-16 0:00 GMT+08:00 Sujit Pal : > Hi Wush, > > One option may be to try a replicated join

RE: fileStream with old files

2015-07-15 Thread Hunter Morgan
After moving the setting of the parameter to SparkConf initialization instead of after the context is already initialized, I have it operating reliably on local filesystem, but not on hdfs. Are there any differences in behavior between these two cases I should be aware of? I don’t usually maili

Re: Sessionization using updateStateByKey

2015-07-15 Thread algermissen1971
On 15 Jul 2015, at 17:38, Cody Koeninger wrote: > An in-memory hash key data structure of some kind so that you're close to > linear on the number of items in a batch, not the number of outstanding keys. > That's more complex, because you have to deal with expiration for keys that > never ge

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
Don't get me wrong, we've been able to use updateStateByKey for some jobs, and it's certainly convenient. At a certain point though, iterating through every key on every batch is a less viable solution. On Wed, Jul 15, 2015 at 12:38 PM, Sean McNamara wrote: > I would just like to add that we d

Re: Sessionization using updateStateByKey

2015-07-15 Thread Sean McNamara
I would just like to add that we do the very same/similar thing at Webtrends, updateStateByKey has been a life-saver for our sessionization use-cases. Cheers, Sean On Jul 15, 2015, at 11:20 AM, Silvio Fiorito mailto:silvio.fior...@granturing.com>> wrote: Hi Cody, I’ve had success using upda

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
bump From: Jonathan Kelly mailto:jonat...@amazon.com>> Date: Tuesday, July 14, 2015 at 4:23 PM To: "user@spark.apache.org" mailto:user@spark.apache.org>> Subject: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf I've set up

Re: Sessionization using updateStateByKey

2015-07-15 Thread Silvio Fiorito
Hi Cody, I’ve had success using updateStateByKey for real-time sessionization by aging off timed-out sessions (returning None in the update function). This was on a large commercial website with millions of hits per day. This was over a year ago so I don’t have access to the stats any longer fo

Spark job returns a different result on each run

2015-07-15 Thread sbvarre
I am working on a scala code which performs Linear Regression on certain datasets. Right now I am using 20 cores and 25 executors and everytime I run a Spark job I get a different result. The input size of the files are 2GB and 400 MB.However, when I run the job with 20 cores and 1 executor, I get

Re: Research ideas using spark

2015-07-15 Thread shahid ashraf
Sorry Guys! I mistakenly added my question to this thread( Research ideas using spark). Moreover people can ask any question , this spark user group is for that. Cheers! 😊 On Wed, Jul 15, 2015 at 9:43 PM, Robin East wrote: > Well said Will. I would add that you might want to investigate GraphC

Re: Research ideas using spark

2015-07-15 Thread Jörn Franke
Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing).

Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread lokeshkumar
Hi forum I have downloaded the latest spark version 1.4.0 and started using it. But I couldn't find the compute-classpath.sh file in bin/ which I am using in previous versions to provide third party libraries to my application. Can anyone please let me know where I can provide CLASSPATH with my

Re: Spark and HDFS

2015-07-15 Thread Naveen Madhire
Yes. I did this recently. You need to copy the cloudera cluster related conf files into the local machine and set HADOOP_CONF_DIR or YARN_CONF_DIR. And also local machine should be able to ssh to the cloudera cluster. On Wed, Jul 15, 2015 at 8:51 AM, ayan guha wrote: > Assuming you run spark lo

Spark Accumulator Issue - java.io.IOException: java.lang.StackOverflowError

2015-07-15 Thread Jadhav Shweta
Hi, I am trying one transformation by calling scala method this scala method returns MutableList[AvroObject] def processRecords(id: String, list1: Iterable[(String, GenericRecord)]): scala.collection.mutable.MutableList[AvroObject]  Hence, the output of transaformation is RDD[MutableList[AvroOb

Re: Research ideas using spark

2015-07-15 Thread Robin East
Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes

Re: Strange Error: "java.lang.OutOfMemoryError: GC overhead limit exceeded"

2015-07-15 Thread Ted Yu
bq. serializeUncompressed() Is there a method which enables compression ? Just wondering if that would reduce the memory footprint. Cheers On Wed, Jul 15, 2015 at 8:06 AM, Saeed Shahrivari < saeed.shahriv...@gmail.com> wrote: > I use a simple map/reduce step in a Java/Spark program to remove >

out of memory error in treeAggregate

2015-07-15 Thread AlexG
I'm using the following function to compute B*A where B is a 32-by-8mil Breeze DenseMatrix and A is a 8mil-by-100K IndexedRowMatrix. // computes BA where B is a local matrix and A is distributed: let b_i denote the // ith col of B and a_i denote the ith row of A, then BA = sum(b_i a_i) def leftM

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Sujit Pal
Hi Wush, One option may be to try a replicated join. Since your rdd1 is small, read it into a collection and broadcast it to the workers, then filter your larger rdd2 against the collection on the workers. -sujit On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain wrote: > Leftouterjoin and join ap

Re: Spark on EMR with S3 example (Python)

2015-07-15 Thread Sujit Pal
Hi Roberto, I think you would need to as Akhil said. Just checked from this page: http://aws.amazon.com/public-data-sets/ and clicking through to a few dataset links, all of them are available on s3 (some are available via http and ftp, but I think the point of these datasets are that they are u

Tasks unevenly distributed in Spark 1.4.0

2015-07-15 Thread gisleyt
Hello all, I upgraded from spark 1.3.1 to 1.4.0, but I'm experiencing a massive drop in performance for the application I'm running. I've (somewhat) reproduced this behaviour in the attached file. My current spark setup may not be optimal exactly for this reproduction, but I see that Spark 1.4.0

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
An in-memory hash key data structure of some kind so that you're close to linear on the number of items in a batch, not the number of outstanding keys. That's more complex, because you have to deal with expiration for keys that never get hit, and for unusually long sessions you have to either drop

Re: Running mllib from R in Spark 1.4

2015-07-15 Thread Burak Yavuz
Hi, There is no MLlib support in SparkR in 1.4. There will be some support in 1.5. You can check these JIRAs for progress: https://issues.apache.org/jira/browse/SPARK-6805 https://issues.apache.org/jira/browse/SPARK-6823 Best, Burak On Wed, Jul 15, 2015 at 6:00 AM, madhu phatak wrote: > Hi, > I

java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32

2015-07-15 Thread Wang, Ningjun (LNG-NPV)
I just installed spark 1.3.1 on windows 2008 server. When I start spark-shell, I got the following error Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32 Please advise. Thanks. Ningjun

Strange Error: "java.lang.OutOfMemoryError: GC overhead limit exceeded"

2015-07-15 Thread Saeed Shahrivari
I use a simple map/reduce step in a Java/Spark program to remove duplicated documents from a large (10 TB compressed) sequence file containing some html pages. Here is the partial code: JavaPairRDD inputRecords = sc.sequenceFile(args[0], BytesWritable.class, NullWritable.class).coalesce(numMap

DataFrame more efficient than RDD?

2015-07-15 Thread k0ala
Hi, I have been working a bit with RDD, and am now taking a look at DataFrames. The schema definition using case classes looks very attractive; https://spark.apache.org/docs/1.4.0/sql-programming-guide.html#inferring-the-schema-using-reflection

Job aborted due to stage failure: Task not serializable:

2015-07-15 Thread Naveen Dabas
I am using the below code and using kryo serializer .when i run this code i got this error : Task not serializable in commented line2) how broadcast variables are treated in exceotu.are they local variables or can be used in any function defined as global variables. object StreamingLogIn

java heap error

2015-07-15 Thread AlexG
I'm trying to compute the Frobenius norm error in approximating an IndexedRowMatrix A with the product L*R where L and R are Breeze DenseMatrices. I've written the following function that computes the squared error over each partition of rows then sums up to get the total squared error (ignore th

Re: Sessionization using updateStateByKey

2015-07-15 Thread algermissen1971
Hi Cody, oh ... I though that was one of *the* use cases for it. Do you have a suggestion / best practice how to achieve the same thing with better scaling characteristics? Jan On 15 Jul 2015, at 15:33, Cody Koeninger wrote: > I personally would try to avoid updateStateByKey for sessionizati

Re: Spark and HDFS

2015-07-15 Thread ayan guha
Assuming you run spark locally (ie either local mode or standalone cluster on your localm/c) 1. You need to have hadoop binaries locally 2. You need to have hdfs-site on Spark Classpath of your local m/c I would suggest you to start off with local files to play around. If you need to run spark on

Re: Research ideas using spark

2015-07-15 Thread William Temperley
There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question. It would be a shame to send someone away so rudely, who may do valuable work on Spark. Sashidar (not Sashid!) I'm personally interested in running g

Re: [SparkR] creating dataframe from json file

2015-07-15 Thread jianshu Weng
Thanks. t <- getField(df$hashtags, "text") does return a Column. But when I tried to call t <- getField(df$hashtags, "text"), it would give an error: Error: All select() inputs must resolve to integer column positions. The following do not: * getField(df$hashtags, "text") In fact, the "text" fi

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
I personally would try to avoid updateStateByKey for sessionization when you have long sessions / a lot of keys, because it's linear on the number of keys. On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das wrote: > [Apologies for repost, for those who have seen this response already in > the dev ma

Re: Sorted Multiple Outputs

2015-07-15 Thread Eugene Morozov
Yiannis , It looks like you might explore other approach. sc.textFile("input/path") .map() // your own implementation .partitionBy(new HashPartitioner(num)) .groupBy() //your own implementation, as a result - PairRDD of key vs Iterable of values .foreachPartition() On the last step you could so

Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
Hi, I have been playing with Spark R API that is introduced in Spark 1.4 version. Can we use any mllib functionality from the R as of now?. From the documentation it looks like we can only use SQL/Dataframe functionality as of now. I know there is separate project SparkR project but it doesnot seem

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
suppose df <- jsonFile(sqlContext, "") You can extract hashtags.text as a Column object using the following command: t <- getField(df$hashtags, "text") and then you can perform operations on the column. You can extract hashtags.text as a DataFrame using the following command: t <- select(d

updateStateByKey schedule time

2015-07-15 Thread Michel Hubert
Hi, I want to implement a time-out mechanism in de updateStateByKey(...) routine. But is there a way the retrieve the time of the start of the batch corresponding to the call to my updateStateByKey routines? Suppose the streaming has build up some delay then a System.currentTimeMillis() will

Spark and HDFS

2015-07-15 Thread Jeskanen, Elina
I have Spark 1.4 on my local machine and I would like to connect to our local 4 nodes Cloudera cluster. But how? In the example it says text_file = spark.textFile("hdfs://..."), but can you advise me in where to get this "hdfs://..." -address? Thanks! Elina

Re: Java 8 vs Scala

2015-07-15 Thread Alan Burlison
On 15/07/2015 08:31, Ignacio Blasco wrote: The main advantage of using scala vs java 8 is being able to use a console https://bugs.openjdk.java.net/browse/JDK-8043364 -- Alan Burlison -- - To unsubscribe, e-mail: user-unsubs

compression behaviour inconsistency between 1.3 and 1.4

2015-07-15 Thread Marcin Cylke
Hi I've observed an inconsistent behaviour in .saveAsTextFile. Up until version 1.3 it was possible to save RDDs as snappy compressed files with the invocation of rdd.saveAsTextFile(targetFile) but after upgrading to 1.4 this no longer works. I need to specify a codec for that: rdd.saveAsText

Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Nkechi Achara
Hi all, I am trying to get some summary statistics to retrieve the moving average for several devices that have an array or latency in seconds in this kind of format: deviceLatencyMap = [K:String, Iterable[V: Double]] I understand that there is a MultivariateSummary, but as this is a trait, but

DataFrame.write().partitionBy("some_column").parquet(path) produces OutOfMemory with very few items

2015-07-15 Thread Nikos Viorres
Hi, I am trying to test partitioning for DataFrames with parquet usage so i attempted to do df.write().partitionBy("some_column").parquet(path) on a small dataset of 20.000 records which when saved as parquet locally with gzip take 4mb of disk space. However, on my dev machine with -Dspark.master=

Re: what is metadata in StructField ?

2015-07-15 Thread Peter Rudenko
Hi Mathieu, metadata is very usefull if you need to save some data about a column (e.g. count of null values, cardinality, domain, min/max/std, etc.). It's currently used in ml package in attributes: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/attribute/

Re: spark-submit can not resolve spark-hive_2.10

2015-07-15 Thread Hao Ren
Thanks for the reply. Actually, I don't think excluding spark-hive from spark-submit --packages is a good idea. I don't want to recompile spark by assembly for my cluster, every time a new spark release is out. I prefer using binary version of spark and then adding some jars for job execution. e

Re: Java 8 vs Scala

2015-07-15 Thread Gourav Sengupta
Why would you create a class and then instantiate it to store data and change the class every time you have to add a new element? In OOPS terminology a class represents an object, and an object has states - does it not? Purely from a data warehousing perspective - one of the fundamental principles

what is metadata in StructField ?

2015-07-15 Thread matd
I see in StructField that we can provide metadata. What is it meant for ? How is it used by Spark later on ? Are there any rules on what we can/cannot do with it ? I'm building some DataFrame processing, and I need to maintain a set of (meta)data along with the DF. I was wondering if I can use S

Re: creating a distributed index

2015-07-15 Thread Jem Tucker
This is very interesting, do you know if this version will be backwards compatible with older versions of Spark (1.2.0)? Thanks, Jem On Wed, Jul 15, 2015 at 10:04 AM Ankur Dave wrote: > The latest version of IndexedRDD supports any key type with a defined > serializer >

Re: Strange behavior of CoalescedRDD

2015-07-15 Thread Konstantin Knizhnik
Looks like the source of the problem is in SqlNewHad\oopRDD.compute method: Created Parquet file reader is intended to be closed at task completion time. This reader contains a lot of references to parquet.bytes.BytesInput object which in turn contains reference sot large byte arrays (some of t

Spark Stream suitability

2015-07-15 Thread polariz
Hi, I am am evaluating my options for a project that injects a rich data feed, does some aggregate calculations and allows the user to query on these. The (protobuf) data feed is rich in the sense that it contains several data fields which can be used to calculate several different KPI figures.

  1   2   >