Re: 答复: 答复: RDD usage

2014-03-29 Thread Chieh-Yen
Got it. Thanks for your help!! Chieh-Yen On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng wrote: > Hi~I wrote a program to test.The non-idempotent "compute" function in > foreach does change the value of RDD. It may looks a little crazy to do so > since modify the RDD will make it impossible to ke

working with MultiTableInputFormat

2014-03-29 Thread Livni, Dana
I'm trying to create an RDD from multiple scans. I tried to set the configuration this way: Configuration config = HBaseConfiguration.create(); config.setStrings(MultiTableInputFormat.SCANS,scanStrings); And creating each scan string in the array scanStrings this way: Scan scan = new Scan(); sc

Zip or map elements to create new RDD

2014-03-29 Thread yh18190
Hi, I have an RDD of elements and want to create a new RDD by Zipping other RDD in order. result[RDD] with sequence of 10,20,30,40,50 ...elements. I am facing problems as index is not an RDD...its gives an error...Could anyone help me how we can zip it or map it inorder to obtain following result.(

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread Sonal Goyal
>From my limited knowledge, all classes involved with the RDD operations should be extending Serializable if you want Java serialization(default). However, if you want Kryo serialization, you can use conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer"); If you also want to per

Re: Zip or map elements to create new RDD

2014-03-29 Thread Sonal Goyal
zipWithIndex works on the git clone, not sure if its part of a released version. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala Best Regards, Sonal Nube Technologies On Sat, Mar

Re: Zip or map elements to create new RDD

2014-03-29 Thread yh18190
Thanks sonal.Is der anyother way like to map values with Increasing indexes...so that i can map(t=>(i,t)) where value if 'i' increases after each map operation on element... Please help me ..in this aspect -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Zi

How to index each map operation????

2014-03-29 Thread yh18190
Hi, I want to perform map operation on an RDD of elements such that resulting RDD is a key value pair(counter,value) For example var k:RDD[Int]=10,20,30,40,40,60... k.map(t=>(i,t)) where 'i' value should be like a counter whose value increments after each mapoperation... Pleas help me.. I tried

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread anny9699
Thanks so much Sonal! I am much clearer now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Strange behavior of RDD.cartesian

2014-03-29 Thread Andrew Ash
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision bug that's fixed in 0.9.1 that might cause you to have too few results in that join. Sent from my mobile phone On Mar 28, 2014 8:04 PM, "Matei Zaharia" wrote: > Weird, how exactly are you pulling out the sample

Re: Announcing Spark SQL

2014-03-29 Thread Michael Armbrust
On Fri, Mar 28, 2014 at 9:53 PM, Rohit Rai wrote: > > Upon discussion with couple of our clients, it seems the reason they would > prefer using hive is that they have already invested a lot in it. Mostly in > UDFs and HiveQL. > 1. Are there any plans to develop the SQL Parser to handdle more compl

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-29 Thread Nicolas Bär
Hi Is there any workaround to this problem? I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of Kafka and handle the partition assignment manually. The easiest setup in this case would be to bind the number of parallel jobs to the number of partitions in Kafka. This is bas

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I've only tried 0.9, in which I ran into the `stdin writer to Python finished early` so frequently I wasn't able to load even a 1GB file. Let me know if I can provide any other info! On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia wrote: > I see, did this also fail with previous versions of Spark

Limiting number of reducers performance implications

2014-03-29 Thread Matthew Cheah
Hi everyone, I'm using Spark on machines where I can't change the maximum number of open files. As a result, I'm limiting the number of reducers to 500. I'm also only using a single machine that has 32 cores and emulating a cluster by running 4 worker daemons with 8 cores (maximum) each. What I'm

SQL on Spark - Shark or SparkSQL

2014-03-29 Thread Manoj Samel
Hi, In context of the recent Spark SQL announcement ( http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html ). If there is no existing investment in Hive/Shark, would it be worth starting a new SQL work using SparkSQL rather than Shark ? * It seems Shark S

Cross validation is missing in machine learning examples

2014-03-29 Thread Aureliano Buendia
Hi, I notices spark machine learning examples use training data to validate regression models, For instance, in linear regressionexample: // Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { poi

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I think the problem I ran into in 0.9 is covered in https://issues.apache.org/jira/browse/SPARK-1323 When I kill the python process, the stacktrace I gets indicates that this happens at initialization. It looks like the initial write to the Python process does not go through, and then the iterato

Re: SQL on Spark - Shark or SparkSQL

2014-03-29 Thread Nicholas Chammas
This is a great question. We are in the same position, having not invested in Hive yet and looking at various options for SQL-on-Hadoop. On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel wrote: > Hi, > > In context of the recent Spark SQL announcement ( > http://databricks.com/blog/2014/03/26/Spark-S

Re: WikipediaPageRank Data Set

2014-03-29 Thread Tsai Li Ming
I’m interested in obtaining the data set too. Thanks! On 27 Mar, 2014, at 9:45 pm, Niko Stahl wrote: > Hello, > > I would like to run the WikipediaPageRank example, but the Wikipedia dump XML > files are no longer available on Freebase. Does anyone know an alternative > source for the data? >