RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread linkpatrickliu
Seems like the thriftServer cannot connect to Zookeeper, so it cannot get lock. This is how it the log looks when I run SparkSQL: "load data inpath "kv1.txt" into table src;" log: 14/09/16 14:40:47 INFO Driver: 14/09/16 14:40:47 INFO ClientCnxn: Opening socket connection to server SVR4044HW2285.h

Re: spark and mesos issue

2014-09-15 Thread Gurvinder Singh
It might not be related only to memory issue. Memory issue is also there as you mentioned. I have seen that one too. The fine mode issue is mainly spark considering that it got two different block manager for same ID, whereas if I search for the ID in the mesos slave, it exist only on the one slave

Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-15 Thread Kevin Peng
Sean, Thanks. That worked. Kevin On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen wrote: > This is more of a Java / Maven issue than Spark per se. I would use > the shade plugin to remove signature files in your final META-INF/ > dir. As Spark does, in its : > > > > *:* > > org/d

Re: Complexity/Efficiency of SortByKey

2014-09-15 Thread Matei Zaharia
sortByKey is indeed O(n log n), it's a first pass to figure out even-sized partitions (by sampling the RDD), then a second pass to do a distributed merge-sort (first partition the data on each machine, then run a reduce phase that merges the data for each partition). The point where it becomes u

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread Cheng, Hao
Sorry, I am not able to reproduce that. Can you try add the following entry into the hive-site.xml? I know they have the default value, but let's make it explicitly. hive.server2.thrift.port hive.server2.thrift.bind.host hive.server2.authentication (NONE、KERBEROS、LDAP、PAM or CUSTOM) -Origi

Re: NullWritable not serializable

2014-09-15 Thread Matei Zaharia
Can you post the exact code for the test that worked in 1.0? I can't think of much that could've changed. The one possibility is if  we had some operations that were computed locally on the driver (this happens with things like first() and take(), which will try to do the first partition locally

Complexity/Efficiency of SortByKey

2014-09-15 Thread cjwang
I wonder what algorithm is used to implement sortByKey? I assume it is some O(n*log(n)) parallelized on x number of nodes, right? Then, what size of data would make it worthwhile to use sortByKey on multiple processors rather than use standard Scala sort functions on a single processor (consider

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread linkpatrickliu
Besides, When I use bin/spark-sql, I can Load data and drop table freely. Only when I use sbin/start-thriftserver.sh and connect with beeline, the client will hang! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222

How to set executor num on spark on yarn

2014-09-15 Thread hequn cheng
hi~I want to set the executor number to 16, but it is very strange that executor cores may affect executor num on spark on yarn, i don't know why and how to set executor number. = ./bin/spark-submit --class com.hequn.spark.SparkJoins \ --master yarn-c

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread linkpatrickliu
Hi, Hao Cheng, This is my spark assembly jar name: spark-assembly-1.1.0-hadoop2.0.0-cdh4.6.0.jar I compiled spark 1.1.0 with following cmd: export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn -Dhadoop.version=2.0.0-cdh4.6.0 -Phive -Pspark-ganglia-lgpl -DskipTests pa

Re: Re: About SpakSQL OR MLlib

2014-09-15 Thread boyingk...@163.com
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int, fuel:Int) 1. Create an PairedRDD of (age,Car) tuples (pairedRDD) 2. Create a new function fc //returns the interval lower and upper bound def fc(x:Int, interval:Int) : (Int,Int) = { val floor = x - (x%interval)

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread Cheng, Hao
The Hadoop client jar should be assembled into the uber-jar, but (I suspect) it's probably not compatible with your Hadoop Cluster. Can you also paste the Spark uber-jar name? Usually will be under the path lib/spark-assembly-1.1.0-xxx-hadoopxxx.jar. -Original Message- From: linkpatrick

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread linkpatrickliu
Hi, Hao Cheng, Here is the Spark\Hadoop version: Spark version = 1.1.0 Hadoop version = 2.0.0-cdh4.6.0 And hive-site.xml: fs.default.name hdfs://ns dfs.nameservices ns dfs.ha.namenodes.ns machine01,machine02 dfs.namenode.rpc-address.ns.mach

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread linkpatrickliu
Hi, Hao Cheng, Here is the Spark\Hadoop version: Spark version = 1.1.0 Hadoop version = 2.0.0-cdh4.6.0 And hive-site.xml: fs.default.name hdfs://ns dfs.nameservices ns dfs.ha.namenodes.ns machine01,machine02 dfs.namenode.rpc-address.ns.mach

Re: NullWritable not serializable

2014-09-15 Thread Du Li
Hi Matei, Thanks for your reply. The Writable classes have never been serializable and this is why it is weird. I did try as you suggested to map the Writables to integers and strings. It didn’t pass, either. Similar exceptions were thrown except that the messages became IntWritable, Text are

Re: About SpakSQL OR MLlib

2014-09-15 Thread Soumya Simanta
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int, fuel:Int) 1. Create an PairedRDD of (age,Car) tuples (pairedRDD) 2. Create a new function fc //returns the interval lower and upper bound def fc(x:Int, interval:Int) : (Int,Int) = { val floor = x - (x%interval)

Re: SPARK_MASTER_IP

2014-09-15 Thread Koert Kuipers
hey mark, you think that this is on purpose, or is it an omission? thanks, koert On Mon, Sep 15, 2014 at 8:32 PM, Mark Grover wrote: > Hi Koert, > I work on Bigtop and CDH packaging and you are right, based on my quick > glance, it doesn't seem to be used. > > Mark > > From: Koert Kuipers > Dat

About SpakSQL OR MLlib

2014-09-15 Thread boyingk...@163.com
Hi: I have a dataset ,the struct [id,driverAge,TotalKiloMeter ,Emissions ,date,KiloMeter ,fuel], and the data like this: [1-980,34,221926,9,2005-2-8,123,14] [1-981,49,271321,15,2005-2-8,181,82] [1-982,36,189149,18,2005-2-8,162,51] [1-983,51,232753,5,2005-2-8,106,92] [1-984,56,45338,8,2005-2-8,156,

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread Cheng, Hao
What's your Spark / Hadoop version? And also the hive-site.xml? Most of case like that caused by incompatible Hadoop client jar and the Hadoop cluster. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Monday, September 15, 2014 2:35 PM To: u...@spark.incubat

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
Okay, that's consistent with what I was expecting. Thanks, Matei. On Mon, Sep 15, 2014 at 5:20 PM, Matei Zaharia wrote: > I think the current plan is to put it in 1.2.0, so that's what I meant by > "soon". It might be possible to backport it too, but I'd be hesitant to do > that as a maintenanc

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Christian Chua
Hi Paul. I would recommend building your own 1.1.0 distribution. ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests I downloaded the "Pre-build for Hadoop 2.4" binary, and it had this strange behavior where sp

Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Paul Wais
Dear List, I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for reading SequenceFiles. In particular, I'm seeing: Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.ca

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Matei Zaharia
It's true that it does not send a kill command right now -- we should probably add that. This code was written before tasks were killable AFAIK. However, the *job* should still finish while a speculative task is running as far as I know, and it will just leave that task behind. Matei On Septem

SPARK_MASTER_IP

2014-09-15 Thread Mark Grover
Hi Koert, I work on Bigtop and CDH packaging and you are right, based on my quick glance, it doesn't seem to be used. Mark From: Koert Kuipers Date: Sat, Sep 13, 2014 at 7:03 AM Subject: SPARK_MASTER_IP To: user@spark.apache.org a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and sb

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
I think the current plan is to put it in 1.2.0, so that's what I meant by "soon". It might be possible to backport it too, but I'd be hesitant to do that as a maintenance release on 1.1.x and 1.0.x since it would require nontrivial changes to the build that could break things on Scala 2.10. Mat

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Pramod Biligiri
I'm already running with speculation set to true and the speculated tasks are launching, but the issue I'm observing is that Spark does not kill the long running task even if the shorter alternative has finished successfully. Therefore the overall turnaround time is still the same as without specul

"apply at Option.scala:120" callback in Spark 1.1, but no user code involved?

2014-09-15 Thread John Salvatier
In Spark 1.1, I'm seeing tasks with callbacks that don't involve my code at all! I'd seen something like this before in 1.0.0, but the behavior seems to be back apply at Option.scala:120 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
Yes, it's AvroParquetInputFormat, which is splittable. If I force a repartitioning, it works. If I don't, spark chokes on my not-terribly-large 250Mb files. PySpark's documentation says that the dictionary is turned into a Configuration object. @param conf: Hadoop configuration, passed in as a d

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Du Li
There is a parameter spark.speculation that is turned off by default. Look at the configuration doc: http://spark.apache.org/docs/latest/configuration.html From: Pramod Biligiri mailto:pramodbilig...@gmail.com>> Date: Monday, September 15, 2014 at 3:30 PM To: "user@spark.apache.org

Convert GraphX Graph to Sparse Matrix

2014-09-15 Thread crockpotveggies
Hi everyone, I'm looking to implement Markov algorithms in GraphX and I'm wondering if it's already possible to auto-convert the Graph into a Sparse Double Matrix? I've seen this implemented in other graphs before, namely JUNG, but still familiarizing myself with GraphX. Example: https://code.goog

Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
Probably worth noting that the factory methods in mllib create an object of type org.apache.spark.mllib.linalg.Vector which stores data in a similar format as Breeze vectors Chris On Sep 15, 2014, at 3:24 PM, Xiangrui Meng wrote: > Or you can use the factory method `Vectors.sparse`: > > val

Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-15 Thread Sean Owen
This is more of a Java / Maven issue than Spark per se. I would use the shade plugin to remove signature files in your final META-INF/ dir. As Spark does, in its : *:* org/datanucleus/** META-INF/*.SF META-INF/*.DSA META-INF/*.RSA On Mon, Sep 15, 2014

Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-15 Thread kpeng1
Hi All, I am trying to submit a spark job that I have built in maven using the following command: /usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain --master local[1] /home/cloudera/myjar.jar 100 But I seem to be getting the following error: Exception in thread "main" java.lang.

Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Pramod Biligiri
Hi, I'm running Spark tasks with speculation enabled. I'm noticing that Spark seems to wait in a given stage for all stragglers to finish, even though the speculated alternative might have finished sooner. Is that correct? Is there a way to indicate to Spark not to wait for stragglers to finish?

Re: minPartitions for non-text files?

2014-09-15 Thread Sean Owen
Heh, it's still just a suggestion to Hadoop I guess, not guaranteed. Is it a splittable format? for example, some compressed formats are not splittable and Hadoop has to process whole files at a time. I'm also not sure if this is something to do with pyspark, since the underlying Scala API takes

Re: MLLib sparse vector

2014-09-15 Thread Xiangrui Meng
Or you can use the factory method `Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x => (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore wrote: > Hi Sameer, > > MLLib uses Breeze’s vector fo

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
That would be awesome, but doesn't seem to have any effect. In PySpark, I created a dict with that key and a numeric value, then passed it into newAPIHadoopFile as a value for the "conf" keyword. The returned RDD still has a single partition. On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen wrote: >

Re: Accuracy hit in classification with Spark

2014-09-15 Thread Xiangrui Meng
Thanks for the update! -Xiangrui On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet wrote: > Hi, > > I have been able to get the same accuracy with MLlib as Mahout's. The > pre-processing phase of Mahout was the reason behind the accuracy mismatch. > After studying and applying the same logic in my co

Re: Define the name of the outputs with Java-Spark.

2014-09-15 Thread Xiangrui Meng
Spark doesn't support MultipleOutput at this time. You can cache the parent RDD. Then create RDDs from it and save them separately. -Xiangrui On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz wrote: > > I would like to define the names of my output in Spark, I have a process > which write many fai

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
Or maybe you could give this one a try: https://labs.spotify.com/2013/05/07/snakebite/ On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu wrote: > There is one way by do it in bash: hadoop fs -ls , maybe you could > end up with a bash scripts to do the things. > > On Mon, Sep 15, 2014 at 1:01 PM, Er

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
There is one way by do it in bash: hadoop fs -ls , maybe you could end up with a bash scripts to do the things. On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman wrote: > That's a good idea and one I had considered too. Unfortunately I'm not > aware of an API in PySpark for enumerating paths on

Re: Weird aggregation results when reusing objects inside reduceByKey

2014-09-15 Thread Sean Owen
It isn't a question of an item being reduced twice, but of when objects may be reused to represent other items. I don't think you have a guarantee that you can safely reuse the objects in this argument, but I'd also be interested if there was a case where this is guaranteed. For example I'm guess

Re: vertex active/inactive feature in Pregel API ?

2014-09-15 Thread Ankur Dave
At 2014-09-15 16:25:04 +0200, Yifan LI wrote: > I am wondering if the vertex active/inactive(corresponding the change of its > value between two supersteps) feature is introduced in Pregel API of GraphX? Vertex activeness in Pregel is controlled by messages: if a vertex did not receive a messag

Weird aggregation results when reusing objects inside reduceByKey

2014-09-15 Thread kriskalish
I have a pretty simple scala spark aggregation job that is summing up number of occurrences of two types of events. I have run into situations where it seems to generate bad values that are clearly incorrect after reviewing the raw data. First I have a Record object which I use to do my aggregati

Re: minPartitions for non-text files?

2014-09-15 Thread Sean Owen
I think the reason is simply that there is no longer an explicit min-partitions argument for Hadoop InputSplits in the new Hadoop APIs. At least, I didn't see it when I glanced just now. However, you should be able to get the same effect by setting a Configuration property, and you can do so throu

minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
sc.textFile takes a minimum # of partitions to use. is there a way to get sc.newAPIHadoopFile to do the same? I know I can repartition() and get a shuffle. I'm wondering if there's a way to tell the underlying InputFormat (AvroParquet, in my case) how many partitions to use at the outset. What

Re: Spark Streaming union expected behaviour?

2014-09-15 Thread Varad Joshi
I am seeing the same exact behavior. Shrikar, did you get any response to your post? Varad -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-union-expected-behaviour-tp7206p14284.html Sent from the Apache Spark User List mailing list archive a

Re: Efficient way to sum multiple columns

2014-09-15 Thread Xiangrui Meng
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui On Mon, Sep 15, 2014 at 1:00 PM, jamborta wrote: > Hi all, > > I have an RDD that contains around 50 columns. I need to sum each column, > which I am doing by running it through a for loop, creating an array and > run

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Eric Friedman
That's a good idea and one I had considered too. Unfortunately I'm not aware of an API in PySpark for enumerating paths on HDFS. Have I overlooked one? On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu wrote: > In PySpark, I think you could enumerate all the valid files, and create > RDD by > newAP

Efficient way to sum multiple columns

2014-09-15 Thread jamborta
Hi all, I have an RDD that contains around 50 columns. I need to sum each column, which I am doing by running it through a for loop, creating an array and running the sum function as follows: for (i <- 0 until 10) yield { data.map(x => x(i)).sum } is their a better way to do this? thanks,

Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
Hi Sameer, MLLib uses Breeze’s vector format under the hood. You can use that. http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector For example: import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV} val numClasses = classes.distinct.count.toInt val

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas
Davies, That’s pretty neat. I heard there was a pure Python clone of Spark out there—so you were one of the people behind it! I’ve created a JIRA issue about this. SPARK-3533: Add saveAsTextFileByKey() method to RDDs Sean, I think you might be

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
Are we going to put 2.11 support into 1.1 or 1.0? Else "will be in soon" applies to the master development branch, but actually in the Spark 1.2.0 release won't occur until the second half of November at the earliest. On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia wrote: > Scala 2.11 work is u

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra wrote: No, not yet.  Spark SQL is using org.scalamacros:q

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-15 Thread Jeoffrey Lim
What we did for gracefully shutting down the spark streaming context is extend a Spark Web UI Tab and perform a SparkContext.SparkUI.attachTab(). However, the custom scala Web UI extensions needs to be under the package org.apache.spark.ui to get around with the package access restrictions. Would

Dealing with Time Series Data

2014-09-15 Thread Gary Malouf
I have a use case for our data in HDFS that involves sorting chunks of data into time series format by a specific characteristic and doing computations from that. At large scale, what is the most efficient way to do this? Obviously, having the data sharded by that characteristic would make the pe

Example of Geoprocessing with Spark

2014-09-15 Thread Abel Coronado Iruegas
Here an example of a working code that takes a csv with lat lon points and intersects with polygons of municipalities of Mexico, generating a new version of the file with new attributes. Do you think that could be improved? Thanks. The Code: import org.apache.spark.SparkContext import org.apach

MLLib sparse vector

2014-09-15 Thread Sameer Tilak
Hi All,I have transformed the data into following format: First column is user id, and then all the other columns are class ids. For a user only class ids that appear in this row have value 1 and others are 0. I need to crease a sparse vector from this. Does the API for creating a sparse vector

Need help with ThriftServer/Spark1.1.0

2014-09-15 Thread Yana Kadiyska
Hi ladies and gents, trying to get Thrift server up and running in an effort to replace Shark. My first attempt to run sbin/start-thriftserver resulted in: 14/09/15 17:09:05 ERROR TThreadPoolServer: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transp

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Dibyendu Bhattacharya
Hi Tim, I have not tried persist the RDD. Here are some discussion on Rate Limiting Spark Streaming is there in this thread. http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-rate-limiting-from-kafka-td8590.html There is a Pull Request https://github.com/apache/spark/pull/945/

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Davies Liu
Maybe we should provide an API like saveTextFilesByKey(path), could you create an JIRA for it ? There is one in DPark [1] actually. [1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309 On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas wrote: > Any tips from anybody on how to do thi

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
In PySpark, I think you could enumerate all the valid files, and create RDD by newAPIHadoopFile(), then union them together. On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman wrote: > I neglected to specify that I'm using pyspark. Doesn't look like these APIs > have been bridged. > > > Eric Fr

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Tim Smith
Hi Dibyendu, I am a little confused about the need for rate limiting input from kafka. If the stream coming in from kafka has higher message/second rate than what a Spark job can process then it should simply build a backlog in Spark if the RDDs are cached on disk using persist(). Right? Thanks,

Re: Broadcast error

2014-09-15 Thread Davies Liu
I think the 1.1 will be really helpful for you, it's all compatitble with 1.0, so it's not hard to upgrade to 1.1. On Mon, Sep 15, 2014 at 2:35 AM, Chengi Liu wrote: > So.. same result with parallelize (matrix,1000) > with broadcast.. seems like I got jvm core dump :-/ > 4/09/15 02:31:22 INFO Blo

Re: File I/O in spark

2014-09-15 Thread Frank Austin Nothaft
Kartheek, What exactly are you trying to do? Those APIs are only for local file access. If you want to access data in HDFS, you’ll want to use one of the reader methods in org.apache.spark.SparkContext which will give you an RDD (e.g., newAPIHadoopFile, sequenceFile, or textFile). If you want t

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
If you underlying filesystem is HDFS, you need to use HDFS APIs. A google search brought up this link which appears reasonable. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample If you want to use java.io APIs, you have to make sure your filesystem is accessible from all nodes in your clust

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
Can you please direct me to the right way of doing this. On Mon, Sep 15, 2014 at 10:18 PM, rapelly kartheek wrote: > I came across these APIs in one the scala tutorials over the net. > > On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi > wrote: > >> But the above APIs are not for HDFS. >> >> On Mo

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
I came across these APIs in one the scala tutorials over the net. On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi wrote: > But the above APIs are not for HDFS. > > On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek > wrote: > >> Yes. I have HDFS. My cluster has 5 nodes. When I run the above comman

Re: scala 2.11?

2014-09-15 Thread Mohit Jaggi
ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra wrote: > No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10. > > On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi wrote: > >> Folks, >> I understand Spark SQL uses quasiquotes. Does that mean Spark has now >> moved to Scala 2

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10. On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi wrote: > Folks, > I understand Spark SQL uses quasiquotes. Does that mean Spark has now > moved to Scala 2.11? > > Mohit. >

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
But the above APIs are not for HDFS. On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek wrote: > Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I > see that the file gets created in the master node. But, there wont be any > data written to it. > > > On Mon, Sep 15, 2014

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
The file gets created on the fly. So I dont know how to make sure that its accessible to all nodes. On Mon, Sep 15, 2014 at 10:10 PM, rapelly kartheek wrote: > Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I > see that the file gets created in the master node. But, the

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the master node. But, there wont be any data written to it. On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi wrote: > Is this code running in an executor? You need to make sure the file is

Re: Compiler issues for multiple map on RDD

2014-09-15 Thread Sean Owen
(Adding back the user list) Boromir says: Thanks much Sean, verified 1.1.0 does not have this issue. On Mon, Sep 15, 2014 at 4:47 PM, Sean Owen wrote: > Looks like another instance of > https://issues.apache.org/jira/browse/SPARK-1199 which was intended to > be fixed in 1.1.0. > > I'm not clear

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek wrote: > Hi > > I am trying to perform some read/write file operatio

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-15 Thread stanley
Thank you. Would the following approaches to address this problem an overkills? a. create a ServerSocket in a different thread from the main thread that created the Spark StreamingContext, and listens to shutdown command there b. create a web service that wraps around the main thread that create

scala 2.11?

2014-09-15 Thread Mohit Jaggi
Folks, I understand Spark SQL uses quasiquotes. Does that mean Spark has now moved to Scala 2.11? Mohit.

Re: Compiler issues for multiple map on RDD

2014-09-15 Thread Sean Owen
Looks like another instance of https://issues.apache.org/jira/browse/SPARK-1199 which was intended to be fixed in 1.1.0. I'm not clear whether https://issues.apache.org/jira/browse/SPARK-2620 is the same issue and therefore whether it too is resolved in 1.1? On Mon, Sep 15, 2014 at 4:37 PM, Borom

File I/O in spark

2014-09-15 Thread rapelly kartheek
Hi I am trying to perform some read/write file operations in spark. Somehow I am neither able to write to a file nor read. import java.io._ val writer = new PrintWriter(new File("test.txt" )) writer.write("Hello Scala") Can someone please tell me how to perform file I/O in spark.

Compiler issues for multiple map on RDD

2014-09-15 Thread Boromir Widas
Hello Folks, I am trying to chain a couple of map operations and it seems the second map fails with a mismatch in arguments(event though the compiler prints them to be the same.) I checked the function and variable types using :t and they look ok to me. Have you seen this earlier? I am posting th

Re: About SparkSQL 1.1.0 join between more than two table

2014-09-15 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Your query should be fine in 1.1. On Mon, Sep 15, 2014 at 5:35 AM, Yanbo Liang wrote: > Spark SQL can support SQL and HiveSQL which used SQLContext and > HiveContext separate. > As far as I know, SQLContext of Spark SQL 1.1.0 can not

Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-15 Thread Koert Kuipers
in spark 1.1.0 i get this error: 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. i checked my application. i do not set spark.driver.extraClassPath or SPARK_CLASSPATH. SPARK_CLASSPATH is set in spark-env.sh since

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Sean Owen
AFAIK there is no direct equivalent in Spark. You can cache or persist and RDD, and then run N separate operations to output different things from it, which is pretty close. I think you might be able to get this working with a subclass of MultipleTextOutputFormat, which overrides generateFileNameF

vertex active/inactive feature in Pregel API ?

2014-09-15 Thread Yifan LI
Hi, I am wondering if the vertex active/inactive(corresponding the change of its value between two supersteps) feature is introduced in Pregel API of GraphX? if it is not a default setting, how to call it below? def sendMessage(edge: EdgeTriplet[(Int,HashMap[VertexId, Double]), Int]) = It

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas
Any tips from anybody on how to do this in PySpark? (Or regular Spark, for that matter.) On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas wrote: > Howdy doody Spark Users, > > I’d like to somehow write out a single RDD to multiple paths in one go. > Here’s an example. > > I have an RDD of (key, val

Upgrading a standalone cluster on ec2 from 1.0.2 to 1.1.0

2014-09-15 Thread Tomer Benyamini
Hi, I would like to upgrade a standalone cluster to 1.1.0. What's the best way to do it? Should I just replace the existing /root/spark folder with the uncompressed folder from http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz ? What about hdfs and other installations? I have spark 1.

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Eric Friedman
I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been bridged. Eric Friedman > On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan wrote: > > Hi Eric, > > Something along the lines of the following should work > > val fs = getFileSystem(...) // standard hadoop

Re: Serving data

2014-09-15 Thread Marius Soutier
Nice, I’ll check it out. At first glance, writing Parquet files seems to be a bit complicated. On 15.09.2014, at 13:54, andy petrella wrote: > nope. > It's an efficient storage for genomics data :-D > > aℕdy ℙetrella > about.me/noootsab > > > > On Mon, Sep 15, 2014 at 1:52 PM, Marius Soutie

Re: Serving data

2014-09-15 Thread andy petrella
nope. It's an efficient storage for genomics data :-D aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Mon, Sep 15, 2014 at 1:52 PM, Marius Soutier wrote: > So you are living the dream of using HDFS as a database? ;) > > On 15.09.2014, at 13:50,

Re: Serving data

2014-09-15 Thread Marius Soutier
So you are living the dream of using HDFS as a database? ;) On 15.09.2014, at 13:50, andy petrella wrote: > I'm using Parquet in ADAM, and I can say that it works pretty fine! > Enjoy ;-) > > aℕdy ℙetrella > about.me/noootsab > > > > On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier wrote: >

Re: Serving data

2014-09-15 Thread andy petrella
I'm using Parquet in ADAM, and I can say that it works pretty fine! Enjoy ;-) aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier wrote: > Thank you guys, I’ll try Parquet and if that’s not quick enough I

Re: Serving data

2014-09-15 Thread Marius Soutier
Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the usual route with either read-only or normal database. On 13.09.2014, at 12:45, andy petrella wrote: > however, the cache is not guaranteed to remain, if other jobs are launched in > the cluster and require more memory

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Dibyendu Bhattacharya
Hi Alon, No this will not be guarantee that same set of messages will come in same RDD. This fix just re-play the messages from last processed offset in same order. Again this is just a interim fix we needed to solve our use case . If you do not need this message re-play feature, just do not perfo

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Alon Pe'er
Hi Dibyendu, Thanks for your great work! I'm new to Spark Streaming, so I just want to make sure I understand Driver failure issue correctly. In my use case, I want to make sure that messages coming in from Kafka are always broken into the same set of RDDs, meaning that if a set of messages are

Re: Broadcast error

2014-09-15 Thread Chengi Liu
So.. same result with parallelize (matrix,1000) with broadcast.. seems like I got jvm core dump :-/ 4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager host:47978 with 19.2 GB RAM 14/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager host:43360 with 19.2 GB RAM Unhandled

Re: About SparkSQL 1.1.0 join between more than two table

2014-09-15 Thread Yanbo Liang
Spark SQL can support SQL and HiveSQL which used SQLContext and HiveContext separate. As far as I know, SQLContext of Spark SQL 1.1.0 can not support three table join directly. However you can modify your query with subquery such as SELECT * FROM (SELECT * FROM youhao_data left join youhao_age on

Re: Broadcast error

2014-09-15 Thread Akhil Das
Try: rdd = sc.broadcast(matrix) Or rdd = sc.parallelize(matrix,100) // Just increase the number of slices, give it a try. Thanks Best Regards On Mon, Sep 15, 2014 at 2:18 PM, Chengi Liu wrote: > Hi Akhil, > So with your config (specifically with set("spark.akka.frameSize ", > "1000")

Re: Broadcast error

2014-09-15 Thread Chengi Liu
Hi Akhil, So with your config (specifically with set("spark.akka.frameSize ", "1000")) , I see the error: org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 401970046 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast va

Re: Broadcast error

2014-09-15 Thread Akhil Das
Can you give this a try: conf = SparkConf().set("spark.executor.memory", "32G")*.set("spark.akka.frameSize > ", > "1000").set("spark.broadcast.factory","org.apache.spark.broadcast.TorrentBroadcastFactory")* > sc = SparkContext(conf = conf) > rdd = sc.parallelize(matrix,5) > from pyspark.mllib.

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-15 Thread Thorsten Bergler
Hello, When I remove the line and try to execute "sbt run", I end up with the following lines: 14/09/15 10:11:35 INFO ui.SparkUI: Stopped Spark web UI at http://base:4040 [...] 14/09/15 10:11:05 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster

  1   2   >