Re: Spark partition

2014-07-30 Thread Haiyang Fu
Hi, you may referer this http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections ,both of which are about the RDD partitions.As you are going to load data from hdfs, so you maybe also need to know h

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-30 Thread Prashant Sharma
This looks like a bug to me. This happens because we serialize the code that starts the receiver and send it across. And since we have not registered the classes of akka library it does not work. I have not tried myself, but may be by including something like chill-akka ( https://github.com/xitrum-

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-30 Thread chenjie
Hi, Michael. I Have the same problem. My warehouse directory is always created locally. I copied the default hive-site.xml into the $SPARK_HOME/conf directory on each node. After I executed the code below, val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql("CREA

RE: Data from Mysql using JdbcRDD

2014-07-30 Thread Cheng, Hao
Probably you need to update the SQL like "SELECT * FROM student_info where id >= ? and id <= ?". -Original Message- From: srinivas [mailto:kusamsrini...@gmail.com] Sent: Thursday, July 31, 2014 6:55 AM To: u...@spark.incubator.apache.org Subject: Data from Mysql using JdbcRDD Hi, I am

Spark partition

2014-07-30 Thread Sameer Tilak
Hi All, >From the documention RDDs are already partitioned distributed. However, there >is a way to repartition a given RDD using the following function. Can someone >please point out the best practices for using this. I have a 10 GB TSV file >stored in HDFS and I have a 4 node cluster with 1 ma

Re: Index calculation will cause integer overflow of numPartitions > 10362 in sortByKey

2014-07-30 Thread Jianshi Huang
Looks like I cannot assign it. On Thu, Jul 31, 2014 at 11:56 AM, Larry Xiao wrote: > Hi > > Can you assign it to me? Thanks > > Larry > > > On 7/31/14, 10:47 AM, Jianshi Huang wrote: > >> I created this JIRA issue, somebody please pick it up? >> >> https://issues.apache.org/jira/browse/SPARK-27

Re: Logging in Spark through YARN.

2014-07-30 Thread Archit Thakur
Hi Marcelo, Thanks for your quick comment. This doesn't seem working in 1.0.0. release. -Archit Thakur. On Thu, Jul 31, 2014 at 3:18 AM, Marcelo Vanzin wrote: > Hi Archit, > > Are you using spark-submit? If so, can you try adding the following to > its command line: > > --files /dir/log4j.p

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
On Wed, Jul 30, 2014 at 11:32 PM, Jeffrey Picard wrote: > That worked! The entire thing ran in about an hour and a half, thanks! Great! > Is there by chance an easy way to build spark apps using the master branch > build of spark? I’ve been having to use the spark-shell. The easiest way is pro

Re: Index calculation will cause integer overflow of numPartitions > 10362 in sortByKey

2014-07-30 Thread Larry Xiao
Hi Can you assign it to me? Thanks Larry On 7/31/14, 10:47 AM, Jianshi Huang wrote: I created this JIRA issue, somebody please pick it up? https://issues.apache.org/jira/browse/SPARK-2728 -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Spark on Yarn

2014-07-30 Thread li.ching.090
Hi, So far I have been using spark in standalone mode. I am planning to use spark on yarn,as this one has been around from some time, can you please suggest me any pros and cons. Thanks,

Re: GraphX Connected Components

2014-07-30 Thread Jeffrey Picard
On Jul 30, 2014, at 4:39 PM, Ankur Dave wrote: > Jeffrey Picard writes: >> I tried unpersisting the edges and vertices of the graph by hand, then >> persisting the graph with persist(StorageLevel.MEMORY_AND_DISK). I still see >> the same behavior in connected components however, and the same th

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-30 Thread liuwei
Hi, Tathagata Das: I followed your advice and solved this problem, thank you very much! 在 2014年7月31日,上午3:07,Tathagata Das 写道: > This is because setLocalProperty makes all Spark jobs submitted using > the current thread belong to the set pool. However, in Spark > Streaming, all the jobs a

RE: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Shao, Saisai
I don’t think it’s a bug of consolidated shuffle, it’s a Linux configuration problem. The default open files in Linux is 1024, while your open file is larger than 1024 you will get the error as you mentioned below. So you can set the open file numbers to a large one by: ulimit –n xxx or write in

Index calculation will cause integer overflow of numPartitions > 10362 in sortByKey

2014-07-30 Thread Jianshi Huang
I created this JIRA issue, somebody please pick it up? https://issues.apache.org/jira/browse/SPARK-2728 -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Jianshi Huang
Ok... but my question is why spark.shuffle.consolidateFiles is working (or is it)? Is this a bug? On Wed, Jul 30, 2014 at 4:29 PM, Larry Xiao wrote: > Hi Jianshi, > > I've met similar situation before. > And my solution was 'ulimit', you can use > > -a to see your current settings > -n to set o

RDD.coalesce got compilation error

2014-07-30 Thread Jianshi Huang
In my code I have something like val rdd: (WrapWithComparable[(Array[Byte], Array[Byte], Array[Byte])], Externalizer[KeyValue]) = ... val rdd_coalesced = rdd.coalesce(Math.min(1000, rdd.partitions.length)) My purpose is to limit the number of partitions (later sortByKey always reported "too many

Re: Keep state inside map function

2014-07-30 Thread Tobias Pfeiffer
Hi, On Thu, Jul 31, 2014 at 2:23 AM, Sean Owen wrote: > > ... you can run setup code before mapping a bunch of records, and > after, like so: > > rdd.mapPartitions { partition => >// Some setup code here >partition.map(yourfunction) >// Some cleanup code here > } > Please be careful

Re: Converting matrix format

2014-07-30 Thread Chengi Liu
Thanks.. What if its a big matrix.. like billions rows million columns On Wednesday, July 30, 2014, Davies Liu wrote: > It will depends on the size of your matrix. If it can fit in memory, > then you can > > sparse = sparse_matrix(matrix) # sparse_matrix is the function you had > written > sc.par

Re: Deploying spark applications from within Eclipse?

2014-07-30 Thread Tobias Pfeiffer
Hi, if your Spark driver runs in Eclipse, you should be able to get what you want. That is, create a main() function that creates a SparkContext with the correct master URL of your cluster and execute it with your standard Eclipse method. For example, I am using Mesos and the following works fine:

Re: How true is this about spark streaming?

2014-07-30 Thread Tathagata Das
Hi Rohit, Can you point which thread has this statement? Maybe the additional context would help us disambiguate the original idea. TD On Tue, Jul 29, 2014 at 6:15 PM, Tobias Pfeiffer wrote: > Hi, > > that quoted statement doesn't make too much sense for me, either. Maybe if > you had a link fo

Re: How do you debug a PythonException?

2014-07-30 Thread Davies Liu
The exception in Python means that the worker try to read command from JVM, but it reach the end of socket (socket had been closed). So it's possible that there another exception happened in JVM. Could you change the log level of log4j, then check is there any problem inside JVM? Davies On Wed,

Re: Converting matrix format

2014-07-30 Thread Davies Liu
It will depends on the size of your matrix. If it can fit in memory, then you can sparse = sparse_matrix(matrix) # sparse_matrix is the function you had written sc.parallelize(sparse, NUM_OF_PARTITIONS) On Tue, Jul 29, 2014 at 11:39 PM, Chengi Liu wrote: > Hi, > I have an rdd with n rows and

Re: Graphx : Perfomance comparison over cluster

2014-07-30 Thread Ankur Dave
ShreyanshB writes: >> The version with in-memory shuffle is here: >> https://github.com/amplab/graphx2/commits/vldb. > > It'd be great if you can tell me how to configure and invoke this spark > version. Sorry for the delay on this. Assuming you're planning to launch an EC2 cluster, here's how t

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
Thanks. So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I would set SPARK_WORKER_INSTANCES to something like 2 (instead of the default value of 1).  Is that correct?  But, it also sounds like I need to explicitly set a value for SPARKER_WORKER_CORES (based on what the do

Deploying spark applications from within Eclipse?

2014-07-30 Thread nunarob
Hi all, I'm a new Spark user and I'm interested in deploying java-based Spark applications directly from Eclipse. Specifically, I'd like to write a Spark/java application, have it directly submitted to a spark cluster, and then see the output of the spark execution directly in the eclipse console.

Spark Deployment Patterns - Automated Deployment & Performance Testing

2014-07-30 Thread nightwolf
Hi all, We are developing an application which uses Spark & Hive to do static and ad-hoc reporting. For these static reports, they take a number of parameters and then run over a data set. We would like to make it easier to test performance of these reports on a cluster. If we have a test cluster

Re: Getting the number of slaves

2014-07-30 Thread Sung Hwan Chung
Is there a more fixed way of doing this? E.g., if I submit a job with 10 executors, I want to see 10 all the time, and not a fluctuating number based on currently available executors. In a tight cluster with lots of jobs running, I can see that this number goes up slowly and even down (when an exe

Re: Data from Mysql using JdbcRDD

2014-07-30 Thread Josh Mahonin
Hi Srini, I believe the JdbcRDD requires input splits based on ranges within the query itself. As an example, you could adjust your query to something like: SELECT * FROM student_info WHERE id >= ? AND id <= ? Note that the values you've passed in '1, 20, 2' correspond to the lower bound index, u

A task gets stuck after following messages in the std error log:

2014-07-30 Thread Sung Hwan Chung
14/07/30 16:08:00 INFO Executor: Running task ID 1199 14/07/30 16:08:00 INFO BlockManager: Found block broadcast_0 locally 14/07/30 16:08:00 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/30 16:08:00 INFO BlockFetcherIterator$Basic

Spark fault tolerance after a executor failure.

2014-07-30 Thread Sung Hwan Chung
I sometimes see that after fully caching the data, if one of the executors fails for some reason, that portion of cache gets lost and does not get re-cached, even though there are plenty of resources. Is this a bug or a normal behavior (V1.0.1)?

Re: Data from Mysql using JdbcRDD

2014-07-30 Thread chaitu reddy
Kc On Jul 30, 2014 3:55 PM, "srinivas" wrote: > Hi, > I am trying to get data from mysql using JdbcRDD using code The table have > three columns > > val url = "jdbc:mysql://localhost:3306/studentdata" > val username = "root" > val password = "root" > val mysqlrdd = new org.apache.sp

Data from Mysql using JdbcRDD

2014-07-30 Thread srinivas
Hi, I am trying to get data from mysql using JdbcRDD using code The table have three columns val url = "jdbc:mysql://localhost:3306/studentdata" val username = "root" val password = "root" val mysqlrdd = new org.apache.spark.rdd.JdbcRDD(sc,() => { Class.forName("com.mysql.jd

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Daniel Siegmann
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 24 cores, you can only run at most 24 tasks at once. You could run multiple workers per node to get more executors. That would give you more cores in

Re: Logging in Spark through YARN.

2014-07-30 Thread Marcelo Vanzin
Hi Archit, Are you using spark-submit? If so, can you try adding the following to its command line: --files /dir/log4j.properties I tested that locally with the master branch and it works, but I don't have a 1.0.x install readily available for me to test at the moment. Handling of SPARK_LOG4J_

Re: Is there a way to write spark RDD to Avro files

2014-07-30 Thread Marcelo Vanzin
Hi Fengyun, Have you tried to use saveAsHadoopFile() (or saveAsNewAPIHadoopFile())? You should be able to do something with that API by using AvroKeyValueOutputFormat. The API is defined here: http://spark.apache.org/docs/1.0.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions Lots of RDD types i

Installing Spark 1.0.1

2014-07-30 Thread cetaylor
Hello, I am attempting to install Spark 1.0.1 on a windows machine but I've been running into some difficulties. When I attempt to run some examples I am always met with the same response: Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this progra

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
Jeffrey Picard writes: > I tried unpersisting the edges and vertices of the graph by hand, then > persisting the graph with persist(StorageLevel.MEMORY_AND_DISK). I still see > the same behavior in connected components however, and the same thing you > described in the storage page. Unfortunately

Re: Spark SQL JDBC Connectivity

2014-07-30 Thread Michael Armbrust
Very cool. Glad you found a solution that works. On Wed, Jul 30, 2014 at 1:04 PM, Venkat Subramanian wrote: > For the time being, we decided to take a different route. We created a Rest > API layer in our app and allowed SQL query passing via the Rest. Internally > we pass that query to the Sp

Re: GraphX Connected Components

2014-07-30 Thread Jeffrey Picard
On Jul 30, 2014, at 5:18 AM, Ankur Dave wrote: > Jeffrey Picard writes: >> As the program runs I’m seeing each iteration take longer and longer to >> complete, this seems counter intuitive to me, especially since I am seeing >> the shuffle read/write amounts decrease with each iteration. I wo

Re: Spark SQL JDBC Connectivity

2014-07-30 Thread Venkat Subramanian
For the time being, we decided to take a different route. We created a Rest API layer in our app and allowed SQL query passing via the Rest. Internally we pass that query to the SparkSQL layer on the RDD and return back the results. With this Spark SQL is supported for our RDDs via this rest API no

Re: Spark Streaming Checkpoint: SparkContext is not serializable class

2014-07-30 Thread RodrigoB
Hi, I don't think you can do that. The code inside the for each is running on the node level and you're trying to create another rdd within the node's specific execution context. Try to load the text file before the streaming context on the driver app and use it later as a cached rdd on following

Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1. I have an RDD which I've repartitioned so it has 100 partitions (hoping to increase the parallelism). When I do a transformation (such as filter) on this RDD, I can't  seem to get more than 24 tasks (my total number of cores a

Re: Decision Tree requires regression LabeledPoint

2014-07-30 Thread SK
I have also used labeledPoint or libSVM format (for sparse data) for DecisionTree. When I had categorical labels (not features), I mapped the categories to numerical data as part of the data transformation step (i.e. before creating the LabeledPoint). -- View this message in context: http://apa

Re: evaluating classification accuracy

2014-07-30 Thread SK
I am using 1.0.1 and I am running locally (I am not providing any master URL). But the zip() does not produce the correct count as I mentioned above. So not sure if the issue has been fixed in 1.0.1. However, instead of using zip, I am now using the code that Sean has mentioned and am getting the c

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-30 Thread Tathagata Das
This is because setLocalProperty makes all Spark jobs submitted using the current thread belong to the set pool. However, in Spark Streaming, all the jobs are actually launched in the background from a different thread. So this setting does not work. However, there is a work around. If you are doi

Re: Implementing percentile through top Vs take

2014-07-30 Thread Sean Owen
No, it's definitely not done on the driver. It works as you say. Look at the source code for RDD.takeOrdered, which is what top calls. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar wrote: >

Partioner to process data in the same order for each key

2014-07-30 Thread Venkat Subramanian
I have a data file that I need to process using Spark . The file has multiple events for different users and I need to process the events for each user in the order it is in the file. User 1 : Event 1 User 2: Event 1 User 1 : Event 2 User 3: Event 1 User 2: Event 2 User 3: Event 2 etc.. I want t

Re: Do I need to know Scala to take full advantage of spark?

2014-07-30 Thread Matei Zaharia
Java is very close to Scala across the board, the only thing missing in it right now is GraphX (which is still alpha). Python is missing GraphX, streaming and a few of the ML algorithms, though most of them are there. So it should be fine to start with  any of them. See  http://spark.apache.org/

Re: Keep state inside map function

2014-07-30 Thread Kevin
Thanks to the both of you for your inputs. Looks like I'll play with the mapPartitions function to start porting MapReduce algorithms to Spark. On Wed, Jul 30, 2014 at 1:23 PM, Sean Owen wrote: > Really, the analog of a Mapper is not map(), but mapPartitions(). Instead > of: > > rdd.map(yourFun

Do I need to know Scala to take full advantage of spark?

2014-07-30 Thread Majid Azimi
Hi guys I'm very new in spark land comming from old school MapReduce world. I have no idea about scala. Does Java/Python API can compete with native Scala API? Is Spark heavily scala centric and binding for other languages are only for starting up and testing and serious work in spark will re

Implementing percentile through top Vs take

2014-07-30 Thread Bharath Ravi Kumar
I'm looking to select the top n records (by rank) from a data set of a few hundred GB's. My understanding is that JavaRDD.top(n, comparator) is entirely a driver-side operation in that all records are sorted in the driver's memory. I prefer an approach where the records are sorted on the cluster an

Re: Worker logs

2014-07-30 Thread Andrew Or
They are found in the executors' logs (not the worker's). In general, all code inside foreach or map etc. are executed on the executors. You can find these either through the Master UI (under Running Applications) or manually on the worker machines (under $SPARK_HOME/work). -Andrew 2014-07-30 10

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Zongheng Yang
To add to this: for this many (>= 20) machines I usually use at least --wait 600. On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas wrote: > William, > > The error you are seeing is misleading. There is no need to terminate the > cluster and start over. > > Just re-run your launch command, but wi

Re: Keep state inside map function

2014-07-30 Thread Sean Owen
Really, the analog of a Mapper is not map(), but mapPartitions(). Instead of: rdd.map(yourFunction) ... you can run setup code before mapping a bunch of records, and after, like so: rdd.mapPartitions { partition => // Some setup code here partition.map(yourfunction) // Some cleanup code

Re: Keep state inside map function

2014-07-30 Thread aaronjosephs
use mapPartitions to get the equivalent functionality to hadoop -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p10969.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Keep state inside map function

2014-07-30 Thread Kevin
Hi, Is it possible to maintain state inside a Spark map function? With Hadoop MapReduce, Mappers and Reducers are classes that can have their own state using instance variables. Can this be done with Spark? Are there any examples? Most examples I have seen do a simple operating on the value passe

Worker logs

2014-07-30 Thread Ruchir Jha
I have a very simple spark-job which has some System.outs in a rdd.forEach() and they don't show up in the spark-submit logs. Where or which file are these being written to?

Re: How do you debug a PythonException?

2014-07-30 Thread Nicholas Chammas
Any clues? This looks like a bug, but I can't report it without more precise information. On Tue, Jul 29, 2014 at 9:56 PM, Nick Chammas wrote: > I’m in the PySpark shell and I’m trying to do this: > > a = > sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json', > minPar

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Nicholas Chammas
William, The error you are seeing is misleading. There is no need to terminate the cluster and start over. Just re-run your launch command, but with the additional --resume option tacked on the end. As Akhil explained, this happens because AWS is not starting up the instances as quickly as the s

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Akhil Das
You need to increase the wait time, (-w) the default is 120 seconds, you may set it to a higher number like 300-400. The problem is that EC2 takes some time to initiate the machine (which is > 120 seconds sometimes.) Thanks Best Regards On Wed, Jul 30, 2014 at 8:52 PM, William Cox wrote: > *TL

Re: why a machine learning application run slowly on the spark cluster

2014-07-30 Thread Xiangrui Meng
It looks reasonable. You can also try the treeAggregate ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89) instead of normal aggregate if the driver needs to collect a large weight vector from each partition. -Xiangrui On Wed, Jul

Re: Is there a way to write spark RDD to Avro files

2014-07-30 Thread Lewis John Mcgibbney
Hi, Have you checked out SchemaRDD? There should be an examp[le of writing to Parquet files there. BTW, FYI I was discussing this with the SparlSQL developers last week and possibly using Apache Gora [0] for achieving this. HTH Lewis [0] http://gora.apache.org On Wed, Jul 30, 2014 at 5:14 AM, Fen

Re: Streaming on different store types

2014-07-30 Thread Jörn Franke
Hallo, I fear you have to write your own transaction logic for it (coordination,e. .g. via Zookeeper, transaction log, depending on your requirements raft /paxos etc.). However, before you embark on this journey question yourself if your application really needs it and what data load you expect.

the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread William Cox
*TL;DR >50% of the time I can't SSH into either my master or slave nodes and have to terminate all the machines and restart the EC2 cluster setup process.* Hello, I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup script to be delicate and unpredictable in terms of reliably

Re: Spark 0.9.1 - saveAsTextFile() exception: _temporary doesn't exist!

2014-07-30 Thread Andrew Ash
Hi Oleg, Did you ever figure this out? I'm observing the same exception also in 0.9.1 and think it might be related to setting spark.speculation=true. My theory is that multiple attempts at the same task start, the first finishes and cleans up the _temporary directory, and then the second fails

Streaming on different store types

2014-07-30 Thread Flavio Pompermaier
Hi everybody, I have a scenario where I would like to stream data to different persistency types (i.e. sql db, graphdb ,hdfs, etc) and perform some filtering and trasformation as the the data comes in. The problem is to maintain consistency between all datastores (maybe some operation could fail)

Re: Avro Schema + GenericRecord to HadoopRDD

2014-07-30 Thread Laird, Benjamin
That makes sense, thanks Chris. I'm currently reworking my code to use the newAPIHadoopRDD with an AvroSequenceFileInputFormat (see below), but I think I'll run into the same issue. I'll give your suggestion a try. val avroRdd = sc.newAPIHadoopFile(fp, classOf[AvroSequenceFileInputFormat[AvroKey[

Re: How to specify the job to run on the specific nodes(machines) in the hadoop yarn cluster?

2014-07-30 Thread Steve Nunez
This is a common request. Unfortunately, AFAIK, you can¹t do it yet. Once labels (YARN-796 ) are out we should see this capability and be able to OEpin¹ jobs to labels. If anyone figures out a way to do this in the meantime, I¹d love to hear about it

RE: Example standalone app error!

2014-07-30 Thread Alex Minnaar
Hi Andrew, I'm not sure why an assembly would help (I don't have the Spark source code, I have just included Spark core and Spark streaming in my dependencies in my build file). I did try it though and the error is still occurring. I have tried cleaning and refreshing SBT many times as well.

Initialize custom serializer on YARN

2014-07-30 Thread Anthony F
I am creating an API that can access data stored using an Avro schema. The API can only know the Avro schema at runtime when it is passed as a parm by a user of the API. I need to initialize a custom serializer with the Avro schema on remote worker and driver processes. I've tried to set the sch

Re: GraphX Pragel implementation

2014-07-30 Thread Arun Kumar
Hello Ankur, For my implementation to work the vprog function which is responsible for handling in coming messages and the sendMsg function should be aware of which super step they are in. Is it possible to pass super step information in this methods? Can u through some light on how to approach t

Re: How to submit Pyspark job in mesos?

2014-07-30 Thread daijia
:240] Master ID: 20140730-165621-1526966464-5050-23977 Hostname: CentOS-19 I0730 16:56:21.127964 23977 master.cpp:322] Master started on 192.168.3.91:5050 I0730 16:56:21.127989 23977 master.cpp:332] Master allowing unauthenticated frameworks to register!! I0730 16:56:21.130705 23979 master.cpp:757] The

Re: the pregel operator of graphx throws NullPointerException

2014-07-30 Thread Denis RP
spark-submit actually works, thanks for the reply! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/the-pregel-operator-of-graphx-throws-NullPointerException-tp10865p10949.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
Jeffrey Picard writes: > As the program runs I’m seeing each iteration take longer and longer to > complete, this seems counter intuitive to me, especially since I am seeing > the shuffle read/write amounts decrease with each iteration. I would think > that as more and more vertices converged t

Is there a way to write spark RDD to Avro files

2014-07-30 Thread Fengyun RAO
We used mapreduce for ETL and storing results in Avro files, which are loaded to hive/impala for query. Now we are trying to migrate to spark, but didn't find a way to write resulting RDD to Avro files. I wonder if there is a way to make it, or if not, why spark doesn't support Avro as well as ma

Re: Debugging "Task not serializable"

2014-07-30 Thread Sourav Chandra
While running application set this -Dsun.io.serialization.extendedDebugInfo=true This is applciable post java 1.6 version On Wed, Jul 30, 2014 at 2:13 PM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Akhil, Andry, thanks a lot for your suggestions. I will take a look to >

spark.scheduler.pool seems not working in spark streaming

2014-07-30 Thread liuwei
In my spark streaming program, I set scheduler pool, just as follows: val myFairSchedulerFile = “xxx.xml” val myStreamingPool = “xxx” System.setProperty(“spark.scheduler.allocation.file”, myFairSchedulerFile) val conf = new SparkConf() val ssc = new StreamingContext(conf, batchInterval) ssc.spark

Re: Debugging "Task not serializable"

2014-07-30 Thread Juan Rodríguez Hortalá
Akhil, Andry, thanks a lot for your suggestions. I will take a look to those JVM options. Greetings, Juan 2014-07-28 18:56 GMT+02:00 andy petrella : > Also check the guides for the JVM option that prints messages for such > problems. > Sorry, sent from phone and don't know it by heart :/ > Le

Re: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Larry Xiao
Hi Jianshi, I've met similar situation before. And my solution was 'ulimit', you can use -a to see your current settings -n to set open files limit (and other limits also) And I set -n to 10240. I see spark.shuffle.consolidateFiles helps by reusing open files. (so I don't know to what extend d

Re: why a machine learning application run slowly on the spark cluster

2014-07-30 Thread Tan Tim
I modify the code: lines.map(parsePoint).persist(StorageLever.MEMORY_ONLY) to lines.map(parsePoint).repartition(64).persist(StorageLever.MEMORY_ONLY) Every Stage run so fast, about 30 seconds(reduce from 3.5 minutes). But I found the total task reduce from 200 t0 64 after first stage just like thi

spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Jianshi Huang
I'm using Spark 1.0.1 on Yarn-Client mode. SortByKey always reports a FileNotFoundExceptions with messages says "too many open files". I already set spark.shuffle.consolidateFiles to true: conf.set("spark.shuffle.consolidateFiles", "true") But it seems not working. What are the other possible

Re: Spark Streaming Json file groupby function

2014-07-30 Thread lalit1303
you can try repartition/coalesce and make the final RDD into a single partition before saveAsTextFile. This should bring the content of whole RDD into single part- - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: RDD Cleanup

2014-07-30 Thread nightwolf
Hi premdass, Where did you set spark.cleaner.referenceTracking = true/false? Was this in your job-server conf? Cheers. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p10939.html Sent from the Apache Spark User List mailing list archiv

NotSerializableException

2014-07-30 Thread Ron Gonzalez
Hi, I took avro 1.7.7 and recompiled my distribution to be able to fix the issue when dealing with avro GenericRecord. The issue I got was resolved. I'm referring to AVRO-1476. I also enabled kryo registration in SparkConf. That said, I am still seeing a NotSerializableException for Schema

Re: Is it possible to read file head in each partition?

2014-07-30 Thread Fengyun RAO
of course we can filter them out. A typical file head is as below: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2013-07-04 20:00:00 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status

Spark & Ooyala Job Server

2014-07-30 Thread nightwolf
Hi all, I'm trying to get the jobserver working with Spark 1.0.1. I've got it building, tests passing and it connects to my Spark master (e.g. spark://hadoop-001:7077). I can also pre-create contexts. These show up in the Spark master console i.e. on hadoop-001:8080 The problem is that after I c

Spark Streaming : CassandraRDD not getting refreshed with new rows in column family

2014-07-30 Thread Praful CJ
Hi, I have the following Spark streaming application which stream from a Kafka topic, do some processing and publish the result to another topic. In between I am reading records from a Cassandra CF. The issue is when the application is running, if a new row is inserted into Cassandra CF, that n