Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Akhil Das
I think any requests going to s3*:// requires the credentials. If they have made it public (via http) then you won't require the keys. Thanks Best Regards On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto wrote: > Hi Sujit, > > I just wanted to access public datasets on Amazon. Do I still need

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Vi Ngo Van
This is a LibSVM format. I can use this data with libsvm library. In this sample, they are not sorted. I will sort them and try it again. Thanks you, On Wed, Jul 15, 2015 at 1:47 PM, Burak Yavuz wrote: > Hi, > > Is this in LibSVM format? If so, the indices should be sorted in > increasing order.

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Burak Yavuz
Hi, Is this in LibSVM format? If so, the indices should be sorted in increasing order. It seems like they are not sorted. Best, Burak On Tue, Jul 14, 2015 at 7:31 PM, Vi Ngo Van wrote: > Hi All, > I've met a issue with MLlib when i use LogisticRegressionWithLBFGS > > my sample data : > > *0 86

Re: creating a distributed index

2015-07-14 Thread Burak Yavuz
Hi Swetha, IndexedRDD is available as a package on Spark Packages . Best, Burak On Tue, Jul 14, 2015 at 5:23 PM, swetha wrote: > Hi Ankur, > > Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark > Streaming to do

Re: Strange behavoir of pyspark with --jars option

2015-07-14 Thread Burak Yavuz
Hi, I believe the HiveContext uses a different class loader. It then falls back to the system class loader if it can't find the classes in the context class loader. The system class loader contains the classpath passed through --driver-class-path and spark.executor.extraClassPath. The JVM is alread

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Deepak Jain
Leftouterjoin and join apis are super slow in spark. 100x slower than hadoop Sent from my iPhone > On 14-Jul-2015, at 10:59 PM, Wush Wu wrote: > > I don't understand. > > By the way, the `joinWithCassandraTable` does improve my query time > from 40 mins to 3 mins. > > > 2015-07-15 13:19 GMT

Re: Research ideas using spark

2015-07-14 Thread Akhil Das
Try to repartition it to a higher number (at least 3-4 times the total # of cpu cores). What operation are you doing? It may happen that if you are doing a join/groupBy sort of operation that task which is taking time is having all the values, in that case you need to use a Partitioner which will e

Strange behavoir of pyspark with --jars option

2015-07-14 Thread gen tang
Hi, I met some interesting problems with --jars options As I use the third party dependencies: elasticsearch-spark, I pass this jar with the following command: ./bin/spark-submit --jars path-to-dependencies ... It works well. However, if I use HiveContext.sql, spark will lost the dependencies that

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
I don't understand. By the way, the `joinWithCassandraTable` does improve my query time from 40 mins to 3 mins. 2015-07-15 13:19 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : > I have explored spark joins for last few months (you can search my posts) > and its frustrating useless. > > On Tue, Jul 14, 2015 at 9:35 P

Re: Java 8 vs Scala

2015-07-14 Thread Tristan Blakers
We have had excellent results operating on RDDs using Java 8 with Lambdas. It’s slightly more verbose than Scala, but I haven’t found this an issue, and haven’t missed any functionality. The new DataFrame API makes the Spark platform even more language agnostic. Tristan On 15 July 2015 at 06:40,

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread ๏̯͡๏
I have explored spark joins for last few months (you can search my posts) and its frustrating useless. On Tue, Jul 14, 2015 at 9:35 PM, Wush Wu wrote: > Dear all, > > I have found a post discussing the same thing: > > https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connecto

spark cache issue while doing saveAsTextFile and saveAsParquetFile

2015-07-14 Thread mathewvinoj
Hi There, I am using cache mapPartition to do some processing and cache the result as below I am storing the file as both format (parquet and textfile) where recomputing is happening both time.Eventhough i put the cache its not working as expected. below is the code snippet.Any help is really

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I have found a post discussing the same thing: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J The solution is using "joinWithCassandraTable" and the documentation is here: https://github.com/datast

Re: SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread Okehee Goh
The command "list jar" doesn't seem accepted in beeline with Spark's ThriftServer in both Spark 1.3.1 and Spark1.4. 0: jdbc:hive2://localhost:1> list jar; Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 'list' 'jar' ''; line 1 pos 0 (state=,code=0) Thanks On Tue,

Re: Ordering of Batches in Spark streaming

2015-07-14 Thread Tathagata Das
This has been discussed in a number of threads in this mailing list. Here is a summary. 1. Processing of batch T+1 always starts after all the processing of batch T has completed. But here a "batch" is defined by data of all the receivers running the in the system receiving within the batch interv

Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I am trying to join two RDDs, named rdd1 and rdd2. rdd1 is loaded from a textfile with about 33000 records. rdd2 is loaded from a table in cassandra which has about 3 billions records. I tried the following code: ```scala val rdd1 : (String, XXX) = sc.textFile(...).map(...) import o

Re: fileStream with old files

2015-07-14 Thread Tathagata Das
It was added, but its not documented publicly. I am planning to change the name of the conf to spark.streaming.fileStream.minRememberDuration to make it easier to understand On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole wrote: > A new configuration named *spark.streaming.minRememberDuration* was a

Re: Spark Intro

2015-07-14 Thread vinod kumar
Thank you Hafsa On Tue, Jul 14, 2015 at 11:09 AM, Hafsa Asif wrote: > Hi, > I was also in the same situation as we were using MySQL. Let me give some > clearfications: > 1. Spark provides a great methodology for big data analysis. So, if you > want to make your system more analytical and want de

Re:SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread prosp4300
What's the result of "list jar" in both 1.3.1 and 1.4.0, please check if there is any difference At 2015-07-15 08:10:44, "ogoh" wrote: >Hello, >I am using SparkSQL along with ThriftServer so that we can access using Hive >queries. >With Spark 1.3.1, I can register UDF function. But, Spark

Re: Stopping StreamingContext before receiver has started

2015-07-14 Thread Tathagata Das
This is a known race condition - root cause of SPARK-5681 On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi, > > I have noticed that when StreamingContext.stop is called when no receiver > has

Using reference for RDD is safe?

2015-07-14 Thread Abarah
Hello, I am wondering what will happen if I use a reference for transforming rdd, for example: def func1(rdd: RDD[Int]): RDD[Int] = { rdd.map(x => x * 2) // example transformation, but I am using a more complex function } def main() { . val myrdd = sc.parallelize(1 to 100) va

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Of course, exactly once receiving is not same as exactly once. In case of direct kafka stream, the data may actually be pulled multiple time. But even if the data of a batch is pulled twice because of some failure, the final result (that is, transformed data accessed through foreachRDD) will always

Re: Spark Streaming - Inserting into Tables

2015-07-14 Thread Tathagata Das
Why is .remember not ideal? On Sun, Jul 12, 2015 at 7:22 PM, Brandon White wrote: > Hi Yin, > > Yes there were no new rows. I fixed it by doing a .remember on the > context. Obviously, this is not ideal. > > On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai wrote: > >> Hi Brandon, >> >> Can you explai

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD. As for 1), if timing is not guaranteed, how does exactly once semantics supported? It feels like exactly once receiving is not necessarily exactly once processing. Chen On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das wrote: > > > On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote: >

Re: rest on streaming

2015-07-14 Thread Chen Song
Thanks TD, that is very useful. On Tue, Jul 14, 2015 at 10:19 PM, Tathagata Das wrote: > You can do this. > > // global variable to keep track of latest stuff > var latestTime = _ > var latestRDD = _ > > > dstream.foreachRDD((rdd: RDD[..], time: Time) => { > latestTime = time > latestRDD

MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Vi Ngo Van
Hi All, I've met a issue with MLlib when i use LogisticRegressionWithLBFGS my sample data : *0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1 5689:1 18493:1 44187:1 5694:1 27799:1 12010:1* *0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1 5689:1 18493:1 4

Re: rest on streaming

2015-07-14 Thread Tathagata Das
You can do this. // global variable to keep track of latest stuff var latestTime = _ var latestRDD = _ dstream.foreachRDD((rdd: RDD[..], time: Time) => { latestTime = time latestRDD = rdd }) Now you can asynchronously access the latest RDD. However if you are going to run jobs on the la

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote: > Thanks TD and Cody. I saw that. > > 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets > on HDFS at the end of each batch interval? > The timing is not guaranteed. > 2. In the code, if I first apply transformations and a

rest on streaming

2015-07-14 Thread Chen Song
I have been POC adding a rest service in a Spark Streaming job. Say I create a stateful DStream X by using updateStateByKey, and each time there is a HTTP request, I want to apply some transformations/actions on the latest RDD of X and collect the results immediately but not scheduled by streaming

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD and Cody. I saw that. 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets on HDFS at the end of each batch interval? 2. In the code, if I first apply transformations and actions on the directKafkaStream and then use foreachRDD on the original KafkaDStream to commit o

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
So you’re with different HiveContext instances for the caching. We are not expected to see the cached tables cached with the other HiveContext instance. From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Wednesday, July 15, 2015 8:48 AM To: Cheng, Hao Cc: user Subject: Re: How do you acc

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Ted Yu
bq. that is, key-value stores Please consider HBase for this purpose :-) On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das wrote: > I do not recommend using IndexRDD for state management in Spark Streaming. > What it does not solve out-of-the-box is checkpointing of indexRDDs, which > important be

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Tathagata Das
I do not recommend using IndexRDD for state management in Spark Streaming. What it does not solve out-of-the-box is checkpointing of indexRDDs, which important because long running streaming jobs can lead to infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey operation which

Re: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Brandon White
I cache the tell with hiveContext.cacheTable("tableName") On Tue, Jul 14, 2015 at 5:43 PM, Cheng, Hao wrote: > Can you describe how did you cache the tables? In another HiveContext? > AFAIK, cached table only be visible within the same HiveContext, you > probably need to execute the sql query

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Ted Yu
Please take a look at SPARK-2365 which is in progress. On Tue, Jul 14, 2015 at 5:18 PM, swetha wrote: > Hi, > > Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark > Streaming to do lookups/updates/deletes in RDDs using keys by storing them > as key/value pairs. > > Thanks

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
Can you describe how did you cache the tables? In another HiveContext? AFAIK, cached table only be visible within the same HiveContext, you probably need to execute the sql query like “cache table mytable as SELECT xxx” in the JDBC connection also. Cheng Hao From: Brandon White [mailto:bwwinthe

How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Brandon White
Hello there, I have a JBDC connection setup to my Spark cluster but I cannot see the tables that I cache in memory. The only tables I can see are those that are in my Hive instance. I use a HiveContext to register a table and cache it in memory. How can I enable my JBDC connection to query this in

Re: creating a distributed index

2015-07-14 Thread swetha
Hi Ankur, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cr

Sorted Multiple Outputs

2015-07-14 Thread Yiannis Gkoufas
Hi there, I have been using the approach described here: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job In addition to that, I was wondering if there is a way to set the customize the order of those values contained in each file. Thanks a lot!

Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread swetha
Hi, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD-avail

SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread ogoh
Hello, I am using SparkSQL along with ThriftServer so that we can access using Hive queries. With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work for that. The jar of the udf is same. Below is logs: I appreciate any advice. == With Spark 1.4 Beeline version 1.4.0 by Apache

Re: DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
I was able to workaround this by converting the DataFrame to an RDD and then back to DataFrame. This seems very weird to me, so any insight would be much appreciated! Thanks, Nick P.S. Here's the updated code with the workaround: ``` // Examples udf's that println when called val twice =

Getting not implemented by the TFS FileSystem implementation

2015-07-14 Thread Jerrick Hoang
Hi all, I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation``` exception. I did not get this error with 1.3 and I don't use any TFS FileSystem. Full stack trace is

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Arthur Chan
I found the reason, it is about sc. Thanks On Tue, Jul 14, 2015 at 9:45 PM, Akhil Das wrote: > Someone else also reported this error with spark 1.4.0 > > Thanks > Best Regards > > On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan > wrote: > >> Hi, Below is the log form the worker. >> >> >> 15/07/14

Re: Sessionization using updateStateByKey

2015-07-14 Thread Tathagata Das
[Apologies for repost, for those who have seen this response already in the dev mailing list] 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app wi

Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-14 Thread Kelly, Jonathan
I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation

Sessionization using updateStateByKey

2015-07-14 Thread swetha
Hi, I have a question regarding sessionization using updateStateByKey. If near real time state needs to be maintained in a Streaming application, what happens when the number of RDDs to maintain the state becomes very large? Does it automatically get saved to HDFS and reload when needed or do I h

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 3:42 PM, Elkhan Dadashov wrote: > I looked into Virtual memory usage (jmap+jvisualvm) does not show that > 11.5 g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory > usage using top -p pid command for SparkSubmit process. > If you're looking at top you w

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
Thanks, Marcelo. That article confused me, thanks for correcting it & helpful tips. I looked into Virtual memory usage (jmap+jvisualvm) does not show that 11.5 g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory usage using top -p pid command for SparkSubmit process. The virtua

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Relevant documentation - https://spark.apache.org/docs/latest/streaming-kafka-integration.html, towards the end. directKafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] // offsetRanges.length = # of Kafka partitions being consumed ... } On Tue,

Re: spark streaming with kafka reset offset

2015-07-14 Thread Cody Koeninger
You have access to the offset ranges for a given rdd in the stream by typecasting to HasOffsetRanges. You can then store the offsets wherever you need to. On Tue, Jul 14, 2015 at 5:00 PM, Chen Song wrote: > A follow up question. > > When using createDirectStream approach, the offsets are checkp

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
A follow up question. When using createDirectStream approach, the offsets are checkpointed to HDFS and it is understandable by Spark Streaming job. Is there a way to expose the offsets via a REST api to end users. Or alternatively, is there a way to have offsets committed to Kafka Offset Manager s

Data Frame for nested json

2015-07-14 Thread spark user
is DataFrame  support nested json to dump directely to data base  For simple json it working fine  {"id":2,"name":"Gerald","email":"gbarn...@zimbio.com","city":"Štoky","country":"Czech Republic","ip":"92.158.154.75”},  But for nested json it failed to load  root |-- rows: array (nullable = true)

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Yep :) On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 wrote: > > On 14 Jul 2015, at 23:26, Tathagata Das wrote: > > > Just to be clear, you mean the Spark Standalone cluster manager's > "master" and not the applications "driver", right. > > Sorry, by now I have understood that I would not nec

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread algermissen1971
On 14 Jul 2015, at 23:26, Tathagata Das wrote: > Just to be clear, you mean the Spark Standalone cluster manager's "master" > and not the applications "driver", right. Sorry, by now I have understood that I would not necessarily put the driver app on the master node and that not making that

Re: master compile broken for scala 2.11

2015-07-14 Thread Josh Rosen
I've opened a PR to fix this; please take a look: https://github.com/apache/spark/pull/7405 On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers wrote: > it works for scala 2.10, but for 2.11 i get: > > [ERROR] > /home/koert/src/spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeEx

Misaligned Rows with UDF

2015-07-14 Thread pedro
Hi, I am working at finding the root cause of a bug where rows in dataframes seem to have misaligned data. My dataframes have two types of columns: columns from data and columns from UDFs. I seem to be having trouble where for a given row, the row data doesn't match the data used to compute the UD

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Just to be clear, you mean the Spark Standalone cluster manager's "master" and not the applications "driver", right. In that case, the earlier responses are correct. TD On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller wrote: > The master node does not have to be similar to the worker nodes. It

RE: Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Hi Sujit, I just wanted to access public datasets on Amazon. Do I still need to provide the keys? Thank you, From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Tuesday, July 14, 2015 3:14 PM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: Spark on EMR with S3 example (Python) H

Re: How to speed up Spark process

2015-07-14 Thread ๏̯͡๏
Any solutions to solve this exception ? org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:389) at org.a

Re: Java 8 vs Scala

2015-07-14 Thread Vineel Yalamarthy
Good question. Like you , many are in the same boat(coming from Java background). Looking forward to response from the community. Regards Vineel On Tue, Jul 14, 2015 at 2:30 PM, spark user wrote: > Hi All > > To Start new project in Spark , which technology is good .Java8 OR Scala . > > I

Re: Java 8 vs Scala

2015-07-14 Thread Ted Yu
See previous thread: http://search-hadoop.com/m/q3RTtaXamv1nFTGR On Tue, Jul 14, 2015 at 1:30 PM, spark user wrote: > Hi All > > To Start new project in Spark , which technology is good .Java8 OR Scala . > > I am Java developer , Can i start with Java 8 or I Need to learn Scala . > > which one

DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
Hi! I am seeing some unexpected behavior with regards to cache() in DataFrames. Here goes: In my Scala application, I have created a DataFrame that I run multiple operations on. It is expensive to recompute the DataFrame, so I have called cache() after it gets created. I notice that the cache()

Java 8 vs Scala

2015-07-14 Thread spark user
Hi All  To Start new project in Spark , which technology is good .Java8 OR  Scala . I am Java developer , Can i start with Java 8  or I Need to learn Scala . which one is better technology  for quick start any POC project  Thanks  - su 

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Dan Dong
Yes, it works! Thanks a lot Burak! Cheers, Dan 2015-07-14 14:34 GMT-05:00 Burak Yavuz : > Hi Dan, > > You could zip the indices with the values if you like. > > ``` > val sVec = sparseVector(1).asInstanceOf[ > org.apache.spark.mllib.linalg.SparseVector] > val map = sVec.indices.zip(sVec.values)

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 12:03 PM, Shushant Arora wrote: > Can a container have multiple JVMs running in YARN? > Yes and no. A container runs a single command, but that process can start other processes, and those also count towards the resource usage of the container (mostly memory). For example

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Burak Yavuz
Hi Dan, You could zip the indices with the values if you like. ``` val sVec = sparseVector(1).asInstanceOf[ org.apache.spark.mllib.linalg.SparseVector] val map = sVec.indices.zip(sVec.values).toMap ``` Best, Burak On Tue, Jul 14, 2015 at 12:23 PM, Dan Dong wrote: > Hi, > I'm wondering how t

To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Dan Dong
Hi, I'm wondering how to access elements of a linalg.Vector, e.g: sparseVector: Seq[org.apache.spark.mllib.linalg.Vector] = List((3,[1,2],[1.0,2.0]), (3,[0,1,2],[3.0,4.0,5.0])) scala> sparseVector(1) res16: org.apache.spark.mllib.linalg.Vector = (3,[0,1,2],[3.0,4.0,5.0]) How to get the indices

Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Sujit Pal
Hi Roberto, I have written PySpark code that reads from private S3 buckets, it should be similar for public S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext, then you can access the S3 folders and files with their s3n:// paths. Something like this: sc = Spa

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Can a container have multiple JVMs running in YARN? I am comparing Hadoop Mapreduce running on yarn vs spark running on yarn here : 1.Is the difference is in Hadoop Mapreduce job - say I specify 20 reducers and my job uses 10 map tasks then, it need total 30 containers or 30 vcores ? I guess 30 v

Re: How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread Jong Wook Kim
Your question is not very clear, but from what I understand, you want to deal with a stream of MyTable that has parsed records from your Kafka topics. What you need is JavaDStream, and you can use transform()

Re: ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Jong Wook Kim
The article you've linked, is specific to an embedded system. the JVM built for that architecture (which the author didn't mention) might not be as stable and well-supported as HotSpot. ProcessBuilder is a stable Java API and despite somewhat limited functionality it is the standard method to l

Re: HDFS performances + unexpected death of executors.

2015-07-14 Thread Max Demoulin
I will try a fresh setup very soon. Actually, I tried to compile spark by myself, against hadoop 2.5.2, but I had the issue that I mentioned in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Master-doesn-t-start-no-logs-td23651.html I was wondering if maybe serialization/deseria

Re: spark on yarn

2015-07-14 Thread Jong Wook Kim
it's probably because your YARN cluster has only 40 vCores available. Go to your resource manager and check if "VCores Total" and "Memory Total" exceeds what you have set. (40 cores and 5120 MB) If that looks fine, go to "Scheduler" page and find the queue on which your jobs run, and check the

RE: Master vs. Slave Nodes Clarification

2015-07-14 Thread Mohammed Guller
The master node does not have to be similar to the worker nodes. It can be a smaller machine. In case of C*, again you don't need to have C* on the master node. You need C* and Spark workers co-located. Master can be on one of the C* node or a non-C* node. Mohammed -Original Message-

master compile broken for scala 2.11

2015-07-14 Thread Koert Kuipers
it works for scala 2.10, but for 2.11 i get: [ERROR] /home/koert/src/spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java:135: error: is not abstract and does not override abstract method minBy(Function1,Ordering) in TraversableOnce [ERROR] return new

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 11:13 AM, Shushant Arora wrote: > spark-submit --class classname --num-executors 10 --executor-cores 4 > --master masteradd jarname > > Will it allocate 10 containers throughout the life of streaming > application on same nodes until any node failure happens and > It will

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Ok thanks a lot! few more doubts : What happens in a streaming application say with spark-submit --class classname --num-executors 10 --executor-cores 4 --master masteradd jarname Will it allocate 10 containers throughout the life of streaming application on same nodes until any node failure hap

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 10:55 AM, Shushant Arora wrote: > Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per > container? > I don't remember YARN config names by heart, but that sounds promising. I'd look at the YARN documentation for details. > Whats the setting for ma

Re: spark on yarn

2015-07-14 Thread Ted Yu
Please see YARN-193 where 'yarn.scheduler.maximum-allocation-vcores' was introduced. See also YARN-3823 which changed default value. Cheers On Tue, Jul 14, 2015 at 10:55 AM, Shushant Arora wrote: > Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per > container? > > What

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 9:53 AM, Elkhan Dadashov wrote: > While the program is running, these are the stats of how much memory each > process takes: > > SparkSubmit process : 11.266 *gigabyte* Virtual Memory > > ApplicationMaster process: 2303480 *byte *Virtual Memory > That SparkSubmit number l

Re: spark on yarn

2015-07-14 Thread Ted Yu
Shushant : Please also see 'Debugging your Application' section of https://spark.apache.org/docs/latest/running-on-yarn.html On Tue, Jul 14, 2015 at 10:48 AM, Marcelo Vanzin wrote: > On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> My understanding wa

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per container? Whats the setting for max limit of --num-executors ? On Tue, Jul 14, 2015 at 11:18 PM, Marcelo Vanzin wrote: > On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> My und

Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Is there an example about how to load data from a public S3 bucket in Python? I haven't found any. Thank you,

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora wrote: > My understanding was --executor-cores(5 here) are maximum concurrent > tasks possible in an executor and --num-executors (10 here)are no of > executors or containers demanded by Application master/Spark driver program > to yarn RM. > --e

Re: spark on yarn

2015-07-14 Thread Shushant Arora
got the below exception in logs: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=5, maxVirtualCores=4 at org.apache.ha

Re: Few basic spark questions

2015-07-14 Thread Feynman Liang
You could implement the receiver as a Spark Streaming Receiver ; the data received would be available for any streaming applications which operate on DStreams (e.g. Streaming KMeans

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 9:57 AM, Shushant Arora wrote: > When I specify --executor-cores > 4 it fails to start the application. > When I give --executor-cores as 4 , it works fine. > Do you have any NM that advertises more than 4 available cores? Also, it's always worth it to check if there's a

Re: Finding moving average using Spark and Scala

2015-07-14 Thread Feynman Liang
If your rows may have NAs in them, I would process each column individually by first projecting the column ( map(x => x.nameOfColumn) ), filtering out the NAs, then running a summarizer over each column. Even if you have many rows, after summarizing you will only have a vector of length #columns.

spark on yarn

2015-07-14 Thread Shushant Arora
I am running spark application on yarn managed cluster. When I specify --executor-cores > 4 it fails to start the application. I am starting the app as spark-submit --class classname --num-executors 10 --executor-cores 5 --master masteradd jarname Exception in thread "main" org.apache.spark.Spar

Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
More particular example: I run pi.py Spark Python example in *yarn-cluster* mode (--master) through SparkLauncher in Java. While the program is running, these are the stats of how much memory each process takes: SparkSubmit process : 11.266 *gigabyte* Virtual Memory ApplicationMaster process: 2

Re: Spark application with a RESTful API

2015-07-14 Thread Debasish Das
How do you manage the spark context elastically when your load grows from 1000 users to 1 users ? On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif wrote: > I have almost the same case. I will tell you what I am actually doing, if > it > is according to your requirement, then I will love to help y

ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Elkhan Dadashov
Hi all, If you want to launch Spark job from Java in programmatic way, then you need to Use SparkLauncher. SparkLauncher uses ProcessBuilder for creating new process - Java seems handle process creation in an inefficient way. " When you execute a process, you must first fork() and then exec(). F

Re: correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread DW @ Gmail
You are mixing the 1.0.0 Spark SQL jar with Spark 1.4.0 jars in your build file Sent from my rotary phone. > On Jul 14, 2015, at 7:57 AM, ashwang168 wrote: > > Hello! > > I am currently using Spark 1.4.0, scala 2.10.4, and sbt 0.13.8 to try and > create a jar file from a scala file (attached

How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread unk1102
I use Spark Streaming where messages read from Kafka topics are stored into JavaDStream this rdd contains actual data. Now after going through documentation and other help I have found we traverse JavaDStream using foreachRDD javaDStreamRdd.foreachRDD(new Function,Void>() { public void call(Ja

Re: Create RDD from output of unix command

2015-07-14 Thread Igor Berman
haven't you thought about spark streaming? there is thread that could help https://www.mail-archive.com/user%40spark.apache.org/msg30105.html On 14 July 2015 at 18:20, Hafsa Asif wrote: > Your question is very interesting. What I suggest is, that copy your output > in some text file. Read text f

Re: Few basic spark questions

2015-07-14 Thread Oded Maimon
Hi, Thanks for all the help. I'm still missing something very basic. If I wont use sparkR, which doesn't support streaming (will use mlib instead as Debasish suggested), and I have my scala receiver working, how the receiver should save the data in memory? I do see the store method, so if i use it

Re: Spark application with a RESTful API

2015-07-14 Thread Hafsa Asif
I have almost the same case. I will tell you what I am actually doing, if it is according to your requirement, then I will love to help you. 1. my database is aerospike. I get data from it. 2. written standalone spark app (it does not run in standalone mode, but with simple java command or maven c

Re: Create RDD from output of unix command

2015-07-14 Thread Hafsa Asif
Your question is very interesting. What I suggest is, that copy your output in some text file. Read text file in your code and apply RDD. Just consider wordcount example by Spark. I love this example with Java client. Well, Spark is an analytical engine and it has a slogan to analyze big big data s

Re: Including additional scala libraries in sparkR

2015-07-14 Thread Shivaram Venkataraman
There was a fix for `--jars` that went into 1.4.1 https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6 Shivaram On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui wrote: > Could you give more details about the mis-behavior of --jars for SparkR? > maybe it's a bug. > __

  1   2   >