Re: Error when connecting to Spark SQL via Hive JDBC driver

2015-06-18 Thread rahulkumar-aws
it look's like your spark-Hive jars are not compatible with Spark , compile spark source with hive 13 flag. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package it will solve ur problem. - Software Developer Sigmoid (SigmoidAnalytics), India

Re: RE: Spark or Storm

2015-06-18 Thread Enno Shioji
Tbh I find the doc around this a bit confusing. If it says "end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional)", I think most people will interpret it that as long as you use a storage which has atomicity (like MySQL/Postgres etc.), a successfu

RE: Build spark application into uber jar

2015-06-18 Thread prajod.vettiyattil
> but when I run the application locally, it complains that spark related stuff > is missing I use the uber jar option. What do you mean by “locally” ? In the Spark scala shell ? In the From: bit1...@163.com [mailto:bit1...@163.com] Sent: 19 June 2015 08:11 To: user Subject: Build spark applica

Re: Error when connecting to Spark SQL via Hive JDBC driver

2015-06-18 Thread ogoh
hello, I am not sure what is wrong.. But, in my case, I followed the instruction from http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html. It worked fine with SQuirreL SQL Client (http://squirrel-sql.sourceforge.net/), and SQL Workbench J (http://www.sql-workbenc

Re: SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Burak Yavuz
Hey Nathan, I like the first idea better. Let's see what others think. I'd be happy to review your PR afterwards! Best, Burak On Thu, Jun 18, 2015 at 9:53 PM, Nathan McCarthy < nathan.mccar...@quantium.com.au> wrote: > Hey, > > Spark Submit adds maven central & spark bintray to the ChainResol

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
If you are writing to an existing hive table, our insert into operator follows hive's requirement, which is "*the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order** in which they appear in the PARTITION() clause*." You can find requir

SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Nathan McCarthy
Hey, Spark Submit adds maven central & spark bintray to the ChainResolver before it adds any external resolvers. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821 When running on a cluster without internet access, this means the s

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
Are you writing to an existing hive orc table? On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian wrote: > Thanks for reporting this. Would you mind to help creating a JIRA for this? > > > On 6/16/15 2:25 AM, patcharee wrote: > >> I found if I move the partitioned columns in schemaString and in Row to

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi Tathagata, When you say please mark spark-core and spark-streaming as dependencies how do you mean? I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark downloads. In my maven pom.xml, I am using version 1.4 as described. Please let me know how I can fix that? Thanks Nipun On T

how to change /tmp folder for spark ut use sbt

2015-06-18 Thread yuemeng (A)
hi,all if i want to change the /tmp folder to any other folder for spark ut use sbt,how can i do?

Build spark application into uber jar

2015-06-18 Thread bit1...@163.com
Hi,sparks, I have a spark streaming application that is a maven project, I would like to build it into a uber jar and run in the cluster. I have found out two options to build the uber jar, either of them has its shortcomings, so I would ask how you guys do it. Thanks. 1. Use the maven shade ja

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Why do you need to uniquely identify the message? All you need is the time when the message was inserted by the receiver, and when it is processed, isnt it? On Thu, Jun 18, 2015 at 2:28 PM, anshu shukla wrote: > Thanks alot , But i have already tried the second way ,Problem with that > is tha

Re: createDirectStream and Stats

2015-06-18 Thread Tim Smith
Thanks for the super-fast response, TD :) I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. Cloudera, are you listening? :D On Thu, Jun 18, 2015 at 7:02 PM, Tathagata Das wrote: > Are you using Spark 1.3.x ? That explains. This issue has been fixed in > Spark 1.4.0. Bonus you g

Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Sea
Hi, all: I want to run spark sql on yarn(yarn-client), but ... I already set "spark.yarn.jar" and "spark.jars" in conf/spark-defaults.conf. ./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 > game.txt Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spar

Re: Does MLLib has attribute importance?

2015-06-18 Thread Debasish Das
Running l1 and picking non zero coefficient s gives a good estimate of interesting features as well... On Jun 17, 2015 4:51 PM, "Xiangrui Meng" wrote: > We don't have it in MLlib. The closest would be the ChiSqSelector, > which works for categorical data. -Xiangrui > > On Thu, Jun 11, 2015 at 4:3

Re: createDirectStream and Stats

2015-06-18 Thread Tathagata Das
Are you using Spark 1.3.x ? That explains. This issue has been fixed in Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome stats. :) On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith wrote: > Hi, > > I just switched from "createStream" to the "createDirectStream" API for > kafka and

createDirectStream and Stats

2015-06-18 Thread Tim Smith
Hi, I just switched from "createStream" to the "createDirectStream" API for kafka and while things otherwise seem happy, the first thing I noticed is that stream/receiver stats are gone from the Spark UI :( Those stats were very handy for keeping an eye on health of the app. What's the best way t

RE: RE: Spark or Storm

2015-06-18 Thread prajod.vettiyattil
More details on the Direct API of Spark 1.3 is at the databricks blog: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html Note the use of checkpoints to persist the Kafka offsets in Spark Streaming itself, and not in zookeeper. Also this statement:”

Re: understanding on the "waiting batches" and "scheduling delay" in Streaming UI

2015-06-18 Thread Tathagata Das
Also, could you give a screenshot of the streaming UI. Even better, could you run it on Spark 1.4 which has a new streaming UI and then use that for debugging/screenshot? TD On Thu, Jun 18, 2015 at 3:05 AM, Akhil Das wrote: > Which version of spark? and what is your data source? For some reason

NaiveBayes for MLPipeline is absent

2015-06-18 Thread Justin Yip
Hello, Currently, there is no NaiveBayes implementation for MLpipeline. I couldn't find the JIRA ticket related to it too (or maybe I missed). Is there a plan to implement it? If no one has the bandwidth, I can work on it. Thanks. Justin -- View this message in context: http://apache-spark

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
I saw another report so I filed it already: Filed as: https://issues.apache.org/jira/browse/SPARK-8470 On Thu, Jun 18, 2015 at 4:07 PM, Chad Urso McDaniel wrote: > We're using the normal command line: > --- > bin/spark-submit --properties-file ./spark-submit.conf --class > com.rr.data.visits.Vis

Re: [SparkSQL]. MissingRequirementError when creating dataframe from RDD (new error in 1.4)

2015-06-18 Thread Michael Armbrust
Thanks for reporting. Filed as: https://issues.apache.org/jira/browse/SPARK-8470 On Thu, Jun 18, 2015 at 5:35 PM, Adam Lewandowski < adam.lewandow...@gmail.com> wrote: > Since upgrading to Spark 1.4, I'm getting a > scala.reflect.internal.MissingRequirementError when creating a DataFrame > from

Coalescing with shuffle = false in imbalanced cluster

2015-06-18 Thread Corey Nolet
I'm confused about this. The comment on the function seems to indicate that there is absolutely no shuffle or network IO but it also states that it assigns an even number of parent partitions to each final partition group. I'm having trouble seeing how this can be guaranteed without some data pass

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Sorry Du, Repartition means coalesce(shuffle = true) as per [1]. They are the same operation. Coalescing with shuffle = false means you are specifying the max amount of partitions after the coalesce (if there are less partitions you will end up with the lesser amount. [1] https://github.com/apac

[SparkSQL]. MissingRequirementError when creating dataframe from RDD (new error in 1.4)

2015-06-18 Thread Adam Lewandowski
Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YAR

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-18 Thread Haopu Wang
Akhil, >From my test, I can see the files in the last batch will alwyas be reprocessed upon restarting from checkpoint even for graceful shutdown. I think usually the file is expected to be processed only once. Maybe this is a bug in fileStream? or do you know any approach to workaround it?

Interaction between StringIndexer feature transformer and CrossValidator

2015-06-18 Thread cyz
Hi, I encountered errors fitting a model using a CrossValidator. The training set contained a feature which was initially a String with many unique values. I used a StringIndexer to transform this feature column into label indices. Fitting a model with a regular pipeline worked fine, but I ran int

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
I am submitting the application from a python notebook. I am launching pyspark as follows: SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g SPARK_DAEMON_JAVA_OPTS="-XX:MaxPermSize=30g -Xms30g -Xmx30g" IPYTHON=1

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
repartition() means coalesce(shuffle=false) On Thursday, June 18, 2015 4:07 PM, Corey Nolet wrote: Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, "Du Li" wrote: I got the same problem with rdd,repartition() in my streaming app, which generated a few huge

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to store the cluster centers. That is ~600MB. If there are 10 partitions, you might need 6GB on the driver to collect updates from workers. I guess the driver died. Did you specify driver memory with spark-submit? -Xiangrui On Thu

Re: Submitting Spark Applications using Spark Submit

2015-06-18 Thread maxdml
You can specify the jars of your application to be included with spark-submit with the /--jars/ switch. Otherwise, are you sure that your newly compiled spark jar assembly is in assembly/target/scala-2.10/? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Su

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Chad Urso McDaniel
We're using the normal command line: --- bin/spark-submit --properties-file ./spark-submit.conf --class com.rr.data.visits.VisitSequencerRunner ./mvt-master-SNAPSHOT-jar-with-dependencies.jar --- Our jar contains both com.rr.data.visits.orc.OrcReadWrite (which you can see in the stack trace) and t

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Doesn't repartition call coalesce(shuffle=true)? On Jun 18, 2015 6:53 PM, "Du Li" wrote: > I got the same problem with rdd,repartition() in my streaming app, which > generated a few huge partitions and many tiny partitions. The resulting > high data skew makes the processing time of a batch unpre

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
I got the same problem with rdd,repartition() in my streaming app, which generated a few huge partitions and many tiny partitions. The resulting high data skew makes the processing time of a batch unpredictable and often exceeding the batch interval. I eventually solved the problem by using rdd

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
How are you adding com.rr.data.Visit to spark? With --jars? It is possible we are using the wrong classloader. Could you open a JIRA? On Thu, Jun 18, 2015 at 2:56 PM, Chad Urso McDaniel wrote: > We are seeing class exceptions when converting to a DataFrame. > Anyone out there with some sugges

Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Michael Armbrust
I would also love to see a more recent version of Spark SQL. There have been a lot of performance improvements between 1.2 and 1.4 :) On Thu, Jun 18, 2015 at 3:18 PM, Steve Nunez wrote: > Interesting. What where the Hive settings? Specifically it would be > useful to know if this was Hive on

Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Steve Nunez
Interesting. What where the Hive settings? Specifically it would be useful to know if this was Hive on Tez. - Steve From: Sanjay Subramanian Reply-To: Sanjay Subramanian Date: Thursday, June 18, 2015 at 11:08 To: "user@spark.apache.org" Subject: Spark-sql versus Imp

confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Chad Urso McDaniel
We are seeing class exceptions when converting to a DataFrame. Anyone out there with some suggestions on what is going on? Our original intention was to use a HiveContext to write ORC and we say the error there and have narrowed it down. This is an example of our code: --- def saveVisitsAsOrcFil

Re: Latency between the RDD in Streaming

2015-06-18 Thread anshu shukla
Thanks alot , But i have already tried the second way ,Problem with that is that how to identify the particular RDD from source to sink (as we can do by passing a msg id in storm) . For that i just updated RDD and added a msgID (as static variable) . but while dumping them to file some of the

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Couple of ways. 1. Easy but approx way: Find scheduling delay and processing time using StreamingListener interface, and then calculate "end-to-end delay = 0.5 * batch interval + scheduling delay + processing time". The 0.5 * batch inteval is the approx average batching delay across all the record

Re: Latency between the RDD in Streaming

2015-06-18 Thread anshu shukla
Sorry , i missed the LATENCY word.. for a large streaming query .How to find the time taken by the particular RDD to travel from initial D-STREAM to final/last D-STREAM . Help Please !! On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das wrote: > Its not clear what you are asking. Find "what

Re: The "Initial job has not accepted any resources" error; can't seem to set

2015-06-18 Thread dgoldenberg
I just realized that --conf needs to be one key-value pair per line. And somehow I needed --conf "spark.cores.max=2" \ However, when it was --conf "spark.deploy.defaultCores=2" \ then one job would take up all 16 cores on the box. What's the actual model here? We've got 10 app

Re: Does MLLib has attribute importance?

2015-06-18 Thread Ruslan Dautkhanov
Got it. Thanks! -- Ruslan Dautkhanov On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng wrote: > ChiSqSelector calls an RDD of labeled points, where the label is the > target. See > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scal

Re: Serial batching with Spark Streaming

2015-06-18 Thread Michal Čizmazia
Tathagata, Please could you confirm that batches are not processed in parallel during retries in Spark 1.4? See Binh's email copied below. Any pointers for workarounds if necessary? Thanks! On 18 June 2015 at 14:29, Binh Nguyen Van wrote: > I haven’t tried with 1.4 but I tried with 1.3 a while

The "Initial job has not accepted any resources" error; can't seem to set

2015-06-18 Thread dgoldenberg
Hi, I'm running Spark Standalone on a single node with 16 cores. Master and 4 workers are running. I'm trying to submit two applications via spark-submit and am getting the following error when submitting the second one: "Initial job has not accepted any resources; check your cluster UI to ensure

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Tathagata Das
I think you may be including a different version of Spark Streaming in your assembly. Please mark spark-core nd spark-streaming as provided dependencies. Any installation of Spark will automatically provide Spark in the classpath so you do not have to bundle it. On Thu, Jun 18, 2015 at 8:44 AM, Ni

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Nick Pentreath
Yup, numpy calls into BLAS for matrix multiply. Sent from my iPad > On 18 Jun 2015, at 8:54 PM, Ayman Farahat wrote: > > Thanks all for the help. > It turned out that using the bumpy matrix multiplication made a huge > difference in performance. I suspect that Numpy already uses BLAS optimize

MLIB-KMEANS: Py4JNetworkError: An error occurred while trying to connect to the Java server , on a huge data set

2015-06-18 Thread rogersjeffreyl
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. *My Code is as Follows:* def convert_into_sparse_vector(A): non_nan_i

Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. My Code is as Follows: def convert_into_sparse_vector(A): non_nan_indic

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-18 Thread Tathagata Das
Glad to hear that. :) On Thu, Jun 18, 2015 at 6:25 AM, Ji ZHANG wrote: > Hi, > > We switched from ParallelGC to CMS, and the symptom is gone. > > On Thu, Jun 4, 2015 at 3:37 PM, Ji ZHANG wrote: > >> Hi, >> >> I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this >> setting can

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-18 Thread Corey Nolet
> This is not independent programmatic way of running of Spark job on Yarn cluster. The example I created simply demonstrates how to wire up the classpath so that spark submit can be called programmatically. For my use case, I wanted to hold open a connection so I could send tasks to the executors

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Its not clear what you are asking. Find "what" among RDD? On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla wrote: > Is there any fixed way to find among RDD in stream processing systems , > in the Distributed set-up . > > -- > Thanks & Regards, > Anshu Shukla >

Re: Does MLLib has attribute importance?

2015-06-18 Thread Xiangrui Meng
ChiSqSelector calls an RDD of labeled points, where the label is the target. See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L120 On Wed, Jun 17, 2015 at 10:22 PM, Ruslan Dautkhanov wrote: > Thank you Xiangrui. > > Oracle's

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Ayman Farahat
Thanks all for the help. It turned out that using the bumpy matrix multiplication made a huge difference in performance. I suspect that Numpy already uses BLAS optimized code. Here is Python code #This is where i load and directly test the predictions myModel = MatrixFactorizationModel.load(sc

Specify number of partitions with which to run DataFrame.join?

2015-06-18 Thread Matt Cheah
Hi everyone, I¹m looking into switching raw RDD operations to DataFrames operations. When I used JavaPairRDD.join(), I had the option to specify the number of partitions with which to do the join. However, I don¹t see an equivalent option in DataFrame.join(). Is there a way to specify the partitio

Re: Serial batching with Spark Streaming

2015-06-18 Thread Binh Nguyen Van
I haven’t tried with 1.4 but I tried with 1.3 a while ago and I could not get the serialized behavior by using default scheduler when there is failure and retry so I created a customized stream like this. class EachSeqRDD[T: ClassTag] ( parent: DStream[T], eachSeqFunc: (RDD[T], Time) => Unit

Latency between the RDD in Streaming

2015-06-18 Thread anshu shukla
Is there any fixed way to find among RDD in stream processing systems , in the Distributed set-up . -- Thanks & Regards, Anshu Shukla

Spark-sql versus Impala versus Hive

2015-06-18 Thread Sanjay Subramanian
I just published results of my findings herehttps://bigdatalatte.wordpress.com/2015/06/18/spark-sql-versus-impala-versus-hive/

Re: Got the exception when joining RDD with spark streamRDD

2015-06-18 Thread Davies Liu
This seems be a bug, could you file a JIRA for it? RDD should be serializable for Streaming job. On Thu, Jun 18, 2015 at 4:25 AM, Groupme wrote: > Hi, > > > I am writing pyspark stream program. I have the training data set to compute > the regression model. I want to use the stream data set to t

Re: Serial batching with Spark Streaming

2015-06-18 Thread Michal Čizmazia
Tathagata, thanks for your response. You are right! Everything seems to work as expected. Please could help me understand why the time for processing of all jobs for a batch is always less than 4 seconds? Please see my playground code below. The last modified time of the input (lines) RDD dump f

Hivecontext going out-of-sync issue

2015-06-18 Thread Ranadip Chatterjee
Hi All. I have a partitioned table in Hive. The use case is to drop one of the partitions before inserting new data every time the Spark process runs. I am using the Hivecontext to read and write (dynamic partitions) and also to alter the table to drop the partition before insert. Everything runs

problem with pants building

2015-06-18 Thread peixin li
Hi, We use pants to build python project to executable python file (pex). But we cannot run pex file until we add all necessary library paths to PYTHONPATH and use pip to install necessary packages for $SPARK_HOME/python/lib/pyspark.zip specially. Since we have already add the unzipped pyspark

Re: Issue with PySpark UDF on a column of Vectors

2015-06-18 Thread Xiangrui Meng
This is a known issue. See https://issues.apache.org/jira/browse/SPARK-7902 -Xiangrui On Thu, Jun 18, 2015 at 6:41 AM, calstad wrote: > I am having trouble using a UDF on a column of Vectors in PySpark which can > be illustrated here: > > from pyspark import SparkContext > from pyspark.sql import

problem with pants building

2015-06-18 Thread peixin li
Hi, We use pants to build python project to executable python file (pex). But we cannot run pex file until we add all necessary library paths to PYTHONPATH and use pip to install necessary packages for $SPARK_HOME/python/lib/pyspark.zip specially. Since we have already add the unzipped pyspark

different schemas per row with DataFrames

2015-06-18 Thread Alex Nastetsky
I am reading JSON data that has different schemas for every record. That is, for a given field that would have a null value, it's simply absent from that record (and therefore, its schema). I would like to use the DataFrame API to select specific fields from this data, and for fields that are miss

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
@Twinkle - what did you mean by "Regarding not keeping whole dataset in memory, you can tweak the parameter of remember, such that it does checkpoint at appropriate time"? On Thu, Jun 18, 2015 at 11:40 AM, Nipun Arora wrote: > Hi All, > > I appreciate the help :) > > Here is a sample code where

[Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi, I have the following piece of code, where I am trying to transform a spark stream and add min and max to it of eachRDD. However, I get an error saying max call does not exist, at run-time (compiles properly). I am using spark-1.4 I have added the question to stackoverflow as well: http://stac

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also in my experiments, it's much faster to blocked BLAS through cartesian rather than doing sc.union. Here are the details on the experiments: https://issues.apache.org/jira/browse/SPARK-4823 On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das wrote: > Also not sure how threading helps here because

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also not sure how threading helps here because Spark puts a partition to each core. On each core may be there are multiple threads if you are using intel hyperthreading but I will let Spark handle the threading. On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das wrote: > We added SPARK-3066 for this.

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
Hi All, I appreciate the help :) Here is a sample code where I am trying to keep the data of the previous RDD and the current RDD in a foreachRDD in spark stream. I do not know if the bottom code technically works as I cannot compile it , but I am trying to in a way keep the historical reference

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS dgemm based calculation. On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat < ayman.fara...@yahoo.com.invalid> wrote: > Thanks Sabarish and Nick > Would you happen to have some code snippets that you can share. > Best > Ayman > >

Re: Submitting Spark Applications using Spark Submit

2015-06-18 Thread lovelylavs
Hi, To make the jar files as part of the jar which you would like to use, you should create a uber jar. Please refer to the following: https://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread twinkle sachdeva
Hi, UpdateStateByKey : if you can brief the issue you are facing with this,that will be great. Regarding not keeping whole dataset in memory, you can tweak the parameter of remember, such that it does checkpoint at appropriate time. Thanks Twinkle On Thursday, June 18, 2015, Nipun Arora wrote

Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai
btw, user listt will be a better place for this thread. On Thu, Jun 18, 2015 at 8:19 AM, Yin Huai wrote: > Is it the full stack trace? > > On Thu, Jun 18, 2015 at 6:39 AM, Sea <261810...@qq.com> wrote: > >> Hi, all: >> >> I want to run spark sql on yarn(yarn-client), but ... I already set >> "sp

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Ayman Farahat
Thanks Sabarish and Nick Would you happen to have some code snippets that you can share. Best Ayman On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan wrote: > Nick is right. I too have implemented this way and it works just fine. In my > case, there can be even more products. You simply broadc

Re: Loading lots of parquet files into dataframe from s3

2015-06-18 Thread lovelylavs
You can do something like this: ObjectListing objectListing; do { objectListing = s3Client.listObjects(listObjectsRequest); for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) { if ((objectSummary.get

Re: RE: Spark or Storm

2015-06-18 Thread Cody Koeninger
That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results.

Re: Executor memory allocations

2015-06-18 Thread Richard Marscher
It would be the "40%", although it's probably better to think of it as shuffle vs. data cache and the remainder goes to tasks. As the comments for the shuffle memory fraction configuration clarify that it will be taking memory at the expense of the storage/data cache fraction: spark.shuffle.memory

Issue with PySpark UDF on a column of Vectors

2015-06-18 Thread calstad
I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from pyspark.sql import Row from pyspark.sql.types import DoubleType from pyspark.sql.functions import udf from pyspark.mllib.linalg import Vectors FeatureRow = Row('i

Re: Spark and Google Cloud Storage

2015-06-18 Thread Nick Pentreath
I believe it is available here: https://cloud.google.com/hadoop/google-cloud-storage-connector 2015-06-18 15:31 GMT+02:00 Klaus Schaefers : > Hi, > > is there a kind adapter to use GoogleCloudStorage with Spark? > > > Cheers, > > Klaus > > -- > > -- > > Klaus Schaefers > Senior Optimization Manag

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
BTW I suggest this instead of using thread locals as I am not sure in which situation spark will reuse or not them. For example if an error happens inside a thread, will spark then create a new one or the error is catched inside the thread preventing it to stop. So in short, does spark guarantee th

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-18 Thread Ji ZHANG
Hi, We switched from ParallelGC to CMS, and the symptom is gone. On Thu, Jun 4, 2015 at 3:37 PM, Ji ZHANG wrote: > Hi, > > I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this > setting can be seen in web ui's environment tab. But, it still eats memory, > i.e. -Xmx set to 512M

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
2015-06-18 15:17 GMT+02:00 Guillaume Pitel : > I was thinking exactly the same. I'm going to try it, It doesn't really > matter if I lose an executor, since its sketch will be lost, but then > reexecuted somewhere else. > > I mean that between the action that will update the sketches and the acti

Spark and Google Cloud Storage

2015-06-18 Thread Klaus Schaefers
Hi, is there a kind adapter to use GoogleCloudStorage with Spark? Cheers, Klaus -- -- Klaus Schaefers Senior Optimization Manager Ligatus GmbH Hohenstaufenring 30-32 D-50674 Köln Tel.: +49 (0) 221 / 56939 -784 Fax: +49 (0) 221 / 56 939 - 599 E-Mail: klaus.schaef...@ligatus.com Web: www

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
I was thinking exactly the same. I'm going to try it, It doesn't really matter if I lose an executor, since its sketch will be lost, but then reexecuted somewhere else. And anyway, it's an approximate data structure, and what matters are ratios, not exact values. I mostly need to take care o

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
Yeah thats the problem. There is probably some "perfect" num of partitions that provides the best balance between partition size and memory and merge overhead. Though it's not an ideal solution :( There could be another way but very hacky... for example if you store one sketch in a singleton per j

Re: Best way to randomly distribute elements

2015-06-18 Thread Himanshu Mehra
Hi A bellet You can try RDD.randomSplit(weights array) where a weights array is the array of weight you wants to want to put in the consecutive partition example RDD.randomSplit(Array(0.7, 0.3)) will create two partitions containing 70% data in one and 30% in other, randomly selecting the elements

Re: kafka spark streaming working example

2015-06-18 Thread Akhil Das
.setMaster("local") set it to local[2] or local[*] Thanks Best Regards On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski wrote: > hi, > I'm trying to run simple kafka spark streaming example over spark-shell: > > sc.stop > import org.apache.spark.SparkConf > import org.apache.spark.SparkCont

kafka spark streaming working example

2015-06-18 Thread Bartek Radziszewski
hi, I'm trying to run simple kafka spark streaming example over spark-shell: sc.stop import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import kafka.serializer.DefaultDecoder import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import org.apache.spar

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
Hi, Thank you for this confirmation. Coalescing is what we do now. It creates, however, very big partitions. Guillaume Hey, I am not 100% sure but from my understanding accumulators are per partition (so per task as its the same) and are sent back to the driver with the task result and merg

Re: connect mobile app with Spark backend

2015-06-18 Thread Akhil Das
Why not something like your mobile app pushes data to your webserver which pushes the data to Kafka or Cassandra or any other database and have a Spark streaming job running all the time operating on the incoming data and pushes the calculated values back. This way, you don't have to start a spark

Re: Best way to randomly distribute elements

2015-06-18 Thread ayan guha
how about generating the key using some 1-way hashing like md5? On Thu, Jun 18, 2015 at 9:59 PM, Guillaume Pitel wrote: > > I think you can randomly reshuffle your elements just by emitting a random > key (mapping a PairRdd's key triggers a reshuffle IIRC) > > yourrdd.map{ x => (rand(), x)} > >

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
Hey, I am not 100% sure but from my understanding accumulators are per partition (so per task as its the same) and are sent back to the driver with the task result and merged. When a task needs to be run n times (multiple rdds depend on this one, some partition loss later in the chain etc) then th

Re: Best way to randomly distribute elements

2015-06-18 Thread Guillaume Pitel
I think you can randomly reshuffle your elements just by emitting a random key (mapping a PairRdd's key triggers a reshuffle IIRC) yourrdd.map{ x => (rand(), x)} There is obiously a risk that rand() will give same sequence of numbers in each partition, so you may need to use mapPartitionsWit

Re: RE: Spark or Storm

2015-06-18 Thread bit1...@163.com
I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2.

Best way to randomly distribute elements

2015-06-18 Thread abellet
Hello, In the context of a machine learning algorithm, I need to be able to randomly distribute the elements of a large RDD across partitions (i.e., essentially assign each element to a random partition). How could I achieve this? I have tried to call repartition() with the current number of parti

Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
Hi, I'm trying to figure out the smartest way to implement a global count-min-sketch on accumulators. For now, we are doing that with RDDs. It works well, but with one sketch per partition, merging takes too long. As you probably know, a count-min sketch is a big mutable array of array of in

Re: deployment options for Spark and YARN w/ many app jar library dependencies

2015-06-18 Thread Sweeney, Matt
Thank you, Sandy! I'll investigate use of the extraClassPath variable. Both options are helpful. Thanks, Matt On Jun 17, 2015, at 8:01 PM, Sandy Ryza mailto:sandy.r...@cloudera.com>> wrote: Hi Matt, If you place your jars on HDFS in a public location, YARN will cache them on each node after

Got the exception when joining RDD with spark streamRDD

2015-06-18 Thread Groupme
Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream data set to test the model. So, I join with RDD with the StreamRDD, but i got the exception. Following are my source code, and the exception I got. Any help is appreciat

connect mobile app with Spark backend

2015-06-18 Thread Ralph Bergmann
Hi, I'm new to Spark and need some architecture tips :-) I need a way to connect the mobile app with the Spark backend to upload data to and download data from the Spark backend. The use case is that the user do something with the app. This changes are uploaded to the backend. Spark calculate

  1   2   >