Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Rahul Bindlish
Tobias, Understand and thanks for quick resolution of problem. Thanks ~Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20446.html Sent from the Apache Spark User List mailing list archi

Profiling GraphX codes.

2014-12-05 Thread Deep Pradhan
Is there any tool to profile GraphX codes in a cluster? Is there a way to know the messages exchanged among the nodes in a cluster? WebUI does not give all the information. Thank You

R: Clarifications on Spark

2014-12-05 Thread Paolo Platter
Hi, 1) yes you can. Spark is supporting a lot of file formats on hdfs/s3 then is supporting cassandra and jdbc in General. 2) yes. Spark has a jdbc thrift server where you can attach BI tools. I suggest to you to pay attention to your Query response time requirements. 3) no you can go with Ca

Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Sourav Chandra
Hi, I am getting the below error and due to this there is no completed stages- all the waiting *14/12/05 03:31:59 WARN AkkaUtils: Error sending message in 1 attempts* *java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]* *at scala.concurrent.impl.Promise$DefaultPro

Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread LinQili
Hi, all: According to https://github.com/apache/spark/pull/2732, When a spark job fails or exits nonzero in yarn-cluster mode, the spark-submit will get the corresponding return code of the spark job. But I tried in spark-1.1.1 yarn cluster, spark-submit return zero anyway. Here is my spark code

RE: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread LinQili
I tried in spark client mode, spark-submit can get the correct return code from spark job. But in yarn-cluster mode, It failed. From: lin_q...@outlook.com To: u...@spark.incubator.apache.org Subject: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode Date: Fri, 5 D

Re: SQL query in scala API

2014-12-05 Thread Cheng Lian
Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write you own aggregation with |aggregateByKey|: |users.aggregateByKey((0,Set.empty[String]))({case ((count, seen), user) => (count +1, seen + user) }, {case ((count0, seen0), (count1, seen1)) => (count0 + count1, seen0 ++

RE: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Shao, Saisai
Hi, I don’t think it’s a problem of Spark Streaming, seeing for call stack, it’s the problem when BlockManager starting to initializing itself. Would you mind checking your configuration of Spark, hardware problem, deployment. Mostly I think it’s not the problem of Spark. Thanks Saisai From:

RE: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread LinQili
I tried anather test code: def main(args: Array[String]) {if (args.length != 1) { Util.printLog("ERROR", "Args error - arg1: BASE_DIR") exit(101) }val currentFile = args(0).toStringval DB = "test_spark" val tableName = "src" val sparkConf = new SparkConf().setApp

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
What's the status of this application in the yarn web UI? Best Regards, Shixiong Zhu 2014-12-05 17:22 GMT+08:00 LinQili : > I tried anather test code: > def main(args: Array[String]) { > if (args.length != 1) { > Util.printLog("ERROR", "Args error - arg1: BASE_DIR") > exit(101)

Re: NullPointerException When Reading Avro Sequence Files

2014-12-05 Thread cjdc
Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class w

Increasing the number of retry in case of job failure

2014-12-05 Thread shahab
Hello, By some (unknown) reasons some of my tasks, that fetch data from Cassandra, are failing so often, and apparently the master removes a tasks which fails more than 4 times (in my case). Is there any way to increase the number of re-tries ? best, /Shahab

[Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Yifan LI
Hi, I have a graph in where each vertex keep several messages to some faraway neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. k = 5). now, I propose to distribute these messages to their corresponding destinations(say, "faraway neighbours”): - by using pregel api

scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Hao Ren
Hi, I am using SparkSQL on 1.1.0 branch. The following code leads to a scala.MatchError at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) val scm = StructType(inputRDD.schema.fields.init :+ StructField("list", ArrayType( StructType(

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Jay Vyas
Here's an example of a Cassandra etl that you can follow which should exit on its own. I'm using it as a blueprint for revolving spark streaming apps on top of. For me, I kill the streaming app w system.exit after a sufficient amount of data is collected. That seems to work for most any scena

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-05 Thread sahanbull
I worked man.. Thanks alot :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html Sent from the Apache Spark User List mailing list archive at Nabble.c

How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Emre Sevinc
Hello, I'm currently developing a Spark Streaming application and trying to write my first unit test. I've used Java for this application, and I also need use Java (and JUnit) for writing unit tests. I could not find any documentation that focuses on Spark Streaming unit testing, all I could find

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-05 Thread Daniel Darabos
Hi, Alexey, I'm getting the same error on startup with Spark 1.1.0. Everything works fine fortunately. The error is mentioned in the logs in https://issues.apache.org/jira/browse/SPARK-4498, so maybe it will also be fixed in Spark 1.2.0 and 1.1.2. I have no insight into it unfortunately. On Tue,

Re: Does filter on an RDD scan every data item ?

2014-12-05 Thread nsareen
Any thoughts, how could Spark SQL help in our scenario ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20465.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Helena Edelson
You can just do You can just do something like this, the Spark Cassandra Connector handles the rest KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Map(KafkaTopicRaw -> 10), StorageLevel.DISK_ONLY_2) .map { case (_, line) => line.split(",")} .map(Ra

cartesian on pyspark not paralleised

2014-12-05 Thread Antony Mayi
Hi, using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - I can seen multiple python processes spawned on each nodemanager but from some reason when running cartesian there is only single python process running on each node. the task is indicating thousands of partitions so

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Daniel Darabos
It is controlled by "spark.task.maxFailures". See http://spark.apache.org/docs/latest/configuration.html#scheduling. On Fri, Dec 5, 2014 at 11:02 AM, shahab wrote: > Hello, > > By some (unknown) reasons some of my tasks, that fetch data from > Cassandra, are failing so often, and apparently the

Re: Market Basket Analysis

2014-12-05 Thread Sean Owen
Generally I don't think frequent-item-set algorithms are that useful. They're simple and not probabilistic; they don't tell you what sets occurred unusually frequently. Usually people ask for frequent item set algos when they really mean they want to compute item similarity or make recommendations.

spark streaming kafa best practices ?

2014-12-05 Thread david
hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd => { rdd.map(event => { // pro

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Daniel Darabos
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer wrote: > Rahul, > > On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish < > rahul.bindl...@nectechnologies.in> wrote: >> >> I have done so thats why spark is able to load objectfile [e.g. >> person_obj] >> and spark has maintained serialVersionUID [perso

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Ted Yu
Please specify '-DskipTests' on commandline. Cheers On Dec 5, 2014, at 3:52 AM, Emre Sevinc wrote: > Hello, > > I'm currently developing a Spark Streaming application and trying to write my > first unit test. I've used Java for this application, and I also need use > Java (and JUnit) for wr

subscribe me to the list

2014-12-05 Thread Wang, Ningjun (LNG-NPV)
I would like to subscribe to the user@spark.apache.org Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541

Re: subscribe me to the list

2014-12-05 Thread 张鹏
Hi Ningjun Please send email to this address to get subscribed: user-subscr...@spark.apache.org On Dec 5, 2014, at 10:36 PM, Wang, Ningjun (LNG-NPV) wrote: > I would like to subscribe to the user@spark.apache.org > > > Regards, > > Ningjun Wang > Consulting Software Engineer > LexisNexi

Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile("path") and then call count on my data. When I do th

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Sean Owen
How big is your file? it's probably of a size that the Hadoop InputFormat would make 52 splits for it. Data drives partitions, not processing resource. Really, 8 splits is the minimum parallelism you want. Several times your # of cores is better. On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread m.sarosh
Hi Akhil, Vyas, Helena, Thank you for your suggestions. As Akhil suggested earlier, i have implemented the batch Duration into JavaStreamingContext and waitForTermination(Duration). The approach Helena suggested is Scala oriented. But the issue now is that I want to set Cassandra as my output.

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G big and with less bigger file I have a different partitions size. Thanks for this clarification. On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen wrote: > How big is your file? it's probably of a size that the Hadoop > InputForm

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Helena Edelson
I think what you are looking for is something like: JavaRDD pricesRDD = javaFunctions(sc).cassandraTable("ks", "tab", mapColumnTo(Double.class)).select("price"); JavaRDD rdd = javaFunctions(sc).cassandraTable("ks", "people", mapRowTo(Person.class)); noted here: https://github.com/datastax/spa

Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core. Solvin

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Emre Sevinc
Hello, Specifying '-DskipTests' on commandline worked, though I can't be sure whether first running 'sbt assembly' also contributed to the solution. (I've tried 'sbt assembly' because branch-1.1's README says to use sbt). Thanks for the answer. Kind regards, Emre Sevinç

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Sean Owen
Spark has much more overhead, since it's set up to distribute the computation. Julia isn't distributed, and so has no such overhead in a completely in-core implementation. You generally use Spark when you have a problem large enough to warrant distributing, or, your data already lives in a distribu

pyspark exception catch

2014-12-05 Thread Igor Mazor
Hi , Is it possible to catch exceptions using pyspark so in case of error, the program will not fail and exit. for example if I am using (key, value) rdd functionality but the data don't have actually (key, value) format, pyspark will throw exception (like ValueError) that I am unable to catch.

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread m.sarosh
Thank you Helena, But I would like to explain my problem space: The output is supposed to be Cassandra. To achieve that, I have to use spark-cassandra-connecter APIs. So going in a botton-up approach, to write to cassandra, I have to use: javaFunctions( rdd, TestTable.class).saveToCassandra

Re: Market Basket Analysis

2014-12-05 Thread Rohit Pujari
This is a typical use case "people who buy electric razors, also tend to buy batteries and shaving gel along with it". The goal is to build a model which will look through POS records and find which product categories have higher likelihood of appearing together in given a transaction. What would

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hmm, here I use spark on local mode on my laptop with 8 cores. The data is on my local filesystem. Event thought, there an overhead due to the distributed computation, I found the difference between the runtime of the two implementations really, really huge. Is there a benchmark on how well the alg

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Davies Liu
Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa wrote: > Hmm, here I use spark on local mode on my laptop with 8 cores. The data is > on my local filesystem. Event thought

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Ted Yu
I don't think 'sbt assembly' would touch local maven repo for Spark. Looking at dependency:tree output: [INFO] org.apache.spark:spark-streaming_2.10:jar:1.1.0-SNAPSHOT [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0-SNAPSHOT:compile spark-streaming only depends on spark-core other than thir

Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Asim Jalis
Is there a way I can have a JDBC connection open through a streaming job. I have a foreach which is running once per batch. However, I don’t want to open the connection for each batch but would rather have a persistent connection that I can reuse. How can I do this? Thanks. Asim

Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Hi, Seems adding the cassandra connector and spark streaming causes "issues". I've added by build and code file. Running "sbt compile" gives weird errors like Seconds is not part of org.apache.spark.streaming and object Receiver is not a member of package org.apache.spark.streaming.receiver. If

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
The code is really simple : *object TestKMeans {* * def main(args: Array[String]) {* *val conf = new SparkConf()* * .setAppName("Test KMeans")* * .setMaster("local[8]")* * .set("spark.executor.memory", "8g")* *val sc = new SparkContext(conf)* *val numClusters = 500;

RE: Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Ashic Mahtab
I've done this: 1. foreachPartition 2. Open connection. 3. foreach inside the partition. 4. close the connection. Slightly crufty, but works. Would love to see a better approach. Regards, Ashic. Date: Fri, 5 Dec 2014 12:32:24 -0500 Subject: Spark Streaming Reusing JDBC Connections From: asimja.

RE: Market Basket Analysis

2014-12-05 Thread Ashic Mahtab
This can definitely be useful. "Frequently bought together" is something amazon does, though surprisingly, you don't get a discount. Perhaps it can lead to offering (or avoiding!) deals on frequent itemsets. This is a good resource for frequent itemsets implementations: http://infolab.stanford.

Optimized spark configuration

2014-12-05 Thread vdiwakar.malladi
Hi Could any one help what would be better / optimized configuration for driver memory, worker memory, number of parallelisms etc., parameters to be configured when we are running 1 master node (it itself acting as slave node also) and 1 slave node. Both are of 32 GB RAM with 4 cores. On this, I

I am having problems reading files in the 4GB range

2014-12-05 Thread Steve Lewis
I am using a custom hadoop input format which works well on smaller files but fails with a file at about 4GB size - the format is generating about 800 splits and all variables in my code are longs - Any suggestions? Is anyone reading files of this size? Exception in thread "main" org.apache.spark

Re: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ted Yu
Can you try with maven ? diff --git a/streaming/pom.xml b/streaming/pom.xml index b8b8f2e..6cc8102 100644 --- a/streaming/pom.xml +++ b/streaming/pom.xml @@ -68,6 +68,11 @@ junit-interface test + + com.datastax.spark + spark-cassandra-connector_2.10 + 1.1.0 +

Re: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Andrew Or
Hey Sourav, are you able to run a simple shuffle in a spark-shell? 2014-12-05 1:20 GMT-08:00 Shao, Saisai : > Hi, > > > > I don’t think it’s a problem of Spark Streaming, seeing for call stack, > it’s the problem when BlockManager starting to initializing itself. Would > you mind checking your c

Re: Unable to run applications on clusters on EC2

2014-12-05 Thread Andrew Or
Hey, the default port is 7077. Not sure if you actually meant to put 7070. As a rule of thumb, you can go to the Master web UI and copy and paste the URL at the top left corner. That almost always works unless your cluster has a weird proxy set up. 2014-12-04 14:26 GMT-08:00 Xingwei Yang : > I th

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or
Hey Tobias, As you suspect, the reason why it's slow is because the resource manager in YARN takes a while to grant resources. This is because YARN needs to first set up the application master container, and then this AM needs to request more containers for Spark executors. I think this accounts f

Re: [Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Ankur Dave
At 2014-12-05 02:26:52 -0800, Yifan LI wrote: > I have a graph in where each vertex keep several messages to some faraway > neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. > k = 5). > > now, I propose to distribute these messages to their corresponding > destinatio

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hi Tobias, What version are you using? In some recent versions, we had a couple of large hardcoded sleeps on the Spark side. -Sandy On Fri, Dec 5, 2014 at 11:15 AM, Andrew Or wrote: > Hey Tobias, > > As you suspect, the reason why it's slow is because the resource manager > in YARN takes a wh

Re: Monitoring Spark

2014-12-05 Thread Andrew Or
If you're only interested in a particular instant, a simpler way is to check the executors page on the Spark UI: http://spark.apache.org/docs/latest/monitoring.html. By default each executor runs one task per core, so you can see how many tasks are being run at a given time and this translates dire

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Sorry...really don't have enough maven know how to do this quickly. I tried the pom below, and IntelliJ could find org.apache.spark.streaming.StreamingContext and org.apache.spark.streaming.Seconds, but not org.apache.spark.streaming.receiver.Receiver. Is there something specific I can try? I'l

Re: Issue in executing Spark Application from Eclipse

2014-12-05 Thread Andrew Or
Hey Stuti, Did you start your standalone Master and Workers? You can do this through sbin/start-all.sh (see http://spark.apache.org/docs/latest/spark-standalone.html). Otherwise, I would recommend launching your application from the command line through bin/spark-submit. I am not sure if we offici

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting d

Re: Any ideas why a few tasks would stall

2014-12-05 Thread Andrew Or
Hi Steve et al., It is possible that there's just a lot of skew in your data, in which case repartitioning is a good idea. Depending on how large your input data is and how much skew you have, you may want to repartition to a larger number of partitions. By the way you can just call rdd.repartitio

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand steps. If I was running this on standalone cluster mode the query finished in 55s but on YARN, the query was still running 30min later. Would the hard coded sleeps potentially be in play here? On Fri, Dec 5, 2014 at 11:23 Sandy Ry

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Andrew Or
Increasing max failures is a way to do it, but it's probably a better idea to keep your tasks from failing in the first place. Are your tasks failing with exceptions from Spark or your application code? If from Spark, what is the stack trace? There might be a legitimate Spark bug such that even inc

Re: drop table if exists throws exception

2014-12-05 Thread Michael Armbrust
The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang wrote: > Hi, > > I got exception saying Hive: NoSuchObjectException(message: table > not found) > > when

Re: drop table if exists throws exception

2014-12-05 Thread Mark Hamstra
And that is no different from how Hive has worked for a long time. On Fri, Dec 5, 2014 at 11:42 AM, Michael Armbrust wrote: > The command run fine for me on master. Note that Hive does print an > exception in the logs, but that exception does not propogate to user code. > > On Thu, Dec 4, 2014

Re: SchemaRDD partition on specific column values?

2014-12-05 Thread Michael Armbrust
It does not appear that the in-memory caching currently preserves the information about the partitioning of the data so this optimization will probably not work. On Thu, Dec 4, 2014 at 8:42 PM, nitin wrote: > With some quick googling, I learnt that I can we can provide "distribute by > " in hive

Re: spark-submit on YARN is slow

2014-12-05 Thread Sameer Farooqui
Just an FYI - I can submit the SparkPi app to YARN in cluster mode on a 1-node m3.xlarge EC2 instance instance and the app finishes running successfully in about 40 seconds. I just figured the 30 - 40 sec run time was normal b/c of the submitting overhead that Andrew mentioned. Denny, you can mayb

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hi Denny, Those sleeps were only at startup, so if jobs are taking significantly longer on YARN, that should be a different problem. When you ran on YARN, did you use the --executor-cores, --executor-memory, and --num-executors arguments? When running against a standalone cluster, by default Spa

Re: Java RDD Union

2014-12-05 Thread Sean Owen
No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5, 201

Including data nucleus tools

2014-12-05 Thread spark.dubovsky.jakub
Hi all,   I have created assembly jar from 1.2 snapshot source by running [1] which sets correct version of hadoop for our cluster and uses hive profile. I also have written relatively simple test program which starts by reading data from parquet using hive context. I compile the code against as

Re: Java RDD Union

2014-12-05 Thread Sameer Farooqui
Hi Ron, Out of curiosity, why do you think that union is modifying an existing RDD in place? In general all transformations, including union, will create new RDDs, not modify old RDDs in place. Here's a quick test: scala> val firstRDD = sc.parallelize(1 to 5) firstRDD: org.apache.spark.rdd.RDD[I

Re: Market Basket Analysis

2014-12-05 Thread Sean Owen
I doubt Amazon uses a priori for this, but who knows. Usually you want "also bought" functionality, which is a form of similar-item computation. But you don't want to favor items that are simply frequently purchased in general. You probably want to look at pairs of items that co-occur in purchase

Re: spark-submit on YARN is slow

2014-12-05 Thread Arun Ahuja
Hey Sandy, What are those sleeps for and do they still exist? We have seen about a 1min to 1:30 executor startup time, which is a large chunk for jobs that run in ~10min. Thanks, Arun On Fri, Dec 5, 2014 at 3:20 PM, Sandy Ryza wrote: > Hi Denny, > > Those sleeps were only at startup, so if jo

Re: spark-submit on YARN is slow

2014-12-05 Thread Ashish Rangole
Likely this not the case here yet one thing to point out with Yarn parameters like --num-executors is that they should be specified *before* app jar and app args on spark-submit command line otherwise the app only gets the default number of containers which is 2. On Dec 5, 2014 12:22 PM, "Sandy Ryz

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hey Arun, The sleeps would only cause maximum like 5 second overhead. The idea was to give executors some time to register. On more recent versions, they were replaced with the spark.scheduler.minRegisteredResourcesRatio and spark.scheduler.maxRegisteredResourcesWaitingTime. As of 1.1, by defau

Re: Java RDD Union

2014-12-05 Thread Sean Owen
foreach also creates a new RDD, and does not modify an existing RDD. However, in practice, nothing stops you from fiddling with the Java objects inside an RDD when you get a reference to them in a method like this. This is definitely a bad idea, as there is certainly no guarantee that any other ope

R: Optimized spark configuration

2014-12-05 Thread Paolo Platter
What kind of Query are you performing? You should set something like 2 partition per core that would be 400 Mb per partition. As you have a lot of ram I suggest to cache the whole table, performance will increase a lot. Paolo Inviata dal mio Windows Phone Da: vd

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Michael Armbrust
All values in Hive are always nullable, though you should still not be seeing this error. It should be addressed by this patch: https://github.com/apache/spark/pull/3150 On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren wrote: > Hi, > > I am using SparkSQL on 1.1.0 branch. > > The following code leads to

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or
Hey Arun I've seen that behavior before. It happens when the cluster doesn't have enough resources to offer and the RM hasn't given us our containers yet. Can you check the RM Web UI at port 8088 to see whether your application is requesting more resources than the cluster has to offer? 2014-12-05

Re: Using data in RDD to specify HDFS directory to write to

2014-12-05 Thread Nathan Murthy
I'm experiencing the same problem when I try to run my app in a standalone Spark cluster. My use case, however, is closer to the problem documented in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Please-help-running-a-standalone-app-on-a-Spark-cluster-td1596.html. The solution

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread DB Tsai
Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai

Re: Including data nucleus tools

2014-12-05 Thread DB Tsai
Can you try to run the same job using the assembly packaged by make-distribution as we discussed in the other thread. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 5, 2014 at 12

Cannot PredictOnValues or PredictOn base on the model build with StreamingLinearRegressionWithSGD

2014-12-05 Thread Bui, Tri
Hi, The following example code is able to build the correct model.weights, but its prediction value is zero. Am I passing the PredictOnValues incorrectly? I also coded a batch version base on LinearRegressionWithSGD() with the same train and test data, iteration, stepsize info, and it was

Running two different Spark jobs vs multi-threading RDDs

2014-12-05 Thread Corey Nolet
I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached R

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
You can set SPARK_PREPEND_CLASSES=1 and it should pick your new mllib classes whenever you compile them. I don't see anything similar for examples/, so if you modify example code you need to re-build the examples module ("package" or "install" - just "compile" won't work, since you need to build t

Re: Market Basket Analysis

2014-12-05 Thread Debasish Das
Apriori can be thought as a post-processing on product similarity graph...I call it product similarity but for each product you build a node which keeps distinct users visiting the product and two product nodes are connected by an edge if the intersection > 0...you are assuming if no one user visit

Transfer from RDD to JavaRDD

2014-12-05 Thread Xingwei Yang
I use Spark in Java. I want to access the vectors of RowMatrix M, thus I use M.rows(), which is a RDD I want to transform it to JavaRDD, I used the following command; JavaRDD data = JavaRDD.fromRDD(M.rows(), scala.reflect.ClassTag$.MODULE$.apply(Vector.class); However, it shows a error like th

Re: spark streaming kafa best practices ?

2014-12-05 Thread Patrick Wendell
The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect(). On Fri, Dec 5, 201

Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell
Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer wrote: > Hi, > > O

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i suddenly also run into the issue that maven is trying to download snapshots that dont exists for other sub projects. did something change in the maven build? does maven not have capability to smartly compile the other sub-projects that a sub-project depends on? i rather avoid "mvn install" sin

Re: drop table if exists throws exception

2014-12-05 Thread Jianshi Huang
I see. The resulting SchemaRDD is returned so like Michael said, the exception does not propogate to user code. However printing out the following log is confusing :) scala> sql("drop table if exists abc") 14/12/05 16:27:02 INFO ParseDriver: Parsing command: drop table if exists abc 14/12/05 16:2

Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
Hi, I had to use Pig for some preprocessing and to generate Parquet files for Spark to consume. However, due to Pig's limitation, the generated schema contains Pig's identifier e.g. sorted::id, sorted::cre_ts, ... I tried to put the schema inside CREATE EXTERNAL TABLE, e.g. create external t

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i think what changed is that core now has dependencies on other sub projects. ok... so i am forced to install stuff because maven cannot compile "what is needed". i will install On Fri, Dec 5, 2014 at 7:12 PM, Koert Kuipers wrote: > i suddenly also run into the issue that maven is trying to down

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Sean Owen
Maven definitely compiles "what is needed", but not if you tell it to only compile one module alone. Unless you have previously built and installed the other local snapshot artifacts it needs, that invocation can't proceed because you have restricted it to build one module whose dependencies don't

Re: Transfer from RDD to JavaRDD

2014-12-05 Thread Sean Owen
You can probably get around it with casting, but I ended up using wrapRDD -- which is not a static method -- from another JavaRDD in scope to address this more directly without casting or warnings. It's not ideal but both should work, just a matter of which you think is less hacky. On Fri, Dec 5,

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
I've never used it, but reading the help it seems the "-am" option might help here. On Fri, Dec 5, 2014 at 4:47 PM, Sean Owen wrote: > Maven definitely compiles "what is needed", but not if you tell it to > only compile one module alone. Unless you have previously built and > installed the other

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Ted Yu
I tried the following: 511 rm -rf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.3.0-SNAPSHOT/ 513 mvn -am -pl streaming package -DskipTests [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [4.976s] [INFO] Spark Project Networking ..

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal characters val rdd = parquetFile(file) val schema = rdd.schema.fields.map(f => s"`${f.name}` ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n") val ddl_13 = s"""

Problems creating and reading a large test file

2014-12-05 Thread Steve Lewis
I am trying to look at problems reading a data file over 4G. In my testing I am trying to create such a file. My plan is to create a fasta file (a simple format used in biology) looking like >1 TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG >2 G

Re: rdd.saveAsTextFile problem

2014-12-05 Thread dylanhogg
Try the workaround for Windows found here: http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. This fix the issue when calling rdd.saveAsTextFile(..) for me with Spark v1.1.0 on windows 8.1 in local mode. Summary of steps: 1) download compiled winutils.exe from http://social.m

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
There were two exit in this code. If the args was wrong, the spark-submit will get the return code 101, but, if the args is correct, spark-submit cannot get the second return code 100. What’s the difference between these two exit? I was so confused. I’m also confused. When I tried your codes, spar

  1   2   >