Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
I read the other day that there will be a fair number of improvements in 1.4 for Mesos. Could I ask for one more (if it isn't already in there): a configurable limit for the number of tasks for jobs run on Mesos ? This would be a very simple yet effective way to prevent a job dominating the cluster

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config setting that will limit the total # of cores it grabs. This was there in earlier versions too. Matei > On May 19, 2015, at 12:39 PM, Thomas Dudziak wrote: >

PanTera Big Data Visualization built with Spark

2015-05-19 Thread Cyrus Handy
Hi, Can you please add us to the list of Spark Users Org: PanTera URL: http://pantera.io Components we are using: - PanTera uses a direct access to the Spark Scala API - Spark Core ­ SparkContext, JavaSparkContext, SparkConf, RDD, JavaRDD, - Accumulable, AccumulableParam, Accumulator, A

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
I'm using fine-grained for a multi-tenant environment which is why I would welcome the limit of tasks per job :) cheers, Tom On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia wrote: > Hey Tom, > > Are you using the fine-grained or coarse-grained scheduler? For the > coarse-grained scheduler, ther

Re: spark streaming doubt

2015-05-19 Thread Shushant Arora
Thanks Akhil andDibyendu. Does in high level receiver based streaming executors run on receivers itself to have data localisation ? Or its always data is transferred to executor nodes and executor nodes differ in each run of job but receiver node remains same(same machines) throughout life of stre

Re: Getting the best parameter set back from CrossValidatorModel

2015-05-19 Thread Joseph Bradley
Hi Justin & Ram, To clarify, PipelineModel.stages is not private[ml]; only the PipelineModel constructor is private[ml]. So it's safe to use pipelineModel.stages as a Spark user. Ram's example looks good. Btw, in Spark 1.4 (and the current master build), we've made a number of improvements to P

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Yeah, this definitely seems useful there. There might also be some ways to cap the application in Mesos, but I'm not sure. Matei > On May 19, 2015, at 1:11 PM, Thomas Dudziak wrote: > > I'm using fine-grained for a multi-tenant environment which is why I would > welcome the limit of tasks per

Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit a Spark streaming job. I am currently using the direct approach for Kafk

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
Have you read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ? 1. There's nothing preventing that. 2. Checkpointing will give you at-least-once semantics, provided you have sufficient kafka retention. Be aware that checkpoints aren't recoverable if you upgrade code. On

Problem querying RDD using HiveThriftServer2.startWithContext functionality

2015-05-19 Thread fdmitriy
Hi, I am trying to query a Spark RDD using the HiveThriftServer2.startWithContext functionality and getting the following Exception: 15/05/19 13:26:43 WARN thrift.ThriftCLIService: Error executing statement: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hive.se

Code error

2015-05-19 Thread Ricardo Goncalves da Silva
Hi, Can anybody see what's wrong in this piece of code: ./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile("/user/p_loadbd/fraude5.csv").map(x => x.

Re: MLlib libsvm isssues with data

2015-05-19 Thread Xiangrui Meng
The index should start from 1 for LIBSVM format, as defined in the README of LIBSVM (https://github.com/cjlin1/libsvm/blob/master/README#L64). The only exception is the precomputed kernel, which MLlib doesn't support. -Xiangrui On Wed, May 6, 2015 at 1:42 AM, doyere wrote: > Hi all, > > After do

Re: RandomSplit with Spark-ML and Dataframe

2015-05-19 Thread Xiangrui Meng
In 1.4, we added RAND as a DataFrame expression, which can be used for random split. Please check the example here: https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214. -Xiangrui On Thu, May 7, 2015 at 8:39 AM, Olivier Girardot wrote: > Hi, > is there any best practice to

Re: User Defined Type (UDT)

2015-05-19 Thread Xiangrui Meng
(Note that UDT is not a public API yet.) On Thu, May 7, 2015 at 7:11 AM, wjur wrote: > Hi all! > > I'm using Spark 1.3.0 and I'm struggling with a definition of a new type for > a project I'm working on. I've created a case class Person(name: String) and > now I'm trying to make Spark to be able

Re: Code error

2015-05-19 Thread Stephen Boesch
Hi Ricardo, providing the error output would help . But in any case you need to do a collect() on the rdd returned from computeCost. 2015-05-19 11:59 GMT-07:00 Ricardo Goncalves da Silva < ricardog.si...@telefonica.com>: > Hi, > > > > Can anybody see what’s wrong in this piece of code: > > > >

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-05-19 Thread Xiangrui Meng
In implicit feedback model, the coefficients were already penalized (towards zero) by the number of unobserved ratings. So I think it is fair to keep the 1.3.0 weighting (by the number of total users/items). Again, I don't think we have a clear answer. It would be nice to run some experiments and s

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will it still start to consume the data that were produced to Kafka at 12:01? Or it will just start consuming from the current time? On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger wrote: > Have you read > https://github.co

Re: Discretization

2015-05-19 Thread Xiangrui Meng
Thanks for asking! We should improve the documentation. The sample dataset is actually mimicking the MNIST digits dataset, where the values are gray levels (0-255). So by dividing by 16, we want to map it to 16 coarse bins for the gray levels. Actually, there is a bug in the doc, we should convert

Re: RandomSplit with Spark-ML and Dataframe

2015-05-19 Thread Olivier Girardot
Thank you ! Le mar. 19 mai 2015 à 21:08, Xiangrui Meng a écrit : > In 1.4, we added RAND as a DataFrame expression, which can be used for > random split. Please check the example here: > > https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214. >

Add to Powered by Spark page

2015-05-19 Thread Michal Klos
Hi, We would like to be added to the Powered by Spark list: organization name: Localytics URL: http://eng.localytics.com/ a list of which Spark components you are using: Spark, Spark Streaming, MLLib a short description of your use case: Batch, real-time, and predictive analytics driving our mobi

Re: How to implement an Evaluator for a ML pipeline?

2015-05-19 Thread Xiangrui Meng
The documentation needs to be updated to state that higher metric values are better (https://issues.apache.org/jira/browse/SPARK-7740). I don't know why if you negate the return value of the Evaluator you still get the highest regularization parameter candidate. Maybe you should check the log messa

Re: Find KNN in Spark SQL

2015-05-19 Thread Xiangrui Meng
Spark SQL doesn't provide spatial features. Large-scale KNN is usually combined with locality-sensitive hashing (LSH). This Spark package may be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash. -Xiangrui On Sat, May 9, 2015 at 9:25 PM, Dong Li wrote: > Hello experts, > > I’m new t

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-19 Thread Xiangrui Meng
Hey Jaonary, I saw this line in the error message: org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163) CaseClassStringParser is only used in older versions of Spark to parse schema from JSON. So I suspect that the cluster was running on a old version of Spark wh

Re: spark mllib kmeans

2015-05-19 Thread Xiangrui Meng
Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa wrote: > take a look at this > https://github.com/derrickburns/generalized-kmeans-clustering > > Best, > > Jao > > On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko > wrote: >> >> Hi Pa

Re: Stratified sampling with DataFrames

2015-05-19 Thread Xiangrui Meng
You need to convert DataFrame to RDD, call sampleByKey, and then apply the schema back to create DataFrame. val df: DataFrame = ... val schema = df.schema val sampledRDD = df.rdd.keyBy(r => r.getAs[Int](0)).sampleByKey(...).values val sampled = sqlContext.createDataFrame(sampledRDD, schema) Hopef

Re: question about customize kmeans distance measure

2015-05-19 Thread Xiangrui Meng
MLlib only supports Euclidean distance for k-means. You can find Bregman divergence support in Derrick's package: http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Which distance measure do you want to use? -Xiangrui On Tue, May 12, 2015 at 7:23 PM, June wrote: > Dear

Re: Increase maximum amount of columns for covariance matrix for principal components

2015-05-19 Thread Xiangrui Meng
We use a dense array to store the covariance matrix on the driver node. So its length is limited by the integer range, which is 65536 * 65536 (actually half). -Xiangrui On Wed, May 13, 2015 at 1:57 AM, Sebastian Alfers wrote: > Hello, > > > in order to compute a huge dataset, the amount of column

Re: Word2Vec with billion-word corpora

2015-05-19 Thread Xiangrui Meng
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the vec

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
If you checkpoint, the job will start from the successfully consumed offsets. If you don't checkpoint, by default it will start from the highest available offset, and you will potentially lose data. Is the link I posted, or for that matter the scaladoc, really not clear on that point? The scalad

RE: Word2Vec with billion-word corpora

2015-05-19 Thread nate
Might also want to look at Y! post, looks like they are experimenting with similar efforts in large scale word2vec: http://yahooeng.tumblr.com/post/118860853846/distributed-word2vec-on-top-of-pistachio -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, May

Re: k-means core function for temporal geo data

2015-05-19 Thread Xiangrui Meng
I'm not sure whether k-means would converge with this customized distance measure. You can list (weighted) time as a feature along with coordinates, and then use Euclidean distance. For other supported distance measures, you can check Derrick's package: http://spark-packages.org/package/derrickburn

Re: LogisticRegressionWithLBFGS with large feature set

2015-05-19 Thread Xiangrui Meng
For ML applications, the best setting to set the number of partitions to match the number of cores to reduce shuffle size. You have 3072 partitions but 128 executors, which causes the overhead. For the MultivariateOnlineSummarizer, we plan to add flags to specify what need to be computed to reduce

rdd.sample() methods very slow

2015-05-19 Thread Wang, Ningjun (LNG-NPV)
Hi I have an RDD[Document] that contains 7 million objects and it is saved in file system as object file. I want to get a random sample of about 70 objects from it using rdd.sample() method. It is ver slow val rdd : RDD[Document] = sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.0

Spark 1.3 classPath problem

2015-05-19 Thread Bill Q
Hi, We have some Spark job that ran well under Spark 1.2 using spark-submit --conf "spark.executor.extraClassPath=/etc/hbase/conf" and the Java HBase driver code the Spark called can pick up the settings for HBase such as ZooKeeper addresses. But after upgrade to CDH 5.4.1 Spark 1.3, the Spark cod

Re: rdd.sample() methods very slow

2015-05-19 Thread Sean Owen
The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Hi >

Exception when using CLUSTER BY or ORDER BY

2015-05-19 Thread Thomas Dudziak
Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext (Spark 1.3.1 on Mesos, fine-grained mode). Is this a known problem or should I file a JIRA for it ? org.apache.spark.SparkException: Can only zip RDDs with same

EOFException using KryoSerializer

2015-05-19 Thread Jim Carroll
I'm seeing the following exception ONLY when I run on a Mesos cluster. If I run the exact same code with master set to "local[N]" I have no problem: 2015-05-19 16:45:43,484 [task-result-getter-0] WARN TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 10.253.1.101): java.io.EOFException

Naming an DF aggregated column

2015-05-19 Thread Cesar Flores
I would like to ask if there is a way of specifying the column name of a data frame aggregation. For example If I do: customerDF.groupBy("state").agg(max($"discount")) the name of my aggregated column will be: MAX('discount) Is there a way of changing the name of that column to something else on

How to set the file size for parquet Part

2015-05-19 Thread Richard Grossman
Hi I'm using spark 1.3.1 and now I can't set the size of the part generated file for parquet. The size is only 512Kb it's really to small I must made them bigger. How can set this ? Thanks

Re: Does Python 2.7 have to be installed on every cluster node?

2015-05-19 Thread Davies Liu
PySpark work with CPython by default, and you can specify which version of Python to use by: PYSPARK_PYTHON=path/to/path bin/spark-submit xxx.py When you do the upgrade, you could install python 2.7 on every machine in the cluster, test it with PYSPARK_PYTHON=python2.7 bin/spark-submit xxx.py

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-19 Thread Davies Liu
It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes wrote: > Dear Experts, > > we have a spark cluster (standalone mode) in which master and workers are > started from root account. Everything runs

Hive 1.0 support in Spark

2015-05-19 Thread Kannan Rajah
Does Spark 1.3.1 support Hive 1.0? If not, which version of Spark will start supporting Hive 1.0? -- Kannan

Re: Spark SQL on large number of columns

2015-05-19 Thread ayan guha
One option is batch up columns and do the batches in sequence. On 20 May 2015 00:20, "madhu phatak" wrote: > Hi, > Another update, when run on more that 1000 columns I am getting > > Could not write class > __wrapper$1$40255d281a0d4eacab06bcad6cf89b0d/__wrapper$1$40255d281a0d4eacab06bcad6cf89b0d$

Re: Naming an DF aggregated column

2015-05-19 Thread Michael Armbrust
customerDF.groupBy("state").agg(max($"discount").alias("newName")) (or .as(...), both functions can take a String or a Symbol) On Tue, May 19, 2015 at 2:11 PM, Cesar Flores wrote: > > I would like to ask if there is a way of specifying the column name of a > data frame aggregation. For example

sparkSQL - Hive metastore connection hangs with MS SQL server

2015-05-19 Thread jamborta
Hi all, I am trying to setup an external metastore using Microsoft SQL on Azure, it works ok initially but after about 5 mins inactivity it hangs, then times out after 15 mins with this error: 15/05/20 00:02:49 ERROR ConnectionHandle: Database access problem. Killing off this connection and all r

Spark users

2015-05-19 Thread Ricardo Goncalves da Silva
Hi I'm learning spark focused on data and machine learning. Migrating from SAS. There is a group for it? My questions are basic for now and I having very few answers. Tal Rick. Enviado do meu smartphone Samsung Galaxy. Este mensaje y sus adjuntos se dirigen

Re: Reading Binary files in Spark program

2015-05-19 Thread Tapan Sharma
Thanks. I will try and let you know. But what exactly is an issue? Any pointers? Regards Tapan On Tue, May 19, 2015 at 6:26 PM, Akhil Das wrote: > Try something like: > > JavaPairRDD output = sc.newAPIHadoopFile(inputDir, > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.cla

Re: EOFException using KryoSerializer

2015-05-19 Thread Imran Rashid
Hi Jim, this is definitley strange. It sure sounds like a bug, but it also is a very commonly used code path, so it at the very least you must be hitting a corner case. Could you share a little more info with us? What version of spark are you using? How big is the object you are trying to broa

spark 1.3.1 jars in repo1.maven.org

2015-05-19 Thread Edward Sargisson
Hi, I'd like to confirm an observation I've just made. Specifically that spark is only available in repo1.maven.org for one Hadoop variant. The Spark source can be compiled against a number of different Hadoops using profiles. Yay. However, the spark jars in repo1.maven.org appear to be compiled a

Re: Reading Binary files in Spark program

2015-05-19 Thread Tapan Sharma
Problem is still there. Exception is not coming at the time of reading. Also the count of JavaPairRDD is as expected. It is when we are calling collect() or toArray() methods, the exception is coming. Something to do with Text class even though I haven't used it in the program. Regards Tapan On T

Re: spark 1.3.1 jars in repo1.maven.org

2015-05-19 Thread Ted Yu
I think your observation is correct. e.g. http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.3.1 shows that it depends on hadoop-client from hadoop 2.2 Cheers On Tue, May 19, 2015 at 6:17 PM, Edward Sargisson w

?????? How to use spark to access HBase with Security enabled

2015-05-19 Thread donhoff_h
Sorry, this ref does not help me. I have set up the configuration in hbase-site.xml. But it seems there are still some extra configurations to be set or APIs to be called to make my spark program be able to pass the authentication with the HBase. Does anybody know how to set authentication to

Spark Job not using all nodes in cluster

2015-05-19 Thread Shailesh Birari
Hi, I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB of RAM. I have around 600,000+ Json files on HDFS. Each file is small around 1KB in size. Total data is around 16GB. Hadoop block size is 256MB. My application reads these files with sc.textFile() (or sc.jsonFile() tri

RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Chandra Mohan, Ananda Vel Murugan
Hi, Thanks for the response. I was looking for a java solution. I will check the scala and python ones. Regards, Anand.C From: Todd Nist [mailto:tsind...@gmail.com] Sent: Tuesday, May 19, 2015 6:17 PM To: Chandra Mohan, Ananda Vel Murugan Cc: ayan guha; user Subject: Re: Spark sql error while w

Spark logo license

2015-05-19 Thread Justin Pihony
What is the license on using the spark logo. Is it free to be used for displaying commercially? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Find KNN in Spark SQL

2015-05-19 Thread Debasish Das
The batch version of this is part of rowSimilarities JIRA 4823 ...if your query points can fit in memory there is broadcast version which we are experimenting with internallywe are using brute force KNN right now in the PR...based on flann paper lsh did not work well but before you go to approx

Re: Spark logo license

2015-05-19 Thread Matei Zaharia
Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ Matei > On May 20, 2015, at 12:02 AM, Justin Pihony wrote: > > What is the license on using the spark logo. Is it free to be used for > displaying commercially? > >

Re: Spark Job not using all nodes in cluster

2015-05-19 Thread ayan guha
What is your spark env file says? Are you setting number of executors in spark context? On 20 May 2015 13:16, "Shailesh Birari" wrote: > Hi, > > I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB > of RAM. > I have around 600,000+ Json files on HDFS. Each file is small aro

Re: Spark logo license

2015-05-19 Thread Justin Pihony
Thanks! On Wed, May 20, 2015 at 12:41 AM, Matei Zaharia wrote: > Check out Apache's trademark guidelines here: > http://www.apache.org/foundation/marks/ > > Matei > > On May 20, 2015, at 12:02 AM, Justin Pihony > wrote: > > What is the license on using the spark logo. Is it free to be used for

Hive on Spark VS Spark SQL

2015-05-19 Thread guoqing0...@yahoo.com.hk
Hive on Spark and SparkSQL which should be better , and what are the key characteristics and the advantages and the disadvantages between ? guoqing0...@yahoo.com.hk

Spark Streaming to Kafka

2015-05-19 Thread twinkle sachdeva
Hi, As Spark streaming is being nicely integrated with consuming messages from Kafka, so I thought of asking the forum, that is there any implementation available for pushing data to Kafka from Spark Streaming too? Any link(s) will be helpful. Thanks and Regards, Twinkle

Re: Spark Streaming to Kafka

2015-05-19 Thread Saisai Shao
I think here is the PR https://github.com/apache/spark/pull/2994 you could refer to. 2015-05-20 13:41 GMT+08:00 twinkle sachdeva : > Hi, > > As Spark streaming is being nicely integrated with consuming messages from > Kafka, so I thought of asking the forum, that is there any implementation > ava

java program Get Stuck at broadcasting

2015-05-19 Thread allanjie
​Hi All, The variable I need to broadcast is just 468 MB. When broadcasting, it just “stop” at here: * 15/05/20 11:36:14 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/05/20 11:36:14 INFO Configuration.deprecation: mapred.task.id is deprecated. I

Re: Hive on Spark VS Spark SQL

2015-05-19 Thread Debasish Das
SparkSQL was built to improve upon Hive on Spark runtime further... On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hive on Spark and SparkSQL which should be better , and what are the key > characteristics and the advantages and the disadvantages b

Re: How to implement an Evaluator for a ML pipeline?

2015-05-19 Thread Stefan H.
Thanks, Xiangrui, for clarifying the metric and creating that JIRA issue. I made an error while composing my earlier mail: "paramMap.get(als.regParam)" in my Evaluator actually returns "None". I just happended to use "getOrElse(1.0)" in my tests, which explains why negating the metric did not

Re: Spark Streaming to Kafka

2015-05-19 Thread twinkle sachdeva
Thanks Saisai. On Wed, May 20, 2015 at 11:23 AM, Saisai Shao wrote: > I think here is the PR https://github.com/apache/spark/pull/2994 you > could refer to. > > 2015-05-20 13:41 GMT+08:00 twinkle sachdeva : > >> Hi, >> >> As Spark streaming is being nicely integrated with consuming messages >> f

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
You mean to say within Runtime.getRuntime().addShutdownHook I call ssc.stop(stopSparkContext = true, stopGracefully = true) ? This won't work anymore in 1.4. The SparkContext got stopped before Receiver processed all received blocks and I see below exception in logs. But if I add the Utils.addS

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
By the way this happens when I stooped the Driver process ... On Tue, May 19, 2015 at 12:29 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > You mean to say within Runtime.getRuntime().addShutdownHook I call > ssc.stop(stopSparkContext = true, stopGracefully = true) ? > > Th

Re: TwitterUtils on Windows

2015-05-19 Thread Akhil Das
Hi Justin, Can you try with sbt, may be that will help. -> Install sbt for windows http://www.scala-sbt.org/0.13/tutorial/Installing-sbt-on-Windows.html -> Create a lib directory in your project directory -> Place these jars in it: - spark-streaming-twitter_2.10-1.3.1.jar - twitter4j-async-3.0.3

Re: org.apache.spark.shuffle.FetchFailedException :: Migration from Spark 1.2 to 1.3

2015-05-19 Thread Akhil Das
There were some similar discussion happened on JIRA https://issues.apache.org/jira/browse/SPARK-3633 may be that will give you some insights. Thanks Best Regards On Mon, May 18, 2015 at 10:49 PM, zia_kayani wrote: > Hi, I'm getting this exception after shifting my code from Spark 1.2 to > Spark

group by and distinct performance issue

2015-05-19 Thread Peer, Oded
I am running Spark over Cassandra to process a single table. My task reads a single days' worth of data from the table and performs 50 group by and distinct operations, counting distinct userIds by different grouping keys. My code looks like this: JavaRdd rdd = sc.parallelize().mapPartitions(

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Sean Owen
I don't think you should rely on a shutdown hook. Ideally you try to stop it in the main exit path of your program, even in case of an exception. On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya wrote: > You mean to say within Runtime.getRuntime().addShutdownHook I call > ssc.stop(stopSpark

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
Thenka Sean . you are right. If driver program is running then I can handle shutdown in main exit path . But if Driver machine is crashed (if you just stop the application, for example killing the driver process ), then Shutdownhook is the only option isn't it ? What I try to say is , just doing s

spark streaming doubt

2015-05-19 Thread Shushant Arora
What happnes if in a streaming application one job is not yet finished and stream interval reaches. Does it starts next job or wait for first to finish and rest jobs will keep on accumulating in queue. Say I have a streaming application with stream interval of 1 sec, but my job takes 2 min to pro

Re: group by and distinct performance issue

2015-05-19 Thread Akhil Das
Hi Peer, If you open the driver UI (running on port 4040) you can see the stages and the tasks happening inside it. Best way to identify the bottleneck for a stage is to see if there's any time spending on GC, and how many tasks are there per stage (it should be a number > total # cores to achieve

Re: Spark and Flink

2015-05-19 Thread Pa Rö
it's sound good, maybe you can send me pseudo structure, that is my fist maven project. best regards, paul 2015-05-18 14:05 GMT+02:00 Robert Metzger : > Hi, > I would really recommend you to put your Flink and Spark dependencies into > different maven modules. > Having them both in the same proj

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
It will be a single job running at a time by default (you can also configure the spark.streaming.concurrentJobs to run jobs parallel which is not recommended to put in production). Now, your batch duration being 1 sec and processing time being 2 minutes, if you are using a receiver based streaming

Re: Broadcast variables can be rebroadcast?

2015-05-19 Thread N B
Hi Imran, If I understood you correctly, you are suggesting to simply call broadcast again from the driver program. This is exactly what I am hoping will work as I have the Broadcast data wrapped up and I am indeed (re)broadcasting the wrapper over again when the underlying data changes. However,

Re: TwitterUtils on Windows

2015-05-19 Thread Steve Loughran
> On 19 May 2015, at 03:08, Justin Pihony wrote: > > > 15/05/18 22:03:14 INFO Executor: Fetching > http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with > timestamp 1432000973058 > 15/05/18 22:03:14 INFO Utils: Fetching > http://192.168.56.1:49752/jars/twitter4j-media-support-3.

Re: Working with slides. How do I know how many times a RDD has been processed?

2015-05-19 Thread Guillermo Ortiz
I tried to insert an "flag" in the RDD, so I could set in the last position a counter, when the counter gets X, I could do something. But in each slide comes the original RDD although I modificated it. I did this code to check if this is possible but it doesn't work. val rdd1WithFlag = rdd1.map {

Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Night Wolf
Hi all, I have a job that, for every row, creates about 20 new objects (i.e. RDD of 100 rows in = RDD 2000 rows out). The reason for this is each row is tagged with a list of the 'buckets' or 'windows' it belongs to. The actual data is about 10 billion rows. Each executor has 60GB of memory. Cur

AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Hi all, I might be missing something, but does the new Spark 1.3 sqlContext save interface support using Avro as the schema structure when writing Parquet files, in a similar way to AvroParquetWriter (which I've got working)? I've seen how you can load an avro file and save it as parquet from

Re: py-files (and others?) not properly set up in cluster-mode Spark Yarn job?

2015-05-19 Thread Shay Rojansky
Thanks for the quick response and confirmation, Marcelo, I just opened https://issues.apache.org/jira/browse/SPARK-7725. On Mon, May 18, 2015 at 9:02 PM, Marcelo Vanzin wrote: > Hi Shay, > > Yeah, that seems to be a bug; it doesn't seem to be related to the default > FS nor compareFs either - I

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
spark.streaming.concurrentJobs takes an integer value, not boolean. If you set it as 2 then 2 jobs will run parallel. Default value is 1 and the next job will start once it completes the current one. > Actually, in the current implementation of Spark Streaming and under > default configuration, o

How to use spark to access HBase with Security enabled

2015-05-19 Thread donhoff_h
Hi, experts. I ran the "HBaseTest" program which is an example from the Apache Spark source code to learn how to use spark to access HBase. But I met the following exception: Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptio

RE: Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Evo Eftimov
Is that a Spark or Spark Streaming application Re the map transformation which is required you can also try flatMap Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN – giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I would sugges

Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Cheng Lian
Hi Ewan, Different from AvroParquetWriter, in Spark SQL we uses StructType as the intermediate schema format. So when converting Avro files to Parquet files, we internally converts Avro schema to Spark SQL StructType first, and then convert StructType to Parquet schema. Cheng On 5/19/15 4:4

Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and create a logical plan. Even if i have just one row, it's taking more than 1 hour just to get pass the parsing. Any i

RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that makes sense. So for new dataframe creation (not conversion from Avro but from JSON or CSV inputs) in Spark we shouldn't worry about using Avro at all, just use the Spark SQL StructType when building new Dataframes? If so, that will be a lot simpler! Thanks, Ewan From: Cheng

Reading Binary files in Spark program

2015-05-19 Thread Tapan Sharma
Hi Team, I am new to Spark and learning. I am trying to read image files into spark job. This is how I am doing: Step 1. Created sequence files with FileName as Key and Binary image as value. i.e. Text and BytesWritable. I am able to read these sequence files into Map Reduce programs. Step 2. I

Re: Spark SQL on large number of columns

2015-05-19 Thread ayan guha
can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak wrote: > Hi, > I am trying run spark sql aggregation on a file with 26k columns. No of > rows is very small. I am running into issue that spark is taking huge > amount of time to parse the sql and create a logical pla

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I have fields from field_0 to fied_26000. The query is select on max( cast($columnName as double)), |min(cast($columnName as double)), avg(cast($columnName as double)), count(*) for all those 26000 fields in one query. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, An additional information is, table is backed by a csv file which is read using spark-csv from databricks. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:05 PM, madhu phatak wrote: > Hi, > I have fields from field_0 to fied_26000. The query is select on > > ma

Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Cheng Lian
That's right. Also, Spark SQL can automatically infer schema from JSON datasets. You don't need to specify an Avro schema: sqlContext.jsonFile("json/path").saveAsParquetFile("parquet/path") or with the new reader/writer API introduced in 1.4-SNAPSHOT: sqlContext.read.json("json/path").write

RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that's brilliant, you've saved me a headache. Ewan From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: 19 May 2015 11:58 To: Ewan Leith; user@spark.apache.org Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces? That's right. Also

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I am using spark 1.3.1 Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:34 PM, Wangfei (X) wrote: > And which version are you using > > 发自我的 iPhone > > 在 2015年5月19日,18:29,"ayan guha" 写道: > > can you kindly share your code? > > On Tue, May 19, 2015 at 8:04 PM,

Re: Spark SQL on large number of columns

2015-05-19 Thread Wangfei (X)
And which version are you using 发自我的 iPhone 在 2015年5月19日,18:29,"ayan guha" mailto:guha.a...@gmail.com>> 写道: can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak mailto:phatak@gmail.com>> wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns.

Re: How to use spark to access HBase with Security enabled

2015-05-19 Thread Ted Yu
Which user did you run your program as ? Have you granted proper permission on hbase side ? You should also check master log to see if there was some clue. Cheers > On May 19, 2015, at 2:41 AM, donhoff_h <165612...@qq.com> wrote: > > Hi, experts. > > I ran the "HBaseTest" program which is

?????? How to use spark to access HBase with Security enabled

2015-05-19 Thread donhoff_h
The principal is sp...@bgdt.dev.hrb. It is the user that I used to run my spark programs. I am sure I have run the kinit command to make it take effect. And I also used the HBase Shell to verify that this user has the right to scan and put the tables in HBase. Now I still have no idea how to s

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, Tested for calculating values for 300 columns. Analyser takes around 4 minutes to generate the plan. Is this normal? Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:35 PM, madhu phatak wrote: > Hi, > I am using spark 1.3.1 > > > > > Regards, > Madhukara Phatak >

  1   2   >