Re: Error in creating spark RDD

2015-04-22 Thread Akhil Das
Here's a complete scala example https://github.com/bbux-proteus/spark-accumulo-examples/blob/1dace96a115f29c44325903195c8135edf828c86/src/main/scala/org/bbux/spark/AccumuloMetadataCount.scala Thanks Best Regards On Thu, Apr 23, 2015 at 12:19 PM, Akhil Das wrote: > Change your import from mapred

Re: Error in creating spark RDD

2015-04-22 Thread Akhil Das
Change your import from mapred to mapreduce. like : import org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat; Thanks Best Regards On Wed, Apr 22, 2015 at 2:42 PM, madhvi wrote: > Hi, > > I am creating a spark RDD through accumulo writing like: > > JavaPairRDD accumuloRDD = > sc.new

RE: Error in creating spark RDD

2015-04-22 Thread Sun, Rui
Hi, SparkContext.newAPIHadoopRDD() is for working with new Hadoop mapreduce API. So, you should import import org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat; Instead of import org.apache.accumulo.core.client.mapred.AccumuloInputFormat; -Original Message- From: madhvi [mailt

Re: problem writing to s3

2015-04-22 Thread Akhil Das
Can you try writing to a different S3 bucket and confirm that? Thanks Best Regards On Thu, Apr 23, 2015 at 12:11 AM, Daniel Mahler wrote: > Hi Akhil, > > It works fine when outprefix is a hdfs:///localhost/... url. > > It looks to me as if there is something about spark writing to the same s3 >

Re: IOUtils cannot write anything in Spark?

2015-04-22 Thread Holden Karau
It seems like saveAsTextFile might do what you are looking for. On Wednesday, April 22, 2015, Xi Shen wrote: > Hi, > > I have a RDD of some processed data. I want to write these files to HDFS, > but not for future M/R processing. I want to write plain old style text > file. I tried: > > rdd fore

Re: Distinct is very slow

2015-04-22 Thread Jeetendra Gangele
Anyone any thought on this? On 22 April 2015 at 22:49, Jeetendra Gangele wrote: > I made 7000 tasks in mapTopair and in distinct also I made same number of > tasks. > Still lots of shuffle read and write is happening due to application > running for much longer time. > Any idea? > > On 17 April

IOUtils cannot write anything in Spark?

2015-04-22 Thread Xi Shen
Hi, I have a RDD of some processed data. I want to write these files to HDFS, but not for future M/R processing. I want to write plain old style text file. I tried: rdd foreach {d => val file = // create the file using a HDFS FileSystem val lines = d map { // format data into string }

Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Rok Roskar
the feature dimension is 800k. yes, I believe the driver memory is likely the problem since it doesn't crash until the very last part of the tree aggregation. I'm running it via pyspark through YARN -- I have to run in client mode so I can't set spark.driver.memory -- I've tried setting the sp

Re: StackOverflow Error when run ALS with 100 iterations

2015-04-22 Thread Xiangrui Meng
ALS.setCheckpointInterval was added in Spark 1.3.1. You need to upgrade Spark to use this feature. -Xiangrui On Wed, Apr 22, 2015 at 9:03 PM, amghost wrote: > Hi, would you please how to checkpoint the training set rdd since all things > are done in ALS.train method. > > > > -- > View this messag

Re: setting cost in linear SVM [Python]

2015-04-22 Thread Xiangrui Meng
If by "C" you mean the parameter C in LIBLINEAR, the corresponding parameter in MLlib is regParam: https://github.com/apache/spark/blob/master/python/pyspark/mllib/classification.py#L273, while regParam = 1/C. -Xiangrui On Wed, Apr 22, 2015 at 3:25 PM, Pagliari, Roberto wrote: > Is there a way to

Loading lots of .parquet files in Spark 1.3.1 (Hadoop 2.4)

2015-04-22 Thread cosmincatalin
I am trying to read a few hundred .parquet files from S3 into an EMR cluster. The .parquet files are structured by date and have /_common_metadata/ in each of the folders (as well as /_metadata/).The *sqlContext.parquetFile* operation takes a very long time, opening for reading each of the .parquet

Re: the indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Xiangrui Meng
Having ordered indices is a contract of SparseVector: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector. We do not verify it for performance. -Xiangrui On Wed, Apr 22, 2015 at 8:26 AM, yaochunnan wrote: > Hi all, > I am using Spark 1.3.1 to write

Re: [MLlib] fail to run word2vec

2015-04-22 Thread Xiangrui Meng
We store the vectors on the driver node. So it is hard to handle a really large vocabulary. You can use setMinCount to filter out infrequent word to reduce the model size. -Xiangrui On Wed, Apr 22, 2015 at 12:32 AM, gm yu wrote: > When use Mllib.Word2Vec, I meet the following error: > > allocati

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-22 Thread Xiangrui Meng
The patched was merged and it will be included in 1.3.2 and 1.4.0. Thanks for reporting the bug! -Xiangrui On Tue, Apr 21, 2015 at 2:51 PM, ayan guha wrote: > Thank you all. > > On 22 Apr 2015 04:29, "Xiangrui Meng" wrote: >> >> SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-22 Thread Xiangrui Meng
This is the size of the serialized task closure. Is stage 246 part of ALS iterations, or something before or after it? -Xiangrui On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone wrote: > Hi Sean, thanks for the answer. I tried to call repartition() on the input > with many different sizes an

How to debug Spark on Yarn?

2015-04-22 Thread ๏̯͡๏
I submit a spark app to YARN and i get these messages 15/04/22 22:45:04 INFO yarn.Client: Application report for application_1429087638744_101363 (state: RUNNING) 15/04/22 22:45:04 INFO yarn.Client: Application report for application_1429087638744_101363 (state: RUNNING). ... 1) I can go to S

Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Xiangrui Meng
What is the feature dimension? Did you set the driver memory? -Xiangrui On Tue, Apr 21, 2015 at 6:59 AM, rok wrote: > I'm trying to use the StandardScaler in pyspark on a relatively small (a few > hundred Mb) dataset of sparse vectors with 800k features. The fit method of > StandardScaler crashes

Re: Problem with using Spark ML

2015-04-22 Thread Xiangrui Meng
Please try reducing the step size. The native BLAS library is not required. -Xiangrui On Tue, Apr 21, 2015 at 5:15 AM, Staffan wrote: > Hi, > I've written an application that performs some machine learning on some > data. I've validated that the data _should_ give a good output with a decent > RM

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Nick Pentreath
Is your ES cluster reachable from your Spark cluster via network / firewall? Can you run the same query from the spark master and slave nodes via curl / one of the other clients? Seems odd that GC issues would be a problem from the scan but not when running query from a browser plugin... Sou

Re: LDA code little error @Xiangrui Meng

2015-04-22 Thread Xiangrui Meng
Thanks! That's a bug .. -Xiangrui On Wed, Apr 22, 2015 at 9:34 PM, buring wrote: > Hi: > there is a little error in source code LDA.scala at line 180, as > follows: >def setBeta(beta: Double): this.type = setBeta(beta) > >which cause "java.lang.StackOverflowError". It's easy to see th

Re: Building Spark : Adding new DataType in Catalyst

2015-04-22 Thread kmader
Unless you are directly concerned with the query optimization you needn't modify catalyst or any of the core Spark SQL code. You can simply create a new project with Spark SQL as a dependency and like is done in MLLib Vectors (in 1.3, the newer versions have it for matrices as well) Use the @SQLU

LDA code little error @Xiangrui Meng

2015-04-22 Thread buring
Hi: there is a little error in source code LDA.scala at line 180, as follows: def setBeta(beta: Double): this.type = setBeta(beta) which cause "java.lang.StackOverflowError". It's easy to see there is error -- View this message in context: http://apache-spark-user-list.1001560.n3.n

Re: StackOverflow Error when run ALS with 100 iterations

2015-04-22 Thread amghost
Hi, would you please how to checkpoint the training set rdd since all things are done in ALS.train method. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StackOverflow-Error-when-run-ALS-with-100-iterations-tp4296p22619.html Sent from the Apache Spark User

Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

2015-04-22 Thread Jeffery
Hi, Dear Spark Users/Devs: In a method, I create a new RDD, and cache it, whether Spark will unpersit the RDD automatically after the rdd is out of scope? I am thinking so, but want to make sure with you, the experts :) Thanks, Jeffery Yuan -- View this message in context: http://apache-spar

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Abhay Bansal
Thanks for your suggestions. sc.set("spark.streaming.concurrentJobs","2") works, but I am not sure of using it in production. @TD: The number of streams that we are interacting with are very large. Managing these many applications would just be an overhead. Moreover there are other operation whic

.toPairDStreamFunctions method not found

2015-04-22 Thread avseq
Dear allI had install spark 1.2.1 in centos 6.6 x64. I run the NetworkWordcount example and it works.But when I paste the same code to the Intellij , packet to a jar and execute it.It Always occurs the following exceptionException in thread "main" java.lang.NoSuchMethodError: org.apache.spark.strea

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Otis Gospodnetic
Hi, If you get ES response back in 1-5 seconds that's pretty slow. Are these ES aggregation queries? Costin may be right about GC possibly causing timeouts. SPM can give you all Spark and all key Elasticsearch metrics, including various JVM metrics. If the problem is

Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee
You may need to specify the hive port itself. For example, my own Thrift start command is in the form: ./sbin/start-thriftserver.sh --master spark://$myserver:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host $myserver --hiveconf hive.server2.thrift.port 1 HTH! O

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Rex Xiong
Yin, Thanks for your reply. We already patched this PR to our 1.3.0 As Xudong mentioned, we have thousand of parquet files, it's very very slow in first read, and another app will add more files and refresh table regularly. Cheng Lian's PR-5334 seems can resolve this issue, it will skip read all f

How to access HBase on Spark SQL

2015-04-22 Thread doovsaid
I notice that databricks provides external datasource api for Spark SQL. But I can't find any reference documents to guide how to access HBase based on it directly. Who know it? Or please give me some related links. Thanks. ZhangYi (张逸) BigEye website: http://w

Re: [spark-csv] Trouble working with Spark-CSV package (error: object databricks is not a member of package com) (#54)

2015-04-22 Thread Mohammed Omer
Yes, `import com.brkyvz.spark.WordCounter` failed as well :/ On Wed, Apr 22, 2015 at 6:45 PM, Burak Yavuz wrote: > Hi @momer , Could you please try spark-shell > with --packages brkyvz:demo-scala-python:0.1.3 just to test whether this > is a package related issue or Sp

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Tathagata Das
Vaguely makes sense. :) Wow that's an interesting corner case. On Wed, Apr 22, 2015 at 1:57 PM, Jean-Pascal Billaud wrote: > I have now a fair understanding of the situation after looking at javap > output. So as a reminder: > > dstream.map(tuple => { > val w = StreamState.fetch[K,W](state.prefi

spark-ec2 s3a filesystem support and hadoop versions

2015-04-22 Thread Daniel Mahler
I would like to easily launch a cluster that supports s3a file systems. if I launch a cluster with `spark-ec2 --hadoop-major-version=2`, what determines the minor version of hadoop? Does it depend on the spark version being launched? Are there other allowed values for --hadoop-major-version besi

Re: Map Question

2015-04-22 Thread Tathagata Das
Can you give full code? especially the myfunc? On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Here's what I did: > > print 'BROADCASTING...' > broadcastVar = sc.broadcast(mylist) > print broadcastVar > print broadcastVar.value > print 'FINISHED BROADCASTI

RE: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-22 Thread yana
You can try pulling the jar with wget and using it with -jars with Spark shell. I used 1.0.3 with Spark 1.3.0 but with a different version of scala. From the stack trace it looks like Spark shell is just not seeing the csv jar... Sent on the new Sprint Network from my Samsung Galaxy S®4. -

why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-22 Thread Hao Ren
Hi, Just a quick question, Regarding the source code of groupByKey: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L453 In the end, it cast CompactBuffer to Iterable. But why ? Any advantage? Thank you. Hao. -- View this message

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-22 Thread Michael Armbrust
Sorry for the confusion. We should be more clear about the semantics in the documentation. (PRs welcome :) ) .saveAsTable does not create a hive table, but instead creates a Spark Data Source table. Here the metadata is persisted into Hive, but hive cannot read the tables (as this API support ML

setting cost in linear SVM [Python]

2015-04-22 Thread Pagliari, Roberto
Is there a way to set the cost value C when using linear SVM?

beeline that comes with spark 1.3.0 doesn't work with "--hiveconf" or ''--hivevar" which substitutes variables at hive scripts.

2015-04-22 Thread ogoh
Hello, I am using Spark 1.3 for SparkSQL (hive) & ThriftServer & Beeline. The Beeline doesn't work with "--hiveconf" or ''--hivevar" which substitutes variables at hive scripts. I found the following jiras saying that Hive 0.13 resolved that issue. I wonder if this is well-known issue? https://i

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Costin Leau
Hi, First off, for Elasticsearch questions is worth pinging the Elastic mailing list as that is closer monitored than this one. Back to your question, Jeetendra is right that the exception indicates nodata is flowing back to the es-connector and Spark. The default is 1m [1] which should be mor

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Here's what I did: print 'BROADCASTING...' broadcastVar = sc.broadcast(mylist) print broadcastVar print broadcastVar.value print 'FINISHED BROADCASTING...' The above works fine, but when I call myrdd.map(myfunc) I get *NameError: global name 'broadcastVar' is not defined* The myfunc function is

Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
https://github.com/databricks/spark-avro On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Thanks Michael! > I have tried applying my schema programatically but I didn't get any > improvement on performance :( > Could you point me to some code exa

Re: Scheduling across applications - Need suggestion

2015-04-22 Thread Lan Jiang
YARN capacity scheduler support hierarchical queues, which you can assign cluster resource as percentage. Your spark application/shell can be submitted to different queues. Mesos supports fine-grained mode, which allows the machines/cores used each executors ramp up and down. Lan On Wed, Apr 22,

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Jean-Pascal Billaud
I have now a fair understanding of the situation after looking at javap output. So as a reminder: dstream.map(tuple => { val w = StreamState.fetch[K,W](state.prefixKey, tuple._1) (tuple._1, (tuple._2, w)) }) And StreamState being a very simple standalone object: object StreamState { def fetch[

Re: RDD.filter vs. RDD.join--advice please

2015-04-22 Thread dsgriffin
Test it out, but I would be willing to bet the join is going to be a good deal faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-filter-vs-RDD-join-advice-please-tp22612p22614.html Sent from the Apache Spark User List mailing list archive at Nabble

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi Thanks for the help. My ES is up. Out of curiosity, do you know what the timeout value is? There are probably other things happening to cause the timeout; I don't think my ES is that slow but it's possible that ES is taking too long to find the data. What I see happening is that it uses scro

RE: Scheduling across applications - Need suggestion

2015-04-22 Thread yana
Yes. Fair schedulwr only helps concurrency within an application.  With multiple shells you'd either need something like Yarn/Mesos or careful math on resources as you said Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Arun Patel Date:04/2

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Sujee Maniyam
Thanks all... btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop 2.4 I tried this on 1.3.1-hadoop26 > sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") > val f = sc.textFile("s3n://bucket/file") > f.count No it can't find the im

Spark SQL: SchemaRDD, DataFrame. Multi-value, Nested attributes

2015-04-22 Thread Eugene Morozov
Hi! I’m trying to query a dataset that reads data from csv and provides a SQL on top of it. The problem I have is I have a hierarchy of objects that I need to represent as a table so that users might use SQL to query it and do some aggregations. I do have multi value attributes (that in csv fil

Re: Efficient saveAsTextFile by key, directory for each key?

2015-04-22 Thread Arun Luthra
I ended up post-processing the result in hive with a dynamic partition insert query to get a table partitioned by the key. Looking further, it seems that 'dynamic partition' insert is in Spark SQL and working well in Spark SQL versions > 1.2.0: https://issues.apache.org/jira/browse/SPARK-3007 On

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
Basically ready timeout means hat no data arrived within the specified receive timeout period. Few thing I would suggest 1.are your ES cluster Up and running? 2. if 1 is yes then reduce the size of the Index make it few kbps and then test? On 23 April 2015 at 00:19, Adrian Mocanu wrote: > Hi >

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi The gist of it is this: I have data indexed into ES. Each index stores monthly data and the query will get data for some date range (across several ES indexes or within 1 if the date range is within 1 month). Then I merge these RDDs into an uberRdd and performs some operations then print the

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
will you be able to paste the code? On 23 April 2015 at 00:19, Adrian Mocanu wrote: > Hi > > > > I use the ElasticSearch package for Spark and very often it times out > reading data from ES into an RDD. > > How can I keep the connection alive (why doesn't it? Bug?) > > > > Here's the exception

ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn't it? Bug?) Here's the exception I get: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: java.net.SocketTimeoutExceptio

Re: problem writing to s3

2015-04-22 Thread Daniel Mahler
Hi Akhil, It works fine when outprefix is a hdfs:///localhost/... url. It looks to me as if there is something about spark writing to the same s3 bucket it is reading from. That is the only real difference between the 2 saveAsTextFile whet outprefix is on s3, inpath is also on s3 but in a differ

Re: Map Question

2015-04-22 Thread Tathagata Das
Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Can I use broadcast vars in local mode? > ᐧ > > On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das > wrote: > >> Yep. Not efficient. P

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the state RDDs AND operate on the data. 1G per executor is quite low. Definitely give more memory. And have you tried increasing the number of partitions (specify number of partitions in updateStateByKey) ? On Wed, Apr 22, 2015

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das wrote: > Yep. Not efficient. Pretty bad actually. That's why broadcast variable > were introduced right at the very beginning of Spark. > > > > On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < > vadim.bi

Re: Map Question

2015-04-22 Thread Tathagata Das
Yep. Not efficient. Pretty bad actually. That's why broadcast variable were introduced right at the very beginning of Spark. On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Thanks TD. I was looking into broadcast variables. > > Right now I am running it

Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-22 Thread Mohammed Omer
Afternoon all, I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package` The error is encountered when running spark shell via: `spark-shell --packages com.databricks:spark-csv_2.11:1.0.3` The full trace of the comman

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Tathagata Das
Furthermore, just to explain, doing arr.par.foreach does not help because it not really running anything, it only doing setup of the computation. Doing the setup in parallel does not mean that the jobs will be done concurrently. Also, from your code it seems like your pairs of dstreams dont intera

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Thanks TD. I was looking into broadcast variables. Right now I am running it locally...and I plan to move it to "production" on EC2. The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc) but I don't think it's efficient? mylist is filled only once at the start and never cha

Re: Spark streaming action running the same work in parallel

2015-04-22 Thread Tathagata Das
Unfortunately, none of the indicated logs, etc. is visible in the mail we got through the mailing list. On Wed, Apr 22, 2015 at 10:16 AM, ColinMc wrote: > Hi, > > I'm running a unit test that keeps failing to work with the code I wrote in > Spark. > > Here is the output logs from my test that I

Re: HiveContext setConf seems not stable

2015-04-22 Thread madhu phatak
Hi, calling getConf don't solve the issue. Even many hive specific queries are broken. Seems like no hive configurations are getting passed properly. Regards, Madhukara Phatak http://datamantra.io/ On Wed, Apr 22, 2015 at 2:19 AM, Michael Armbrust wrote: > As a workaround, can you call getCo

Re: Map Question

2015-04-22 Thread Tathagata Das
Is the mylist present on every executor? If not, then you have to pass it on. And broadcasts are the best way to pass them on. But note that once broadcasted it will immutable at the executors, and if you update the list at the driver, you will have to broadcast it again. TD On Wed, Apr 22, 2015

Re: Distinct is very slow

2015-04-22 Thread Jeetendra Gangele
I made 7000 tasks in mapTopair and in distinct also I made same number of tasks. Still lots of shuffle read and write is happening due to application running for much longer time. Any idea? On 17 April 2015 at 11:55, Akhil Das wrote: > How many tasks are you seeing in your mapToPair stage? Is it

Re: regarding ZipWithIndex

2015-04-22 Thread Jeetendra Gangele
Sure thanks. if you can guide me how to do this will be great help. On 17 April 2015 at 22:05, Ted Yu wrote: > I have some assignments on hand at the moment. > > Will try to come up with sample code after I clear the assignments. > > FYI > > On Thu, Apr 16, 2015 at 2:00 PM, Jeetendra Gangele >

Spark streaming action running the same work in parallel

2015-04-22 Thread ColinMc
Hi, I'm running a unit test that keeps failing to work with the code I wrote in Spark. Here is the output logs from my test that I ran that gets the customers from incoming events in the JSON called QueueEvent and I am trying to convert the incoming events for each customer to an alert. >From

RDD.filter vs. RDD.join--advice please

2015-04-22 Thread hokiegeek2
Hi Everyone, I have two options of filtering the RDD resulting from the Graph.vertices method as illustrated with the following pseudo code: 1. Filter val vertexSet = Set("vertexOne","vertexTwo"...); val filteredVertices = Graph.vertices.filter(x => vertexSet.contains(x._2.vertexName)) 2. Join

RE: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Shuai Zheng
Below is my code to access s3n without problem (only for 1.3.1. there is a bug in 1.3.0). Configuration hadoopConf = ctx.hadoopConfiguration(); hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem"); hadoopConf.set("fs.s3n

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Ted Yu
This thread from hadoop mailing list should give you some clue: http://search-hadoop.com/m/LgpTk2df7822 On Wed, Apr 22, 2015 at 9:45 AM, Sujee Maniyam wrote: > Hi all > I am unable to access s3n:// urls using sc.textFile().. getting 'no > file system for scheme s3n://' error. > > a bug or so

spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Sujee Maniyam
Hi all I am unable to access s3n:// urls using sc.textFile().. getting 'no file system for scheme s3n://' error. a bug or some conf settings missing? See below for details: env variables : AWS_SECRET_ACCESS_KEY=set AWS_ACCESS_KEY_ID=set spark/RELAESE : Spark 1.3.1 (git revision 908a0bf) bui

Map Question

2015-04-22 Thread Vadim Bichutskiy
I am using Spark Streaming with Python. For each RDD, I call a map, i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet another separate Python module I have a global list, i.e. mylist, that's populated with metadata. I can't get myfunc to see mylist...it's always empty. Alternat

Master <-chatter -> Worker

2015-04-22 Thread James King
Is there a good resource that covers what kind of chatter (communication) that goes on between driver, master and worker processes? Thanks

Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container? An expensive transformation? On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu wrote: > In master branch, overhead is now 10%. > That would be 500 MB > > FYI > > > > > On Apr 22, 2015, at 8:26 AM, nsalian wrote: > > > > +1 to executor-memory to 5g. >

Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. That would be 500 MB FYI > On Apr 22, 2015, at 8:26 AM, nsalian wrote: > > +1 to executor-memory to 5g. > Do check the overhead space for both the driver and the executor as per > Wilfred's suggestion. > > Typically, 384 MB should suffice. > > > >

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Akhil Das
nice, thanks for the information. Thanks Best Regards On Wed, Apr 22, 2015 at 8:53 PM, Jeff Nadler wrote: > > You can run multiple Spark clusters against one ZK cluster. Just use > this config to set independent ZK roots for each cluster: > > spark.deploy.zookeeper.dir > The directo

the indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread yaochunnan
Hi all, I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really confused me today. At first I thought my implementation is wrong. It turns out it's an issue in MLlib. Fortunately, I've figured it out. I suggest to add a hint on user document of MLlib ( as far as I know, ther

Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent fr

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Jeff Nadler
You can run multiple Spark clusters against one ZK cluster. Just use this config to set independent ZK roots for each cluster: spark.deploy.zookeeper.dir The directory in ZooKeeper to store recovery state (default: /spark). -Jeff From: Sean Owen To: Akhil Das Cc: Michal Klos , Use

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Yin Huai
Xudong and Rex, Can you try 1.3.1? With PR 5339 , after we get a hive parquet from metastore and convert it to our native parquet code path, we will cache the converted relation. For now, the first access to that hive parquet table reads all of the footer

Re: Convert DStream to DataFrame

2015-04-22 Thread Tathagata Das
Aaah, that. That is probably a limitation of the SQLContext (cc'ing Yin for more information). On Wed, Apr 22, 2015 at 7:07 AM, Sergio Jiménez Barrio < drarse.a...@gmail.com> wrote: > Sorry, this is the error: > > [error] /home/sergio/Escritorio/hello/streaming.scala:77: Implementation > restric

Re: Building Spark : Building just one module.

2015-04-22 Thread Iulian Dragoș
One way is to use export SPARK_PREPEND_CLASSES=true. This will instruct the launcher to prepend the target directories for each project to the spark assembly. I’ve had mixed experiences with it lately, but in principle that's the only way I know. ​ On Wed, Apr 22, 2015 at 3:42 PM, zia_kayani wrot

Can I index a column in parquet file to make it join faster

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
I have two RDDs each saved in a parquet file. I need to join this two RDDs by the "id" column. Can I created index on the id column so they can join faster? Here is the code case class Example(val id: String, val category: String) case class DocVector(val id: String, val vector: Vector) val

Re: Convert DStream to DataFrame

2015-04-22 Thread Sergio Jiménez Barrio
Sorry, this is the error: [error] /home/sergio/Escritorio/hello/streaming.scala:77: Implementation restriction: case classes cannot have more than 22 parameters. 2015-04-22 16:06 GMT+02:00 Sergio Jiménez Barrio : > I tried the solution of the guide, but I exceded the size of case class > Row:

Re: Auto Starting a Spark Job on Cluster Starup

2015-04-22 Thread Ted Yu
This thread seems related: http://search-hadoop.com/m/JW1q51W02V Cheers On Wed, Apr 22, 2015 at 6:09 AM, James King wrote: > What's the best way to start-up a spark job as part of starting-up the > Spark cluster. > > I have an single uber jar for my job and want to make the start-up as easy > a

Re: Convert DStream to DataFrame

2015-04-22 Thread Sergio Jiménez Barrio
I tried the solution of the guide, but I exceded the size of case class Row: 2015-04-22 15:22 GMT+02:00 Tathagata Das : > Did you checkout the latest streaming programming guide? > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > > You also

RE: implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
I only have import sqlContext.implicits._ but did not have import sqlContext.implicits._ I don’t think this is a problem because it compile fine using sbt. I think it is a problem of IntelliJ IDEA. I just delete the project and reimport the project into IntelliJ IDEA and now it works. Thanks Ni

Re: How to merge two dataframes with same schema

2015-04-22 Thread Peter Rudenko
Just use unionAll method: df1.show() nameid a 1 b 2 df2.show() nameid c 3 d 4 df1.unionAll(df2).show() nameid a 1 b 2 c 3 d 4 Thanks, Peter Rudneko On 2015-04-22 16:38, bipin wrote: I have looked into

Building Spark : Building just one module.

2015-04-22 Thread zia_kayani
Hi, I've to add custom things into spark SQL and Catalyst Module ... But for every time I change a line of code I've to compile the whole spark, if I only compile sql/core and sql/catalyst module those changes aren't visible when I run the job over that spark, What I'm missing ? Any other way to

How to merge two dataframes with same schema

2015-04-22 Thread bipin
I have looked into sqlContext documentation but there is nothing on how to merge two data-frames. How can I do this ? Thanks Bipin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-two-dataframes-with-same-schema-tp22606.html Sent from the Apache

Re: Convert DStream to DataFrame

2015-04-22 Thread Tathagata Das
Did you checkout the latest streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations You also need to be aware of that to convert json RDDs to dataframe, sqlContext has to make a pass on the data to learn the schema. This will

Re: Clustering algorithms in Spark

2015-04-22 Thread Jeetendra Gangele
does anybody have any thought on this? On 21 April 2015 at 20:57, Jeetendra Gangele wrote: > The problem with k means is we have to define the no of cluster which I > dont want in this case > So thinking for something like hierarchical clustering any idea and > suggestions? > > > > On 21 April 2

Re: Convert DStream to DataFrame

2015-04-22 Thread ayan guha
What about sqlcontext.createDataframe(rdd)? On 22 Apr 2015 23:04, "Sergio Jiménez Barrio" wrote: > Hi, > > I am using Kafka with Apache Stream to send JSON to Apache Spark: > > val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicsSe

Auto Starting a Spark Job on Cluster Starup

2015-04-22 Thread James King
What's the best way to start-up a spark job as part of starting-up the Spark cluster. I have an single uber jar for my job and want to make the start-up as easy as possible. Thanks jk

Convert DStream to DataFrame

2015-04-22 Thread Sergio Jiménez Barrio
Hi, I am using Kafka with Apache Stream to send JSON to Apache Spark: val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) Now, I want parse the DStream created to DataFrame, but I don't know if Spark 1.3 have some easy way for t

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Thank you Iulian ! That's precisely what I discovered today. Best, Marius On Wed, Apr 22, 2015 at 3:31 PM Iulian Dragoș wrote: > On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu > wrote: > >> Hello anyone, >> >> I have a question regarding the sort shuffle. Roughly I'm doing something >> like:

Re: Shuffle question

2015-04-22 Thread Iulian Dragoș
On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu wrote: > Hello anyone, > > I have a question regarding the sort shuffle. Roughly I'm doing something > like: > > rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) > > The problem is that in f2 I don't see the keys being sorted. The key

Re: Start ThriftServer Error

2015-04-22 Thread Yiannis Gkoufas
Hi Himanshu, I am using: ./start-thriftserver.sh --master spark://localhost:7077 Do I need to specify something additional to the command? Thanks! On 22 April 2015 at 13:14, Himanshu Parashar wrote: > what command are you using to start the Thrift server? > > On Wed, Apr 22, 2015 at 3:52 PM,

Re: Start ThriftServer Error

2015-04-22 Thread Himanshu Parashar
what command are you using to start the Thrift server? On Wed, Apr 22, 2015 at 3:52 PM, Yiannis Gkoufas wrote: > Hi all, > > I am trying to start the thriftserver and I get some errors. > I have hive running and placed hive-site.xml under the conf directory. > From the logs I can see that the er

  1   2   >