Re: Addition of new Metrics for killed executors.

2015-04-22 Thread twinkle sachdeva
Hi, Looks interesting. It is quite interesting to know about what could have been the reason for not showing these stats in UI. As per the description of Patrick W in https://spark-project.atlassian.net/browse/SPARK-999, it does not mention any exception w.r.t failed tasks/executors. Can somebo

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Sourav Chandra
Hi Olivier, *the update function is as below*: *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long, Long)]) => {* * val previousCount = state.getOrElse((0L, 0L))._2* * var startValue: IConcurrentUsers = ConcurrentViewers(0)* * var currentCount = 0L* * val las

Re: Join on DataFrames from the same source (Pyspark)

2015-04-22 Thread Karlson
DataFrames do not have the attributes 'alias' or 'as' in the Python API. On 2015-04-21 20:41, Michael Armbrust wrote: This is https://issues.apache.org/jira/browse/SPARK-6231 Unfortunately this is pretty hard to fix as its hard for us to differentiate these without aliases. However you can add

Building Spark : Adding new DataType in Catalyst

2015-04-22 Thread zia_kayani
Hi, I am working on adding Geometry i.e. a new DataType into Spark catalyst, so that ROW can hold that object also, I've made a progress but its time taking as I've to compile the whole spark project, otherwise that changes aren't visible, I've tried to just build Spark SQL and Catalyst module but

[MLlib] fail to run word2vec

2015-04-22 Thread gm yu
When use Mllib.Word2Vec, I meet the following error: allocating large array--thread_id[0x7ff2741ca000]--thread_name[Driver]--array_size[1146093680 bytes]--array_length[1146093656 elememts] prio=10 tid=0x7ff2741ca000 nid=0x1405f runnable at java.util.Arrays.copyOf(Arrays.java:2786)

Re: sparksql - HiveConf not found during task deserialization

2015-04-22 Thread Akhil Das
I see, now try a bit tricky approach, Add the hive jar to the SPARK_CLASSPATH (in conf/spark-env.sh file on all machines) and make sure that jar is available on all the machines in the cluster in the same path. Thanks Best Regards On Wed, Apr 22, 2015 at 11:24 AM, Manku Timma wrote: > Akhil, Th

Hive table creation - possible bug in Spark 1.3?

2015-04-22 Thread Ophir Cohen
I wrote few mails here regarding this issue. After further investigation I think there is a bug in Spark 1.3 in saving Hive tables. (hc is HiveContext) 1. Verify the needed configuration exists: scala> hc.sql("set hive.exec.compress.output").collect res4: Array[org.apache.spark.sql.Row] = Array([

Re: How does GraphX stores the routing table?

2015-04-22 Thread MUHAMMAD AAMIR
Hi Ankur, Thanks for the answer. However i still have following queries. On Wed, Apr 22, 2015 at 8:39 AM, Ankur Dave wrote: > On Tue, Apr 21, 2015 at 10:39 AM, mas wrote: > >> How does GraphX stores the routing table? Is it stored on the master node >> or >> chunks of the routing table is send

Error in creating spark RDD

2015-04-22 Thread madhvi
Hi, I am creating a spark RDD through accumulo writing like: JavaPairRDD accumuloRDD = sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),AccumuloInputFormat.class,Key.class, Value.class); But I am getting the following error and it is not getting compiled: Bound mismatch: The generic method

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Anyone ? On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu wrote: > Hello anyone, > > I have a question regarding the sort shuffle. Roughly I'm doing something > like: > > rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) > > The problem is that in f2 I don't see the keys being sorte

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Sourav Chandra
Anyone? On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra < sourav.chan...@livestream.com> wrote: > Hi Olivier, > > *the update function is as below*: > > *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long, > Long)]) => {* > * val previousCount = state.getOrElse((0L, 0L))._

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Sean Owen
Not that i've tried it, but, why couldn't you use one ZK server? I don't see a reason. On Wed, Apr 22, 2015 at 7:40 AM, Akhil Das wrote: > It isn't mentioned anywhere in the doc, but you will probably need separate > ZK for each of your HA cluster. > > Thanks > Best Regards > > On Wed, Apr 22, 20

Start ThriftServer Error

2015-04-22 Thread Yiannis Gkoufas
Hi all, I am trying to start the thriftserver and I get some errors. I have hive running and placed hive-site.xml under the conf directory. >From the logs I can see that the error is: Call From localhost to localhost:54310 failed I am assuming that it tries to connect to the wrong port for the n

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Steve Loughran
the key thing would be to use different ZK paths for each cluster. You shouldn't need more than 2 ZK quorums even for a large (few thousand node) Hadoop clusters: one for the HA bits of the infrastructure (HDFS, YARN) and one for the applications to abuse. It's easy for apps using ZK to stick to

Scheduling across applications - Need suggestion

2015-04-22 Thread Arun Patel
I believe we can use the properties like --executor-memory --total-executor-cores to configure the resources allocated for each application. But, in a multi user environment, shells and applications are being submitted by multiple users at the same time. All users are requesting resources with d

How to write Hive's map(key, value, ...) in Spark SQL DSL

2015-04-22 Thread Jianshi Huang
Hi, I want to write this in Spark SQL DSL: select map('c1', c1, 'c2', c2) as m from table Is there a way? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Hi, I want to do this in Spark SQL DSL: select '2015-04-22' as date from table How to do this? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Oh, I found it out. Need to import sql.functions._ Then I can do table.select(lit("2015-04-22").as("date")) Jianshi On Wed, Apr 22, 2015 at 7:27 PM, Jianshi Huang wrote: > Hi, > > I want to do this in Spark SQL DSL: > > select '2015-04-22' as date > from table > > How to do this? > >

Spark SQL performance issue.

2015-04-22 Thread Nikolay Tikhonov
Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable { > private int id; > private String name; > private double salary; > > } > Apply a schema to an RDD and register table. JavaRDD rdds = .

Re: Start ThriftServer Error

2015-04-22 Thread Himanshu Parashar
what command are you using to start the Thrift server? On Wed, Apr 22, 2015 at 3:52 PM, Yiannis Gkoufas wrote: > Hi all, > > I am trying to start the thriftserver and I get some errors. > I have hive running and placed hive-site.xml under the conf directory. > From the logs I can see that the er

Re: Start ThriftServer Error

2015-04-22 Thread Yiannis Gkoufas
Hi Himanshu, I am using: ./start-thriftserver.sh --master spark://localhost:7077 Do I need to specify something additional to the command? Thanks! On 22 April 2015 at 13:14, Himanshu Parashar wrote: > what command are you using to start the Thrift server? > > On Wed, Apr 22, 2015 at 3:52 PM,

Re: Shuffle question

2015-04-22 Thread Iulian Dragoș
On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu wrote: > Hello anyone, > > I have a question regarding the sort shuffle. Roughly I'm doing something > like: > > rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) > > The problem is that in f2 I don't see the keys being sorted. The key

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Thank you Iulian ! That's precisely what I discovered today. Best, Marius On Wed, Apr 22, 2015 at 3:31 PM Iulian Dragoș wrote: > On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu > wrote: > >> Hello anyone, >> >> I have a question regarding the sort shuffle. Roughly I'm doing something >> like:

Convert DStream to DataFrame

2015-04-22 Thread Sergio Jiménez Barrio
Hi, I am using Kafka with Apache Stream to send JSON to Apache Spark: val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) Now, I want parse the DStream created to DataFrame, but I don't know if Spark 1.3 have some easy way for t

Auto Starting a Spark Job on Cluster Starup

2015-04-22 Thread James King
What's the best way to start-up a spark job as part of starting-up the Spark cluster. I have an single uber jar for my job and want to make the start-up as easy as possible. Thanks jk

Re: Convert DStream to DataFrame

2015-04-22 Thread ayan guha
What about sqlcontext.createDataframe(rdd)? On 22 Apr 2015 23:04, "Sergio Jiménez Barrio" wrote: > Hi, > > I am using Kafka with Apache Stream to send JSON to Apache Spark: > > val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicsSe

Re: Clustering algorithms in Spark

2015-04-22 Thread Jeetendra Gangele
does anybody have any thought on this? On 21 April 2015 at 20:57, Jeetendra Gangele wrote: > The problem with k means is we have to define the no of cluster which I > dont want in this case > So thinking for something like hierarchical clustering any idea and > suggestions? > > > > On 21 April 2

Re: Convert DStream to DataFrame

2015-04-22 Thread Tathagata Das
Did you checkout the latest streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations You also need to be aware of that to convert json RDDs to dataframe, sqlContext has to make a pass on the data to learn the schema. This will

How to merge two dataframes with same schema

2015-04-22 Thread bipin
I have looked into sqlContext documentation but there is nothing on how to merge two data-frames. How can I do this ? Thanks Bipin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-two-dataframes-with-same-schema-tp22606.html Sent from the Apache

Building Spark : Building just one module.

2015-04-22 Thread zia_kayani
Hi, I've to add custom things into spark SQL and Catalyst Module ... But for every time I change a line of code I've to compile the whole spark, if I only compile sql/core and sql/catalyst module those changes aren't visible when I run the job over that spark, What I'm missing ? Any other way to

Re: How to merge two dataframes with same schema

2015-04-22 Thread Peter Rudenko
Just use unionAll method: df1.show() nameid a 1 b 2 df2.show() nameid c 3 d 4 df1.unionAll(df2).show() nameid a 1 b 2 c 3 d 4 Thanks, Peter Rudneko On 2015-04-22 16:38, bipin wrote: I have looked into

RE: implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
I only have import sqlContext.implicits._ but did not have import sqlContext.implicits._ I don’t think this is a problem because it compile fine using sbt. I think it is a problem of IntelliJ IDEA. I just delete the project and reimport the project into IntelliJ IDEA and now it works. Thanks Ni

Re: Convert DStream to DataFrame

2015-04-22 Thread Sergio Jiménez Barrio
I tried the solution of the guide, but I exceded the size of case class Row: 2015-04-22 15:22 GMT+02:00 Tathagata Das : > Did you checkout the latest streaming programming guide? > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > > You also

Re: Auto Starting a Spark Job on Cluster Starup

2015-04-22 Thread Ted Yu
This thread seems related: http://search-hadoop.com/m/JW1q51W02V Cheers On Wed, Apr 22, 2015 at 6:09 AM, James King wrote: > What's the best way to start-up a spark job as part of starting-up the > Spark cluster. > > I have an single uber jar for my job and want to make the start-up as easy > a

Re: Convert DStream to DataFrame

2015-04-22 Thread Sergio Jiménez Barrio
Sorry, this is the error: [error] /home/sergio/Escritorio/hello/streaming.scala:77: Implementation restriction: case classes cannot have more than 22 parameters. 2015-04-22 16:06 GMT+02:00 Sergio Jiménez Barrio : > I tried the solution of the guide, but I exceded the size of case class > Row:

Can I index a column in parquet file to make it join faster

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
I have two RDDs each saved in a parquet file. I need to join this two RDDs by the "id" column. Can I created index on the id column so they can join faster? Here is the code case class Example(val id: String, val category: String) case class DocVector(val id: String, val vector: Vector) val

Re: Building Spark : Building just one module.

2015-04-22 Thread Iulian Dragoș
One way is to use export SPARK_PREPEND_CLASSES=true. This will instruct the launcher to prepend the target directories for each project to the spark assembly. I’ve had mixed experiences with it lately, but in principle that's the only way I know. ​ On Wed, Apr 22, 2015 at 3:42 PM, zia_kayani wrot

Re: Convert DStream to DataFrame

2015-04-22 Thread Tathagata Das
Aaah, that. That is probably a limitation of the SQLContext (cc'ing Yin for more information). On Wed, Apr 22, 2015 at 7:07 AM, Sergio Jiménez Barrio < drarse.a...@gmail.com> wrote: > Sorry, this is the error: > > [error] /home/sergio/Escritorio/hello/streaming.scala:77: Implementation > restric

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Yin Huai
Xudong and Rex, Can you try 1.3.1? With PR 5339 , after we get a hive parquet from metastore and convert it to our native parquet code path, we will cache the converted relation. For now, the first access to that hive parquet table reads all of the footer

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Jeff Nadler
You can run multiple Spark clusters against one ZK cluster. Just use this config to set independent ZK roots for each cluster: spark.deploy.zookeeper.dir The directory in ZooKeeper to store recovery state (default: /spark). -Jeff From: Sean Owen To: Akhil Das Cc: Michal Klos , Use

Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent fr

the indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread yaochunnan
Hi all, I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really confused me today. At first I thought my implementation is wrong. It turns out it's an issue in MLlib. Fortunately, I've figured it out. I suggest to add a hint on user document of MLlib ( as far as I know, ther

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Akhil Das
nice, thanks for the information. Thanks Best Regards On Wed, Apr 22, 2015 at 8:53 PM, Jeff Nadler wrote: > > You can run multiple Spark clusters against one ZK cluster. Just use > this config to set independent ZK roots for each cluster: > > spark.deploy.zookeeper.dir > The directo

Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. That would be 500 MB FYI > On Apr 22, 2015, at 8:26 AM, nsalian wrote: > > +1 to executor-memory to 5g. > Do check the overhead space for both the driver and the executor as per > Wilfred's suggestion. > > Typically, 384 MB should suffice. > > > >

Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container? An expensive transformation? On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu wrote: > In master branch, overhead is now 10%. > That would be 500 MB > > FYI > > > > > On Apr 22, 2015, at 8:26 AM, nsalian wrote: > > > > +1 to executor-memory to 5g. >

Master <-chatter -> Worker

2015-04-22 Thread James King
Is there a good resource that covers what kind of chatter (communication) that goes on between driver, master and worker processes? Thanks

Map Question

2015-04-22 Thread Vadim Bichutskiy
I am using Spark Streaming with Python. For each RDD, I call a map, i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet another separate Python module I have a global list, i.e. mylist, that's populated with metadata. I can't get myfunc to see mylist...it's always empty. Alternat

spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Sujee Maniyam
Hi all I am unable to access s3n:// urls using sc.textFile().. getting 'no file system for scheme s3n://' error. a bug or some conf settings missing? See below for details: env variables : AWS_SECRET_ACCESS_KEY=set AWS_ACCESS_KEY_ID=set spark/RELAESE : Spark 1.3.1 (git revision 908a0bf) bui

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Ted Yu
This thread from hadoop mailing list should give you some clue: http://search-hadoop.com/m/LgpTk2df7822 On Wed, Apr 22, 2015 at 9:45 AM, Sujee Maniyam wrote: > Hi all > I am unable to access s3n:// urls using sc.textFile().. getting 'no > file system for scheme s3n://' error. > > a bug or so

RE: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Shuai Zheng
Below is my code to access s3n without problem (only for 1.3.1. there is a bug in 1.3.0). Configuration hadoopConf = ctx.hadoopConfiguration(); hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem"); hadoopConf.set("fs.s3n

RDD.filter vs. RDD.join--advice please

2015-04-22 Thread hokiegeek2
Hi Everyone, I have two options of filtering the RDD resulting from the Graph.vertices method as illustrated with the following pseudo code: 1. Filter val vertexSet = Set("vertexOne","vertexTwo"...); val filteredVertices = Graph.vertices.filter(x => vertexSet.contains(x._2.vertexName)) 2. Join

Spark streaming action running the same work in parallel

2015-04-22 Thread ColinMc
Hi, I'm running a unit test that keeps failing to work with the code I wrote in Spark. Here is the output logs from my test that I ran that gets the customers from incoming events in the JSON called QueueEvent and I am trying to convert the incoming events for each customer to an alert. >From

Re: regarding ZipWithIndex

2015-04-22 Thread Jeetendra Gangele
Sure thanks. if you can guide me how to do this will be great help. On 17 April 2015 at 22:05, Ted Yu wrote: > I have some assignments on hand at the moment. > > Will try to come up with sample code after I clear the assignments. > > FYI > > On Thu, Apr 16, 2015 at 2:00 PM, Jeetendra Gangele >

Re: Distinct is very slow

2015-04-22 Thread Jeetendra Gangele
I made 7000 tasks in mapTopair and in distinct also I made same number of tasks. Still lots of shuffle read and write is happening due to application running for much longer time. Any idea? On 17 April 2015 at 11:55, Akhil Das wrote: > How many tasks are you seeing in your mapToPair stage? Is it

Re: Map Question

2015-04-22 Thread Tathagata Das
Is the mylist present on every executor? If not, then you have to pass it on. And broadcasts are the best way to pass them on. But note that once broadcasted it will immutable at the executors, and if you update the list at the driver, you will have to broadcast it again. TD On Wed, Apr 22, 2015

Re: HiveContext setConf seems not stable

2015-04-22 Thread madhu phatak
Hi, calling getConf don't solve the issue. Even many hive specific queries are broken. Seems like no hive configurations are getting passed properly. Regards, Madhukara Phatak http://datamantra.io/ On Wed, Apr 22, 2015 at 2:19 AM, Michael Armbrust wrote: > As a workaround, can you call getCo

Re: Spark streaming action running the same work in parallel

2015-04-22 Thread Tathagata Das
Unfortunately, none of the indicated logs, etc. is visible in the mail we got through the mailing list. On Wed, Apr 22, 2015 at 10:16 AM, ColinMc wrote: > Hi, > > I'm running a unit test that keeps failing to work with the code I wrote in > Spark. > > Here is the output logs from my test that I

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Thanks TD. I was looking into broadcast variables. Right now I am running it locally...and I plan to move it to "production" on EC2. The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc) but I don't think it's efficient? mylist is filled only once at the start and never cha

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Tathagata Das
Furthermore, just to explain, doing arr.par.foreach does not help because it not really running anything, it only doing setup of the computation. Doing the setup in parallel does not mean that the jobs will be done concurrently. Also, from your code it seems like your pairs of dstreams dont intera

Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-22 Thread Mohammed Omer
Afternoon all, I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package` The error is encountered when running spark shell via: `spark-shell --packages com.databricks:spark-csv_2.11:1.0.3` The full trace of the comman

Re: Map Question

2015-04-22 Thread Tathagata Das
Yep. Not efficient. Pretty bad actually. That's why broadcast variable were introduced right at the very beginning of Spark. On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Thanks TD. I was looking into broadcast variables. > > Right now I am running it

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das wrote: > Yep. Not efficient. Pretty bad actually. That's why broadcast variable > were introduced right at the very beginning of Spark. > > > > On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < > vadim.bi

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the state RDDs AND operate on the data. 1G per executor is quite low. Definitely give more memory. And have you tried increasing the number of partitions (specify number of partitions in updateStateByKey) ? On Wed, Apr 22, 2015

Re: Map Question

2015-04-22 Thread Tathagata Das
Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Can I use broadcast vars in local mode? > ᐧ > > On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das > wrote: > >> Yep. Not efficient. P

Re: problem writing to s3

2015-04-22 Thread Daniel Mahler
Hi Akhil, It works fine when outprefix is a hdfs:///localhost/... url. It looks to me as if there is something about spark writing to the same s3 bucket it is reading from. That is the only real difference between the 2 saveAsTextFile whet outprefix is on s3, inpath is also on s3 but in a differ

ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn't it? Bug?) Here's the exception I get: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: java.net.SocketTimeoutExceptio

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
will you be able to paste the code? On 23 April 2015 at 00:19, Adrian Mocanu wrote: > Hi > > > > I use the ElasticSearch package for Spark and very often it times out > reading data from ES into an RDD. > > How can I keep the connection alive (why doesn't it? Bug?) > > > > Here's the exception

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi The gist of it is this: I have data indexed into ES. Each index stores monthly data and the query will get data for some date range (across several ES indexes or within 1 if the date range is within 1 month). Then I merge these RDDs into an uberRdd and performs some operations then print the

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
Basically ready timeout means hat no data arrived within the specified receive timeout period. Few thing I would suggest 1.are your ES cluster Up and running? 2. if 1 is yes then reduce the size of the Index make it few kbps and then test? On 23 April 2015 at 00:19, Adrian Mocanu wrote: > Hi >

Re: Efficient saveAsTextFile by key, directory for each key?

2015-04-22 Thread Arun Luthra
I ended up post-processing the result in hive with a dynamic partition insert query to get a table partitioned by the key. Looking further, it seems that 'dynamic partition' insert is in Spark SQL and working well in Spark SQL versions > 1.2.0: https://issues.apache.org/jira/browse/SPARK-3007 On

Spark SQL: SchemaRDD, DataFrame. Multi-value, Nested attributes

2015-04-22 Thread Eugene Morozov
Hi! I’m trying to query a dataset that reads data from csv and provides a SQL on top of it. The problem I have is I have a hierarchy of objects that I need to represent as a table so that users might use SQL to query it and do some aggregations. I do have multi value attributes (that in csv fil

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Sujee Maniyam
Thanks all... btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop 2.4 I tried this on 1.3.1-hadoop26 > sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") > val f = sc.textFile("s3n://bucket/file") > f.count No it can't find the im

RE: Scheduling across applications - Need suggestion

2015-04-22 Thread yana
Yes. Fair schedulwr only helps concurrency within an application.  With multiple shells you'd either need something like Yarn/Mesos or careful math on resources as you said Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Arun Patel Date:04/2

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi Thanks for the help. My ES is up. Out of curiosity, do you know what the timeout value is? There are probably other things happening to cause the timeout; I don't think my ES is that slow but it's possible that ES is taking too long to find the data. What I see happening is that it uses scro

Re: RDD.filter vs. RDD.join--advice please

2015-04-22 Thread dsgriffin
Test it out, but I would be willing to bet the join is going to be a good deal faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-filter-vs-RDD-join-advice-please-tp22612p22614.html Sent from the Apache Spark User List mailing list archive at Nabble

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Jean-Pascal Billaud
I have now a fair understanding of the situation after looking at javap output. So as a reminder: dstream.map(tuple => { val w = StreamState.fetch[K,W](state.prefixKey, tuple._1) (tuple._1, (tuple._2, w)) }) And StreamState being a very simple standalone object: object StreamState { def fetch[

Re: Scheduling across applications - Need suggestion

2015-04-22 Thread Lan Jiang
YARN capacity scheduler support hierarchical queues, which you can assign cluster resource as percentage. Your spark application/shell can be submitted to different queues. Mesos supports fine-grained mode, which allows the machines/cores used each executors ramp up and down. Lan On Wed, Apr 22,

Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
https://github.com/databricks/spark-avro On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Thanks Michael! > I have tried applying my schema programatically but I didn't get any > improvement on performance :( > Could you point me to some code exa

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Here's what I did: print 'BROADCASTING...' broadcastVar = sc.broadcast(mylist) print broadcastVar print broadcastVar.value print 'FINISHED BROADCASTING...' The above works fine, but when I call myrdd.map(myfunc) I get *NameError: global name 'broadcastVar' is not defined* The myfunc function is

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Costin Leau
Hi, First off, for Elasticsearch questions is worth pinging the Elastic mailing list as that is closer monitored than this one. Back to your question, Jeetendra is right that the exception indicates nodata is flowing back to the es-connector and Spark. The default is 1m [1] which should be mor

beeline that comes with spark 1.3.0 doesn't work with "--hiveconf" or ''--hivevar" which substitutes variables at hive scripts.

2015-04-22 Thread ogoh
Hello, I am using Spark 1.3 for SparkSQL (hive) & ThriftServer & Beeline. The Beeline doesn't work with "--hiveconf" or ''--hivevar" which substitutes variables at hive scripts. I found the following jiras saying that Hive 0.13 resolved that issue. I wonder if this is well-known issue? https://i

setting cost in linear SVM [Python]

2015-04-22 Thread Pagliari, Roberto
Is there a way to set the cost value C when using linear SVM?

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-22 Thread Michael Armbrust
Sorry for the confusion. We should be more clear about the semantics in the documentation. (PRs welcome :) ) .saveAsTable does not create a hive table, but instead creates a Spark Data Source table. Here the metadata is persisted into Hive, but hive cannot read the tables (as this API support ML

why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-22 Thread Hao Ren
Hi, Just a quick question, Regarding the source code of groupByKey: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L453 In the end, it cast CompactBuffer to Iterable. But why ? Any advantage? Thank you. Hao. -- View this message

RE: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-22 Thread yana
You can try pulling the jar with wget and using it with -jars with Spark shell. I used 1.0.3 with Spark 1.3.0 but with a different version of scala. From the stack trace it looks like Spark shell is just not seeing the csv jar... Sent on the new Sprint Network from my Samsung Galaxy S®4. -

Re: Map Question

2015-04-22 Thread Tathagata Das
Can you give full code? especially the myfunc? On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Here's what I did: > > print 'BROADCASTING...' > broadcastVar = sc.broadcast(mylist) > print broadcastVar > print broadcastVar.value > print 'FINISHED BROADCASTI

spark-ec2 s3a filesystem support and hadoop versions

2015-04-22 Thread Daniel Mahler
I would like to easily launch a cluster that supports s3a file systems. if I launch a cluster with `spark-ec2 --hadoop-major-version=2`, what determines the minor version of hadoop? Does it depend on the spark version being launched? Are there other allowed values for --hadoop-major-version besi

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Tathagata Das
Vaguely makes sense. :) Wow that's an interesting corner case. On Wed, Apr 22, 2015 at 1:57 PM, Jean-Pascal Billaud wrote: > I have now a fair understanding of the situation after looking at javap > output. So as a reminder: > > dstream.map(tuple => { > val w = StreamState.fetch[K,W](state.prefi

Re: [spark-csv] Trouble working with Spark-CSV package (error: object databricks is not a member of package com) (#54)

2015-04-22 Thread Mohammed Omer
Yes, `import com.brkyvz.spark.WordCounter` failed as well :/ On Wed, Apr 22, 2015 at 6:45 PM, Burak Yavuz wrote: > Hi @momer , Could you please try spark-shell > with --packages brkyvz:demo-scala-python:0.1.3 just to test whether this > is a package related issue or Sp

How to access HBase on Spark SQL

2015-04-22 Thread doovsaid
I notice that databricks provides external datasource api for Spark SQL. But I can't find any reference documents to guide how to access HBase based on it directly. Who know it? Or please give me some related links. Thanks. ZhangYi (张逸) BigEye website: http://w

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Rex Xiong
Yin, Thanks for your reply. We already patched this PR to our 1.3.0 As Xudong mentioned, we have thousand of parquet files, it's very very slow in first read, and another app will add more files and refresh table regularly. Cheng Lian's PR-5334 seems can resolve this issue, it will skip read all f

Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee
You may need to specify the hive port itself. For example, my own Thrift start command is in the form: ./sbin/start-thriftserver.sh --master spark://$myserver:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host $myserver --hiveconf hive.server2.thrift.port 1 HTH! O

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Otis Gospodnetic
Hi, If you get ES response back in 1-5 seconds that's pretty slow. Are these ES aggregation queries? Costin may be right about GC possibly causing timeouts. SPM can give you all Spark and all key Elasticsearch metrics, including various JVM metrics. If the problem is

.toPairDStreamFunctions method not found

2015-04-22 Thread avseq
Dear allI had install spark 1.2.1 in centos 6.6 x64. I run the NetworkWordcount example and it works.But when I paste the same code to the Intellij , packet to a jar and execute it.It Always occurs the following exceptionException in thread "main" java.lang.NoSuchMethodError: org.apache.spark.strea

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Abhay Bansal
Thanks for your suggestions. sc.set("spark.streaming.concurrentJobs","2") works, but I am not sure of using it in production. @TD: The number of streams that we are interacting with are very large. Managing these many applications would just be an overhead. Moreover there are other operation whic

Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

2015-04-22 Thread Jeffery
Hi, Dear Spark Users/Devs: In a method, I create a new RDD, and cache it, whether Spark will unpersit the RDD automatically after the rdd is out of scope? I am thinking so, but want to make sure with you, the experts :) Thanks, Jeffery Yuan -- View this message in context: http://apache-spar

Re: StackOverflow Error when run ALS with 100 iterations

2015-04-22 Thread amghost
Hi, would you please how to checkpoint the training set rdd since all things are done in ALS.train method. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StackOverflow-Error-when-run-ALS-with-100-iterations-tp4296p22619.html Sent from the Apache Spark User

LDA code little error @Xiangrui Meng

2015-04-22 Thread buring
Hi: there is a little error in source code LDA.scala at line 180, as follows: def setBeta(beta: Double): this.type = setBeta(beta) which cause "java.lang.StackOverflowError". It's easy to see there is error -- View this message in context: http://apache-spark-user-list.1001560.n3.n

Re: Building Spark : Adding new DataType in Catalyst

2015-04-22 Thread kmader
Unless you are directly concerned with the query optimization you needn't modify catalyst or any of the core Spark SQL code. You can simply create a new project with Spark SQL as a dependency and like is done in MLLib Vectors (in 1.3, the newer versions have it for matrices as well) Use the @SQLU

Re: LDA code little error @Xiangrui Meng

2015-04-22 Thread Xiangrui Meng
Thanks! That's a bug .. -Xiangrui On Wed, Apr 22, 2015 at 9:34 PM, buring wrote: > Hi: > there is a little error in source code LDA.scala at line 180, as > follows: >def setBeta(beta: Double): this.type = setBeta(beta) > >which cause "java.lang.StackOverflowError". It's easy to see th

  1   2   >