Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
There're some skew. 6461640SUCCESSPROCESS_LOCAL200 / 2015/03/04 23:45:471.1 min6 s198.6 MB21.1 GB240.8 MB5961590SUCCESSPROCESS_LOCAL30 / 2015/03/04 23:45:4744 s5 s200.7 MB4.8 GB154.0 MB But I expect this kind of skewness to be quite common. Jianshi On Thu, Mar 5, 2015 at 3:48 PM, Jiansh

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
Yes, if one key has too many values, there still has a chance to meet the OOM. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, March 5, 2015 3:49 PM To: Shao, Saisai Cc: Cheng, Hao; user Subject: Re: Having lots of FetchFailedException in join I see. I'm using c

using log4j2 with spark

2015-03-04 Thread Lior Chaga
Hi, Trying to run spark 1.2.1 w/ hadoop 1.0.4 on cluster and configure it to run with log4j2. Problem is that spark-assembly.jar contains log4j and slf4j classes compatible with log4j 1.2 in it, and so it detects it should use log4j 1.2 ( https://github.com/apache/spark/blob/54e7b456dd56c9e52132154

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
I see. I'm using core's join. The data might have some skewness (checking). I understand shuffle can spill data to disk but when consuming it, say in cogroup or groupByKey, it still needs to read the whole group elements, right? I guess OOM happened there when reading very large groups. Jianshi

RE: Passing around SparkContext with in the Driver

2015-03-04 Thread Kapil Malik
Replace val sqlContext = new SQLContext(sparkContext) with @transient val sqlContext = new SQLContext(sparkContext) -Original Message- From: kpeng1 [mailto:kpe...@gmail.com] Sent: 04 March 2015 23:39 To: user@spark.apache.org Subject: Passing around SparkContext with in the Driver Hi

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
I think what you could do is to monitor through web UI to see if there’s any skew or other symptoms in shuffle write and read. For GC you could use the below configuration as you mentioned. From Spark core side, all the shuffle related operations can spill the data into disk and no need to read

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
Hi Saisai, What's your suggested settings on monitoring shuffle? I've enabled -XX:+PrintGCDetails -XX:+PrintGCTimeStamps for GC logging. I found SPARK-3461 (Support external groupByKey using repartitionAndSortWithinPartitions) want to make groupByKey using external storage. It's still open status

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
Hi Jianshi, From my understanding, it may not be the problem of NIO or Netty, looking at your stack trace, the OOM is occurred in EAOM(ExternalAppendOnlyMap), theoretically EAOM can spill the data into disk if memory is not enough, but there might some issues when join key is skewed or key numb

RE: distribution of receivers in spark streaming

2015-03-04 Thread Shao, Saisai
Yes, hostname is enough. I think currently it is hard for user code to get the worker list from standalone master. If you can get the Master object, you could get the worker list, but AFAIK may be it is difficult to get this object. All you could do is to manually get the worker list and assign

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
One really interesting is that when I'm using the netty-based spark.shuffle.blockTransferService, there's no OOM error messages (java.lang.OutOfMemoryError: Java heap space). Any idea why it's not here? I'm using Spark 1.2.1. Jianshi On Thu, Mar 5, 2015 at 1:56 PM, Jianshi Huang wrote: > I ch

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Hi Jerry, Thanks for your response. Is there a way to get the list of currently registered/live workers? Even in order to provide preferredLocation, it would be safer to know which workers are active. Guess I only need to provide the hostname, right? Thanks,Du On Wednesday, March 4, 2015 1

RE: distribution of receivers in spark streaming

2015-03-04 Thread Shao, Saisai
Hi Du, You could try to sleep for several seconds after creating streaming context to let all the executors registered, then all the receivers can distribute to the nodes more evenly. Also setting locality is another way as you mentioned. Thanks Jerry From: Du Li [mailto:l...@yahoo-inc.com.IN

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
I changed spark.shuffle.blockTransferService to nio and now I'm getting OOM errors, I'm doing a big join operation. 15/03/04 19:04:07 ERROR Executor: Exception in task 107.0 in stage 2.0 (TID 6207) java.lang.OutOfMemoryError: Java heap space at org.apache.spark.util.collection.CompactBuff

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Figured it out: I need to override method preferredLocation() in MyReceiver class. On Wednesday, March 4, 2015 3:35 PM, Du Li wrote: Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In my code I did something like

Re: spark master shut down suddenly

2015-03-04 Thread Benjamin Stickel
Generally the location of logs in /var/log/mesos but the definitive configuration can be found via the /etc/mesos-master/... configuration files. There should be a configuration file labeled log_dir. ps -ax | grep mesos should also show the output of the configuration if it is configured. Another

How to parse Json formatted Kafka message in spark streaming

2015-03-04 Thread Cui Lin
Friends, I'm trying to parse json formatted Kafka messages and then send back to cassandra.I have two problems: 1. I got the exception below. How to check an empty RDD? Exception in thread "main" java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$

Re: spark master shut down suddenly

2015-03-04 Thread Denny Lee
It depends on your setup but one of the locations is /var/log/mesos On Wed, Mar 4, 2015 at 19:11 lisendong wrote: > I ‘m sorry, but how to look at the mesos logs? > where are them? > > > > 在 2015年3月4日,下午6:06,Akhil Das 写道: > > > You can check in the mesos logs and see whats really happening. > >

Unable to Read/Write Avro RDD on cluster.

2015-03-04 Thread ๏̯͡๏
I am trying to read RDD avro, transform and write. I am able to run it locally fine but when i run onto cluster, i see issues with Avro. export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1 export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf" export HADOOP_CONF_DIR=/apache/hadoop/

Re: Where can I find more information about the R interface forSpark?

2015-03-04 Thread Ted Yu
Please follow SPARK-5654 On Wed, Mar 4, 2015 at 7:22 PM, Haopu Wang wrote: > Thanks, it's an active project. > > > > Will it be released with Spark 1.3.0? > > > -- > > *From:* 鹰 [mailto:980548...@qq.com] > *Sent:* Thursday, March 05, 2015 11:19 AM > *To:* Haopu Wang

RE: Where can I find more information about the R interface forSpark?

2015-03-04 Thread Haopu Wang
Thanks, it's an active project. Will it be released with Spark 1.3.0? From: 鹰 [mailto:980548...@qq.com] Sent: Thursday, March 05, 2015 11:19 AM To: Haopu Wang; user Subject: Re: Where can I find more information about the R interface forSpark? you can

Re: Where can I find more information about the R interface forSpark?

2015-03-04 Thread ??
you can search SparkR on google or search it on github

Re: Where can I find more information about the R interface for Spark?

2015-03-04 Thread haopu
Do you have any update on SparkR? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-more-information-about-the-R-interface-for-Spark-tp155p21922.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: spark master shut down suddenly

2015-03-04 Thread lisendong
I ‘m sorry, but how to look at the mesos logs? where are them? > 在 2015年3月4日,下午6:06,Akhil Das 写道: > > You can check in the mesos logs and see whats really happening. > > Thanks > Best Regards > > On Wed, Mar 4, 2015 at 3:10 PM, lisendong > wrote: > 15/03/04 09:26:3

how to update als in mllib?

2015-03-04 Thread lisendong
I 'm using spark1.0.0 with cloudera. but I want to use new als code which supports more features, such as rdd cache level(MEMORY ONLY), checkpoint, and so on. What is the easiest way to use the new als code? I only need the mllib als code, so maybe I don't need to update all the spark & mllib o

Re: Driver disassociated

2015-03-04 Thread Ted Yu
See this thread: https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs Here're the relevant config parameters in Spark: val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 6000) val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 1000) Cheers

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Zhan Zhang
Which spark version did you use? I tried spark-1.2.1 and didn’t meet this problem. scala> val m = hiveContext.sql(" select * from testtable where value like '%Restaurant%'") 15/03/05 02:02:30 INFO ParseDriver: Parsing command: select * from testtable where value like '%Restaurant%' 15/03/05 0

Extra output from Spark run

2015-03-04 Thread cjwang
When I run Spark 1.2.1, I found these display that wasn't in the previous releases: [Stage 12:=> (6 + 1) / 16] [Stage 12:>(8 + 1) / 16] [Stage 12:==>

Re: RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Zhan Zhang
It use HashPartitioner to distribute the record to different partitions, but the key is just integer evenly across output partitions. >From the code, each resulting partition will get very similar number of >records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li mailto:l...@yahoo-in

In the HA master mode, how to identify the alive master?

2015-03-04 Thread Xuelin Cao
Hi, In our project, we use "stand alone duo master" + "zookeeper" to make the HA of spark master. Now the problem is, how do we know which master is the current alive master? We tried to read the info that the master stored in zookeeper. But we found there is no information to

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Marcelo, Thanks. The one in the CDH repo fixed it :) On Wed, Mar 4, 2015 at 4:37 PM, Marcelo Vanzin wrote: > Hi Kevin, > > If you're using CDH, I'd recommend using the CDH repo [1], and also > the CDH version when building your app. > > [1] > http://www.cloudera.com/content/cloudera/en/documen

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Marcelo Vanzin
Hi Kevin, If you're using CDH, I'd recommend using the CDH repo [1], and also the CDH version when building your app. [1] http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng wrote: > Ted, > > I am c

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Ted, I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not too sure about the compatibility issues between 1.2.0 and 1.2.1, that is why I would want to stick to 1.2.0. On Wed, Mar 4, 2015 at 4:25 PM, Ted Yu wrote: > Kevin: > You can try with 1.2.1 > > See this thread: http:/

Re: scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, On Thu, Mar 5, 2015 at 12:20 AM, Imran Rashid wrote: > This doesn't involve spark at all, I think this is entirely an issue with > how scala deals w/ primitives and boxing. Often it can hide the details > for you, but IMO it just leads to far more confusing errors when things > don't work o

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Ted, I have tried wiping out ~/.m2/org.../spark directory multiple times. It doesn't seem to work. On Wed, Mar 4, 2015 at 4:20 PM, Ted Yu wrote: > kpeng1: > Try wiping out ~/.m2 and build again. > > Cheers > > On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin > wrote: > >> Seems like someone s

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Ted Yu
Kevin: You can try with 1.2.1 See this thread: http://search-hadoop.com/m/JW1q5Vfe6X1 Cheers On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng wrote: > Marcelo, > > Yes that is correct, I am going through a mirror, but 1.1.0 works > properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Ted Yu
kpeng1: Try wiping out ~/.m2 and build again. Cheers On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin wrote: > Seems like someone set up "m2.mines.com" as a mirror in your pom file > or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is > in a messed up state). > > On Wed, Mar 4,

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Marcelo, Yes that is correct, I am going through a mirror, but 1.1.0 works properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0 pom file. On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin wrote: > Seems like someone set up "m2.mines.com" as a mirror in your pom file > or ~/.m2/sett

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
Also, I was experiencing another problem which might be related: "Error communicating with MapOutputTracker" (see email in the ML today). I just thought I would mention it in case it is relevant. On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber wrote: > 1.2.1 > > Also, I was using the following p

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Marcelo Vanzin
Seems like someone set up "m2.mines.com" as a mirror in your pom file or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is in a messed up state). On Wed, Mar 4, 2015 at 3:49 PM, kpeng1 wrote: > Hi All, > > I am currently having problem with the maven dependencies for version 1.2

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
1.2.1 Also, I was using the following parameters, which are 10 times the default ones: spark.akka.timeout 1000 spark.akka.heartbeat.pauses 6 spark.akka.failure-detector.threshold 3000.0 spark.akka.heartbeat.interval 1 which should have helped *avoid* the problem if I understand correctly.

Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread kpeng1
Hi All, I am currently having problem with the maven dependencies for version 1.2.0 of spark-core and spark-hive. Here are my dependencies: org.apache.spark spark-core_2.10 1.2.0 org.apache.spark spark-hive_2.10 1.2.0 When the dependencies

RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Du Li
Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel as even as possible. The number of processing nodes is larger than the receiving nodes. My question is how the coalesce/repa

distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In my code I did something like the following, as suggested in the spark docs:        val streams = (1 to numReceivers).map(_ => ssc.receiverStream(new MyKafkaReceiver()))       

Re: Driver disassociated

2015-03-04 Thread Ted Yu
What release are you using ? SPARK-3923 went into 1.2.0 release. Cheers On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber wrote: > Hello, > > sometimes, in the *middle* of a job, the job stops (status is then seen > as FINISHED in the master). > > There isn't anything wrong in the shell/submit out

Re: issue Running Spark Job on Yarn Cluster

2015-03-04 Thread roni
look at the logs yarn logs --applicationId That should give the error. On Wed, Mar 4, 2015 at 9:21 AM, sachin Singh wrote: > Not yet, > Please let. Me know if you found solution, > > Regards > Sachin > On 4 Mar 2015 21:45, "mael2210 [via Apache Spark User List]" <[hidden > email]

Re: spark sql median and standard deviation

2015-03-04 Thread Ted Yu
Please take a look at DoubleRDDFunctions.scala : /** Compute the mean of this RDD's elements. */ def mean(): Double = stats().mean /** Compute the variance of this RDD's elements. */ def variance(): Double = stats().variance /** Compute the standard deviation of this RDD's elements. */

Driver disassociated

2015-03-04 Thread Thomas Gerber
Hello, sometimes, in the *middle* of a job, the job stops (status is then seen as FINISHED in the master). There isn't anything wrong in the shell/submit output. When looking at the executor logs, I see logs like this: 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker acto

Re: Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
Thanks! On Wed, Mar 4, 2015 at 3:58 PM, Michael Armbrust wrote: > It is somewhat out of data, but here is what we have so far: > https://github.com/marmbrus/sql-typed > > On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony > wrote: > >> I am pretty sure that I saw a presentation where SparkSQL could

Re: Spark SQL Static Analysis

2015-03-04 Thread Michael Armbrust
It is somewhat out of data, but here is what we have so far: https://github.com/marmbrus/sql-typed On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony wrote: > I am pretty sure that I saw a presentation where SparkSQL could be executed > with static analysis, however I cannot find the presentation no

Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
I am pretty sure that I saw a presentation where SparkSQL could be executed with static analysis, however I cannot find the presentation now, nor can I find any documentation or research papers on the topic. So, I am curious if there is indeed any work going on for this topic. The two things I woul

Integer column in schema RDD from parquet being considered as string

2015-03-04 Thread gtinside
Hi , I am coverting jsonRDD to parquet by saving it as parquet file (saveAsParquetFile) cacheContext.jsonFile("file:///u1/sample.json").saveAsParquetFile("sample.parquet") I am reading parquet file and registering it as a table : val parquet = cacheContext.parquetFile("sample_trades.parquet") par

Re: Save and read parquet from the same path

2015-03-04 Thread Michael Armbrust
No, this is not safe to do. On Wed, Mar 4, 2015 at 7:14 AM, Karlson wrote: > Hi all, > > what would happen if I save a RDD via saveAsParquetFile to the same path > that RDD is originally read from? Is that a safe thing to do in Pyspark? > > Thanks! > > > -

configure number of cached partition in memory on SparkSQL

2015-03-04 Thread Judy Nash
Hi, I am tuning a hive dataset on Spark SQL deployed via thrift server. How can I change the number of partitions after caching the table on thrift server? I have tried the following but still getting the same number of partitions after caching: Spark.default.parallelism spark.sql.inMemoryColu

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Srini Karri
Hi Marcelo, I found the problem from http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3cCAL+LEBfzzjugOoB2iFFdz_=9TQsH=DaiKY=cvydfydg3ac5...@mail.gmail.com%3e this link. The problem is the application I am running, is not generating "APPLICATION_COMPLETE" file. If I add this file ma

spark sql median and standard deviation

2015-03-04 Thread tridib
Hello, Is there in built function for getting median and standard deviation in spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling doubleRDD.stats(). But still it does not have median. What is the most efficient way to get the median? Thanks & Regards Tridib -- View thi

Re: Does anyone integrate HBASE on Spark

2015-03-04 Thread gen tang
Hi, There are some examples in spark/example and there are also some examples in spark package . And I find this blog is quite good. Hope it

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
I meant spark.default.parallelism of course. On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber wrote: > Follow up: > We re-retried, this time after *decreasing* spark.parallelism. It was set > to 16000 before, (5 times the number of cores in our cluster). It is now > down to 6400 (2 times the numbe

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Follow up: We re-retried, this time after *decreasing* spark.parallelism. It was set to 16000 before, (5 times the number of cores in our cluster). It is now down to 6400 (2 times the number of cores). And it got past the point where it failed before. Does the MapOutputTracker have a limit on the

Spark logs in standalone clusters

2015-03-04 Thread Thomas Gerber
Hello, I was wondering where all the logs files were located on a standalone cluster: 1. the executor logs are in the work directory on each slave machine (stdout/stderr) - I've notice that GC information is in stdout, and stage information in stderr - *Could we get more i

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Srini Karri
Yes. I do see files, actually I missed copying the other settings: spark.master spark:// skarri-lt05.redmond.corp.microsoft.com:7077 spark.eventLog.enabled true spark.rdd.compress true spark.storage.memoryFraction 1 spark.core.connection.ack.wait.timeout 6000 spark.ak

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Marcelo Vanzin
On Wed, Mar 4, 2015 at 10:08 AM, Srini Karri wrote: > spark.executor.extraClassPath > D:\\Apache\\spark-1.2.1-bin-hadoop2\\spark-1.2.1-bin-hadoop2.4\\bin\\classes > spark.eventLog.dir > D:/Apache/spark-1.2.1-bin-hadoop2/spark-1.2.1-bin-hadoop2.4/bin/tmp/spark-events > spark.history.fs.logDirectory

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Srini Karri
Hi Todd and Marcelo, Thanks for helping me. I was to able to lunch the history server on windows with out any issues. One problem I am running into right now. I always get the message no completed applications found in history server UI. But I was able to browse through these applications from Spa

Passing around SparkContext with in the Driver

2015-03-04 Thread kpeng1
Hi All, I am trying to create a class that wraps functionalities that I need; some of these functions require access to the SparkContext, which I would like to pass in. I know that the SparkContext is not seralizable, and I am not planning on passing it to worker nodes or anything, I just want to

Does anyone integrate HBASE on Spark

2015-03-04 Thread sandeep vura
Hi Sparkers, How do i integrate hbase on spark !!! Appreciate for replies !! Regards, Sandeep.v

Re: issue Running Spark Job on Yarn Cluster

2015-03-04 Thread sachin Singh
Not yet, Please let. Me know if you found solution, Regards Sachin On 4 Mar 2015 21:45, "mael2210 [via Apache Spark User List]" < ml-node+s1001560n21909...@n3.nabble.com> wrote: > Hello, > > I am facing the exact same issue. Could you solve the problem ? > > Kind regards > > -

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Anusha Shamanur
I tried. I still get the same error. 15/03/04 09:01:50 INFO parse.ParseDriver: Parsing command: select * from TableName where value like '%Restaurant%' 15/03/04 09:01:50 INFO parse.ParseDriver: Parse Completed. 15/03/04 09:01:50 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=TableNa

Re: Does SparkSQL support "..... having count (fieldname)" in SQL statement?

2015-03-04 Thread shahab
Thanks Cheng, my problem was some misspelling problem which I just fixed, unfortunately the exception message sometimes does not pin point to exact reason. Sorry my bad. On Wed, Mar 4, 2015 at 5:02 PM, Cheng, Hao wrote: > I’ve tried with latest code, seems it works, which version are you usi

Re: GraphX path traversal

2015-03-04 Thread Robin East
Actually your Pregel code works for me: import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val vertexlist = Array((1L,"One"), (2L,"Two"), (3L,"Three"), (4L,"Four"),(5L,"Five"),(6L,"Six")) val edgelist = Array(Edge(6,5,"6 to 5"),Edge(5,4,"5 to 4"),Edge(4,3,

Re: how to save Word2VecModel

2015-03-04 Thread Xiangrui Meng
+user On Wed, Mar 4, 2015, 8:21 AM Xiangrui Meng wrote: > You can use the save/load implementation in naive Bayes as reference: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > > Ping me on the JIRA page to get the ticket

Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Hello, We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers). We use spark-submit to start an application. We got the following error which leads to a failed stage: Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, most recent failure: Lost task 3095.

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Imran Rashid
You can set the number of partitions dynamically -- its just a parameter to a method, so you can compute it however you want, it doesn't need to be some static constant: val dataSizeEstimate = yourMagicFunctionToEstimateDataSize() val numberOfPartitions = yourConversionFromDataSizeToNumPartitions(

RE: Does SparkSQL support "..... having count (fieldname)" in SQL statement?

2015-03-04 Thread Cheng, Hao
I’ve tried with latest code, seems it works, which version are you using Shahab? From: yana [mailto:yana.kadiy...@gmail.com] Sent: Wednesday, March 4, 2015 8:47 PM To: shahab; user@spark.apache.org Subject: RE: Does SparkSQL support ". having count (fieldname)" in SQL statement? I think the

Re: Nested Case Classes (Found and Required Same)

2015-03-04 Thread Bojan Kostic
Did you find any other way for this issue? I just found out that i have >22 columns data set... And now i am searching for best solution. Anyone else have experienced with this problem? Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Case

Re: scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Imran Rashid
This doesn't involve spark at all, I think this is entirely an issue with how scala deals w/ primitives and boxing. Often it can hide the details for you, but IMO it just leads to far more confusing errors when things don't work out. The issue here is that your map has value type Any, which leads

Save and read parquet from the same path

2015-03-04 Thread Karlson
Hi all, what would happen if I save a RDD via saveAsParquetFile to the same path that RDD is originally read from? Is that a safe thing to do in Pyspark? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org Fo

RE: Spark Streaming Switchover Time

2015-03-04 Thread Nastooh Avessta (navesta)
Indeed. And am wondering if this switchover time can be decreased. Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Akhil Das
Looks like you are having 2 netty jars in the classpath. Thanks Best Regards On Wed, Mar 4, 2015 at 5:14 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > From the lines pointed in the exception log, I figured out that my code is > unable to get the spark context. To isolate

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Arush Kharbanda
Why don't you formulate a string before you pass it to the hql function (appending strings), and hql function is deprecated. You should use sql. http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext On Wed, Mar 4, 2015 at 6:15 AM, Anusha Shamanur wrote: >

Re: Speed Benchmark

2015-03-04 Thread Guillaume Guy
Sorry for the confusion. All are running Hadoop services. Node 1 is the namenode whereas Nodes 2 and 3 are datanodes. Best, Guillaume Guy * +1 919 - 972 - 8750* On Sat, Feb 28, 2015 at 1:09 AM, Sean Owen wrote: > Is machine 1 the only one running an HDFS data node? You describe it as > one

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Arush Kharbanda
You can try increasing the Akka time out in the config, you can set the following in your config. spark.core.connection.ack.wait.timeout: 600 spark.akka.timeout: 1000 (In secs) spark.akka.frameSize:50 On Wed, Mar 4, 2015 at 5:14 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote:

Re: LBGFS optimizer performace

2015-03-04 Thread Gustavo Enrique Salazar Torres
Yeah, without caching makes it gets really slow. I will try to minimize the number of columns on my tables, that may save lots of memory and will eventually work. I will let you know. Thanks! Gustavo On Tue, Mar 3, 2015 at 8:58 PM, Joseph Bradley wrote: > I would recommend caching; if you can't

RE: Does SparkSQL support "..... having count (fieldname)" in SQL statement?

2015-03-04 Thread yana
I think the problem is that you are using an alias in the having clause. I am not able to try just now but see if HAVING count (*)> 2 works ( ie dont use cnt) Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: shahab Date:03/04/2015 7:22 AM (G

Does SparkSQL support "..... having count (fieldname)" in SQL statement?

2015-03-04 Thread shahab
Hi, It seems that SparkSQL, even the HiveContext, does not support SQL statements like : SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt > 10; I get this exception: Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: CAST(('

Re: Problems running version 1.3.0-rc1

2015-03-04 Thread Yiannis Gkoufas
Hi Sean, thanks a lot for helping me pinpoint the issue! The zips I tried to download were just fine. The problem was fixed when I deleted my .m2 folder. Probably something wasn't downloaded properly from Maven Repositories. Thanks! On 3 March 2015 at 09:26, Sean Owen wrote: > Is that really t

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Sarath Chandra
>From the lines pointed in the exception log, I figured out that my code is unable to get the spark context. To isolate the problem, I've written a small code as below - *import org.apache.spark.SparkConf;* *import org.apache.spark.SparkContext;* *public class Test {* *public static void

Unable to submit spark job to mesos cluster

2015-03-04 Thread Sarath Chandra
Hi, I have a cluster running on CDH5.2.1 and I have a Mesos cluster (version 0.18.1). Through a Oozie java action I'm want to submit a Spark job to mesos cluster. Before configuring it as Oozie job I'm testing the java action from command line and getting exception as below. While running I'm poin

Re: insert Hive table with RDD

2015-03-04 Thread patcharee
Hi, I guess that toDF() api in spark 1.3 which is required build from source code? Patcharee On 03. mars 2015 13:42, Cheng, Hao wrote: Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc = new HiveContext(sc) import hc.im

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Parallelism doesn't really affect the throughput as long as it's: - not less than the number of available execution slots, - ... and probably some low multiple of them to even out task size effects - not so high that the bookkeeping overhead dominates Although you may need to select different sca

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Emre Sevinc
I'm adding this 3rd party library to my Maven pom.xml file so that it's embedded into the JAR I send to spark-submit: json-mapreduce json-mapreduce 1.0-SNAPSHOT javax.servlet * commons-io *

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Tathagata Das
That could be a corner case bug. How do you add the 3rd party library to the class path of the driver? Through spark-submit? Could you give the command you used? TD On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc wrote: > I've also tried the following: > > Configuration hadoopConfiguration = n

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Jeff Zhang
Hi Sean, > If you know a stage needs unusually high parallelism for example you can repartition further for that stage. The problem is we may don't know whether high parallelism is needed. e.g. for the join operator, high parallelism may only be necessary for some dataset that lots of data can j

Re: Is FileInputDStream returned by fileStream method a reliable receiver?

2015-03-04 Thread Tathagata Das
The file stream does not use receiver. May be that was not clear in the programming guide. I am updating it for 1.3 release right now, I will make it more clear. And file stream has full reliability. Read this in the programming guide. http://spark.apache.org/docs/latest/streaming-programming-guide

Is FileInputDStream returned by fileStream method a reliable receiver?

2015-03-04 Thread Emre Sevinc
Is FileInputDStream returned by fileStream method a reliable receiver? In the Spark Streaming Guide it says: "There can be two kinds of data sources based on their *reliability*. Sources (like Kafka and Flume) allow the transferred data to be acknowledged. If the system receiving data from

Re: spark master shut down suddenly

2015-03-04 Thread Akhil Das
You can check in the mesos logs and see whats really happening. Thanks Best Regards On Wed, Mar 4, 2015 at 3:10 PM, lisendong wrote: > 15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard > from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket > connectio

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application? Thank

spark master shut down suddenly

2015-03-04 Thread lisendong
15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket connection and attempting reconnect 15/03/04 09:26:36 INFO ConnectionStateManager: State change: SUSPENDED 15/03/04 09:26:36 INFO ZooKeeperLeaderElectio

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Hm, what do you mean? You can control, to some extent, the number of partitions when you read the data, and can repartition if needed. You can set the default parallelism too so that it takes effect for most ops thay create an RDD. One # of partitions is usually about right for all work (2x or so

scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, I have a function with signature def aggFun1(rdd: RDD[(Long, (Long, Double))]): RDD[(Long, Any)] and one with def aggFun2[_Key: ClassTag, _Index](rdd: RDD[(_Key, (_Index, Double))]): RDD[(_Key, Double)] where all "Double" classes involved are "scala.Double" classes (according t

Re: Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-04 Thread أنس الليثي
Thanks very much, I used it and works fine with me. On 4 March 2015 at 11:56, Arush Kharbanda wrote: > For java You can use hive-jdbc connectivity jars to connect to Spark-SQL. > > The driver is inside the hive-jdbc Jar. > > *http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/j

Spark RDD Python, Numpy Shape command

2015-03-04 Thread rui li
I am a beginner to Spark, having some simple questions regarding the use of RDD in python. Suppose I have a matrix called data_matrix, I pass it to RDD using RDD_matrix = sc.parallelize(data_matrix) but I will have a problem if I want to know the dimension of the matrix in Spark, because Sparkk

  1   2   >