Re: Checkpoint file not found

2015-08-03 Thread Tathagata Das
Can you tell us more about streaming app? DStream operation that you are using? On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya wrote: > Hi, > > I'm writing a Streaming application in Spark 1.3. After running for some > time, I'm getting following execption. I'm sure, that no other process is > modi

RE: SparkLauncher not notified about finished job - hangs infinitely.

2015-08-03 Thread Tomasz Guziałek
Reading from the input stream and the error stream (in separate threads) indeed unblocked the launcher and it exited properly. Thanks for your responses! Best regards, Tomasz From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, July 31, 2015 19:20 To: Elkhan Dadashov Cc: Tomasz Guziałek; user

spark --files permission error

2015-08-03 Thread Shushant Arora
Is there any setting to allow --files to copy jar from driver to executor nodes. When I am passing some jar files using --files to executors and adding them in class path of executor it throws exception of File not found 15/08/03 07:59:50 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, ip

Re: Checkpoint file not found

2015-08-03 Thread Anand Nalya
Hi, Its an application that maintains some state from the DStream using updateStateByKey() operation. It then selects some of the records from current batch using some criteria over current values and the state and carries over the remaining values to next batch. Following is the pseudo code : va

Is it possible to disable AM page proxy in Yarn client mode?

2015-08-03 Thread Rex Xiong
In Yarn client mode, Spark driver URL will be redirected to Yarn web proxy server, but I don't want to use this dynamic name, is it possible to still use : as standalone mode?

Re: About memory leak in spark 1.4.1

2015-08-03 Thread Barak Gitsis
Sea, it exists, trust me. We have spark in production under Yarn. if you want more control use Yarn if you can. At least it kills the executor if it hogs memory.. I am explicitly setting spark.yarn.executor.memoryOverhead to the same size as heap for one of our processes For example: spark.execut

Running multiple batch jobs in parallel using Spark on Mesos

2015-08-03 Thread Akash Mishra
Hello *, We are trying to build some Batch jobs using Spark on Mesos. Mesos offer's two main mode of deployment of Spark job. 1. Fine-grained 2. Coarse-grained When we are running the spark jobs in fine grained mode then spark is using max amount of offers from Mesos and running the job. Runnin

Re: spark cluster setup

2015-08-03 Thread Akhil Das
Are you sitting behind a firewall and accessing a remote master machine? In that case, have a look at this http://spark.apache.org/docs/latest/configuration.html#networking, you might want to fix few properties like spark.driver.host, spark.driver.host etc. Thanks Best Regards On Mon, Aug 3, 2015

Re: Extremely poor predictive performance with RF in mllib

2015-08-03 Thread Barak Gitsis
hi, I've run into some poor RF behavior, although not as pronounced as you.. would be great to get more insight into this one Thanks! On Mon, Aug 3, 2015 at 8:21 AM pkphlam wrote: > Hi, > > This might be a long shot, but has anybody run into very poor predictive > performance using RandomForest

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit

2015-08-03 Thread Rajeshkumar J
Hi Everyone, I am using Apache Spark for 2 weeks and as of now I am querying hive tables using spark java api. And it is working fine in Hadoop single mode but when I tried the same code in Hadoop multi cluster it throws "org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't

Fwd: org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit

2015-08-03 Thread Rajeshkumar J
Hi Everyone, I am using Apache Spark for 2 weeks and as of now I am querying hive tables using spark java api. And it is working fine in Hadoop single mode but when I tried the same code in Hadoop multi cluster it throws "org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't

Re: About memory leak in spark 1.4.1

2015-08-03 Thread Igor Berman
in general, what is your configuration? use --conf "spark.logConf=true" we have 1.4.1 in production standalone cluster and haven't experienced what you are describing can you verify in web-ui that indeed spark got your 50g per executor limit? I mean in configuration page.. might be you are using

spark streaming program failed on Spark 1.4.1

2015-08-03 Thread Netwaver
Hi All, I have a spark streaming + kafka program written by Scala, it works well on Spark 1.3.1, but after I migrate my Spark cluster to 1.4.1 and rerun this program, I meet below exception: ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting

Re: spark streaming program failed on Spark 1.4.1

2015-08-03 Thread Cody Koeninger
Just to be clear, did you rebuild your job against spark 1.4.1 as well as upgrading the cluster? On Mon, Aug 3, 2015 at 8:36 AM, Netwaver wrote: > Hi All, > I have a spark streaming + kafka program written by Scala, it > works well on Spark 1.3.1, but after I migrate my Spark cluster to

How to calculate standard deviation of grouped data in a DataFrame?

2015-08-03 Thread the3rdNotch
I have user logs that I have taken from a csv and converted into a DataFrame in order to leverage the SparkSQL querying features. A single user will create numerous entries per hour, and I would like to gather some basic statistical information for each user; really just the count of the user inst

How do I Process Streams that span multiple lines?

2015-08-03 Thread Spark Enthusiast
All  examples of Spark Stream programming that I see assume streams of lines that are then tokenised and acted upon (like the WordCount example). How do I process Streams that span multiple lines? Are there examples that I can use? 

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-08-03 Thread Umesh Kacha
Hi all any help will be much appreciated my spark job runs fine but in the middle it starts loosing executors because of netafetchfailed exception saying shuffle not found at the location since executor is lost On Jul 31, 2015 11:41 PM, "Umesh Kacha" wrote: > Hi thanks for the response. It looks

large scheduler delay in pyspark

2015-08-03 Thread gen tang
Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by "adding" two list if I do reduceByKey as fo

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
Hi, there was this issue for Scala 2.11. https://issues.apache.org/jira/browse/SPARK-7944 It should be fixed on master branch. You may be hitting that. Best, Burak On Sun, Aug 2, 2015 at 9:06 PM, Ted Yu wrote: > I tried the following command on master branch: > bin/spark-shell --packages com.da

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
In addition, you do not need to use --jars with --packages. --packages will get the jar for you. Best, Burak On Mon, Aug 3, 2015 at 9:01 AM, Burak Yavuz wrote: > Hi, there was this issue for Scala 2.11. > https://issues.apache.org/jira/browse/SPARK-7944 > It should be fixed on master branch. Yo

Re: HiveQL to SparkSQL

2015-08-03 Thread Bigdata techguy
Did anybody try to convert HiveQL queries to SparkSQL? If so, would you share the experience, pros & cons please? Thank you. On Thu, Jul 30, 2015 at 10:37 AM, Bigdata techguy wrote: > Thanks Jorn for the response and for the pointer questions to Hive > optimization tips. > > I believe I have don

Does RDD.cartesian involve shuffling?

2015-08-03 Thread Meihua Wu
Does RDD.cartesian involve shuffling? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Standalone Cluster Local Authentication

2015-08-03 Thread MrJew
Hello, Similar to other cluster systems e.g Zookeeper, Hazelcast. Spark has the problem that is protected from the outside world however anyone having access to the host can run a spark node without the need for authentication. Currently we are using Spark 1.3.1. Is there a way to enable authentica

EOFException when transmitting a class that extends Externalizable

2015-08-03 Thread Michael Knapp
Hi, I am having a problem serializing a custom partitioner that I have written that extends Externalizable. The partitioner wraps a java TreeSet which stores table splits. There are thousands of splits. I noticed earlier that my spark job was taking over 30 seconds just to transmit a task to ea

Re: Is it possible to disable AM page proxy in Yarn client mode?

2015-08-03 Thread Steve Loughran
the reason that redirect is there is for security reasons; in a kerberos enabled cluster the RM proxy does the authentication, then forwards the requests to the running application. There's no obvious way to disable it in the spark application master, and I wouldn't recommend doing this anyway,

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Sujit Pal
@Silvio: the mapPartitions instantiates a HttpSolrServer, then for each query string in the partition, sends the query to Solr using SolrJ, and gets back the top N results. It then reformats the result data into one long string and returns the key value pair as (query string, result string). @Igor

Re: Standalone Cluster Local Authentication

2015-08-03 Thread Ted Yu
Looks like related work is in progress. e.g. SPARK-5158 Cheers On Mon, Aug 3, 2015 at 10:05 AM, MrJew wrote: > Hello, > Similar to other cluster systems e.g Zookeeper, Hazelcast. Spark has the > problem that is protected from the outside world however anyone having > access to the host can run

Re: Standalone Cluster Local Authentication

2015-08-03 Thread Steve Loughran
> On 3 Aug 2015, at 10:05, MrJew wrote: > > Hello, > Similar to other cluster systems e.g Zookeeper, Actually, Zookeeper supports SASL authentication of your Kerberos tokens. https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zookeeper+and+SASL > Hazelcast. Spark has the > problem that i

Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-03 Thread Ted Yu
When I tried to compile against hbase 1.1.1, I got: [ERROR] /home/hbase/ssoh/src/main/scala/org/apache/spark/sql/hbase/SparkSqlRegionObserver.scala:124: overloaded method next needs result type [ERROR] override def next(result: java.util.List[Cell], limit: Int) = next(result) Is there plan to s

Combine code for RDD and DStream

2015-08-03 Thread Sidd S
Hello! I am developing a Spark program that uses both batch and streaming (separately). They are both pretty much the exact same programs, except the inputs come from different sources. Unfortunately, RDD's and DStream's define all of their transformations in their own files, and so I have two dif

Re: How do I Process Streams that span multiple lines?

2015-08-03 Thread Michal Čizmazia
Are you looking for RDD.wholeTextFiles? On 3 August 2015 at 10:57, Spark Enthusiast wrote: > All examples of Spark Stream programming that I see assume streams of > lines that are then tokenised and acted upon (like the WordCount example). > > How do I process Streams that span multiple lines?

Re: How do I Process Streams that span multiple lines?

2015-08-03 Thread Michal Čizmazia
Sorry. SparkContext.wholeTextFiles Not sure about streams. On 3 August 2015 at 14:50, Michal Čizmazia wrote: > Are you looking for RDD.wholeTextFiles? > > On 3 August 2015 at 10:57, Spark Enthusiast > wrote: > >> All examples of Spark Stream programming that I see assume streams of >> lines

Writing to HDFS

2015-08-03 Thread Jasleen Kaur
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues). - num-executors 800 - spark.akka.frameSize=1024 - spark.default.parallelism=25600 - driver-memory=4G - executor-memory=32G. - My input size is around 1.5TB. My problem

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Ajay Singal
Hi Sujit, >From experimenting with Spark (and other documentation), my understanding is as follows: 1. Each application consists of one or more Jobs 2. Each Job has one or more Stages 3. Each Stage creates one or more Tasks (normally, one Task per Partition) 4. Master

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread shahid ashraf
hi sujit Can you spin it with 4 (server)*4 (cores) 16 cores i.e there should be 16 cores in your cluster, try to use same no. of partitions. Also look at the http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-td23824.html On Tue, Aug 4, 2015 at 1:46 AM, Ajay Singal

Re: Python, Spark and HBase

2015-08-03 Thread ericbless
I wanted to confirm whether this is now supported, such as in Spark v1.3.0 I've read varying info online & just thought I'd verify. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p24117.html Sent from the Apache Spark U

Re: how to ignore MatchError then processing a large json file in spark-sql

2015-08-03 Thread Michael Armbrust
This sounds like a bug. What version of spark? and can you provide the stack trace? On Sun, Aug 2, 2015 at 11:27 AM, fuellee lee wrote: > I'm trying to process a bunch of large json log files with spark, but it > fails every time with `scala.MatchError`, Whether I give it schema or not. > > I j

Re: Combine code for RDD and DStream

2015-08-03 Thread Sidd S
DStreams "transform" function helps me solve this issue elegantly. Thanks! On Mon, Aug 3, 2015 at 1:42 PM, Sidd S wrote: > Hello! > > I am developing a Spark program that uses both batch and streaming > (separately). They are both pretty much the exact same programs, except the > inputs come fro

Re: how to convert a sequence of TimeStamp to a dataframe

2015-08-03 Thread Michael Armbrust
In general it needs to be a Seq of Tuples for the implicit toDF to work (which is a little tricky when there is only one column). scala> Seq(Tuple1(new java.sql.Timestamp(System.currentTimeMillis))).toDF("a") res3: org.apache.spark.sql.DataFrame = [a: timestamp] or with multiple columns scala> S

shutdown local hivecontext?

2015-08-03 Thread Cesar Flores
We are using a local hive context in order to run unit tests. Our unit tests runs perfectly fine if we run why by one using sbt as the next example: >sbt test-only com.company.pipeline.scalers.ScalerSuite.scala >sbt test-only com.company.pipeline.labels.ActiveUsersLabelsSuite.scala However, if we

NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
Hi, I am trying to modify my code to use HDFS and multiple nodes. The code works fine when I run it locally in a single machine with a single worker. I have been trying to modify it and I get the following error. Any hint would be helpful. java.lang.NullPointerException at thomsonreuters.

Re: NullPointException Help while using accumulators

2015-08-03 Thread Ted Yu
Can you show related code in DriverAccumulator.java ? Which Spark release do you use ? Cheers On Mon, Aug 3, 2015 at 3:13 PM, Anubhav Agarwal wrote: > Hi, > I am trying to modify my code to use HDFS and multiple nodes. The code > works fine when I run it locally in a single machine with a sing

Re: NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
The code was written in 1.4 but I am compiling it and running it with 1.3. import it.unimi.dsi.fastutil.objects.Object2ObjectOpenHashMap; import org.apache.spark.AccumulableParam; import scala.Tuple4; import thomsonreuters.trailblazer.operation.DriverCalc; import thomsonreuters.trailblazer.operati

Re: NullPointException Help while using accumulators

2015-08-03 Thread Ted Yu
Putting your code in a file I find the following on line 17: stepAcc = new StepAccumulator(); However I don't think that was where the NPE was thrown. Another thing I don't understand was that there were two addAccumulator() calls at the top of stack trace while in your code I don'

Re: NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:647) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:647) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.

Contributors group and starter task

2015-08-03 Thread Namit Katariya
My username on the Apache JIRA is katariya.namit. Could one of the admins please add me to the contributors group so that I can have a starter task assigned to myself? Thanks, Namit

Re: Contributors group and starter task

2015-08-03 Thread Marcelo Vanzin
Hi Namit, There's no need to assign a bug to yourself to say you're working on it. The recommended way is to just post a PR on github - the bot will update the bug saying that you have a patch open to fix the issue. On Mon, Aug 3, 2015 at 3:50 PM, Namit Katariya wrote: > My username on the Apa

Re: Contributors group and starter task

2015-08-03 Thread Ted Yu
Once you submit a pull request for some JIRA, the JIRA would be assigned to you. Cheers On Mon, Aug 3, 2015 at 3:50 PM, Namit Katariya wrote: > My username on the Apache JIRA is katariya.namit. Could one of the admins > please add me to the contributors group so that I can have a starter task >

Re: shutdown local hivecontext?

2015-08-03 Thread Michael Armbrust
TestHive takes care of creating a temporary directory for each invocation so that multiple test runs won't conflict. On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores wrote: > > We are using a local hive context in order to run unit tests. Our unit > tests runs perfectly fine if we run why by one usi

SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
Hello, In looking at the SparkR codebase, it seems as if broadcast variables ought to be working based on the tests. I have tried the following in sparkR shell, and similar code in RStudio, but in both cases got the same message > randomMat <- matrix(nrow=10, ncol=10, data=rnorm(100)) > randomMa

Re: SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
I think I just answered my own question. The privitization of the RDD API might have resulted in my error, because this worked: > randomMatBr <- SparkR:::broadcast(sc, randomMat) On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel wrote: > Hello, > > In looking at the SparkR codebase, it seems as if

How does DataFrame except work?

2015-08-03 Thread Srikanth
Hello, I'm planning to use DF1.except(DF2) to get difference between two dataframes. I'd like to know how exactly this API works. Both explain() and spark UI show "except" as an operation on its own. Internally, does does it do a hash partition of both dataframes? If so will it do auto broadcast i

Multiple UpdateStateByKey Functions in the same job?

2015-08-03 Thread swetha
Hi, Can I use multiple UpdateStateByKey Functions in the Streaming job? Suppose I need to maintain the state of the user session in the form of a Json and counts of various other metrics which has different keys ? Can I use multiple updateStateByKey functions to maintain the state for different ke

Re: Writing to HDFS

2015-08-03 Thread ayan guha
Is your data skewed? What happens if you do rdd.count()? On 4 Aug 2015 05:49, "Jasleen Kaur" wrote: > I am executing a spark job on a cluster as a yarn-client(Yarn cluster not > an option due to permission issues). > >- num-executors 800 >- spark.akka.frameSize=1024 >- spark.default.p

Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Upen N
Hi, I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with this version. I created Spark gateways. But I get the following error when run Spark shell from the gateway. Does anyone have any similar experience ? If so, please share the solution. Google shows to copy the Conf files from da

Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Marcelo Vanzin
That should not be a fatal error, it's just a noisy exception. Anyway, it should go away if you add YARN gateways to those nodes (aside from Spark gateways). On Mon, Aug 3, 2015 at 7:10 PM, Upen N wrote: > Hi, > I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with this > version.

Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Guru Medasani
Hi Upen, Did you deploy the client configs after assigning the gateway roles? You should be able to do this from Cloudera Manager. Can you try this and let us know what you see when you run spark-shell? Guru Medasani gdm...@gmail.com > On Aug 3, 2015, at 9:10 PM, Upen N wrote: > > Hi, > I

Unable to compete with performance of single-threaded Scala application

2015-08-03 Thread Philip Weaver
Hello, I am running Spark 1.4.0 on Mesos 0.22.1, and usually I run my jobs in coarse-grained mode. I have written some single-threaded standalone Scala applications for a problem that I am working on, and I am unable to get a Spark solution that comes close to the performance of this application.

Safe to write to parquet at the same time?

2015-08-03 Thread Philip Weaver
I think this question applies regardless if I have two completely separate Spark jobs or tasks on different machines, or two cores that are part of the same task on the same machine. If two jobs/tasks/cores/stages both save to the same parquet directory in parallel like this: df1.write.mode(SaveM

Re: Spark-Submit error

2015-08-03 Thread Guru Medasani
Hi Satish, Can you add more error or log info to the email? Guru Medasani gdm...@gmail.com > On Jul 31, 2015, at 1:06 AM, satish chandra j > wrote: > > HI, > I have submitted a Spark Job with options jars,class,master as local but i am > getting an error as below > > dse spark-submit spa

spark streaming max receiver rate doubts

2015-08-03 Thread Shushant Arora
1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't set spark.streaming.kafka.maxRatePerPartition - so default behavious is to bring all messages from kafka from last offset to current offset ? Say no of messages were large and it took 5 sec to process those so will all jobs

Re: Spark-Submit error

2015-08-03 Thread satish chandra j
Hi Guru, I am executing this on DataStax Enterprise Spark node and ~/.dserc file exists which consists Cassandra credentials but still getting the error Below is the given command dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-120

Re: Spark-Submit error

2015-08-03 Thread Guru Medasani
Thanks Satish. I only see the INFO messages and don’t see any error messages in the output you pasted. Can you paste the log with the error messages? Guru Medasani gdm...@gmail.com > On Aug 3, 2015, at 11:12 PM, satish chandra j > wrote: > > Hi Guru, > I am executing this on DataStax Ente

Repartition question

2015-08-03 Thread Naveen Madhire
Hi All, I am running the WikiPedia parsing example present in the "Advance Analytics with Spark" book. https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112 The partitions of the RDD returned by

Re: Data from PostgreSQL to Spark

2015-08-03 Thread Jeetendra Gangele
Here is the solution this looks perfect for me. thanks for all your help http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/ On 28 July 2015 at 23:27, Jörn Franke wrote: > Can you put some transparent cache in front of the database? Or some jdbc > proxy? >

Re: Unable to query existing hive table from spark sql 1.3.0

2015-08-03 Thread Ishwardeep Singh
Your table is in which database - default or result. By default spark will try to look for table in "default" database. If the table exists in the "result" database try to prefix the table name with database name like "select * from result.salarytest" or set the database by executing "use " -

Spark SQL support for Hive 0.14

2015-08-03 Thread Ishwardeep Singh
Hi, Does spark SQL support Hive 0.14? The documentation refers to Hive 0.13. Is there a way to compile spark with Hive 0.14? Currently we are using Spark 1.3.1. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-support-for-Hive-0-14-tp2412

Re: Schema evolution in tables

2015-08-03 Thread Brandon White
Sim did you find anything? :) On Sun, Jul 26, 2015 at 9:31 AM, sim wrote: > The schema merging > > section of the Spark SQL documentation shows an example of schema evolution > in a partitioned table. > > Is this fun

Re: Running multiple batch jobs in parallel using Spark on Mesos

2015-08-03 Thread Akhil Das
One approach would be to use a Jobserver in between, create SparkContexts in it. Lets say you create two, one which is configured to run on coarse-grained and another set to fine-grained. Let the high priority jobs hit the coarse-grained SparkContext and the other jobs use the fine-grained one. Th