Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-22 Thread Xiao Li
Hi, Sebastian, To use private APIs, you have to be very familiar with the code path; otherwise, it is very easy to hit an exception or a bug. My suggestion is to use IntelliJ to step-by-step step in the function hiveContext.sql until you hit the parseSql API. Then, you will know if you have to ca

Re: How to close connection in mapPartitions?

2015-10-22 Thread Aniket Bhatnagar
Are you sure RedisClientPool is being initialized properly in the constructor of RedisCache? Can you please copy paste the code that you use to initialize RedisClientPool inside the constructor of RedisCache? Thanks, Aniket On Fri, Oct 23, 2015 at 11:47 AM Bin Wang wrote: > BTW, "lines" is a DS

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Akhil Das
Mostly a network issue, you need to check your network configuration from the aws console and make sure the ports are accessible within the cluster. Thanks Best Regards On Thu, Oct 22, 2015 at 8:53 PM, Eugen Cepoi wrote: > Huh indeed this worked, thanks. Do you know why this happens, is that so

Re: How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang
BTW, "lines" is a DStream. Bin Wang 于2015年10月23日周五 下午2:16写道: > I use mapPartitions to open connections to Redis, I write it like this: > > val seqs = lines.mapPartitions { lines => > val cache = new RedisCache(redisUrl, redisPort) > val result = lines.map(line => Parser.parseBody(

How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang
I use mapPartitions to open connections to Redis, I write it like this: val seqs = lines.mapPartitions { lines => val cache = new RedisCache(redisUrl, redisPort) val result = lines.map(line => Parser.parseBody(line, cache)) cache.redisPool.close result } But it see

Re: Save to paquet files failed

2015-10-22 Thread Raghavendra Pandey
Can ypu increase number of partitions and try... Also, i dont think you need to cache dfs before saving them... U can do away with that as well... Raghav On Oct 23, 2015 7:45 AM, "Ram VISWANADHA" wrote: > Hi , > I am trying to load 931MB file into an RDD, then create a DataFrame and > store the

Re: How to restart a failed Spark Streaming Application automatically in client mode on YARN

2015-10-22 Thread Saisai Shao
Looks like currently there's no way for Spark Streaming to restart automatically in yarn-client mode, because in yarn-client mode, AM and driver are two processes, Yarn only control the restart of AM, not driver, so it is not supported in yarn-client mode. You can write some scripts to monitor you

How to restart a failed Spark Streaming Application automatically in client mode on YARN

2015-10-22 Thread y
I'm managing Spark Streaming applications which run on Cloud Dataproc (https://cloud.google.com/dataproc/). Spark Streaming applications running on a Cloud Dataproc cluster seem to run in client mode on YARN. Some of my applications sometimes stop due to the application failure. I'd like YARN to

Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-22 Thread JoneZhang
1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level? 2.If not, How can i set Storage Level when i use Hive on Spark? 3.Do Spark have any intention of dynamically determined Hive on MapReduce or Hive on Spark, base on SQL features. Thanks in advance Best regar

(SOLVED) Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-22 Thread t3l
I was able to solve this by myself. What I did is changing the way spark computes the partitioning for binary files. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140p25170.html Sent from the Apache Sp

Saving offset while reading from kafka

2015-10-22 Thread Ramkumar V
Hi, I had written spark streaming application using kafka stream and its writing to hdfs for every hour(batch time). I would like to know how to get offset or commit offset of kafka stream while writing to hdfs so that if there is any issue or redeployment, i'll start from the point where i did a

Re: Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Sandy Ryza
Hi Deenar, The version of Spark you have may not be compiled with YARN support. If you inspect the contents of the assembly jar, does org.apache.spark.deploy.yarn.ExecutorLauncher exist? If not, you'll need to find a version that does have the YARN classes. You can also build your own using the

Saving RDDs in Tachyon

2015-10-22 Thread mark
I have Avro records stored in Parquet files in HDFS. I want to read these out as an RDD and save that RDD in Tachyon for any spark job that wants the data. How do I save the RDD in Tachyon? What format do I use? Which RDD 'saveAs...' method do I want? Thanks

Save to paquet files failed

2015-10-22 Thread Ram VISWANADHA
Hi , I am trying to load 931MB file into an RDD, then create a DataFrame and store the data in a Parquet file. The save method of Parquet file is hanging. I have set the timeout to 1800 but still the system fails to respond and hangs. I can’t spot any errors in my code. Can someone help me? Than

Best way to use Spark UDFs via Hive (Spark Thrift Server)

2015-10-22 Thread Dave Moyers
Hi, We have several udf's written in Scala that we use within jobs submitted into Spark. They work perfectly with the sqlContext after being registered. We also allow access to saved tables via the Hive Thrift server bundled with Spark. However, we would like to allow Hive connections to use th

Re: Spark issue running jar on Linux vs Windows

2015-10-22 Thread Ted Yu
RemoteActorRefProvider is in akka-remote_2.10-2.3.11.jar jar tvf ~/.m2/repository/com/typesafe/akka/akka-remote_2.10/2.3.11/akka-remote_2.10-2.3.11.jar | grep RemoteActorRefProvi 1761 Fri May 08 16:13:02 PDT 2015 akka/remote/RemoteActorRefProvider$$anonfun$5.class 1416 Fri May 08 16:13:02 PDT

Re: Spark issue running jar on Linux vs Windows

2015-10-22 Thread Ted Yu
RemoteActorRefProvider is in akka-remote_2.10-2.3.11.jar jar tvf ~/.m2/repository/com/typesafe/akka/akka-remote_2.10/2.3.11/akka-remote_2.10-2.3.11.jar | grep RemoteActorRefProvi 1761 Fri May 08 16:13:02 PDT 2015 akka/remote/RemoteActorRefProvider$$anonfun$5.class 1416 Fri May 08 16:13:02 PDT

Spark issue running jar on Linux vs Windows

2015-10-22 Thread Michael Lewis
Hi, I have a Spark driver process that I have built into a single ‘fat jar’ this runs fine, in Cygwin, on my development machine, I can run: scala -cp my-fat-jar-1.0.0.jar com.foo.MyMainClass this works fine, it will submit Spark job, they process, all good. However, on Linux (all Jars Spar

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
Thanks Sean. I did that but didn¹t seem to help. However, I manually downloaded both the pom and jar files from the site, and then run through mvn dependency:purge-local-repository to clean up the local repo (+ re download them all). All are good and then the error went away. Thanks a lot for your

Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-10-22 Thread Babar Tareen
Hi, I am getting following exception when submitting a job to Spark 1.5.x from Scala. The same code works with Spark 1.4.1. Any clues as to what might causing the exception. *Code:App.scala*import org.apache.spark.SparkContext object App { def main(args: Array[String]) = { val l = List(1

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Ashish Shrowty
Thanks Steve. I built it from source. On Thu, Oct 22, 2015 at 4:01 PM Steve Loughran wrote: > > > On 22 Oct 2015, at 15:12, Ashish Shrowty > wrote: > > > > I understand that there is some incompatibility with the API between > Hadoop > > 2.6/2.7 and Amazon AWS SDK where they changed a signatur

Spark YARN Shuffle service wire compatibility

2015-10-22 Thread Jong Wook Kim
Hi, I’d like to know if there is a guarantee that Spark YARN shuffle service has wire compatibility between 1.x versions. I could run Spark 1.5 job with YARN nodemanagers having shuffle service 1.4, but it might’ve been just a coincidence. Now we’re upgrading CDH to 5.3 to 5.4, whose NodeManage

Re: [Spark Streaming] Design Patterns forEachRDD

2015-10-22 Thread Nipun Arora
Hi Sandip, Thanks for your response.. I am not sure if this is the same thing. I am looking for a way to connect to external network as shown in the example. @All - Can anyone else let me know if they have a better solution? Thanks Nipun On Wed, Oct 21, 2015 at 2:07 PM, Sandip Mehta wrote: >

[SPARK STREAMING] polling based operation instead of event based operation

2015-10-22 Thread Nipun Arora
Hi, In general in spark stream one can do transformations ( filter, map etc.) or output operations (collect, forEach) etc. in an event-driven pardigm... i.e. the action happens only if a message is received. Is it possible to do actions every few seconds in a polling based fashion, regardless if a

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread Sean Owen
Maven, in general, does some local caching to avoid htting the repo every time. It's possible this is why you're not seeing 1.5.1. On the command line you can for example add "mvn -U ..." Not sure of the equivalent in IntelliJ, but it will be updating the same repo IJ sees. Try that. The repo defin

Re: Save RandomForest Model from ML package

2015-10-22 Thread Sujit Pal
Hi Sebastian, You can save models to disk and load them back up. In the snippet below (copied out of a working Databricks notebook), I train a model, then save it to disk, then retrieve it back into model2 from disk. import org.apache.spark.mllib.tree.RandomForest > import org.apache.spark.mllib.

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Steve Loughran
> On 22 Oct 2015, at 15:12, Ashish Shrowty wrote: > > I understand that there is some incompatibility with the API between Hadoop > 2.6/2.7 and Amazon AWS SDK where they changed a signature of > com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold. > The J

Re: Spark_1.5.1_on_HortonWorks

2015-10-22 Thread Steve Loughran
On 22 Oct 2015, at 02:47, Ajay Chander mailto:itsche...@gmail.com>> wrote: Thanks for your time. I have followed your inputs and downloaded "spark-1.5.1-bin-hadoop2.6" on one of the node say node1. And when I did a pie test everything seems to be working fine, except that the spark-history -se

Re: Incremental load of RDD from HDFS?

2015-10-22 Thread Chris Spagnoli
Found the problem. There was an rdd.distinct hiding in my code that I had overlooked, and that caused this behavior (because instead of iterating over the raw RDD, I was instead iterating over the RDD which had been derived from it). Thank you everyone! - Chris -- View this message in context

Fwd: sqlContext load by offset

2015-10-22 Thread Kayode Odeyemi
Hi, I've trying to load a postgres table using the following expression: val cachedIndex = cache.get("latest_legacy_group_index") val mappingsDF = sqlContext.load("jdbc", Map( "url" -> Config.dataSourceUrl(mode, Some("mappings")), "dbtable" -> s"(select userid, yid, username from legacyusers

spark.deploy.zookeeper.url

2015-10-22 Thread Michal Čizmazia
Does the spark.deploy.zookeeper.url configuration work correctly when I point it to a single virtual IP address with more hosts behind it (load balancer or round robin)? https://spark.apache.org/docs/latest/spark-standalone.html#high-availability ZooKeeper FAQ also discusses this topic: https://

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
Thanks Deenar for your response. I am able to get the version 1.5.0 and other lower version, they all fine but just not the 1.5.1. It's hard to believe it's proxy settings settings. What is interesting is that the Intellij does a few things when downloading this jar: putting into .m2 repository

Re: Running 2 spark application in parallel

2015-10-22 Thread Simon Elliston Ball
If yarn has capacity to run both simultaneously it will. You should ensure you are not allocating too many executors for the first app and leave some space for the second) You may want to run the application on different yarn queues to control resource allocation. If you run as a different user

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-22 Thread Sander van Dijk
I don't think it is possible in the way you try to do it. It is important to remember that the statements you mention only set up the stream stages, before the stream is actually running. Once it's running, you cannot change, remove or add stages. I am not sure how you determine your condition and

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-22 Thread Uthayan Suthakar
I need to take the value from a RDD and update the state of the other RDD. Is this possible? On 22 October 2015 at 16:06, Uthayan Suthakar wrote: > Hello guys, > > I have a stream job that will carryout computations and update the state > (SUM the value). At some point, I would like to reset the

Running 2 spark application in parallel

2015-10-22 Thread Suman Somasundar
Hi all, Is there a way to run 2 spark applications in parallel under Yarn in the same cluster? Currently, if I submit 2 applications, one of them waits till the other one is completed. I want both of them to start and run at the same time. Thanks, Suman.

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread Deenar Toraskar
I can see this artifact in public repos http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.5.1 http://central.maven.org/maven2/org/apache/spark/spark-sql_2.10/1.5.1/spark-sql_2.10-1.5.1.jar check your proxy settings or the list of repos you are using. Deenar On 22 October 2015

Re: Large number of conf broadcasts

2015-10-22 Thread Anders Arpteg
Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro reader to only broadcast once per dataset, instead of every single file/partition. It seems to work just as fine, and there are significantly less broadcasts and not seeing out of memory issues any more. Strange that mo

Re: Request for submitting Spark jobs in code purely, without jar

2015-10-22 Thread Ali Tajeldin EDU
The Spark job-server project may help (https://github.com/spark-jobserver/spark-jobserver). -- Ali On Oct 21, 2015, at 11:43 PM, ?? wrote: > Hi developers, I've encountered some problem with Spark, and before opening > an issue, I'd like to hear your thoughts. > > Currently, if you want t

"java.io.IOException: Connection reset by peer" thrown on the resource manager when launching Spark on Yarn

2015-10-22 Thread PashMic
Hi all, I am trying to launch a Spark job using yarn-client mode on a cluster. I have already tried spark-shell with yarn and I can launch the application. But, I also would like to be able run the driver program from, say eclipse, while using the cluster to run the tasks. I have also added spark-

Re: Large number of conf broadcasts

2015-10-22 Thread Koert Kuipers
i am seeing the same thing. its gona completely crazy creating broadcasts for the last 15 mins or so. killing it... On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg wrote: > Hi, > > Running spark 1.5.0 in yarn-client mode, and am curios in why there are so > many broadcast being done when loading

Fwd: multiple pyspark instances simultaneously (same time)

2015-10-22 Thread Jeff Sadowski
On Thu, Oct 22, 2015 at 5:40 AM, Akhil Das wrote: > Did you read > https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application > I did. I had set the option spark.scheduler.mode FAIR in conf/spark-defaults.conf and created fairscheduler.xml with the two pools pro

Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
Hi - I tried to download the Spark SQL 2.10 and version 1.5.1 from Intellij using the maven library: -Project Structure -Global Library, click on the + to select Maven Repository -Type in org.apache.spark to see the list. -The list result only shows version up to spark-sql_2.10-1.1.1 -I tried to

Python worker exited unexpectedly (crashed)

2015-10-22 Thread shahid
Hi I am running 10 node standalone cluster on aws and loading 100G data on HDFS.. doing first groupby operation. and then generating pairs from the groupedrdd (key,[a1,b1],key,[a,b,c]) generating the pairs like (a1,b1),(a,b),(a,c) ... n PairRDD will get large in size. some stats from ui when st

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
nevermind my last email. res2 is filtered so my test does not make sense. The issue is not reproduced there. I have the problem somwhere else. From: Ellafi, Saif A. Sent: Thursday, October 22, 2015 12:57 PM To: 'Xiao Li' Cc: user Subject: RE: Spark groupby and agg inconsistent and missing data T

Re: Error in starting Spark Streaming Context

2015-10-22 Thread Tiago Albineli Motta
Solved! The problem has nothing to do about class and object refactory. But in the process of this refactory I made a change that is similar of your code. Before this refactory, I processed the DStream inside the function that I sent to StreamingContext.getOrCreate. After, I started processing th

Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Deenar Toraskar
Hi I have got the prebuilt version of Spark 1.5 for Hadoop 2.6 ( http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz) working with CDH 5.4.0 in local mode on a cluster with Kerberos. It works well including connecting to the Hive metastore. I am facing an issue runn

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

2015-10-22 Thread Jerry Lam
Hi Spark users and developers, I read the ticket [SPARK-8578] (Should ignore user defined output committer when appending data) which ignore DirectParquetOutputCommitter if append mode is selected. The logic was that it is unsafe to use because it is not possible to revert a failed job in append m

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Thanks, sorry I cannot share the data and not sure how much significant it will be for you. I am reproducing the issue on a smaller piece of the content and see wether I find a reason on the inconsistence. val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band ", "aget",

Re: [jira] Ankit shared "SPARK-11213: Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4" with you

2015-10-22 Thread Anubhav Agarwal
Hi Ankit, Here is my solution for this:- 1) Download the latest Spark 1.5.1(Just copied the following link from spark.apache.org, if it doesn't work then gran a new one from the website.) wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz 2) Unzip the folder and rename/move t

Re: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Xiao Li
Hi, Saif, Could you post your code here? It might help others reproduce the errors and give you a correct answer. Thanks, Xiao Li 2015-10-22 8:27 GMT-07:00 : > Hello everyone, > > I am doing some analytics experiments under a 4 server stand-alone cluster > in a spark shell, mostly involving a

[jira] Ankit shared "SPARK-11213: Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4" with you

2015-10-22 Thread Ankit (JIRA)
Ankit shared an issue with you --- > Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4 > --- > > Key: SPARK-11213 > URL: https://issue

Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Hello everyone, I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell, mostly involving a huge database with groupBy and aggregations. I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy fields are of two types, m

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Eugen Cepoi
Huh indeed this worked, thanks. Do you know why this happens, is that some known issue? Thanks, Eugen 2015-10-22 19:08 GMT+07:00 Akhil Das : > Can you try fixing spark.blockManager.port to specific port and see if the > issue exists? > > Thanks > Best Regards > > On Mon, Oct 19, 2015 at 6:21 PM,

Re: Analyzing consecutive elements

2015-10-22 Thread Sampo Niskanen
Hi, Excellent, the sliding method seems to be just what I'm looking for. Hope it becomes part of the stable API, I'd assume there to be lots of uses with time-related data. Dylan's suggestion seems reasonable as well, if DeveloperApi is not an option. Thanks! Best regards, *Sampo Niskane

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Igor Berman
many use it. how do you add aws sdk to classpath? check in environment ui what is in cp. you should make sure that in your cp the version is compatible with one that spark compiled with I think 1.7.4 is compatible(at least we use it) make sure that you don't get other versions from other transitiv

[Spark Streaming] How do we reset the updateStateByKey values.

2015-10-22 Thread Uthayan Suthakar
Hello guys, I have a stream job that will carryout computations and update the state (SUM the value). At some point, I would like to reset the state. I could drop the state by setting 'None' but I don't want to drop it. I would like to keep the state but update the state. For example: JavaPairD

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Adnan Haider
I believe spark.rdd.compress requires the data to be serialized. In my case, I have data already compressed which becomes decompressed as I try to cache it. I believe even when I set spark.rdd.compress to *true, *Spark will still decompress the data and then serialize it and then compress the seria

Re: Accessing external Kerberised resources from Spark executors in Yarn client/cluster mode

2015-10-22 Thread Doug Balog
Another thing to check is to make sure each one of you executor nodes has the JCE jars installed. try{ javax.crypto.Cipher.getMaxAllowedKeyLength("AES") > 128 } catch { case e:java.security.NoSuchAlgorithmException => false } Setting "-Dsun.security.krb5.debug=true” and “-Dsun.security.jgss.d

Re: Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-22 Thread Cody Koeninger
That sounds like a networking issue to me. Stuff to try - make sure every executor node can talk to every kafka broker on relevant ports - look at firewalls / network config. Even if you can make the initial connection, something may be happening after a while (we've seen ... "interesting"... is

Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Ashish Shrowty
I understand that there is some incompatibility with the API between Hadoop 2.6/2.7 and Amazon AWS SDK where they changed a signature of com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold. The JIRA indicates that this would be fixed in Hadoop 2.8. (https://

Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-22 Thread Xiao Li
A few months ago, I used the DB2 jdbc drivers. I hit a couple of issues when using --driver-class-path. At the end, I used the following command to bypass most of issues: ./bin/spark-submit --jars /Users/smile/db2driver/db2jcc.jar,/Users/smile/db2driver/db2jcc_license_cisuz.jar --master local[*] -

Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-22 Thread Conor Fennell
Hi, Firstly want to say a big thanks to Cody for contributing the kafka direct stream. I have been using the receiver based approach for months but the direct stream is a much better solution for my use case. The job in question is now ported over to the direct stream doing idempotent outputs to

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-22 Thread Xiao Li
Actually, I found a design issue in self joins. When we have multiple-layer projections above alias, the information of alias relation between alias and actual columns are lost. Thus, when resolving the alias in self joins, the rules treat the alias (e.g., in Projection) as normal columns. This onl

Re: Analyzing consecutive elements

2015-10-22 Thread Adrian Tanase
Drop is a method on scala’s collections (array, list, etc) - not on Spark’s RDDs. You can look at it as long as you use mapPartitions or something like reduceByKey, but it totally depends on the use-cases you have for analytics. The others have suggested better solutions using only spark’s APIs.

Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-22 Thread Conor Fennell
Hi, Firstly want to say a big thanks to Cody for contributing the kafka direct stream. I have been using the receiver based approach for months but the direct stream is a much better solution for my use case. The job in question is now ported over to the direct stream doing idempotent outputs to

Re: Error in starting Spark Streaming Context

2015-10-22 Thread Tiago Albineli Motta
Can't say what is happening, and I have a similar problem here. While for you the source is: org.apache.spark.streaming.dstream.WindowedDStream@532d0784 has not been initialized For me is: org.apache.spark.SparkException: org.apache.spark.streaming.dstream.MapPartitionedDStream@7a2d07cc has n

RE: Analyzing consecutive elements

2015-10-22 Thread Andrianasolo Fanilo
Hi Sampo, There is a sliding method you could try inside the org.apache.spark.mllib.rdd.RDDFunctions package, though it’s DeveloperApi stuff (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions) import org.apache.spark.{SparkConf, SparkContext} i

Save RandomForest Model from ML package

2015-10-22 Thread Sebastian Kuepers
Hey, I try to figure out the best practice on saving and loading models which have bin fitted with the ML package - i.e. with the RandomForest classifier. There is PMML support in the MLib package afaik but not in ML - is that correct? How do you approach this, so that you do not have to fit yo

Re: Can we split partition

2015-10-22 Thread Akhil Das
Did you try coalesce? It doesn't shuffle the data around. Thanks Best Regards On Wed, Oct 21, 2015 at 10:27 AM, shahid wrote: > Hi > > I have a large partition(data skewed) i need to split it to no. of > partitions, repartitioning causes lot of shuffle. Can we do that..? > > > > -- > View this

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Igor Berman
check spark.rdd.compress On 19 October 2015 at 21:13, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like normal but stores > the data uncompressed in memory. For example, suppose /data/ is

Re: java TwitterUtils.createStream() how create "user stream" ???

2015-10-22 Thread Akhil Das
I don't think the one that comes with spark would listen to specific user feeds, but yes you can filter out the public tweets by passing the filters argument. Here's an example for you to start https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/T

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Akhil Das
Convert your data to parquet, it saves space and time. Thanks Best Regards On Mon, Oct 19, 2015 at 11:43 PM, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like normal but stores > the dat

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Akhil Das
Can you try fixing spark.blockManager.port to specific port and see if the issue exists? Thanks Best Regards On Mon, Oct 19, 2015 at 6:21 PM, Eugen Cepoi wrote: > Hi, > > I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. > The job is reading data from Kinesis and the batch size is

Re: Output println info in LogMessage Info ?

2015-10-22 Thread Akhil Das
Yes, using log4j you can log everything. Here's a thread with example http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs Thanks Best Regards On Sun, Oct 18, 2015 at 12:10 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com>

Accessing external Kerberised resources from Spark executors in Yarn client/cluster mode

2015-10-22 Thread Deenar Toraskar
Hi All I am trying to access a SQLServer that uses Kerberos for authentication from Spark. I can successfully connect to the SQLServer from the driver node, but any connections to SQLServer from executors fails with "Failed to find any Kerberos tgt". org.apache.hadoop.security.UserGroupInformatio

Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-22 Thread Akhil Das
Did you try passing the mysql connector jar through --driver-class-path Thanks Best Regards On Sat, Oct 17, 2015 at 6:33 AM, Hurshal Patel wrote: > Hi all, > > I've been struggling with a particularly puzzling issue after upgrading to > Spark 1.5.1 from Spark 1.4.1. > > When I use the MySQL JDB

Re: Analyzing consecutive elements

2015-10-22 Thread Dylan Hogg
Hi Sampo, You could try zipWithIndex followed by a self join with shifted index values like this: val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E")) val rdd = sc.parallelize(arr) val sorted = rdd.sortByKey(true) val zipped = sorted.zipWithIndex.map(x => (x._2, x._1)) val pairs = z

Re: multiple pyspark instances simultaneously (same time)

2015-10-22 Thread Akhil Das
Did you read https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application Thanks Best Regards On Thu, Oct 15, 2015 at 11:31 PM, jeff.sadow...@gmail.com < jeff.sadow...@gmail.com> wrote: > I am having issues trying to setup spark to run jobs simultaneously. > > I thou

Re: Can I convert RDD[My_OWN_JAVA_CLASS] to DataFrame in Spark 1.3.x?

2015-10-22 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection if you haven't seen that already. Thanks Best Regards On Thu, Oct 15, 2015 at 10:56 PM, java8964 wrote: > Hi, Sparkers: > > I wonder if I can convert a RDD of my own Java class in

Re: Get the previous state string

2015-10-22 Thread Akhil Das
That way, you will eventually end up bloating up that list. Instead, you could push the stream to a noSQL database (like hbase or cassandra etc) and then read it back and join it with your current stream if that's what you are looking for. Thanks Best Regards On Thu, Oct 15, 2015 at 6:11 PM, Yoge

Re: Analyzing consecutive elements

2015-10-22 Thread Sampo Niskanen
Hi, Sorry, I'm not very familiar with those methods and cannot find the 'drop' method anywhere. As an example: val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E")) val rdd = sc.parallelize(arr) val sorted = rdd.sortByKey(true) // ... then what? Thanks. Best regards, *Sampo

Re: Spark StreamingStatefull information

2015-10-22 Thread Adrian Tanase
The result of updatestatebykey is a dstream that emits the entire state every batch - as an RDD - nothing special about it. It easy to join / cogroup with another RDD if you have the correct keys in both. You could load this one when the job starts and/or have it update with updatestatebykey as

Spark StreamingStatefull information

2015-10-22 Thread Arttii
Hi, So I am working on a usecase, where Clients are walking in and out of geofences and sendingmessages based on that. I currently have some in Memory Broadcast vars to do certain lookups for client and geofence info, some of this is also coming from Cassandra. My current quandry is that I need to

Re: Analyzing consecutive elements

2015-10-22 Thread Adrian Tanase
I'm not sure if there is a better way to do it directly using Spark APIs but I would try to use mapPartitions and then within each partition Iterable to: rdd.zip(rdd.drop(1)) - using the Scala collection APIs This should give you what you need inside a partition. I'm hoping that you can partiti

RE: Spark_1.5.1_on_HortonWorks

2015-10-22 Thread Sun, Rui
Frans, SparkR runs with R 3.1+. If possible, latest verison of R is recommended. From: Saisai Shao [mailto:sai.sai.s...@gmail.com] Sent: Thursday, October 22, 2015 11:17 AM To: Frans Thamura Cc: Ajay Chander; Doug Balog; user spark mailing list Subject: Re: Spark_1.5.1_on_HortonWorks SparkR is s