Abnormally large deserialisation time for some tasks

2016-02-16 Thread Abhishek Modi
I'm doing a mapPartitions on a rdd cached in memory followed by a reduce. Here is my code snippet // myRdd is an rdd consisting of Tuple2[Int,Long] myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2)) //The rangify function def rangify(l: Iterator[ Tuple2[Int,Long] ]) : Iterato

Stored proc with spark

2016-02-16 Thread Gaurav Agarwal
Hi Can I load the data into spark from oracle storedproc Thanks

Unusually large deserialisation time

2016-02-16 Thread Abhishek Modi
I'm doing a mapPartitions on a rdd cached in memory followed by a reduce. Here is my code snippet // myRdd is an rdd consisting of Tuple2[Int,Long] myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2)) //The rangify function def rangify(l: Iterator[ Tuple2[Int,Long] ]) : Iterato

Re: Stored proc with spark

2016-02-16 Thread Gourav Sengupta
Hi Gaurav, do you mean stored proc that returns a table? Regards, Gourav On Tue, Feb 16, 2016 at 9:04 AM, Gaurav Agarwal wrote: > Hi > Can I load the data into spark from oracle storedproc > > Thanks >

Re: Stored proc with spark

2016-02-16 Thread Alonso Isidoro Roman
relational databases? what about sqoop? https://en.wikipedia.org/wiki/Sqoop Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger Dijkstra My favorite quotes (to

New line lost in streaming output file

2016-02-16 Thread Ashutosh Kumar
I am getting multiple empty files for streaming output for each interval. To Avoid this I tried kStream.foreachRDD(new VoidFunction2,Time>(){ *public void call(JavaRDD rdd,Time time) throws Exception { if(!rdd.isEmpty()){ rdd.saveAsTextFile("filename_"+time.milliseconds()+".c

Re: Stored proc with spark

2016-02-16 Thread Jörn Franke
There are many facets to this topic, you could use Sqoop or the spark jdbc driver or oracle Hadoop loader or external tables in oracle that use coprocessors to stream directly to compressed csv files that are important by spark. Depends all on volumes, non-functional and functional requirements

Re: New line lost in streaming output file

2016-02-16 Thread Chandeep Singh
!rdd.isEmpty() should work but an alternative could be rdd.take(1) != 0 > On Feb 16, 2016, at 9:33 AM, Ashutosh Kumar wrote: > > I am getting multiple empty files for streaming output for each interval. > To Avoid this I tried > > kStream.foreachRDD(new VoidFunction2,Time>(){ >

reading spark dataframe in python

2016-02-16 Thread Devesh Raj Singh
Hi, I want to read a spark dataframe using python and then convert the spark dataframe to pandas dataframe then convert the pandas dataframe back to spark dataframe ( after doing some data analysis) . Please suggest. -- Warm regards, Devesh.

Submit custom python packages from current project

2016-02-16 Thread Mohannad Ali
Hello Everyone, I have code inside my project organized in packages and modules, however I keep getting the error "ImportError: No module named " when I run spark on YARN. My directory structure is something like this: project/ package/ module.py __init__.py bin/

Re: reading spark dataframe in python

2016-02-16 Thread Mohannad Ali
I think you need to consider using something like this: http://sparklingpandas.com/ On Tue, Feb 16, 2016 at 10:59 AM, Devesh Raj Singh wrote: > Hi, > > I want to read a spark dataframe using python and then convert the spark > dataframe to pandas dataframe then convert the pandas dataframe back

Re: Check if column exists in Schema

2016-02-16 Thread Sebastian Piu
I'm trying to do it before the column is bound to a dataframe. I guess I'm looking for something like col("x").isNull but that would return whether a column exists or not On Mon, Feb 15, 2016 at 10:31 PM Mohammed Guller wrote: > The DataFrame class has a method named columns, which returns all c

Re: New line lost in streaming output file

2016-02-16 Thread Ashutosh Kumar
Hi Chandeep, Thanks for response. Issue is the new line feed is lost. All records appear in one line only. Thanks Ashutosh On Tue, Feb 16, 2016 at 3:26 PM, Chandeep Singh wrote: > !rdd.isEmpty() should work but an alternative could be rdd.take(1) != 0 > > On Feb 16, 2016, at 9:33 AM, Ashutosh K

Re: New line lost in streaming output file

2016-02-16 Thread UMESH CHAUDHARY
Try to print RDD before writing to validate that you are getting '\n' from Kafka. On Tue, Feb 16, 2016 at 4:19 PM, Ashutosh Kumar wrote: > Hi Chandeep, > Thanks for response. Issue is the new line feed is lost. All records > appear in one line only. > > Thanks > Ashutosh > > On Tue, Feb 16, 2016

Scala from Jupyter

2016-02-16 Thread AlexModestov
Hello! I want to use Scala from Jupyter (or may be something else if you could recomend anything. I mean an IDE). Does anyone know how I can do this? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scala-from-Jupyter-tp26234.html Sent from the Apa

RE: Unusually large deserialisation time

2016-02-16 Thread Darren Govoni
I think this is part of the bigger issue of serious deadlock conditions occurring in spark many of us have posted on. Would the task in question be the past task of a stage by chance? Sent from my Verizon Wireless 4G LTE smartphone Original message From: Abhishek Modi

Re: Scala from Jupyter

2016-02-16 Thread Rajeev Reddy
Hello, Let me understand your query correctly. Case 1. You have a jupyter installation for python and you want to use it for scala. Solution: You can install kernels other than python Ref Case 2. You want to use spark

Re: Scala from Jupyter

2016-02-16 Thread Gourav Sengupta
Apache Zeppelin will be the right solution with in built plugins for python and visualizations as well. Are you planning to use this in EMR? Regards, Gourav On Tue, Feb 16, 2016 at 12:04 PM, Rajeev Reddy wrote: > Hello, > > Let me understand your query correctly. > > Case 1. You have a jupyte

Re: Unusually large deserialisation time

2016-02-16 Thread Ted Yu
Darren: Can you post link to the deadlock issue you mentioned ? Thanks > On Feb 16, 2016, at 6:55 AM, Darren Govoni wrote: > > I think this is part of the bigger issue of serious deadlock conditions > occurring in spark many of us have posted on. > > Would the task in question be the past tas

Re: Scala from Jupyter

2016-02-16 Thread andy petrella
Hello Alex! Rajeev is right, come over the spark notebook gitter room, you'll be helped by many experienced people if you have some troubles: https://gitter.im/andypetrella/spark-notebook The spark notebook has many integrated, reactive (scala) and extendable (scala) plotting capabilities. cheer

Re: Scala from Jupyter

2016-02-16 Thread Aleksandr Modestov
Thank you! I will test Spark Notebook. On Tue, Feb 16, 2016 at 3:37 PM, andy petrella wrote: > Hello Alex! > > Rajeev is right, come over the spark notebook gitter room, you'll be > helped by many experienced people if you have some troubles: > https://gitter.im/andypetrella/spark-notebook > > T

Re: Side effects of using var inside a class object in a Rdd

2016-02-16 Thread Ted Yu
RDD is immutable. How about making class with a, b and c populated a base class ? Class with e and f populated would be a subclass. On Mon, Feb 15, 2016 at 11:55 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hello, > > Yes Age was just for a illustration. Actual scenario is as bel

How to debug spark-core with function call stack?

2016-02-16 Thread DaeJin Jung
hello everyone, I would like to draw call stack of Spark-core by analyzing source code. But, I'm not sure how to apply debugging tool like gdb which can support backtrace command. Please let me know if you have any suggestion. Best Regards, Daejin Jung

Re: Submit custom python packages from current project

2016-02-16 Thread Ramanathan R
Have you tried setting PYTHONPATH? $ export PYTHONPATH="/path/to/project" $ spark-submit --master yarn-client /path/to/project/main_script.py Regards, Ram On 16 February 2016 at 15:33, Mohannad Ali wrote: > Hello Everyone, > > I have code inside my project organized in packages and modules, ho

RE: Unusually large deserialisation time

2016-02-16 Thread Darren Govoni
I meant to write 'last task in stage'. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Darren Govoni Date: 02/16/2016 6:55 AM (GMT-05:00) To: Abhishek Modi , user@spark.apache.org Subject: RE: Unusually large deserialisation time I thi

RE: Memory problems and missing heartbeats

2016-02-16 Thread JOAQUIN GUANTER GONZALBEZ
Bumping this thread in hopes that someone will answer. Ximo -Mensaje original- De: JOAQUIN GUANTER GONZALBEZ [mailto:joaquin.guantergonzal...@telefonica.com] Enviado el: lunes, 15 de febrero de 2016 16:43 Para: user@spark.apache.org Asunto: Memory problems and missing heartbeats Hello,

Re: Submit custom python packages from current project

2016-02-16 Thread Mohannad Ali
Hello Ramanathan, Unfortunately I tried this already and it doesn't work. Mo On Tue, Feb 16, 2016 at 2:13 PM, Ramanathan R wrote: > Have you tried setting PYTHONPATH? > $ export PYTHONPATH="/path/to/project" > $ spark-submit --master yarn-client /path/to/project/main_script.py > > Regards, > R

Re: Memory problems and missing heartbeats

2016-02-16 Thread Iulian Dragoș
Regarding your 2nd problem, my best guess is that you’re seeing GC pauses. It’s not unusual, given you’re using 40GB heaps. See for instance this blog post >From conducting numerous tests, we have concluded that unless

Re: Spark runs only on Mesos v0.21?

2016-02-16 Thread Iulian Dragoș
Spark cares about Mesos, and it's safe to run it with Mesos 0.27. 0.21 is a minimum requirement. On Fri, Feb 12, 2016 at 9:42 AM, Tamas Szuromi < tamas.szur...@odigeo.com.invalid> wrote: > Hello Petr, > > We're running Spark 1.5.2 and 1.6.0 on Mesos 0.25.0 without any problem. > We upgraded from

Re: Scala from Jupyter

2016-02-16 Thread Gourav Sengupta
take a look here as well http://zeppelin-project.org/ it executes Scala and Python and Markup document in the same notebook and draws beautiful visualisations as well. It comes built in AWS EMR as well. Regards, Gourav On Tue, Feb 16, 2016 at 12:43 PM, Aleksandr Modestov < aleksandrmodes...@gmai

Re: Scala from Jupyter

2016-02-16 Thread Teng Qiu
Hi Gourav, Hi Alex, you can try this https://github.com/zalando/spark-appliance this docker image (registry.opensource.zalan.do/bi/spark:1.6.0-1) is integrated with Jupyter notebook, plugins (kernels) for spark and R are installed, some python libs, like NumPy,SciPy and matplotlib are already inst

Re: Memory problems and missing heartbeats

2016-02-16 Thread Arkadiusz Bicz
I had similar as #2 problem when I used lot of caching and then doing shuffling It looks like when I cached too much there was no enough space for other spark tasks and it just hang on. That you can try to cache less and see if improve, also executor logs help a lot (watch out logs with informatio

Re: Scala from Jupyter

2016-02-16 Thread andy petrella
That's great man. Note, if you don't want to struggle too much that you just have to download the version you need here http://spark-notebook.io/ It can be whatever you want zip, tgz, docker or even deb ^^. So pick the flavor you like the most. I'd recommend you two things: 1. build on master to

Re: off-heap certain operations

2016-02-16 Thread Li Ming Tsai
Hi Sean, > Personally, I would leave this off. Is this not production ready, thus we should disable it? Thanks, Liming From: Sean Owen Sent: Saturday, February 13, 2016 2:18 AM To: Ovidiu-Cristian MARCU Cc: Ted Yu; Sea; user@spark.apache.org Subject: R

Re: Stored proc with spark

2016-02-16 Thread Mich Talebzadeh
You can use JDBC to oracle to get that data from a given table. What Oracle stored procedure does anyway? How many tables are involved? JDBC is pretty neat. In example below I use JDBC to load two Dimension tables from Oracle in Spark shell and read the FACT table of 100 million rows from Hive va

Re: Stored proc with spark

2016-02-16 Thread Gaurav Agarwal
Thanks I will try with the options On Feb 16, 2016 9:15 PM, "Mich Talebzadeh" wrote: > You can use JDBC to oracle to get that data from a given table. What > Oracle stored procedure does anyway? How many tables are involved? > > JDBC is pretty neat. In example below I use JDBC to load two > Dimen

Re: Scala from Jupyter

2016-02-16 Thread Rajeev Reddy
Hello, As andy mentioned spark notebook works great out of the box and also you can create your own custom spark config (memory stuff, injecting custom jars into notebook, etc) to run for each notebook. I have been using it to write and test my production level spark applications using a huge chun

RE: Memory problems and missing heartbeats

2016-02-16 Thread JOAQUIN GUANTER GONZALBEZ
A GC pause fits nicely with what I’m seeing. Many thanks for the link! Ximo De: Iulian Dragoș [mailto:iulian.dra...@typesafe.com] Enviado el: martes, 16 de febrero de 2016 15:14 Para: JOAQUIN GUANTER GONZALBEZ CC: user@spark.apache.org Asunto: Re: Memory problems and missing heartbeats Regardi

RE: Stored proc with spark

2016-02-16 Thread Mich Talebzadeh
One thing to be aware is that you better convert Oracle NUMBER and NUMBER(m,n) columns to varchar (--> TO_CHAR()) at source as Spark will throw overflow errors. It is better to user TO_CHAR() in Oracle rather than writing UDF in Spark. UDFs in any language are slower compared to the generic

RE: Memory problems and missing heartbeats

2016-02-16 Thread JOAQUIN GUANTER GONZALBEZ
Thanks. I'll take a look at Graphite to see if that helps me out with my first problem. Ximo. -Mensaje original- De: Arkadiusz Bicz [mailto:arkadiusz.b...@gmail.com] Enviado el: martes, 16 de febrero de 2016 16:06 Para: Iulian Dragoș CC: JOAQUIN GUANTER GONZALBEZ ; user@spark.apache.or

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Cody Koeninger
You could use sc.parallelize... but the offsets are already available at the driver, and they're a (hopefully) small enough amount of data that's it's probably more straightforward to just use the normal cassandra client to save them from the driver. On Tue, Feb 16, 2016 at 1:15 AM, Abhishek Anand

Use case for RDD and Data Frame

2016-02-16 Thread Ashok Kumar
Gurus, What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF) Where one can use RDD without transforming it to DF? Regards and obliged

In Spark Dataframes, does dropDuplicates retain the first row?

2016-02-16 Thread tmoffwood
If we sort a Spark Dataframe by col1 and then dropDuplicates on col2, will it retain the first row (which will have the highest col1 value for each col2)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/In-Spark-Dataframes-does-dropDuplicates-retain-the-firs

Re: Use case for RDD and Data Frame

2016-02-16 Thread Andy Grove
This blog post should be helpful http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/ Thanks, Andy. -- Andy Grove Chief Architect AgilData - Simple Streaming SQL that Scales www.agildata.com On Tue, Feb 16, 2016 at 9:05 AM, Ashok Kumar wrote: > Gurus, > > What are the main di

Frustration over Spark and Jackson

2016-02-16 Thread Martin Skøtt
Hi, I recently started experimenting with Spark Streaming for ingesting and enriching content from a Kafka stream. Being new to Spark I expected a bit of a learning curve, but not with something as simple a using JSON data! I have a JAR with common classes used across a number of Java projects wh

Re: Frustration over Spark and Jackson

2016-02-16 Thread Sean Owen
Shading is the answer. It should be transparent to you though if you only apply it at the module where you create the deployable assembly JAR. On Tue, Feb 16, 2016 at 5:08 PM, Martin Skøtt wrote: > Hi, > > I recently started experimenting with Spark Streaming for ingesting and > enriching content

Re: Re: Unusually large deserialisation time

2016-02-16 Thread Abhishek Modi
Darren: this is not the last task of the stage. Thank you, Abhishek e: abshkm...@gmail.com p: 91-8233540996 On Tue, Feb 16, 2016 at 6:52 PM, Darren Govoni wrote: > There were some posts in this group about it. Another person also saw the > deadlock on next to last or last stage task. > > I've

Re: Re: Unusually large deserialisation time

2016-02-16 Thread Abhishek Modi
PS - I don't get this behaviour in all the cases. I did many runs of the same job & i get this behaviour in around 40% of the cases. Task 4 is the bottom row in the metrics table Thank you, Abhishek e: abshkm...@gmail.com p: 91-8233540996 On Tue, Feb 16, 2016 at 11:19 PM, Abhishek Modi wrote:

RE: Use case for RDD and Data Frame

2016-02-16 Thread Mich Talebzadeh
Hi, A Resilient Distributed Dataset (RDD) is a heap of data distributed among all nodes of cluster. It is basically raw data and that is all about it with little optimization on it. Remember data is not much of a value until it is turned into information. On the other hand a DataFrame i

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Todd Nist
You could use the "withSessionDo" of the SparkCassandrConnector to preform the simple insert: CassandraConnector(conf).withSessionDo { session => session.execute() } -Todd On Tue, Feb 16, 2016 at 11:01 AM, Cody Koeninger wrote: > You could use sc.parallelize... but the offsets are already

Re: Use case for RDD and Data Frame

2016-02-16 Thread Chandeep Singh
Here is another interesting post. http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Re: Frustration over Spark and Jackson

2016-02-16 Thread Chandeep Singh
Shading worked pretty well for me when I ran into an issue similar to yours. POM is all you need to change. org.apache.maven.plugins maven-shade-plugin 1.6 package s

How to use a custom partitioner in a dataframe in Spark

2016-02-16 Thread SRK
Hi, How do I use a custom partitioner when I do a saveAsTable in a dataframe. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html Sent from the Apache Spark User List mailing

RE: Use case for RDD and Data Frame

2016-02-16 Thread Mich Talebzadeh
Thanks Chandeep. Andy Grove, the author earlier on pointed to that article in an earlier thread :) Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw http://talebzadehmich.wordpress.com

Re: Use case for RDD and Data Frame

2016-02-16 Thread Chandeep Singh
Ah. My bad! :) > On Feb 16, 2016, at 6:24 PM, Mich Talebzadeh wrote: > > Thanks Chandeep. > > Andy Grove, the author earlier on pointed to that article in an earlier > thread J > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP

Fair Scheduler Pools with Kafka Streaming

2016-02-16 Thread p pathiyil
Hi, I am trying to use Fair Scheduler Pools with Kafka Streaming. I am assigning each Kafka partition to its own pool. The attempt is to give each partition an equal share of compute time irrespective of the number of messages in each time window for each partition. However, I do not see fair sha

RE: Spark LBFGS Error with ANN

2016-02-16 Thread Ulanov, Alexander
Hi Hayri, The MLP classifier is multi-class (one class per instance) but not multi-label (multiple classes per instance). The top layer of the network is softmax http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier that requires the outputs sum

RE: Learning Fails with 4 Number of Layes at ANN Training with SGDOptimizer

2016-02-16 Thread Ulanov, Alexander
Hi Hayri, The default MLP optimizer is LBFGS. SGD is available only thought the private interface and its use is discouraged due to multiple reasons. With regards to SGD in general, the paramters are very specific to the dataset and network configuration, one need to find them empirically. The

Spark SQL step with many tasks takes a long time to begin processing

2016-02-16 Thread Dukek, Dillon
Hello, I have been working on a project that allows a BI tool to query roughly 25 TB of application event data from 2015 using the thrift server and Spark SQL. In general the jobs that are submitted have a step that submit many tasks in the order of hundreds of thousands and is equal to the num

Re: Fair Scheduler Pools with Kafka Streaming

2016-02-16 Thread Sebastian Piu
Yes it is related to concurrentJobs, so you need to increase that. Salt that will mean that if you get overlapping batches then those will be executed in parallel too On Tue, 16 Feb 2016, 18:33 p pathiyil wrote: > Hi, > > I am trying to use Fair Scheduler Pools with Kafka Streaming. I am > assig

Re: Spark LBFGS Error with ANN

2016-02-16 Thread Hayri Volkan Agun
Hi Alexander, Thanks for the info. I modified the code and used sigmoid at latest layer. It worked correctly with 2-3 layer. Thanks... On Tue, Feb 16, 2016 at 8:51 PM, Ulanov, Alexander wrote: > Hi Hayri, > > > > The MLP classifier is multi-class (one class per instance) but not > multi-label

Re: How to join an RDD with a hive table?

2016-02-16 Thread swetha kasireddy
How to use a customPartttioner hashed by userId inside saveAsTable using a dataframe? On Mon, Feb 15, 2016 at 11:24 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > How about saving the dataframe as a table partitioned by userId? My User > records have userId, number of sessions, visit

Re: Spark SQL step with many tasks takes a long time to begin processing

2016-02-16 Thread Teng Qiu
i believe this is a known issue for using spark/hive with files on s3, this huge delay on driver side is caused by partition listing and split computation, and it is more like a issue by hive, since you are using thrift server, the sql queries are running in HiveContext. qubole made some optimizat

How to delete a record from parquet files using dataframes

2016-02-16 Thread SRK
Hi, I am saving my records in the form of parquet files using dataframes in hdfs. How to delete the records using dataframes? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-delete-a-record-from-parquet-files-using-dataframes-tp26242.html Se

spark examples Analytics ConnectedComponents - keep running, nothing in output

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi I’m trying to run Analytics cc (ConnectedComponents) but it is running without ending. Logs are fine, but I just keep getting Job xyz finished, reduce took some time: ... INFO DAGScheduler: Job 29 finished: reduce at VertexRDDImpl.scala:90, took 14.828033 s INFO DAGScheduler: Job 30 finished

Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

2016-02-16 Thread Jerry
Hello, I have questions using Spark streaming to consume data from Kafka and insert to Cassandra database. 5 AWS instances (each one does have 8 cores, 30GB memory) for Spark, Hadoop, Cassandra Scala: 2.10.5 Spark: 1.2.2 Hadoop: 1.2.1 Cassandra 2.0.18 3 AWS instances for Kafka cluster (each one

Lost executors failed job unable to execute spark examples Triangle Count (Analytics triangles)

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi, I am able to run the Triangle Count example with some smaller graphs but when I am using http://snap.stanford.edu/data/com-Friendster.html I am not able to get the job finished ok. For some reason Spark loses its executors. No matter what

Re: off-heap certain operations

2016-02-16 Thread Ovidiu-Cristian MARCU
Well, it is quite important the off-heap setting and now I am curios about other parameters, I hope everything else is well documented or not missleading. Best, Ovidiu > On 12 Feb 2016, at 19:18, Sean Owen wrote: > > I don't think much more is said since in fact it would affect parts of > the i

RE: Spark SQL step with many tasks takes a long time to begin processing

2016-02-16 Thread Dukek, Dillon
Thanks for your response. I have a couple of questions if you don’t mind. 1) Are you saying that if I was to bring all of the data into hdfs on the cluster that I could avoid this problem? My compressed parquet form of the data that is stored on s3 is only about 3 TB so that could be an op

streaming application redundant dag stage execution/performance/caching

2016-02-16 Thread krishna ramachandran
We have a streaming application containing approximately 12 stages every batch, running in streaming mode (4 sec batches). Each stage persists output to cassandra the pipeline stages stage 1 ---> receive Stream A --> map --> filter -> (union with another stream B) --> map --> groupbykey --> trans

Spark Streaming with Kafka DirectStream

2016-02-16 Thread Cyril Scetbon
Hi guys, I'm making some tests with Spark and Kafka using a Python script. I use the second method that doesn't need any receiver (Direct Approach). It should adapt the number of RDDs to the number of partitions in the topic. I'm trying to verify it. What's the easiest way to verify it ? I also

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-16 Thread Nirav Patel
I think you are not getting my question . I know how to tune executor memory settings and parallelism . That's not an issue. It's a specific question about what happens when physical memory limit of given executor is reached. Now yarn nodemanager has specific setting about provisioning virtual m

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
I have a slightly different understanding. Direct stream generates 1 RDD per batch, however, number of partitions in that RDD = number of partitions in kafka topic. On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon wrote: > Hi guys, > > I'm making some tests with Spark and Kafka using a Python sc

How to update data saved as parquet in hdfs using Dataframes

2016-02-16 Thread SRK
Hi, How do I update data saved as Parquet in hdfs using dataframes? If I use SaveMode.Append, it just seems to append the data but does not seem to update if the record is already existing. Do I have to just modify it using Dataframes api or sql using sqlContext? Thanks, Swetha -- View this m

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread Cyril Scetbon
Your understanding is the right one (having re-read the documentation). Still wondering how I can verify that 5 partitions have been created. My job is reading from a topic in Kafka that has 5 partitions and sends the data to E/S. I can see that when there is one task to read from Kafka there ar

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
Hi You can always use RDD properties, which already has partition information. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon wrote: > Your understanding i

RE: Memory problems and missing heartbeats

2016-02-16 Thread Ignacio Blasco
Hi Ximo. Regarding to #1 you can try to increase the number of partitions used for cogroup or reduce. AFAIK Spark needs to have enough memory space to handle in memory all the data processed by a given partition, increasing the number of partitions you can reduce that load. Probably we need to know

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Yanbo Liang
Hi Stuti, The features should be standardized before training the model. Currently AFTSurvivalRegression does not support standardization. Here is the work around for this issue, and I will send a PR to fix this issue soon. val ovarian = sqlContext.read .format("com.databricks.spark.csv")

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Abhishek Anand
Hi Cody, I am able to do using this piece of code kafkaStreamRdd.foreachRDD((rdd,batchMilliSec) -> { Date currentBatchTime = new Date(); currentBatchTime.setTime(batchMilliSec.milliseconds()); List r = new ArrayList(); OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd.rdd()).offsetRanges(); for(

RE: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Stuti Awasthi
Thanks a lot Yanbo, this will really help. Since I was unaware of this, I was speculating if my vectors were not getting generated correctly. Thanks !! Thanks &Regards Stuti Awasthi From: Yanbo Liang [mailto:yblia...@gmail.com] Sent: Wednesday, February 17, 2016 11:51 AM To: Stuti Awasthi Cc: u