from:"RK Aduri"

Best Practices for Spark-Python App Deployment

2016-09-14 Thread RK Aduri

Dear All: We are trying to deploy ( using Jenkins ) a spark-python app on an edge node, however the dilemma is whether to clone the git repo to all the nodes in the cluster. The reason is, if we choose to use the deployment mode as cluster and master as yarn, then driver expects the cur

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread RK Aduri

or of Apache Spark > > <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241> > GITHUB.COM > > <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241> > <https://mixmax.com/r/aMyLuMpcgLtL2LPwR> > &

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread RK Aduri

te: > > >> > > >> RDD contains data but not JVM byte code i.e. data which is read from > > >> source and transformations have been applied. This is ideal case to persist > > >> RDDs.. As Nirav mentioned this data will be serialized before persisting t

Re: Breaking down text String into Array elements

2016-08-23 Thread RK Aduri

That’s because of this: scala> val text = Array((1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr")) text: Array[(Int, String)] = Array((1,hNjLJEgjxn), (2,lgryHkVlCN), (3,ukswqc

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread RK Aduri

On an other note, if you have a streaming app, you checkpoint the RDDs so that they can be accessed in case of a failure. And yes, RDDs are persisted to DISK. You can access spark’s UI and see it listed under Storage tab. If RDDs are persisted in memory, you avoid any disk I/Os so that any look

Sparking Water (Spark 1.6.0 + H2O 3.8.2.6 ) on CDH 5.7.1

2016-08-09 Thread RK Aduri

All, Ran into one strange issue. If I initialize a h2o context and start it (NOT using it anywhere) , the count action on spark data frame would result in an error. The same count action on the spark data frame would work fine without h20 context not being initialized. hc = H2OContext(sc).sta

Question: collect action returning to driver

2016-08-05 Thread RK Aduri

Rather this is a fundamental question: Was it an architectural constraint that collect action always returns the results to the driver? It is gobbling up all the driver’s memory ( in case of cache ) and why can’t we have an exclusive executor that shares the load and “somehow” merge the results

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread RK Aduri

You can have a temporary file to capture the data that you would like to overwrite. And swap that with existing partition that you would want to wipe the data away. Swapping can be done by simple rename of the partition and just repair the table to pick up the new partition. Am not sure if that

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-22 Thread RK Aduri

I can see large number of collections happening on driver and eventually, driver is running out of memory. ( am not sure whether you have persisted any rdd or data frame). May be you would want to avoid doing so many collections or persist unwanted data in memory. To begin with, you may want to

Re: MultiThreading in Spark 1.6.0

2016-07-21 Thread RK Aduri

Thanks for the idea Maciej. The data is roughly 10 gigs. I’m wondering if there any way to avoid the collect for each unit operation and somehow capture all such resultant arrays and collect them at once. > On Jul 20, 2016, at 2:52 PM, Maciej Bryński wrote: > > RK Aduri, > Anothe

Re: Understanding Spark UI DAGs

2016-07-21 Thread RK Aduri

That -1 is coming from here: PythonRDD.writeIteratorToStream(inputIterator, dataOut) dataOut.writeInt(SpecialLengths.END_OF_DATA_SECTION) —> val END_OF_DATA_SECTION = -1 dataOut.writeInt(SpecialLengths.END_OF_STREAM) dataOut.flush() > On Jul 21, 2016, at 12:24 PM, Jacek Laskowski wrote: > >

Re: spark.driver.extraJavaOptions

2016-07-21 Thread RK Aduri

This has worked for me: --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties" \ you may want to try it. If that doesn't work, then you may use --properties-file -- View this message in context: http://apache-spark-user-list.

MultiThreading in Spark 1.6.0

2016-07-20 Thread RK Aduri

Spark version: 1.6.0 So, here is the background: I have a data frame (Large_Row_DataFrame) which I have created from an array of row objects and also have another array of unique ids (U_ID) which I’m going to use to look up into the Large_Row_DataFrame (which is cached) to do a customized

Re: Spark driver getting out of memory

2016-07-20 Thread RK Aduri

e on RDD. Is this reason of high RAM utilization. > > Thanks, > Saurav Sinha > > On Tue, Jul 19, 2016 at 10:14 PM, RK Aduri > wrote: > >> Just want to see if this helps. >> >> Are you doing heavy collects and persist that? If that is so, you might >> w

Re: Task not serializable: java.io.NotSerializableException: org.json4s.Serialization$$anon$1

2016-07-19 Thread RK Aduri

Did you check this: case class Example(name : String, age ; Int) there is a semicolon. should have been (age : Int) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-not-serializable-java-io-NotSerializableException-org-json4s-Serialization-anon-1-tp823

Re: Spark driver getting out of memory

2016-07-19 Thread RK Aduri

Just want to see if this helps. Are you doing heavy collects and persist that? If that is so, you might want to parallelize that collection by converting to an RDD. Thanks, RK On Tue, Jul 19, 2016 at 12:09 AM, Saurav Sinha wrote: > Hi Mich, > >1. In what mode are you running the spark stan

Re: Spark Streaming - Best Practices to handle multiple datapoints arriving at different time interval

2016-07-15 Thread RK Aduri

You can probably define sliding windows and set larger batch intervals. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Best-Practices-to-handle-multiple-datapoints-arriving-at-different-time-interval-tp27315p27348.html Sent from the Apache

Re: java.lang.OutOfMemoryError related to Graphframe bfs

2016-07-15 Thread RK Aduri

Did you try with different driver's memory? Increasing driver's memory can be one option. Can you print the GC and post the GC times? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-related-to-Graphframe-bfs-tp27318p27347.html Sent

Re: RDD and Dataframes

2016-07-15 Thread RK Aduri

DataFrames uses RDDs as internal implementation of its structure. It doesn't convert to RDD but uses RDD partitions to produce logical plan. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306p27346.html Sent from the Apache Spark User

Best Practices for Spark-Python App Deployment

Re: Are RDD's ever persisted to disk?

Re: Are RDD's ever persisted to disk?

Re: Breaking down text String into Array elements

Re: Are RDD's ever persisted to disk?

Sparking Water (Spark 1.6.0 + H2O 3.8.2.6 ) on CDH 5.7.1

Question: collect action returning to driver

Re: Spark SQL overwrite/append for partitioned tables

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Re: MultiThreading in Spark 1.6.0

Re: Understanding Spark UI DAGs

Re: spark.driver.extraJavaOptions

MultiThreading in Spark 1.6.0

Re: Spark driver getting out of memory

Re: Task not serializable: java.io.NotSerializableException: org.json4s.Serialization$$anon$1

Re: Spark driver getting out of memory

Re: Spark Streaming - Best Practices to handle multiple datapoints arriving at different time interval

Re: java.lang.OutOfMemoryError related to Graphframe bfs

Re: RDD and Dataframes

19 matches

Site Navigation

Mail list logo

Footer information