pyspark dataframe sort issue

2016-05-07 Thread Buntu Dev
I'm using pyspark dataframe api to sort by specific column and then saving the dataframe as parquet file. But the resulting parquet file doesn't seem to be sorted. Applying sort and doing a head() on the results shows the correct results sorted by 'value' column in desc order, as shown below: ~~~

Joining a RDD to a Dataframe

2016-05-07 Thread Cyril Scetbon
Hi, I have a RDD built during a spark streaming job and I'd like to join it to a DataFrame (E/S input) to enrich it. It seems that I can't join the RDD and the DF without converting first the RDD to a DF (Tell me if I'm wrong). Here are the schemas of both DF : scala> df res32: org.apache.spark

Rename hive orc table caused no content in spark

2016-05-07 Thread yansqrt3
Hi, I'm trying to rename an orc table (either in hive or spark has no difference). After that, all the content in the table will be invisible in spark while it is still available in hive. The problem could alway be recreated by very simple steps: spark shell output--

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one. All of them have 6474 tasks. 5th task is a count operations and it also performs aggregateByKey as a part of it lazy evaluation. I am setting: spark.driver.memory=10G, spark.yarn.am.memory=2G and spark.driver.maxResultSize=9G

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ashish Dubey
Driver maintains the complete metadata of application ( scheduling of executor and maintaining the messaging to control the execution ) This code seems to be failing in that code path only. With that said there is Jvm overhead based on num of executors , stages and tasks in your app. Do you know yo

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Ashish Dubey
How big is your file and can you also share the code snippet On Saturday, May 7, 2016, Johnny W. wrote: > hi spark-user, > > I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a > dataframe from a parquet data source with a single parquet file, it yields > a stage with lots of sma

sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Johnny W.
hi spark-user, I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a dataframe from a parquet data source with a single parquet file, it yields a stage with lots of small tasks. It seems the number of tasks depends on how many executors I have instead of how many parquet files/partit

Re: Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread kmurph
Hi Simon, Thanks. I did actually have "SPARK_WORKER_CORES=8" in spark-env.sh - its commented as 'to set the number of cores to use on this machine'. Not sure how this would interplay with SPARK_EXECUTOR_INSTANCES and SPARK_EXECUTOR_CORES, but I removed it and still see no scaleup with increasing

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Right but this logs from spark driver and spark driver seems to use Akka. ERROR [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver] I saw following

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ted Yu
bq. at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129) It was Akka which uses JavaSerializer Cheers On Sat, May 7, 2016 at 11:13 AM, Nirav Patel wrote: > Hi, > > I thought I was using kryo serializer for shuffle. I could verify it from > spark UI - Environment tab that > sp

How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Hi, I thought I was using kryo serializer for shuffle. I could verify it from spark UI - Environment tab that spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator com.myapp.spark.jobs.conf.SparkSerializerRegistrator But when I see following error in Driver logs it

Re: Updating Values Inside Foreach Rdd loop

2016-05-07 Thread HARSH TAKKAR
Hi Ted Following is my use case. I have a prediction algorithm where i need to update some records to predict the target. For eg. I have an eq. Y= mX +c I need to change value of Xi of some records and calculate sum(Yi) if the value of prediction is not close to target value then repeat the pro

Finding max value in spark streaming sliding window

2016-05-07 Thread Mich Talebzadeh
Hi, What is the easiest way of finding max(price) in code below object CEP_AVG { def main(args: Array[String]) { // Create a local StreamingContext with two working thread and batch interval of 10 seconds. val sparkConf = new SparkConf(). setAppName("CEP_AVG"). setMast

Re: createDirectStream with offsets

2016-05-07 Thread Eric Friedman
Thanks Cody. It turns out that there was an even simpler explanation (the flaw you pointed out was accurate too). I had mutable.Map instances being passed where KafkaUtils wants immutable ones. On Fri, May 6, 2016 at 8:32 AM, Cody Koeninger wrote: > Look carefully at the error message, the typ

Re: Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread Mich Talebzadeh
Check how much free memory you have on your hosr /usr/bin/free as a heuristic values start with these in export SPARK_EXECUTOR_CORES=4 ##, Number of cores for the workers (Default: 1). export SPARK_EXECUTOR_MEMORY=8G ## , Memory per Worker (e.g. 1000M, 2G) (Default: 1G) export SPARK_DRIVER_MEMOR

Locality aware tree reduction

2016-05-07 Thread Ayman Khalil
Hello, Is there a way to instruct treeReduce() to reduce RDD partitions on the same node locally? In my case, I'm using treeReduce() to reduce map results in parallel. My reduce function is just arithmetically adding map results (i.e. no notion of aggregation by key). As far as I understand, a sh

Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread kmurph
Hi, I'm running spark 1.6.1 on a single machine, initially a small one (8 cores, 16GB ram) using "--master local[*]" to spark-submit and I'm trying to see scaling with increasing cores, unsuccessfully. Initially I'm setting SPARK_EXECUTOR_INSTANCES=1, and increasing cores for each executor. Th

Re: Found Data Quality check package for Spark

2016-05-07 Thread Rick Moritz
Hi Divya, I haven't actually used the package yet, but maybe you should check out the gitter-room where the creator is quite active. You can find it on https://gitter.im/FRosner/drunken-data-quality . There you should be able to get the information you need. Best, Rick On 6 May 2016 12:34, "Div