Re: Unexpected caching behavior

2017-10-26 Thread pnpritchard
I'm not sure why the example code didn't come through, so I'll try again: val x = spark.range(100) val y = x.map(_.toString) println(x.storageLevel) //StorageLevel(1 replicas) println(y.storageLevel) //StorageLevel(1 replicas) x.cache().foreachPartition(_ => ()) y.cache().foreachPartition(_ => (

Re: Unexpected caching behavior

2017-10-26 Thread pnpritchard
Not sure why the example code didn't come through, but here I'll try again: val x = spark.range(100) val y = x.map(_.toString) println(x.storageLevel) //StorageLevel(1 replicas) println(y.storageLevel) //StorageLevel(1 replicas) x.cache().foreachPartition(_ => ()) y.cache().foreachPartition(_ =>

Unexpected caching behavior

2017-10-26 Thread pnpritchard
I've noticed that when unpersisting an "upstream" Dataset, then the "downstream" Dataset is also unpersisted. I did not expect this behavior, and I've noticed that RDDs do not have this behavior. Below I've pasted a simple reproducible case. There are two datasets, x and y, where y is created by a

Parquet datasource optimization for distinct query

2015-12-16 Thread pnpritchard
I have a parquet file that is partitioned by a column, like shown in http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery. I like this storage technique because the datasource can "push-down" filters on the partitioned column, making some queries a lot faster. However

Stage retry limit

2015-11-16 Thread pnpritchard
In my app, I see a condition where a stage fails and Spark retries it endlessly. I see the configuration for task retry limit (spark.task.maxFailures), but is there a configuration for limiting the number of stage retries? -- View this message in context: http://apache-spark-user-list.1001560.n

Does the Standalone cluster and Applications need to be same Spark version?

2015-11-02 Thread pnpritchard
The title gives the gist of it: Does the Standalone cluster and Applications need to be same Spark version? For example, say I have a Standalone cluster running version 1.5.0. Can I run an application that was built with the spark library 1.5.1, and using the spark-submit script from 1.5.1 release

Re: SPARK SQL Error

2015-10-15 Thread pnpritchard
Going back to your code, I see that you instantiate the spark context as: val sc = new SparkContext(args(0), "Csv loading example") which will set the master url to "args(0)" and app name to "Csv loading example". In your case, args(0) is "hdfs://quickstart.cloudera:8020/people_csv", which obviou

Re: SPARK SQL Error

2015-10-14 Thread pnpritchard
I think the stack trace is quite informative. Assuming line 10 of CsvDataSource is "val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(1),"header"->"true"))", then the "args(1)" call is throwing an ArrayIndexOutOfBoundsException. The reason for this is because you aren't passi

Re: SPARK SQL Error

2015-10-13 Thread pnpritchard
Your app jar should be at the end of the command, without the --jars prefix. That option is only necessary if you have more than one jar to put on the classpath (i.e. dependency jars that aren't packaged inside your app jar). spark-submit --master yarn --class org.spark.apache.CsvDataSource --file

Re: ClassCastException when use spark1.5.1

2015-10-12 Thread pnpritchard
I'm not sure why this would have have changed from 1.4.1 to 1.5.1, but I have seen similar exceptions in my code. It seems to me that values with SQL type "ArrayType" are stored internally as an instance of the Scala "WrappedArray" class (regardless if is was originally an instance of Scala "List")

Re: why would a spark Job fail without throwing run-time exceptions?

2015-10-12 Thread pnpritchard
I'm not sure why spark is not showing the runtime exception in the logs. However, I think I can point out why the stage is failing. 1. "lineMapToStockPriceInfoObjectRDD.map(new stockDataFilter(_).requirementsMet.get)" The ".get" will throw a runtime exception when "requirementsMet" is None. I wou

Spark UI consuming lots of memory

2015-10-12 Thread pnpritchard
Hi, In my application, the Spark UI is consuming a lot of memory, especially the SQL tab. I have set the following configurations to reduce the memory consumption: - spark.ui.retainedJobs=20 - spark.ui.retainedStages=40 - spark.sql.ui.retainedExecutions=0 However, I still get OOM errors in the dr

Help needed to reproduce bug

2015-10-06 Thread pnpritchard
Hi spark community, I was hoping someone could help me by running a code snippet below in the spark shell, and seeing if they see the same buggy behavior I see. Full details of the bug can be found in this JIRA issue I filed: https://issues.apache.org/jira/browse/SPARK-10942. The issue was closed

Re: DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
I was able to workaround this by converting the DataFrame to an RDD and then back to DataFrame. This seems very weird to me, so any insight would be much appreciated! Thanks, Nick P.S. Here's the updated code with the workaround: ``` // Examples udf's that println when called val twice =

DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
Hi! I am seeing some unexpected behavior with regards to cache() in DataFrames. Here goes: In my Scala application, I have created a DataFrame that I run multiple operations on. It is expensive to recompute the DataFrame, so I have called cache() after it gets created. I notice that the cache()

Fair Scheduler Pools

2015-02-24 Thread pnpritchard
Hi, I am trying to use the fair scheduler pools (http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools) to schedule two jobs at the same time. In my simple example, I have configured spark in local mode with 2 cores ("local[2]"). I have also configured two pools in fairsche

Force RDD evaluation

2015-02-20 Thread pnpritchard
Is there a technique for forcing the evaluation of an RDD? I have used actions to do so but even the most basic "count" has a non-negligible cost (even on a cached RDD, repeated calls to count take time). My use case is for logging the execution time of the major components in my application. At

Stopping a Custom Receiver

2015-02-20 Thread pnpritchard
Hi, I have a use case for creating a DStream from a single file. I have created a custom receiver that reads the file, calls 'store' with the contents, then calls 'stop'. However, I'm second guessing if this is the correct approach due to the spark logs I see. I always see these logs, and the 'ER

StreamingContext getOrCreate with queueStream

2015-02-05 Thread pnpritchard
I am trying to use the StreamingContext "getOrCreate" method in my app. I started by running the example ( RecoverableNetworkWordCount ), which worked as ex