Re: OutputMetrics empty for DF writes - any hints?

2017-12-12 Thread Jason White
It should be in the first email in this chain. On Tue, Dec 12, 2017, 7:10 PM Ryan Blue wrote: > Great. What's the JIRA issue? > > On Mon, Dec 11, 2017 at 8:12 PM, Jason White > wrote: > >> Yes, the fix has been merged at should make it into the 2.3 release. >>

Re: OutputMetrics empty for DF writes - any hints?

2017-12-11 Thread Jason White
2017 at 12:59 PM, Jason White > wrote: > >> It doesn't look like the insert command has any metrics in it. I don't see >> any commands with metrics, but I could be missing something. > > >> >> >> >>

Re: OutputMetrics empty for DF writes - any hints?

2017-11-27 Thread Jason White
It doesn't look like the insert command has any metrics in it. I don't see any commands with metrics, but I could be missing something. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e

Re: OutputMetrics empty for DF writes - any hints?

2017-11-27 Thread Jason White
I think the difference lies somewhere in here: - RDD writes are done with SparkHadoopMapReduceWriter.executeTask, which calls outputMetrics.setRecordsWritten - DF writes are done with InsertIntoHadoopFsRelationCommand.run ? Which I'm not entirely sure how it works. executeTask appears to be run on

OutputMetrics empty for DF writes - any hints?

2017-11-27 Thread Jason White
I'd like to use the SparkListenerInterface to listen for some metrics for monitoring/logging/metadata purposes. The first ones I'm interested in hooking into are recordsWritten and bytesWritten as a measure of throughput. I'm using PySpark to write Parquet files from DataFrames. I'm able to extrac

Re: distributed computation of median

2017-04-17 Thread Jason White
Have you looked at t-digests? Calculating percentiles (including medians) is something that is inherently difficult/inefficient to do in a distributed system. T-digests provide a useful probabilistic structure to allow you to compute any percentile with a known (and tunable) margin of error. http

Re: Why are DataFrames always read with nullable=True?

2017-03-21 Thread Jason White
Thanks for pointing to those JIRA tickets, I hadn't seen them. Encouraging that they are recent. I hope we can find a solution there. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207p21218.html S

Why are DataFrames always read with nullable=True?

2017-03-20 Thread Jason White
If I create a dataframe in Spark with non-nullable columns, and then save that to disk as a Parquet file, the columns are properly marked as non-nullable. I confirmed this using parquet-tools. Then, when loading it back, Spark forces the nullable back to True. https://github.com/apache/spark/blob/

Re: ArrayType support in Spark SQL

2016-09-25 Thread Jason White
Continuing to dig, I encountered: https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala#L125 // TODO(davies): add tests for ArrayType, MapType and StructType I guess others have thought of this already, jus

ArrayType support in Spark SQL

2016-09-25 Thread Jason White
It seems that `functions.lit` doesn't support ArrayTypes. To reproduce: org.apache.spark.sql.functions.lit(2 :: 1 :: Nil) java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(2, 1) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(lit

Re: How to run PySpark tests?

2016-02-18 Thread Jason White
Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive -Dhadoop.version=2.4.0 -DskipTests clean package` followed by `python/run-tests` seemed to do the trick! Thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357p16362

How to run PySpark tests?

2016-02-18 Thread Jason White
Hi, I'm trying to finish up a PR (https://github.com/apache/spark/pull/10089) which is currently failing PySpark tests. The instructions to run the test suite seem a little dated. I was able to find these: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals http://spark.apache.org/