Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then that seems to point to some other underlying issue as the root cause. Even though adding checkpointing should help, we should understand why it's different between 1.6 and 2.0? On Thu, 2 Feb 2017 at 08:22 Liang

Re: approx_percentile computation

2017-02-01 Thread Liang-Chi Hsieh
Hi, You don't need to run approxPercentile against a list. Since it is an aggregation function, you can simply run: // Just for illustrate the idea. val approxPercentile = new ApproximatePercentile(v1, Literal(percentage)) val agg_approx_percentile = Column(approxPercentile.toAggregateExpression

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Liang-Chi Hsieh
Hi Maciej, FYI, the PR is at https://github.com/apache/spark/pull/16775. Liang-Chi Hsieh wrote > Hi Maciej, > > Basically the fitting algorithm in Pipeline is an iterative operation. > Running iterative algorithm on Dataset would have RDD lineages and query > plans that grow fast. Without cach

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Liang-Chi Hsieh
Hi Maciej, Basically the fitting algorithm in Pipeline is an iterative operation. Running iterative algorithm on Dataset would have RDD lineages and query plans that grow fast. Without cache and checkpoint, it gets slower when the iteration number increases. I think it is why when you run a Pipe

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
There isn't a query per se.im writing the entire dataframe from the output of the read stream. Once I got that working I was planning to test the query aspect I'll do a bit more digging. Thank you very much for your help. Structued streaming is very exciting and I really am enjoying writing a con

Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
What is the query you are apply writeStream on? Essentially can you print the whole query. Also, you can do StreamingQuery.explain() to see in full details how the logical plan changes to physical plan, for a batch of data. that might help. try doing that with some other sink to make sure the sour

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
Yeah sorry Im still working on it, its on a branch you can find here , ignore the logging messages I was trying to workout how the APIs work and unfortunately

Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
I am assuming that you have written your own BigQuerySource (i dont see that code in the link you posted). In that source, you must have implemented getBatch which uses offsets to return the Dataframe having the data of a batch. Can you double check when this DataFrame returned by getBatch, has the

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
Thanks for the quick response TD! Ive been trying to identify where exactly this transformation happens The readStream returns a dataframe with the correct schema The minute I call writeStream, by the time I get to the addBatch method, the dataframe there has an incorrect Schema So Im skeptical

Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
You should make sure that schema of the streaming Dataset returned by `readStream`, and the schema of the DataFrame returned by the sources getBatch. On Wed, Feb 1, 2017 at 3:25 PM, Sam Elamin wrote: > Hi All > > I am writing a bigquery connector here >

Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
Hi All I am writing a bigquery connector here and I am getting a strange error with schemas being overwritten when a dataframe is passed over to the Sink for example the source returns this StructType WARN streaming.BigQuerySource: StructType(StructFi

Re: Spark SQL Dataframe resulting from an except( ) is unusable

2017-02-01 Thread Liang-Chi Hsieh
Hi Vinayak, Thanks for reporting this. I don't think it is left out intentionally for UserDefinedType. If you already know how the UDT is represented in internal format, you can explicitly convert the UDT column to other SQL types, then you may get around this problem. It is a bit hacky, anyway.