Re: Structured Streaming Schema Issue

2017-02-03 Thread Sam Elamin
Hey td I figured out what was happening My source would return the correct schema but the schema on the returned df was actually different. I'm loading json data from cloud storage and that gets infered instead of set So basically the schema I return on the source provider wasn't actually being

Re: Structured Streaming Schema Issue

2017-02-02 Thread Sam Elamin
Hi All Ive done a bit more digging to where exactly this happens. It seems like the schema is infered again after the data leaves the source and then comes into the sink Below is a stack trace, the schema at the BigQuerySource has a LongType for customer id but then at the sink, the data received

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
There isn't a query per se.im writing the entire dataframe from the output of the read stream. Once I got that working I was planning to test the query aspect I'll do a bit more digging. Thank you very much for your help. Structued streaming is very exciting and I really am enjoying writing a con

Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
What is the query you are apply writeStream on? Essentially can you print the whole query. Also, you can do StreamingQuery.explain() to see in full details how the logical plan changes to physical plan, for a batch of data. that might help. try doing that with some other sink to make sure the sour

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
Yeah sorry Im still working on it, its on a branch you can find here , ignore the logging messages I was trying to workout how the APIs work and unfortunately

Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
I am assuming that you have written your own BigQuerySource (i dont see that code in the link you posted). In that source, you must have implemented getBatch which uses offsets to return the Dataframe having the data of a batch. Can you double check when this DataFrame returned by getBatch, has the

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
Thanks for the quick response TD! Ive been trying to identify where exactly this transformation happens The readStream returns a dataframe with the correct schema The minute I call writeStream, by the time I get to the addBatch method, the dataframe there has an incorrect Schema So Im skeptical

Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
You should make sure that schema of the streaming Dataset returned by `readStream`, and the schema of the DataFrame returned by the sources getBatch. On Wed, Feb 1, 2017 at 3:25 PM, Sam Elamin wrote: > Hi All > > I am writing a bigquery connector here >