Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Fred Reiss
If you just want to emulate pushing down a join, you can just wrap the IN list query in a JDBCRelation directly: scala> val r_df = spark.read.format("jdbc").option("url", > "jdbc:h2:/tmp/testdb").option("dbtable", "R").load() > r_df: org.apache.spark.sql.DataFrame = [A: int] > scala> r_df.show > +

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-20 Thread Fred Reiss
Great idea! If the developer docs are in github, then new contributors who find errors or omissions can update the docs as an introduction to the PR process. Fred On Wed, Oct 19, 2016 at 5:46 PM, Reynold Xin wrote: > For the contributing guide I think it makes more sense to put it in > apache/s

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-14 Thread Fred Reiss
nd quality of service characteristics for multiple users. Then your >> only latency concerns are event to update, not request to response. >> >> On Thu, Oct 13, 2016 at 10:39 AM, Fred Reiss >> wrote: >> > On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman >&g

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Fred Reiss
On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > > > > Could you expand a little bit more on stability ? Is it just bursty > workloads in terms of peak vs. average throughput ? Also what level of > latencies do you find users care about ? Is it on the o

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-12 Thread Fred Reiss
On Tue, Oct 11, 2016 at 10:57 AM, Reynold Xin wrote: > > On Tue, Oct 11, 2016 at 10:55 AM, Michael Armbrust > wrote: > >> *Complex event processing and state management:* Several groups I've >>> talked to want to run a large number (tens or hundreds of thousands now, >>> millions in the near fut

StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Fred Reiss
On Thu, Oct 6, 2016 at 12:37 PM, Michael Armbrust > wrote: > > [snip!] > Relatedly, I'm curious to hear more about the types of questions you are > getting. I think the dev list is a good place to discuss applications and > if/how structured streaming can handle them. > Details are difficult to s

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-05 Thread Fred Reiss
Thanks for the thoughtful comments, Michael and Shivaram. From what I’ve seen in this thread and on JIRA, it looks like the current plan with regard to application-facing APIs for sinks is roughly: 1. Rewrite incremental query compilation for Structured Streaming. 2. Redesign Structured Streaming's

Re: welcoming Xiao Li as a committer

2016-10-05 Thread Fred Reiss
Congratulations, Xiao! Fred On Tuesday, October 4, 2016, Joseph Bradley wrote: > Congrats! > > On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta > wrote: > >> Congratulations Xiao! >> >> - Kousuke >> On 2016/10/05 7:44, Bryan Cutler wrote: >> >> Congrats Xiao! >> >> On Tue, Oct 4, 2016 at 11:14 A

Re: Test fails when compiling spark with tests

2016-09-14 Thread Fred Reiss
Also try doing a fresh clone of the git repository. I've seen some of those rare failure modes corrupt parts of my local copy in the past. FWIW the main branch as of yesterday afternoon is building fine in my environment. Fred On Tue, Sep 13, 2016 at 6:29 PM, Jakob Odersky wrote: > There are s

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-14 Thread Fred Reiss
+1 to this request. I talked last week with a product group within IBM that is struggling with the same issue. It's pretty common in data cleaning applications for data in the early stages to have nested lists or sets inconsistent or incomplete schema information. Fred On Tue, Sep 13, 2016 at 8:0

Re: FileStreamSource source checks path eagerly?

2016-09-08 Thread Fred Reiss
The input directory does need to be visible from the driver process, since FileStreamSource does its polling from the driver. FileStreamSource creates a Dataset for each microbatch. I suppose the type-inference-time check for the presence of the input directory could be moved to the FileStreamSour

Re: Structured Streaming with Kafka sources/sinks

2016-08-29 Thread Fred Reiss
I think that the community really needs some feedback on the progress of this very important task. Many existing Spark Streaming applications can't be ported to Structured Streaming without Kafka support. Is there a design document somewhere? Or can someone from the DataBricks team break down the

Re: Source API requires unbounded distributed storage?

2016-08-08 Thread Fred Reiss
ge in the future if we do async checkpointing of > internal state. > > You are totally right that we should relay this info back to the source. > Opening a JIRA sounds like a good first step. > > On Thu, Aug 4, 2016 at 4:38 PM, Fred Reiss wrote: > >> Hi, >> &

Source API requires unbounded distributed storage?

2016-08-04 Thread Fred Reiss
Hi, I've been looking over the Source API in org.apache.spark.sql.execution.streaming, and I'm at a loss for how the current API can be implemented in a practical way. The API defines a single getBatch() method for fetching records from the source, with the following Scaladoc comments defining the