Re: Allowing Unicode Whitespace in Lexer

2024-03-25 Thread Alex Cruise
While we're at it, maybe consider allowing "smart quotes" too :) -0xe1a On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com wrote: > Hello, > > I have a PR https://github.com/apache/spark/pull/45620 ready to go that > will extend the definition of whitespace (what separates token) from the > smal

Query hints visible to DSV2 connectors?

2023-08-02 Thread Alex Cruise
Hey folks, I'm adding an optional feature to my DSV2 connector where it can choose between a row-based or columnar PartitionReader dynamically depending on a query's schema. I'd like to be able to supply a hint at query time that's visible to the connector, but at the moment I can't see any way to

Re: Late materialization?

2023-05-31 Thread Alex Cruise
DML or compactions are happening behind the query's back, but presumably Spark users already have this class of problem, it's just less serious when the end-to-end execution time of a query is shorter. WDYT? -0xe1a On Wed, May 31, 2023 at 11:03 AM Alex Cruise wrote: > Hey folks, I&#

Late materialization?

2023-05-31 Thread Alex Cruise
Hey folks, I'm building a Spark connector for my company's proprietary data lake... That project is going fine despite the near total lack of documentation. ;) In parallel, I'm also trying to figure out a better story for when humans inevitably `select * from 100_trillion_rows`, glance at the firs

planInputPartitions being called twice

2023-05-12 Thread Alex Cruise
(I posted this on Slack originally) Hey folks, I’m writing a batch connector for an in-house data lake and doing some performance work now… I’ve noticed my ScanBuilder creates a Scan exactly once, but its toBatch method is being called three times, returning the identical object every time, then t

Recent paper that might be relevant to pushdown and other optimizations

2023-04-21 Thread Alex Cruise
Optimizing Query Predicates with Disjunctions for Column Stores https://arxiv.org/pdf/2002.00540.pdf [abstract at the end of my message] I just googled [predicate pushdown cnf] and it's WILD to me that this paper came up in the first page of search results, and was published last year. It mention

Re: Adding new connectors

2023-03-27 Thread Alex Cruise
On Fri, Mar 24, 2023 at 11:23 AM Alex Cruise wrote: > I found ExternalCatalog a few days ago and have been implementing one of > those, but it seems like DataSourceRegister / SupportsCatalogOptions is > another popular approach. I'm not sure offhand how they overlap/intersect &g

Re: Adding new connectors

2023-03-24 Thread Alex Cruise
On Fri, Mar 24, 2023 at 3:18 PM John Zhuge wrote: > Is this similar to Iceberg's hidden partitioning > ? > Check out the details in the spec: > https://iceberg.apache.org/spec/#partition-transforms > Yes, it's ver

Re: Adding new connectors

2023-03-24 Thread Alex Cruise
On Fri, Mar 24, 2023 at 1:46 PM John Zhuge wrote: > Have you checked out SparkCatalog > > in > Apache Iceberg project? More docs at > https://iceberg.apache.org/docs/latest/s

Adding new connectors

2023-03-24 Thread Alex Cruise
Hey folks, please let me know this is more of a user@ post! I'm building a Spark connector for my company's data-lake-ish product, and it looks like there's very little documentation about how to go about it. I found ExternalCatalog a few days ago and have been implementing one of those, but it s