Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-06-29 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-3016555361 I have been thinking about this one a lot and I am sorry I haven't written thus up before now. I was trying to collect my thoughts. I feel the core isssue is a tension betw

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2954201095 Sorry for not having a chance to test this work earlier @clflushopt I really look forward to checking it out and will try to do so later this week. -- This is an

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-06-07 Thread via GitHub
kevinjqliu commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2952854828 This is great, thanks @clflushopt I couldn't find a way to use datafusion to write multiple parquet files, but i think this is a limitation with datafusion's `COPY` co

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-06-05 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2948048431 Hey @alamb following suggestions from @kevinjqliu I am happy to say that https://github.com/clflushopt/datafusion-tpch provides a ux on par with duckdb and what we discussed

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-05-25 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2907767246 Thansk @clflushopt -- I'll try and check this out tomorrow or Tuesday -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-05-23 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2906345545 Hey @alamb @kevinjqliu I have individual TPCH table generators working fine https://github.com/clflushopt/datafusion-tpch/blob/main/src/lib.rs but I am still scratching my he

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-05-06 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2856948236 I was stuck trying to decide between a scalar function or a table function for `tpchgen(sf)` I really like your suggestion @alamb thanks for unblocking. I'll have a v0.1.0 do

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-05-05 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2851315858 I think we could use a user defined **TABLE** function: https://datafusion.apache.org/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function So tha

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-05-03 Thread via GitHub
kevinjqliu commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2848681742 sounds good to me! `SELECT * FROM lineitem(1.0)` makes sense `SELECT 1 FROM tpchgen(1.0)` looks a bit odd but i cant think of a better alternative -- This is an autom

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-30 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2844075397 I agree with @alamb on this one, regarding the separation of creation & storing the files on disk explicitly. One suggestion I would propose is that I would add a scalar func

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-28 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2837210261 > duckdb's CALL dbgen(sf = 1); creates tables in the current schema and then populates those tables with data using its own format. The other thing we can do is to just make

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-19 Thread via GitHub
kevinjqliu commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2816875067 > In order to try and make progress on this, I decided to go with having a single function that builds all tables for a single scale factor similar to how DuckDB does it. My

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-18 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2811379764 In order to try and make progress on this, I decided to go with having a single function that builds all tables for a single scale factor similar to how DuckDB does it. My re

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-16 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2810241760 > [@alamb](https://github.com/alamb) Yes once I address the couple of prioritized issues I have open for `v1.0.0` the next step will be to work on the integration, I agree with ha

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-11 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2798250520 @alamb Yes once I address the couple of prioritized issues I have open for `v1.0.0` the next step will be to work on the integration, I agree with having table functions but

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-11 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2797923658 > I just read your blogpost today, and I am really happy to have a faster generator. The post focussed on generating tpc-h to files, but I see you also discussed something like th

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-11 Thread via GitHub
m-mueller678 commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2797014499 I just read your blogpost today, and I am really happy to have a faster generator. The post focussed on generating tpc-h to files, but I see you also discussed something l

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-10 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2789578819 @clflushopt -- do you have any next steps planned for this projec? I think tpchgen is basically ready / done (though I predict we may get a flurry of additional interest on

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2779688162 We have drafted a blog about this project in case anyone wants to review / check it out: - https://github.com/apache/datafusion-site/pull/67 -- This is an automated message f

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-01 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2770127700 @scsmithr of GlareDB integrated the tpchgen library in glaredb as a table function - https://github.com/GlareDB/glaredb/pull/3549 Which is quite cool ```shell g

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-30 Thread via GitHub
matthewmturner commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2764969834 @clflushopt awesome, im really excited to add this to dft - it will be the next item i work on. will let you know if any questions / comments. -- This is an automated m

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-30 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2764746130 I think the next question in my mind is exactly how to integrate this into datafusion-cli We could follow the model of duckdb and create a table function like `dbgen(sf = 1

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-30 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2764733544 @lmwnshn @matthewmturner we now have a live crate for integrations https://crates.io/crates/tpchgen and a cli available https://github.com/clflushopt/tpchgen-rs special thank

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-24 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2749315813 > Good to see this rust generator. We have adopted it in our database projection for benchmarking. Thanks @niebayes -- here is a preview of what I am currently working on

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-23 Thread via GitHub
niebayes commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2746782663 Good to see this rust generator. We have adopted it in our database projection for benchmarking. -- This is an automated message from the Apache Git Service. To respond to th

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-13 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2722682304 @clflushopt has some very cool ideas for testing in tpchdbgen-rs Specifically we verified that the output data (for SF 0.001 and SF 0.01) is byte-for-byte identical with db

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-13 Thread via GitHub
lmwnshn commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2721519020 @clflushopt Nice work! Re: randomness, the TPC-H spec has a "qualification database" (dataset) with specific "query validation" tests (instantiating the SQL queries with specifi

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-10 Thread via GitHub
matthewmturner commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2712539973 @clflushopt this is _awesome_. Once you release I will likely add this to [dft](https://github.com/datafusion-contrib/datafusion-dft). -- This is an automated message

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-10 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2712464986 For anyone following this issue I have a full port here https://github.com/clflushopt/tpchgen-rs and I am working on completing a first release (I have issues to track that m

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-10 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2710564179 @alamb Hey yeah sorry it just by habit I like to complete things before "releasing" them, but just made it open ! -- This is an automated message from the Apache Git Servic

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-10 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2710281339 > Hey [@alamb](https://github.com/alamb) as of today I have a fully working implementation that matches Apache Trino and OLTPBenchmark's, I found the issue I mentionned in the mes

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-09 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2709364715 Hey @alamb as of today I have a fully working implementation that matches Apache Trino and OLTPBenchmark's, I found the issue I mentionned in the message above which was due

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-09 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2708856273 > My goal is to potentially donate it to the [datafusion-contrib ](https://github.com/datafusion-contrib) organization and then keep maintaining it there this way we can coordinat

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-08 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2708672112 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-03-08 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2708663987 Hey @alamb @lmwnshn I've been actually following the CMU 15-799 course (nights and weekend's mostly) and started working on a Rust port of the benchbase Java implementation a

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-02-11 Thread via GitHub
lmwnshn commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600 If you prefer Java to C, CMU-DB's BenchBase project does implement support for generating and loading TPC-H data in parallel: https://github.com/cmu-db/benchbase/tree/main/src/m

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-02-11 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2651062245 Thanks @lmwnshn -- the Java implementation might be easier to transliterate to Rust... Also BTW I am pretty sure other rust data projects would be interested in a Rust imp