Thanks for taking the time for reviewing . It work great for my use-case, as i am doing a one shot loading and data manipulation on big data. And the data is basically immutable for the rest of the lifetime of the process, just read. but i know from testing and benchmarking that it is limited by the fact that we cant easily perform inplace operation, i’m still a novice at the kernel portion of arrow so im sure they are opportunities to improve performance.
On Fri, Jan 27, 2023 at 9:53 AM Weston Pace <weston.p...@gmail.com> wrote: > The new kernels are interesting. There has been some ask recently[1] > for weighted averages and I think you have some of the pieces (if not > all of it) here. We also recently plumbed in support for binary > aggregates into Acero[2] so having more binary aggregate kernels would > be nice. > > Outside of the kernels I agree with Kou that this probably doesn't > need to be a part of the main repo. There is already some discussion > of splitting the main repo itself[3] in the interest of having smaller > composable pieces over monolithic pieces. > > We do want tools like this to exist and flourish so I think Benson's > idea of a blog post is nice. We also have a "powered by" page[4]. > > As for the library itself: > > This appears to be a fairly faithful reproduction of the pandas API > and I imagine would be quite friendly to those coming from python. > Since you are using compute kernels directly you are going to be > limited to operating on what you can fit in memory (though I'm sure > there are plenty of valid use cases in this space). I think the > primary challenge you will encounter in seeking users will be that the > audience of C++ data scientists is pretty small. You aren't, for > example, going to get a significant performance boost over pandas / > numpy (as they use C/C++ for the heavy lifting already) and so the > real benefit will only be for those that are stuck using C++ already. > > [1] https://github.com/apache/arrow/issues/15103 > [2] https://github.com/apache/arrow/pull/15083 > [3] https://github.com/apache/arrow/issues/15280 > [4] https://arrow.apache.org/powered_by/ > > On Sun, Jan 22, 2023 at 2:44 AM Adesola Adedewe <ava6...@g.rit.edu> wrote: > > > > Yes I will, I haven't taken enough time to clean up the README , it was > > generated based on my source code with CHATGPT. I will do that later in > the > > week. > > > > On Sun, Jan 22, 2023 at 2:36 AM Benson Muite <benson_mu...@emailplus.org > > > > wrote: > > > > > On 1/22/23 13:15, Adesola Adedewe wrote: > > > > i'm working on a project where big financial data needs to be loaded > > > stored > > > > and manipulated. the data is stored as parquet. my initial version > had > > > > arrow just load the parquet data and i used the basic unorderedmap > but > > > this > > > > limited me to only one data type. i found i could make my database > more > > > > generic with arrow and its performance benefits. unfortunately my > team is > > > > mostly filled with python dev, so i decided to write a cleaner > interface > > > > over arrow, and using interfaces closer to panda. This enabled us to > use > > > > fewer lines of code as well, and still enjoy the benefit. i will > write a > > > > blog post later, i was mostly looking for other developers looking to > > > > collaborate, or who may need this as well. not necessarily add it to > the > > > > main library, but i'm not opposed to that. I also implemented some > > > > custom kernels like covariance correlation, cumprod, shift, > pctchange. > > > > > > > > > > The context is very helpful. A blog post would certainly alert others > in > > > the Arrow community of your work. Most developers are over burdened, > so > > > explaining a use case and how it may help them would encourage > > > exploration and review of your repository, so would encourage a blog > > > post that alerts the wider Arrow developer community about your work. > > > Updating the README of your repository would also encourage use. > > > > > > >