core.matrix maintainer here.

I think it would be great to have more work on dataframe-type support. I 
think the right strategy is as follows:
a) Make use of the core.matrix Dataset protocols where possible (or add new 
ones)
b) Create implementation(s) for these protocols for whatever back-end data 
frame implementation is being used

The beauty of core.matrix is that we *can* support multiple implementations 
without fragmentation, because the protocol based approach means that every 
implementation can use the same API. This is already working well for the 
array programming APIs (it's easy to mix and match Clojure data structures, 
Vectorz Java-based arrays, GPU backed arrays in computations). We just need 
to do the same for DataFrames.

Now: the current core.matrix Dataset API is a bit focused on 2D data 
tables, but I think it can be extended to general N-dimensional dataframe 
capability. Would be a great project for someone to take on, happy to give 
guidance and help merge in changes as needed.

I don't have a particularly strong opinion on which Dataframe 
implementations are best, but it looks like Spark and Renjin are both great 
candidates and would be very useful additions to the Clojure numerical 
ecosystem. If we do things right, they should interoperate easily with the 
core.matrix APIs, making Clojure ideal for "glue" code across such 
implementations.

On Thursday, 10 March 2016 04:57:31 UTC+8, arthur.ma...@gmail.com wrote:
>
> Is there any desire or need for a Clojure DataFrame?
>
>
> By DataFrame, I mean a structure similar to R's data.frame, and Python's 
> pandas.DataFrame.
>
> Incanter's DataSet may already be fulfilling this purpose, and if so, I'd 
> like to know if and how people are using it.
>
> From quickly researching, I see that some prior work has been done in this 
> space, such as:
>
> * https://github.com/cardillo/joinery
> * https://github.com/mattrepl/data-frame
> * 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
>
> Rather than going off and creating a competing implementation (
> https://xkcd.com/927/), I'd like to know if anyone here is actively 
> working on, or would like to work on a DataFrame and related utilities for 
> Clojure (and by extension Java)? Is it something that's sorely needed, or 
> is everybody happy with using Incanter or some other library that I'm not 
> aware of? If there's already a defacto standard out there, would anyone 
> care to please point it out?
>
> As background information:
>
> My specific use-case is in NLP and ML, where I often explore and prototype 
> in Python, but I'm then left to deal with a smattering of libraries on the 
> JVM (Mallet, Weka, Mahout, ND4J, DeepLearning4j, CoreNLP, etc.), each with 
> their own ad-hoc implementations of algorithms, matrices, and utilities for 
> reading data. It would be great to have a unified way to explore my data in 
> the Clojure REPL, and then serve the same code and models in production.
>
> I would love for Clojure to have a broadly compatible ecosystem similar to 
> Python's Numpy/Pandas/Scikit-*/Scipy/matplotlib/GenSim,etc. Core.Matrix and 
> Incanter appear to fulfill a large chunk of those roles, but I am not aware 
> if they've yet become the defacto standards in the community.
>
> Any feedback is greatly appreciated.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to