[ANN] General ML and ETL libraries

Chris Nuernberger Wed, 27 Feb 2019 09:17:26 -0800

Clojurians,


Good morning from (again) snowy Boulder!


Following lots of discussion and interactions with many people around the
clojure and ML worlds, TechAscent has built a foundation with the intention
to allow the average clojurist to do high quality machine learning of the
type they are likely to encounter in their day to day work.


This isn't a deep learning framework; I already tried that in a bespoke
fashion that and I think the mxnet bindings are great.


This is specifically for the use case where you have data coming in from
multiple data sources and you need to do the cleaning, processing, and
feature augmentation before running some set of simple models.  Then
gridsearch across a range of models and go about your business from there.
Think more small to medium sized datomic databases and such.  Everyone has
a little data before they have a lot and I think this scale captures a far
wider range of possible use cases.


The foundation comes in two parts.


The first is the ETL library:

https://github.com/techascent/tech.ml.dataset

This library is a column-store based design sitting on top of tablesaw.
The clojure ml group profiled lots of different libraries and we found that
tablesaw works great.

The ETL language is composed of three sub languages.  First a set-invariant
column selection language.  Second, a minimal functional math language
along the lines of APL or J.  Finally a pipeline concept that allows you to
describe an ETL pipeline in data with the idea that you create the pipeline
and run it on training data and then it records context.  Then during
inference later you just used the saved pipeline from the first operation.

This is the second large ETL system I have worked on; the first was one
named Alteryx.


The next library is a general ML framework:

https://github.com/techascent/tech.ml

The library has  bindings to xgboost, smile, and libsvm.  Libsvm doesn't
get the credit it deserves, btw, as it works extremely well on small-n
problems.  xgboost works well on everything and smile contains lots of
different types of models that may or may not work well depending on the
problem as well as clustering and a lot of other machine-learny type
things.

For this case, my interest wasn't a clear exposition of all the different
things smile can do as it was more just to get a wide enough domain of
different model generators to be effective.  For a more thorough binding to
smile, check out:

https://github.com/generateme/fastmath


I built a clojure version a very involved kaggle problem example using
clojupyter and oz as a proof of concept:


https://github.com/cnuernber/ames-house-prices/blob/master/ames-housing-prices-clojure.md


Enjoy :-).

Complements of the TechAscent Crew & Clojure ML Working Group

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[ANN] General ML and ETL libraries

Reply via email to