The student application opened, so I'm going to submit my proposal. Does anyone have suggestions or requests?
Cheers, Matteo Il giorno lunedì 3 marzo 2014 15:36:43 UTC+1, Matteo Ceccarello ha scritto: > > Hello everybody, > > I’m Matteo Ceccarello, a PhD student in Computer Engineering from > university of > Padova, Italy. Recently, I’ve started working with Clojure, a language > that I > find both powerful and fun. I’ve participated to GSoC 2013 and 2012 with > JPF <http://babelfish.arc.nasa.gov/trac/jpf/wiki>. I’m interested in > participating to GSoC 2014 with Clojure, in order to gain confidence with > clojure under the guidance of a mentor while contributing to the > community. I’ll > describe what I’d like to do below. > Motivation > > Currently, alongside my research on parallel graph algorithms, I’ve started > writing a web crawler in Clojure. The reasons I want to do this in clojure > are > many: > > - The web crawling software we are using > (Heritrix <https://crawler.archive.org>) performs blocking IO. I think > that in > big crawls this is a showstopper, since to achieve a good throughput > one must > use many Java threads (in the order of 500) that end up spending most > of their > time performing blocking waits. > - Clojure has the wonderful core.async library, with all the magic > that go > blocks and channels bring. > - The http-kit library makes possible to perform async requests to web > servers in a very simple way. > - Clojure has an awesome support for concurrency. > > What I want to achieve eventually is a concurrent asynchronous web crawler, > without a single blocking call. > > Since I’m new to the language, before starting to get some real work done, > I’ve > worked on a few small projects, to get a feel of the language. What I > learned is > that everything you put in a ref or atom must be immutable. A fundamental > component of a web crawler is a data structure that tells you if you have > already visited a URL. Bloom filters are a common choice for the > implementation > of this component. Since this bloom filter will be put in a ref, I want > it to > be immutable and, for efficiency reasons, persistent. After some research > on the > web, I found out that an immutable persistent implementation of the bloom > filter > is still missing from the Clojure ecosystem. > > Since it seems that Clojure is missing a persistent implementation of many > probabilistic data structures, I came up with the following idea. > Persistent probabilistic data structures for Clojure > > Clojure seems to miss persistent implementations of many useful > probabilistic > data structures, in particular: > > - Bloom filters <http://webhdd.ru/library/files/10.1.1.127.9672.pdf> > - Counting Bloom > filters<http://webhdd.ru/library/files/10.1.1.127.9672.pdf> > - Compact approximators <http://arxiv.org/pdf/cs.DS/0306046.pdf> > - HyperLogLog > counters<http://www.dmtcs.org/dmtcs-ojs/index.php/proceedings/article/viewPDFInterstitial/dmAH0110/2100> > > - Count-min > sketches<http://twiki.di.uniroma1.it/pub/Ing_algo/WebHome/p14_Cormode_JAl_05.pdf> > > > Of course one can use one of the various implementations of these data > structures for > Java, however, being these implementations mutable, they cannot be used > in idiomatic concurrent clojure code (as for my understanding of idiomatic > concurrent clojure code). > > What I propose to realize is an optimized persistent implementation of > these > libraries. Hence I plan to explore different paths using > benchmarks <https://github.com/hugoduncan/criterium> as a guide, > for instance to decide whether it’s more convenient to use standard > clojure vectors > to represent bit vectors or to provide a custom persistent > bit vector implementation. > > Since I like the feeling of having a static type checker that > prevents common errors, the library will be annotated with > Typed Clojure <http://typedclojure.org/>. Moreover I find interesting the > idea > of using tests as machine checkable documentation, and > midje-doc <http://docs.caudate.me/lein-midje-doc/> seems the right tool > for the job. > > Is there someone interested in mentoring me with this project? > > Yours sincerely > Matteo > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.