I am sorry, I must have misunderstood the purpose of this thread. I read "Even if you have a vague idea, you can contribute." and tried to give a couple of vague ideas.
I did not really mean that I would be able or have time to mentor such a project 2015-02-18 11:01 GMT+01:00 Sven Van Caekenberghe <s...@stfx.eu>: > OK, try making a proposal then, http://gsoc.pharo.org has the instructions > and the current list, you probably know more about data science than I do. > >> On 18 Feb 2015, at 10:53, Andrea Ferretti <ferrettiand...@gmail.com> wrote: >> >> I am sorry if the previous messages came off as too harsh. The Neo >> tools are perfectly fine for their intended use. >> >> What I was trying to say is that a good idea for a SoC project would >> be to develop a framework for data analysis that would be useful for >> data scientists, and in particular this would include something to >> import unstructured data more freely. >> >> 2015-02-18 10:39 GMT+01:00 Sven Van Caekenberghe <s...@stfx.eu>: >>> Well, you are certainly free to contribute. >>> >>> Heuristic interpretation of data could be useful, but looks like an >>> addition on top, the core library should be fast and efficient. >>> >>>> On 18 Feb 2015, at 10:35, Andrea Ferretti <ferrettiand...@gmail.com> wrote: >>>> >>>> For an example of what I am talking about, see >>>> >>>> http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#csv-text-files >>>> >>>> I agree that this is definitely too much options, but it gets the job >>>> done for quick and dirty exploration. >>>> >>>> The fact is that working with a dump of table on your db, whose >>>> content you know, requires different tools than exploring the latest >>>> opendata that your local municipality has put online, using yet >>>> another messy format. >>>> >>>> Enterprise programmers deal more often with the former, data >>>> scientists with the latter, and I think there is room for both kind of >>>> tools >>>> >>>> 2015-02-18 10:26 GMT+01:00 Andrea Ferretti <ferrettiand...@gmail.com>: >>>>> Thank you Sven. I think this should be emphasized and prominent on the >>>>> home page*. Still, libraries such as pandas are even more lenient, >>>>> doing things such as: >>>>> >>>>> - autodetecting which fields are numeric in CSV files >>>>> - allowing to fill missing data based on statistics (for instance, you >>>>> can say: where the field `age` is missing, use the average age) >>>>> >>>>> Probably there is room for something built on top of Neo >>>>> >>>>> >>>>> * by the way, I suggest that the documentation on Neo could benefit >>>>> from a reorganization. Right now, the first topic on the NeoJSON >>>>> paper introduces JSON itself. I would argue that everyone that tries >>>>> to use the library knows what JSON is already. Still, there is no >>>>> example of how to read JSON from a file in the whole document. >>>>> >>>>> 2015-02-18 10:12 GMT+01:00 Sven Van Caekenberghe <s...@stfx.eu>: >>>>>> >>>>>>> On 18 Feb 2015, at 09:52, Andrea Ferretti <ferrettiand...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Also, these tasks >>>>>>> often involve consuming data from various sources, such as CSV and >>>>>>> Json files. NeoCSV and NeoJSON are still a little too rigid for the >>>>>>> task - libraries like pandas allow to just feed a csv file and try to >>>>>>> make head or tails of the content without having to define too much of >>>>>>> a schema beforehand >>>>>> >>>>>> Both NeoCSV and NeoJSON can operate in two ways, (1) without the >>>>>> definition of any schema's or (2) with the definition of schema's and >>>>>> mappings. The quick and dirty explore style is most certainly possible. >>>>>> >>>>>> 'my-data.csv' asFileReference readStreamDo: [ :in | (NeoCSVReader on: >>>>>> in) upToEnd ]. >>>>>> >>>>>> => an array of arrays >>>>>> >>>>>> 'my-data.json' asFileReference readStreamDo: [ :in | (NeoJSONReader on: >>>>>> in) next ]. >>>>>> >>>>>> => objects structured using dictionaries and arrays >>>>>> >>>>>> Sven >>>>>> >>>>>> >>>> >>> >>> >> > >