17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

p...@highoctane.be Thu, 11 May 2017 02:45:08 -0700

---------- Message transféré ----------
De : "p...@highoctane.be" <p...@highoctane.be>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis -
Oleksandr Zaytsev
À : "Nick Papoylias" <npapoyl...@gmail.com>
Cc :




On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <npapoyl...@gmail.com>
wrote:

>
>
> On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <olk.zayt...@gmail.com>
> wrote:
>
>>
>> *A. Work done*
>>
>>    - Downloaded the threaded VM as suggested by Esteban Lorenzano to
>>    make Iceberg work. And it does! I have successfully pushed my 
>> NeuralNetwork
>>    code to GitHub: https://github.com/olekscode/MLNeuralNetwork
>>    - Joined the PolyMath organization on GitHub
>>    - Created a repository for the TabularDataset project
>>    https://github.com/PolyMathOrg/TabularDataset
>>    <https://github.com/PolyMathOrg/TabularDataset> as a part of PolyMath
>>    organization on GitHub
>>    - Fixed a PolyMath issue #25 and made a PR
>>    - Read an article from Wolfram Mathematica documentation regarding
>>    Dataset. It was one of the reading suggestions sent to me by Nick 
>> Papoylias
>>
>>
>> *B. Next steps*
>>
>>    - Fix more issues of PolyMath, using Iceberg. I have to get used to
>>    it by the time the coding phase starts
>>    - Read the rest of Nick Papoylias's suggestions
>>
>>
>> *C. Help needed*
>>
>>    - The Dataset in Wolfram, as well as Pandas in Python, has a very
>>    advanced indexing system. Smalltalk has its own special conventions for
>>    indexing, so I think that it would be great if I got familiar with them.
>>    Could you suggest me some reading on this topic (what are the indexing
>>    conventions in Smalltalk?).
>>    For example, in Wolfram, I can write *dataset[[-1]]* to extract the
>>    last row. But in Pharo indexes can not be negative. In Pharo I would say 
>> *dataset
>>    last*. But how about *dataset[[-5]]*?
>>
>> This would be a good exercise for you ;) In Pharo you can easily add
> negative indexing yourself.
>
> *Hint:* You know the index of the last element, since this is the size of
> the collection, so... ;)
>
> No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you
do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly
inefficient, doing copies all the time etc. It would be nice to have some
kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super
inefficient implementations of these things for Matrix.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html

The ability to name columns is also nice to have.

In R one does:

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.

Maybe looking at the Voyage queries can be helpful.

Phil



> Try adding an extention method to Ordrered or SequenceableCollection.
>
> If the Pharo by example chapter is not enough or the MOOC, read the source
> itself in the core, to see how basic methods are implemented (it is less
> scary,
> than it sounds).
>
> You can also try Chapters 9, 10, 11 of the blue book (some API changes may
> apply):
>
> <http://goog_1902892863>
> http://sdmeta.gforge.inria.fr/FreeBooks/BlueBook/Bluebook.pdf
>
>
>>    - Or what is the best way of implementing this index:
>>    *dataset[["name"]]* (extracts a named row), *dataset[[1]*] (extracts
>>    the first row)? Should I create two separate messages: *dataset
>>    rowNamed: 'name'* and *dataset rowAt: 1*?
>>
>> rowNamed:
rowAt:

yes, look like it.

But if we want to model things like R dataframes for example, this has to
be seen as a vectorized operation, so you can to use row slices, column
slices, and logical indexes.

Check this out:

http://www.r-tutor.com/r-introduction/data-frame/data-frame-row-slice
https://www.r-bloggers.com/working-with-data-frames/



> The internal representation of your data-structure can be anything at the
> moment, *as long as you encapsulate it.*
>
> (ie it can be nested OrderedCollections with meta-data for column-names to
> indexes, or dictionary of collections etc).
>
> *If you don't expose it to the user* (ie return it from the public api,
> or expect knowledge of it in argument passing),
> we can easily change it later. So *first make it work, and we optimize
> later ;)*
>
> For your case it will be a little bit trickier because *you also have the
> notions of a) rows and b) columns*, which
> are exposed to the user. So *you would need to create abstractions* for
> these too.
>
> Cheers,
>
> Nick
>
>>
>>    -
>>
>>
>> If someone else is having problems with Iceberg on Linux, try downloading
>> the threaded VM:
>>
>> wget -O- get.pharo.org/vmT60 | bash
>>
>> And use SSH (not HTTPS) remote URL.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Pharo Google Summer of Code" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to pharo-gsoc+unsubscr...@googlegroups.com.
>> To post to this group, send email to pharo-g...@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8
>> qkTqfQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Pharo Google Summer of Code" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pharo-gsoc+unsubscr...@googlegroups.com.
> To post to this group, send email to pharo-g...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7G
> h1c0sM%3DA%40mail.gmail.com
> <https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

[Pharo-users] Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Reply via email to