17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Stephane Ducasse Wed, 17 May 2017 10:57:06 -0700

I'm interested to help for such new "containers".
May be we should proceed that way:



On Tue, May 16, 2017 at 7:44 PM, p...@highoctane.be <p...@highoctane.be>
wrote:

> We may also use Discord and do something "somewhat live"
>
> Phil
>
> On Tue, May 16, 2017 at 7:23 PM, <serge.stinckw...@gmail.com> wrote:
>
>> I was asking Philippe but hope to see you also at ESUG !
>>
>> Envoyé de mon iPhone
>>
>> Le 16 mai 2017 à 19:02, Oleksandr Zaytsev <olk.zayt...@gmail.com> a
>> écrit :
>>
>> I would love to, but to go to Lille from my country I would need a visa.
>> Which is not that easy to acquire.
>> So maybe I will come to PharoDays 2018.
>> And I will definitely try to come to ESUG Conference in September.
>>
>> Oleks
>>
>> On Tue, May 16, 2017 at 7:26 PM, <serge.stinckw...@gmail.com> wrote:
>>
>>>
>>>
>>> Envoyé de mon iPhone
>>>
>>> Le 11 mai 2017 à 11:43, "p...@highoctane.be" <p...@highoctane.be> a
>>> écrit :
>>>
>>> ---------- Message transféré ----------
>>> De : "p...@highoctane.be" <p...@highoctane.be>
>>> Date : 11 mai 2017 10:54
>>> Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis -
>>> Oleksandr Zaytsev
>>> À : "Nick Papoylias" <npapoyl...@gmail.com>
>>> Cc :
>>>
>>>
>>>
>>> On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <npapoyl...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <
>>>> olk.zayt...@gmail.com> wrote:
>>>>
>>>>>
>>>>> *A. Work done*
>>>>>
>>>>>    - Downloaded the threaded VM as suggested by Esteban Lorenzano to
>>>>>    make Iceberg work. And it does! I have successfully pushed my 
>>>>> NeuralNetwork
>>>>>    code to GitHub: https://github.com/olekscode/MLNeuralNetwork
>>>>>    - Joined the PolyMath organization on GitHub
>>>>>    - Created a repository for the TabularDataset project
>>>>>    https://github.com/PolyMathOrg/TabularDataset
>>>>>    <https://github.com/PolyMathOrg/TabularDataset> as a part of
>>>>>    PolyMath organization on GitHub
>>>>>    - Fixed a PolyMath issue #25 and made a PR
>>>>>    - Read an article from Wolfram Mathematica documentation regarding
>>>>>    Dataset. It was one of the reading suggestions sent to me by Nick 
>>>>> Papoylias
>>>>>
>>>>>
>>>>> *B. Next steps*
>>>>>
>>>>>    - Fix more issues of PolyMath, using Iceberg. I have to get used
>>>>>    to it by the time the coding phase starts
>>>>>    - Read the rest of Nick Papoylias's suggestions
>>>>>
>>>>>
>>>>> *C. Help needed*
>>>>>
>>>>>    - The Dataset in Wolfram, as well as Pandas in Python, has a very
>>>>>    advanced indexing system. Smalltalk has its own special conventions for
>>>>>    indexing, so I think that it would be great if I got familiar with 
>>>>> them.
>>>>>    Could you suggest me some reading on this topic (what are the indexing
>>>>>    conventions in Smalltalk?).
>>>>>    For example, in Wolfram, I can write *dataset[[-1]]* to extract
>>>>>    the last row. But in Pharo indexes can not be negative. In Pharo I 
>>>>> would
>>>>>    say *dataset last*. But how about *dataset[[-5]]*?
>>>>>
>>>>> This would be a good exercise for you ;) In Pharo you can easily add
>>>> negative indexing yourself.
>>>>
>>>> *Hint:* You know the index of the last element, since this is the size
>>>> of the collection, so... ;)
>>>>
>>>> No need for changes, this exists already.
>>>
>>> Use atWrap: index put: value and atWrap: with negative indexes.
>>> 'hello' atWrap: -2
>>>
>>> There is a specific version for Array using a primitive.
>>> #[ 10 20 30 40 ] atWrap: -1
>>>
>>> atWrap:0 gives you the last item.
>>> atWrap: -1 gives 30
>>>
>>> This is different from 0 based index languages.
>>>
>>> The interesing thing about atWrap: is that it uses modulo interally so
>>> you do not need to care about that.
>>>
>>> ($/ split: 'abc/def/ghi/jkl') atWrap: -1
>>> --> 'ghi'
>>>
>>> The Matrix class has a bunch of things API wise but the class is highly
>>> inefficient, doing copies all the time etc. It would be nice to have some
>>> kind of futures/copy on write style things in there.
>>>
>>> I miss cbind and rbind. These are useful. I have some half baked super
>>> inefficient implementations of these things for Matrix.
>>>
>>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html
>>>
>>> The ability to name columns is also nice to have.
>>>
>>> In R one does:
>>>
>>> df <- dataframe()
>>> cbind(df, c(1,2,3))
>>> cbind(df, c(4,5,6))
>>> names(df)<-("C1", "C2", "C3")
>>> names can be found back with:
>>>
>>> names(df)
>>>
>>> A Smalltalkish style would be welcome.
>>>
>>>
>>>
>>>
>>> Interesting ! Are you coming to PharoDays ? We can talk about that if we
>>> found time.
>>>
>>> Maybe looking at the Voyage queries can be helpful.
>>>
>>> Phil
>>>
>>>
>>>
>>>> Try adding an extention method to Ordrered or SequenceableCollection.
>>>>
>>>> If the Pharo by example chapter is not enough or the MOOC, read the
>>>> source
>>>> itself in the core, to see how basic methods are implemented (it is
>>>> less scary,
>>>> than it sounds).
>>>>
>>>> You can also try Chapters 9, 10, 11 of the blue book (some API changes
>>>> may apply):
>>>>
>>>> <http://goog_1902892863>
>>>> http://sdmeta.gforge.inria.fr/FreeBooks/BlueBook/Bluebook.pdf
>>>>
>>>>
>>>>>    - Or what is the best way of implementing this index:
>>>>>    *dataset[["name"]]* (extracts a named row), *dataset[[1]*]
>>>>>    (extracts the first row)? Should I create two separate messages: 
>>>>> *dataset
>>>>>    rowNamed: 'name'* and *dataset rowAt: 1*?
>>>>>
>>>>> rowNamed:
>>> rowAt:
>>>
>>> yes, look like it.
>>>
>>> But if we want to model things like R dataframes for example, this has
>>> to be seen as a vectorized operation, so you can to use row slices, column
>>> slices, and logical indexes.
>>>
>>> Check this out:
>>>
>>> http://www.r-tutor.com/r-introduction/data-frame/data-frame-row-slice
>>> https://www.r-bloggers.com/working-with-data-frames/
>>>
>>>
>>>
>>>> The internal representation of your data-structure can be anything at
>>>> the moment, *as long as you encapsulate it.*
>>>>
>>>> (ie it can be nested OrderedCollections with meta-data for column-names
>>>> to indexes, or dictionary of collections etc).
>>>>
>>>> *If you don't expose it to the user* (ie return it from the public
>>>> api, or expect knowledge of it in argument passing),
>>>> we can easily change it later. So *first make it work, and we optimize
>>>> later ;)*
>>>>
>>>> For your case it will be a little bit trickier because *you also have
>>>> the notions of a) rows and b) columns*, which
>>>> are exposed to the user. So *you would need to create abstractions*
>>>> for these too.
>>>>
>>>> Cheers,
>>>>
>>>> Nick
>>>>
>>>>>
>>>>>    -
>>>>>
>>>>>
>>>>> If someone else is having problems with Iceberg on Linux, try
>>>>> downloading the threaded VM:
>>>>>
>>>>> wget -O- get.pharo.org/vmT60 | bash
>>>>>
>>>>> And use SSH (not HTTPS) remote URL.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Pharo Google Summer of Code" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to pharo-gsoc+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to pharo-g...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA
>>>>> 6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Pharo Google Summer of Code" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to pharo-gsoc+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to pharo-g...@googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7G
>>>> h1c0sM%3DA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>
>

Re: [Pharo-users] Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Reply via email to