17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Stephane Ducasse Wed, 17 May 2017 10:57:35 -0700

write some tests and ask for a good implementations.

Crazy implementors like henrik can probably beat us all :)



On Wed, May 17, 2017 at 7:55 PM, Stephane Ducasse <stepharo.s...@gmail.com>
wrote:

> I'm interested to help for such new "containers".
> May be we should proceed that way:
>
>
> On Tue, May 16, 2017 at 7:44 PM, p...@highoctane.be <p...@highoctane.be>
> wrote:
>
>> We may also use Discord and do something "somewhat live"
>>
>> Phil
>>
>> On Tue, May 16, 2017 at 7:23 PM, <serge.stinckw...@gmail.com> wrote:
>>
>>> I was asking Philippe but hope to see you also at ESUG !
>>>
>>> Envoyé de mon iPhone
>>>
>>> Le 16 mai 2017 à 19:02, Oleksandr Zaytsev <olk.zayt...@gmail.com> a
>>> écrit :
>>>
>>> I would love to, but to go to Lille from my country I would need a visa.
>>> Which is not that easy to acquire.
>>> So maybe I will come to PharoDays 2018.
>>> And I will definitely try to come to ESUG Conference in September.
>>>
>>> Oleks
>>>
>>> On Tue, May 16, 2017 at 7:26 PM, <serge.stinckw...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Envoyé de mon iPhone
>>>>
>>>> Le 11 mai 2017 à 11:43, "p...@highoctane.be" <p...@highoctane.be> a
>>>> écrit :
>>>>
>>>> ---------- Message transféré ----------
>>>> De : "p...@highoctane.be" <p...@highoctane.be>
>>>> Date : 11 mai 2017 10:54
>>>> Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis -
>>>> Oleksandr Zaytsev
>>>> À : "Nick Papoylias" <npapoyl...@gmail.com>
>>>> Cc :
>>>>
>>>>
>>>>
>>>> On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <npapoyl...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <
>>>>> olk.zayt...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> *A. Work done*
>>>>>>
>>>>>>    - Downloaded the threaded VM as suggested by Esteban Lorenzano to
>>>>>>    make Iceberg work. And it does! I have successfully pushed my 
>>>>>> NeuralNetwork
>>>>>>    code to GitHub: https://github.com/olekscode/MLNeuralNetwork
>>>>>>    - Joined the PolyMath organization on GitHub
>>>>>>    - Created a repository for the TabularDataset project
>>>>>>    https://github.com/PolyMathOrg/TabularDataset
>>>>>>    <https://github.com/PolyMathOrg/TabularDataset> as a part of
>>>>>>    PolyMath organization on GitHub
>>>>>>    - Fixed a PolyMath issue #25 and made a PR
>>>>>>    - Read an article from Wolfram Mathematica documentation
>>>>>>    regarding Dataset. It was one of the reading suggestions sent to me 
>>>>>> by Nick
>>>>>>    Papoylias
>>>>>>
>>>>>>
>>>>>> *B. Next steps*
>>>>>>
>>>>>>    - Fix more issues of PolyMath, using Iceberg. I have to get used
>>>>>>    to it by the time the coding phase starts
>>>>>>    - Read the rest of Nick Papoylias's suggestions
>>>>>>
>>>>>>
>>>>>> *C. Help needed*
>>>>>>
>>>>>>    - The Dataset in Wolfram, as well as Pandas in Python, has a very
>>>>>>    advanced indexing system. Smalltalk has its own special conventions 
>>>>>> for
>>>>>>    indexing, so I think that it would be great if I got familiar with 
>>>>>> them.
>>>>>>    Could you suggest me some reading on this topic (what are the indexing
>>>>>>    conventions in Smalltalk?).
>>>>>>    For example, in Wolfram, I can write *dataset[[-1]]* to extract
>>>>>>    the last row. But in Pharo indexes can not be negative. In Pharo I 
>>>>>> would
>>>>>>    say *dataset last*. But how about *dataset[[-5]]*?
>>>>>>
>>>>>> This would be a good exercise for you ;) In Pharo you can easily add
>>>>> negative indexing yourself.
>>>>>
>>>>> *Hint:* You know the index of the last element, since this is the
>>>>> size of the collection, so... ;)
>>>>>
>>>>> No need for changes, this exists already.
>>>>
>>>> Use atWrap: index put: value and atWrap: with negative indexes.
>>>> 'hello' atWrap: -2
>>>>
>>>> There is a specific version for Array using a primitive.
>>>> #[ 10 20 30 40 ] atWrap: -1
>>>>
>>>> atWrap:0 gives you the last item.
>>>> atWrap: -1 gives 30
>>>>
>>>> This is different from 0 based index languages.
>>>>
>>>> The interesing thing about atWrap: is that it uses modulo interally so
>>>> you do not need to care about that.
>>>>
>>>> ($/ split: 'abc/def/ghi/jkl') atWrap: -1
>>>> --> 'ghi'
>>>>
>>>> The Matrix class has a bunch of things API wise but the class is highly
>>>> inefficient, doing copies all the time etc. It would be nice to have some
>>>> kind of futures/copy on write style things in there.
>>>>
>>>> I miss cbind and rbind. These are useful. I have some half baked super
>>>> inefficient implementations of these things for Matrix.
>>>>
>>>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html
>>>>
>>>> The ability to name columns is also nice to have.
>>>>
>>>> In R one does:
>>>>
>>>> df <- dataframe()
>>>> cbind(df, c(1,2,3))
>>>> cbind(df, c(4,5,6))
>>>> names(df)<-("C1", "C2", "C3")
>>>> names can be found back with:
>>>>
>>>> names(df)
>>>>
>>>> A Smalltalkish style would be welcome.
>>>>
>>>>
>>>>
>>>>
>>>> Interesting ! Are you coming to PharoDays ? We can talk about that if
>>>> we found time.
>>>>
>>>> Maybe looking at the Voyage queries can be helpful.
>>>>
>>>> Phil
>>>>
>>>>
>>>>
>>>>> Try adding an extention method to Ordrered or SequenceableCollection.
>>>>>
>>>>> If the Pharo by example chapter is not enough or the MOOC, read the
>>>>> source
>>>>> itself in the core, to see how basic methods are implemented (it is
>>>>> less scary,
>>>>> than it sounds).
>>>>>
>>>>> You can also try Chapters 9, 10, 11 of the blue book (some API changes
>>>>> may apply):
>>>>>
>>>>> <http://goog_1902892863>
>>>>> http://sdmeta.gforge.inria.fr/FreeBooks/BlueBook/Bluebook.pdf
>>>>>
>>>>>
>>>>>>    - Or what is the best way of implementing this index:
>>>>>>    *dataset[["name"]]* (extracts a named row), *dataset[[1]*]
>>>>>>    (extracts the first row)? Should I create two separate messages: 
>>>>>> *dataset
>>>>>>    rowNamed: 'name'* and *dataset rowAt: 1*?
>>>>>>
>>>>>> rowNamed:
>>>> rowAt:
>>>>
>>>> yes, look like it.
>>>>
>>>> But if we want to model things like R dataframes for example, this has
>>>> to be seen as a vectorized operation, so you can to use row slices, column
>>>> slices, and logical indexes.
>>>>
>>>> Check this out:
>>>>
>>>> http://www.r-tutor.com/r-introduction/data-frame/data-frame-row-slice
>>>> https://www.r-bloggers.com/working-with-data-frames/
>>>>
>>>>
>>>>
>>>>> The internal representation of your data-structure can be anything at
>>>>> the moment, *as long as you encapsulate it.*
>>>>>
>>>>> (ie it can be nested OrderedCollections with meta-data for
>>>>> column-names to indexes, or dictionary of collections etc).
>>>>>
>>>>> *If you don't expose it to the user* (ie return it from the public
>>>>> api, or expect knowledge of it in argument passing),
>>>>> we can easily change it later. So *first make it work, and we
>>>>> optimize later ;)*
>>>>>
>>>>> For your case it will be a little bit trickier because *you also have
>>>>> the notions of a) rows and b) columns*, which
>>>>> are exposed to the user. So *you would need to create abstractions*
>>>>> for these too.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nick
>>>>>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>
>>>>>> If someone else is having problems with Iceberg on Linux, try
>>>>>> downloading the threaded VM:
>>>>>>
>>>>>> wget -O- get.pharo.org/vmT60 | bash
>>>>>>
>>>>>> And use SSH (not HTTPS) remote URL.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Pharo Google Summer of Code" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to pharo-gsoc+unsubscr...@googlegroups.com.
>>>>>> To post to this group, send email to pharo-g...@googlegroups.com.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA
>>>>>> 6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Pharo Google Summer of Code" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to pharo-gsoc+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to pharo-g...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ
>>>>> 8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [Pharo-users] Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Reply via email to