On Thursday, 15 October 2015 at 07:57:51 UTC, Russel Winder wrote:
On Thu, 2015-10-15 at 06:48 +0000, data pulverizer via
Digitalmars-d- learn wrote:
[…]
A journey of a thousand miles ...
Exactly.
I tried to start creating a data table type object by
investigating variantArray:
http://forum.dlang.org/thread/hhzavwrkbrkjzfohc...@forum.dlang.org
but hit the snag that D is a static programming language and
may not
allow the kind of behaviour you need for creating the same
kind of
behaviour you need in data table - like objects.
I envisage such an object as being composed of arrays of
vectors where each vector represents a column in a table as in
R - easier for model matrix creation. Some people believe that
you should work with arrays of tuple rows - which may be more
big data friendly. I am not overly wedded to either approach.
Anyway it seems I have hit an inherent limitation in the
language. Correct me if I am wrong. The data frame needs to
have dynamic behaviour bind rows and columns and return parts
of itself as a data table etc and since D is a static language
we cannot do this.
Just because D doesn't have this now doesn't mean it cannot. C
doesn't have such capability but R and Python do even though R
and CPython are just C codes.
Pandas data structures rely on the NumPy n-dimensional array
implementation, it is not beyond the bounds of possibility that
that data structure could be realized as a D module.
Is R's data.table written in R or in C? In either case, it is
not beyond the bounds of possibility that that data structure
could be realized as a D module.
The core issue is to have a seriously efficient n-dimensional
array that is amenable to data parallelism and is extensible.
As far as I am aware currently (I will investigate more) the
NumPy array is a good native code array, but has some issues
with data parallelism and Pandas has to do quite a lot of work
to get the extensibility. I wonder how the R data.table works.
I have this nagging feeling that like NumPy, data.table seems a
lot better than it could be. From small experiments D is (and
also Chapel is even more) hugely faster than Python/NumPy at
things Python people think NumPy is brilliant for. Expectations
of Python programmers are set by the scale of Python
performance, so NumPy seems brilliant. Compared to the scale
set by D and Chapel, NumPy is very disappointing. I bet the
same is true of R (I have never really used R).
This is therefore an opportunity for D to step in. However it
is a journey of a thousand miles to get something production
worthy. Python/NumPy/Pandas have had a very large number of
programmer hours expended on them. Doing this poorly as a D
modules is likely worse than not doing it at all.
I think it's much better to start, which means solving your own
problems in a way that is acceptable to you rather than letting
perfection be the enemy of the good. It's always easier to do
something a second time too, as you learn from successes and
mistakes and you have a better idea about what you want. Of
course it's better to put some thought into design early on, but
that shouldn't end up in analysis paralysis. John Colvin and
others are putting quite a lot of thought into dlang science, it
seems to me, but he is also getting stuff done. Running D in a
Jupyter notebook is something very useful. It doesn't matter
that it's cosmetically imperfect at this stage, and it won't stay
that way. And that's just a small step towards the bigger goal.