Thank you sir for the clarification. On Sat, Jun 19, 2021 at 7:24 PM Konrad Hinsen <konrad.hin...@cnrs.fr> wrote:
> Dear Balaji, > > > I am working on implementing the DataFrame>>dtypes feature which > > checks the datatypes of columns in a DataFrame, as part of my GSOC > > project. I have tried to explain my theoretical work done so far on > > this blog post. Please kindly go through it , as I need advice on what > > could be the optimal way to implement this feature. Any kind of input > > and discussions is most welcome. > > Your post looks like an overall accurate description of the current > state of everything - with one exception, and that is Pandas. You say > you didn't look at the Pandas code yet, so that's not surprising. > > You seem to assume that Pandas stores Python objects as elements of > DataFrames, but that isn't true. Pandas uses NumPy arrays instead. And > NumPy arrays are very different from standard Python objects, because > their internal data layout is by design the same as used in C or > Fortran. For a full description, see > > https://numpy.org/doc/stable/user/basics.rec.html > > However, I am not sure you need to understand this in all detail, as I > am pretty sure that you do not want to copy this approach in Pharo. > > The one point that does matter for you is where NumPy and Pandas take a > column's dtype from. The answer is that it's defined when a DataFrame is > created, and it cannot be changed afterwards. If a column is "integer", > it will remain "integer" forever. If you try to assign a string to an > element of such a column, you get an error message. When you create a > DataFrame from existing data, e.g. by reading a CSV file, Pandas scans > the data and determines a suitable dtype, much in the same way as V1.0 > in Pharo/PolyMath did. But since Pandas doesn't allow any later change, > there is no serious performance issue. > > So that's an option you can add to your list: define the dtypes once and > for all when the DataFrame is created. The main drawback is that you > would have to change the API for DataFrame creation to make this work. > > Cheers, > Konrad > -- > --------------------------------------------------------------------- > Konrad Hinsen > Centre de Biophysique Moléculaire, CNRS Orléans > Synchrotron Soleil - Division Expériences > Saint Aubin - BP 48 > 91192 Gif sur Yvette Cedex, France > Tel. +33-1 69 35 97 15 > E-Mail: konrad DOT hinsen AT cnrs DOT fr > http://dirac.cnrs-orleans.fr/~hinsen/ > ORCID: https://orcid.org/0000-0003-0330-9428 > Twitter: @khinsen > --------------------------------------------------------------------- >