Re: [Pharo-users] Conditional CSV parsing

stepharo Sat, 31 Jan 2015 05:27:08 -0800

hernan

if you need some help you can also find a smart student and ask ESUG tosponsor him during a summertalk.


Stef

Le 26/1/15 21:03, Hernán Morales Durand a écrit :

2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu<mailto:s...@stfx.eu>>:
    Hernán,

    > On 26 Jan 2015, at 08:00, Hernán Morales Durand
    <hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
    >
    > It is possible :)
    > I work with DNA sequences, there could be millions of common
    SNPs in a genome.
Still weird for CSV. How many record are there then ?
We genotyped few individuals (24 records) but now we have a genotypingplatform (GeneTitan) with array plates allowing up to 96 samples,which is up to 2.6 million of markers. The first run I completedgenerated CSVs of 1 million of records (see attach). Sadly thehigh-level analysis of this data (annotation, clustering,discrimination) now is performed with R with packages like SNPolisher.
And this is microarray analysis, NGS platforms produce larger volumesof data in a shorter period of time (several genomes in a day). Seehttp://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomicsfor the 2014-2020 predictions.
Feel free to contact me if you want to experiment with metrics.

    I assume they all have the same number of fields ?


Yes, never seen CSV file with variable number of fields (in this domain)

    Anyway, could you point me to the specification of the format you
    want to read ?
Actually I have no rush for this, I want to avoid awk, sed and shellscripts in the next run. I would like to avoid Python but spreads likea virus.
I will be working mostly with CSV's from Axiom annotation files [1]and genotyping results. Other file formats I use are genotype fileformats for programs like PLINK [2] (PED files, column 7 onwards) andHaploView. Is worst than you might think, because you have totranspose the output generated by genotyping platforms (millions ofrecords), and then filter & cut them by chromosome because those Javaprograms cannot deal with all chromosomes at the same time.
    And to the older the that you used to use ?
http://www.smalltalkhub.com/#!/~hernan/CSV<http://www.smalltalkhub.com/#%21/%7Ehernan/CSV>
Cheers,
Hernán


[1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx
[2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped<http://pngu.mgh.harvard.edu/%7Epurcell/plink/data.shtml#ped>
    Thx,

    Sven

    > Cheers,
    >
    > Hernán
    >
    >
    > 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu
    <mailto:s...@stfx.eu>>:
    >
    > > On 26 Jan 2015, at 06:32, Hernán Morales Durand
    <hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
    > >
    > >
    > >
    > > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu
    <mailto:s...@stfx.eu>>:
    > >
    > > > On 23 Jan 2015, at 20:53, Hernán Morales Durand
    <hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
    > > >
    > > > Hi Sven,
    > > >
    > > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe
    <s...@stfx.eu <mailto:s...@stfx.eu>>:
    > > > Hi Hernán,
    > > >
    > > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand
    <hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
    > > > >
    > > > > I used to use a CSV parser from Squeak where I could
    attach conditional iterations:
    > > > >
    > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action
    ignoring first 2 fields on each row " ].
    > > > > csvParser rowsSkipLast: 2 do: [: row | " some action
    ignoring last 2 fields on each row " ].
    > > >
    > > > With NeoCSVParser you can describe how each field is read
    and converted, using the same mechanism you can ignore fields.
    Have a look at the senders of #addIgnoredField from the unit tests.
    > > >
    > > >
    > > > I am trying to understand the implementation, I see you
    included #addIgnoredFields: for consecutive fields in
    Neo-CSV-Core-SvenVanCaekenberghe.21
    > > > A question about usage then, adding ignored field(s)
    requires adding field types on all other remaining fields?
    > >
    > > Yes, like this:
    > >
    > > testReadWithIgnoredField
    > >         | input |
    > >         input := (String crlf join: #( '1,2,a,3' '1,2,b,3'
    '1,2,c,3' '')).
    > >         self
    > >                 assert: ((NeoCSVReader on: input readStream)
    > >  addIntegerField;
    > >  addIntegerField;
    > >  addIgnoredField;
    > >  addIntegerField;
    > >  upToEnd)
    > >                 equals: {
    > >                         #(1 2 3).
    > >                         #(1 2 3).
    > >                         #(1 2 3).}
    > >
    > >
    > >
    > > May be you like to know if you make a pass to NeoCSV, for some
    data sets I have 1 million of columns, it would be nice an
    addFieldsInterval: or such.
    >
    > 1 million columns ? How is that possible, useful ?
    >
    > The reader is like a builder. You could try to do this yourself
    by writing a little loop or two.
    >
    > But still, 1 million ?
    >
    > > Thank you.
    > >
    > > Hernán
    > >
    >
    >
    >

Re: [Pharo-users] Conditional CSV parsing

Reply via email to