Re: [Pharo-users] Conditional CSV parsing

Sven Van Caekenberghe Sun, 22 Feb 2015 13:16:31 -0800

There are some cool ideas here:

https://github.com/BurntSushi/xsv


> On 31 Jan 2015, at 14:26, stepharo <steph...@free.fr> wrote:
> 
> hernan
> 
> if you need some help you can also find a smart student and ask ESUG to 
> sponsor him during a summertalk.
> 
> Stef
> 
> Le 26/1/15 21:03, Hernán Morales Durand a écrit :
>> 
>> 
>> 2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
>> Hernán,
>> 
>> > On 26 Jan 2015, at 08:00, Hernán Morales Durand <hernan.mora...@gmail.com> 
>> > wrote:
>> >
>> > It is possible :)
>> > I work with DNA sequences, there could be millions of common SNPs in a 
>> > genome.
>> 
>> Still weird for CSV. How many record are there then ?
>> 
>> We genotyped few individuals (24 records) but now we have a genotyping 
>> platform (GeneTitan) with array plates allowing up to 96 samples, which is 
>> up to 2.6 million of markers. The first run I completed generated CSVs of 1 
>> million of records (see attach). Sadly the high-level analysis of this data 
>> (annotation, clustering, discrimination) now is performed with R with 
>> packages like SNPolisher.
>> 
>> And this is microarray analysis, NGS platforms produce larger volumes of 
>> data in a shorter period of time (several genomes in a day). See 
>> http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics
>>  for the 2014-2020 predictions.
>> 
>> Feel free to contact me if you want to experiment with metrics.
>> 
>> I assume they all have the same number of fields ?
>> 
>> Yes, never seen CSV file with variable number of fields (in this domain)
>> 
>> Anyway, could you point me to the specification of the format you want to 
>> read ?
>> 
>> Actually I have no rush for this, I want to avoid awk, sed and shell scripts 
>> in the next run. I would like to avoid Python but spreads like a virus.
>> 
>> I will be working mostly with CSV's from Axiom annotation files [1] and 
>> genotyping results. Other file formats I use are genotype file formats for 
>> programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is 
>> worst than you might think, because you have to transpose the output 
>> generated by genotyping platforms (millions of records), and then filter & 
>> cut them by chromosome because those Java programs cannot deal with all 
>> chromosomes at the same time.
>>  
>> And to the older the that you used to use ?
>> 
>> 
>> http://www.smalltalkhub.com/#!/~hernan/CSV
>> 
>> Cheers,
>> Hernán
>> 
>> 
>> [1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx
>> [2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
>> 
>> 
>>  
>> Thx,
>> 
>> Sven
>> 
>> > Cheers,
>> >
>> > Hernán
>> >
>> >
>> > 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
>> >
>> > > On 26 Jan 2015, at 06:32, Hernán Morales Durand 
>> > > <hernan.mora...@gmail.com> wrote:
>> > >
>> > >
>> > >
>> > > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
>> > >
>> > > > On 23 Jan 2015, at 20:53, Hernán Morales Durand 
>> > > > <hernan.mora...@gmail.com> wrote:
>> > > >
>> > > > Hi Sven,
>> > > >
>> > > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
>> > > > Hi Hernán,
>> > > >
>> > > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand 
>> > > > > <hernan.mora...@gmail.com> wrote:
>> > > > >
>> > > > > I used to use a CSV parser from Squeak where I could attach 
>> > > > > conditional iterations:
>> > > > >
>> > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 
>> > > > > 2 fields on each row " ].
>> > > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 
>> > > > > fields on each row " ].
>> > > >
>> > > > With NeoCSVParser you can describe how each field is read and 
>> > > > converted, using the same mechanism you can ignore fields. Have a look 
>> > > > at the senders of #addIgnoredField from the unit tests.
>> > > >
>> > > >
>> > > > I am trying to understand the implementation, I see you included 
>> > > > #addIgnoredFields: for consecutive fields in 
>> > > > Neo-CSV-Core-SvenVanCaekenberghe.21
>> > > > A question about usage then, adding ignored field(s) requires adding 
>> > > > field types on all other remaining fields?
>> > >
>> > > Yes, like this:
>> > >
>> > > testReadWithIgnoredField
>> > >         | input |
>> > >         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' 
>> > > '')).
>> > >         self
>> > >                 assert: ((NeoCSVReader on: input readStream)
>> > >                                         addIntegerField;
>> > >                                         addIntegerField;
>> > >                                         addIgnoredField;
>> > >                                         addIntegerField;
>> > >                                         upToEnd)
>> > >                 equals: {
>> > >                         #(1 2 3).
>> > >                         #(1 2 3).
>> > >                         #(1 2 3).}
>> > >
>> > >
>> > >
>> > > May be you like to know if you make a pass to NeoCSV, for some data sets 
>> > > I have 1 million of columns, it would be nice an addFieldsInterval: or 
>> > > such.
>> >
>> > 1 million columns ? How is that possible, useful ?
>> >
>> > The reader is like a builder. You could try to do this yourself by writing 
>> > a little loop or two.
>> >
>> > But still, 1 million ?
>> >
>> > > Thank you.
>> > >
>> > > Hernán
>> > >
>> >
>> >
>> >
>> 
>> 
>> 
>

Re: [Pharo-users] Conditional CSV parsing

Reply via email to