There are some cool ideas here: https://github.com/BurntSushi/xsv
> On 31 Jan 2015, at 14:26, stepharo <steph...@free.fr> wrote: > > hernan > > if you need some help you can also find a smart student and ask ESUG to > sponsor him during a summertalk. > > Stef > > Le 26/1/15 21:03, Hernán Morales Durand a écrit : >> >> >> 2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>: >> Hernán, >> >> > On 26 Jan 2015, at 08:00, Hernán Morales Durand <hernan.mora...@gmail.com> >> > wrote: >> > >> > It is possible :) >> > I work with DNA sequences, there could be millions of common SNPs in a >> > genome. >> >> Still weird for CSV. How many record are there then ? >> >> We genotyped few individuals (24 records) but now we have a genotyping >> platform (GeneTitan) with array plates allowing up to 96 samples, which is >> up to 2.6 million of markers. The first run I completed generated CSVs of 1 >> million of records (see attach). Sadly the high-level analysis of this data >> (annotation, clustering, discrimination) now is performed with R with >> packages like SNPolisher. >> >> And this is microarray analysis, NGS platforms produce larger volumes of >> data in a shorter period of time (several genomes in a day). See >> http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics >> for the 2014-2020 predictions. >> >> Feel free to contact me if you want to experiment with metrics. >> >> I assume they all have the same number of fields ? >> >> Yes, never seen CSV file with variable number of fields (in this domain) >> >> Anyway, could you point me to the specification of the format you want to >> read ? >> >> Actually I have no rush for this, I want to avoid awk, sed and shell scripts >> in the next run. I would like to avoid Python but spreads like a virus. >> >> I will be working mostly with CSV's from Axiom annotation files [1] and >> genotyping results. Other file formats I use are genotype file formats for >> programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is >> worst than you might think, because you have to transpose the output >> generated by genotyping platforms (millions of records), and then filter & >> cut them by chromosome because those Java programs cannot deal with all >> chromosomes at the same time. >> >> And to the older the that you used to use ? >> >> >> http://www.smalltalkhub.com/#!/~hernan/CSV >> >> Cheers, >> Hernán >> >> >> [1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx >> [2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped >> >> >> >> Thx, >> >> Sven >> >> > Cheers, >> > >> > Hernán >> > >> > >> > 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>: >> > >> > > On 26 Jan 2015, at 06:32, Hernán Morales Durand >> > > <hernan.mora...@gmail.com> wrote: >> > > >> > > >> > > >> > > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>: >> > > >> > > > On 23 Jan 2015, at 20:53, Hernán Morales Durand >> > > > <hernan.mora...@gmail.com> wrote: >> > > > >> > > > Hi Sven, >> > > > >> > > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>: >> > > > Hi Hernán, >> > > > >> > > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand >> > > > > <hernan.mora...@gmail.com> wrote: >> > > > > >> > > > > I used to use a CSV parser from Squeak where I could attach >> > > > > conditional iterations: >> > > > > >> > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first >> > > > > 2 fields on each row " ]. >> > > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 >> > > > > fields on each row " ]. >> > > > >> > > > With NeoCSVParser you can describe how each field is read and >> > > > converted, using the same mechanism you can ignore fields. Have a look >> > > > at the senders of #addIgnoredField from the unit tests. >> > > > >> > > > >> > > > I am trying to understand the implementation, I see you included >> > > > #addIgnoredFields: for consecutive fields in >> > > > Neo-CSV-Core-SvenVanCaekenberghe.21 >> > > > A question about usage then, adding ignored field(s) requires adding >> > > > field types on all other remaining fields? >> > > >> > > Yes, like this: >> > > >> > > testReadWithIgnoredField >> > > | input | >> > > input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' >> > > '')). >> > > self >> > > assert: ((NeoCSVReader on: input readStream) >> > > addIntegerField; >> > > addIntegerField; >> > > addIgnoredField; >> > > addIntegerField; >> > > upToEnd) >> > > equals: { >> > > #(1 2 3). >> > > #(1 2 3). >> > > #(1 2 3).} >> > > >> > > >> > > >> > > May be you like to know if you make a pass to NeoCSV, for some data sets >> > > I have 1 million of columns, it would be nice an addFieldsInterval: or >> > > such. >> > >> > 1 million columns ? How is that possible, useful ? >> > >> > The reader is like a builder. You could try to do this yourself by writing >> > a little loop or two. >> > >> > But still, 1 million ? >> > >> > > Thank you. >> > > >> > > Hernán >> > > >> > >> > >> > >> >> >> >