I will add that to the topic list!
Le 22/2/15 22:15, Sven Van Caekenberghe a écrit :
There are some cool ideas here:
On 31 Jan 2015, at 14:26, stepharo <steph...@free.fr> wrote:
if you need some help you can also find a smart student and ask ESUG to sponsor
him during a summertalk.
Le 26/1/15 21:03, Hernán Morales Durand a écrit :
2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
On 26 Jan 2015, at 08:00, Hernán Morales Durand <hernan.mora...@gmail.com>
It is possible :)
I work with DNA sequences, there could be millions of common SNPs in a genome.
Still weird for CSV. How many record are there then ?
We genotyped few individuals (24 records) but now we have a genotyping platform
(GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6
million of markers. The first run I completed generated CSVs of 1 million of
records (see attach). Sadly the high-level analysis of this data (annotation,
clustering, discrimination) now is performed with R with packages like
And this is microarray analysis, NGS platforms produce larger volumes of data
in a shorter period of time (several genomes in a day). See
for the 2014-2020 predictions.
Feel free to contact me if you want to experiment with metrics.
I assume they all have the same number of fields ?
Yes, never seen CSV file with variable number of fields (in this domain)
Anyway, could you point me to the specification of the format you want to read ?
Actually I have no rush for this, I want to avoid awk, sed and shell scripts in
the next run. I would like to avoid Python but spreads like a virus.
I will be working mostly with CSV's from Axiom annotation files [1] and genotyping
results. Other file formats I use are genotype file formats for programs like PLINK
[2] (PED files, column 7 onwards) and HaploView. Is worst than you might think,
because you have to transpose the output generated by genotyping platforms
(millions of records), and then filter & cut them by chromosome because those
Java programs cannot deal with all chromosomes at the same time.
And to the older the that you used to use ?
[1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx
[2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
On 26 Jan 2015, at 06:32, Hernán Morales Durand <hernan.mora...@gmail.com>
2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
On 23 Jan 2015, at 20:53, Hernán Morales Durand <hernan.mora...@gmail.com>
Hi Sven,
2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
Hi Hernán,
On 23 Jan 2015, at 19:50, Hernán Morales Durand <hernan.mora...@gmail.com>
I used to use a CSV parser from Squeak where I could attach conditional
csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on
each row " ].
csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each
row " ].
With NeoCSVParser you can describe how each field is read and converted, using
the same mechanism you can ignore fields. Have a look at the senders of
#addIgnoredField from the unit tests.
I am trying to understand the implementation, I see you included
#addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
A question about usage then, adding ignored field(s) requires adding field
types on all other remaining fields?
Yes, like this:
| input |
input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
assert: ((NeoCSVReader on: input readStream)
equals: {
#(1 2 3).
#(1 2 3).
#(1 2 3).}
May be you like to know if you make a pass to NeoCSV, for some data sets I have
1 million of columns, it would be nice an addFieldsInterval: or such.
1 million columns ? How is that possible, useful ?
The reader is like a builder. You could try to do this yourself by writing a
little loop or two.
But still, 1 million ?
Thank you.