hernan
if you need some help you can also find a smart student and ask ESUG to
sponsor him during a summertalk.
Stef
Le 26/1/15 21:03, Hernán Morales Durand a écrit :
2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu
<mailto:s...@stfx.eu>>:
Hernán,
> On 26 Jan 2015, at 08:00, Hernán Morales Durand
<hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
>
> It is possible :)
> I work with DNA sequences, there could be millions of common
SNPs in a genome.
Still weird for CSV. How many record are there then ?
We genotyped few individuals (24 records) but now we have a genotyping
platform (GeneTitan) with array plates allowing up to 96 samples,
which is up to 2.6 million of markers. The first run I completed
generated CSVs of 1 million of records (see attach). Sadly the
high-level analysis of this data (annotation, clustering,
discrimination) now is performed with R with packages like SNPolisher.
And this is microarray analysis, NGS platforms produce larger volumes
of data in a shorter period of time (several genomes in a day). See
http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics
for the 2014-2020 predictions.
Feel free to contact me if you want to experiment with metrics.
I assume they all have the same number of fields ?
Yes, never seen CSV file with variable number of fields (in this domain)
Anyway, could you point me to the specification of the format you
want to read ?
Actually I have no rush for this, I want to avoid awk, sed and shell
scripts in the next run. I would like to avoid Python but spreads like
a virus.
I will be working mostly with CSV's from Axiom annotation files [1]
and genotyping results. Other file formats I use are genotype file
formats for programs like PLINK [2] (PED files, column 7 onwards) and
HaploView. Is worst than you might think, because you have to
transpose the output generated by genotyping platforms (millions of
records), and then filter & cut them by chromosome because those Java
programs cannot deal with all chromosomes at the same time.
And to the older the that you used to use ?
http://www.smalltalkhub.com/#!/~hernan/CSV
<http://www.smalltalkhub.com/#%21/%7Ehernan/CSV>
Cheers,
Hernán
[1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx
[2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
<http://pngu.mgh.harvard.edu/%7Epurcell/plink/data.shtml#ped>
Thx,
Sven
> Cheers,
>
> Hernán
>
>
> 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu
<mailto:s...@stfx.eu>>:
>
> > On 26 Jan 2015, at 06:32, Hernán Morales Durand
<hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
> >
> >
> >
> > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu
<mailto:s...@stfx.eu>>:
> >
> > > On 23 Jan 2015, at 20:53, Hernán Morales Durand
<hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
> > >
> > > Hi Sven,
> > >
> > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe
<s...@stfx.eu <mailto:s...@stfx.eu>>:
> > > Hi Hernán,
> > >
> > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand
<hernan.mora...@gmail.com <mailto:hernan.mora...@gmail.com>> wrote:
> > > >
> > > > I used to use a CSV parser from Squeak where I could
attach conditional iterations:
> > > >
> > > > csvParser rowsSkipFirst: 2 do: [: row | " some action
ignoring first 2 fields on each row " ].
> > > > csvParser rowsSkipLast: 2 do: [: row | " some action
ignoring last 2 fields on each row " ].
> > >
> > > With NeoCSVParser you can describe how each field is read
and converted, using the same mechanism you can ignore fields.
Have a look at the senders of #addIgnoredField from the unit tests.
> > >
> > >
> > > I am trying to understand the implementation, I see you
included #addIgnoredFields: for consecutive fields in
Neo-CSV-Core-SvenVanCaekenberghe.21
> > > A question about usage then, adding ignored field(s)
requires adding field types on all other remaining fields?
> >
> > Yes, like this:
> >
> > testReadWithIgnoredField
> > | input |
> > input := (String crlf join: #( '1,2,a,3' '1,2,b,3'
'1,2,c,3' '')).
> > self
> > assert: ((NeoCSVReader on: input readStream)
> > addIntegerField;
> > addIntegerField;
> > addIgnoredField;
> > addIntegerField;
> > upToEnd)
> > equals: {
> > #(1 2 3).
> > #(1 2 3).
> > #(1 2 3).}
> >
> >
> >
> > May be you like to know if you make a pass to NeoCSV, for some
data sets I have 1 million of columns, it would be nice an
addFieldsInterval: or such.
>
> 1 million columns ? How is that possible, useful ?
>
> The reader is like a builder. You could try to do this yourself
by writing a little loop or two.
>
> But still, 1 million ?
>
> > Thank you.
> >
> > Hernán
> >
>
>
>