Hi Gabe,

It would help to have a common baseline. Please show the output for writing the Illumina file you sent originally:

library(VariantAnnotation)
fl <- "NA12877_S1.genome.vcf.gz"
vcf <- readVcf(fl, "", param=ScanVcfParam(info=NA))
dim(vcf)

gc()
print(system.time(writeVcf(vcf, tempfile())))
gc()

Here's my output. For this file, writing takes between 8-10 minutes depending on machine load. (Machine has 378 GIG of RAM.)

dim(vcf)
[1] 51612762        1

gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157824501 8428.8  298615851 15947.9  261241334 13951.8
Vcells 1109937571 8468.2 1778479081 13568.8 1693642432 12921.5
print(system.time(writeVcf(vcf, tempfile())))
   user  system elapsed
601.265   9.261 614.230
gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157829989 8429.1  298615851 15947.9  261241334 13951.8
Vcells 1109941106 8468.2 2270282109 17320.9 2243573149 17117.2

I'm not seeing substantial differences in writeVcf with expanded vs collapsed VCF. However, as an aside, note that calling writeVcf on VRanges (or equivalently ExpandedVCF) the new file will have multiple rows per position if there were multiple ALT alleles. I'm not sure this is technically valid as per the VCF specs but am assuming Michael knew about this.

Five positions in original VCF:
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "")
> dim(vcf)
[1] 5 3

vr <- as(vcf, "VRanges")
> length(vr)
[1] 21

Before writing out, VRanges is coerced to ExpandedVCF; 7 rows are written out.
> dim(as(vr, "VCF"))
[1] 7 3

Can you provide a portion of 'vcfgeno' for testing?

Valerie

sessionInfo()
R version 3.1.0 Patched (2014-06-24 r66016)
Platform: x86_64-unknown-linux-gnu (64-bit)

...

other attached packages:
[1] VariantAnnotation_1.11.35 Rsamtools_1.17.34
[3] Biostrings_2.33.14        XVector_0.5.8
[5] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
[7] IRanges_1.99.28           S4Vectors_0.2.4
[9] BiocGenerics_0.11.5



On 09/30/2014 08:36 AM, Gabe Becker wrote:
Valerie,

Apologies for this taking much longer than it should have. The changes
in Bioc-devel have wreaked havoc on the code we use to to generate and
process the data we need to write out, but the fault is mine for not
getting on top of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the
illumina platinum and writing it back out to see if it is something
about our specific vcf object (could ExpandedVCF vs VCF be an issue?).

 > vcfgeno
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
   GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
   DataFrame with 1 column: END
   Fields with no header: END
geno(vcf):
   SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
           Number Type    Description
    AD     2      Integer Allelic depths (number of reads in each
observed al...
    DP     1      Integer Total read depth
    FT     1      String  Variant filters
    GT     1      String  Genotype
    GQ     1      Integer Genotype quality
    PL     3      Integer Normalized, Phred-scaled likelihoods for
genotypes
    MIN_DP 1      Integer Minimum DP observed within the GVCF block
 > sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
  [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
  [3] AnnotationDbi_1.27.16     Biobase_2.25.0
  [5] gmapR_1.7.8               VTGenotyping_0.0.1
  [7] BiocParallel_0.99.22      futile.logger_1.3.7
  [9] VariantTools_1.7.5 *VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34         Biostrings_2.33.14
[13] XVector_0.5.8             rtracklayer_1.25.16
[15] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
[17] IRanges_1.99.28           S4Vectors_0.2.4
[19] BiocGenerics_0.11.5       switchr_0.2.1

loaded via a namespace (and not attached):
  [1] annotate_1.43.5                base64enc_0.1-2
  [3] BatchJobs_1.4                  BBmisc_1.7
  [5] biomaRt_2.21.1                 bitops_1.0-6
  [7] brew_1.0-6                     BSgenome_1.33.9
  [9] CGPtools_2.2.0                 checkmate_1.4
[11] codetools_0.2-9                DBI_0.3.1
[13] DESeq_1.17.0                   digest_0.6.4
[15] fail_1.2                       foreach_1.4.2
[17] futile.options_1.0.0           genefilter_1.47.6
[19] geneplotter_1.43.0             GenomicAlignments_1.1.29
[21] genoset_1.19.32                gneDB_0.4.18
[23] grid_3.1.1                     iterators_1.0.7
[25] lambda.r_1.1.6                 lattice_0.20-29
[27] Matrix_1.1-4                   RColorBrewer_1.0-5
[29] RCurl_1.95-4.3                 rjson_0.2.14
[31] RSQLite_0.11.4                 sendmailR_1.2-1
[33] splines_3.1.1                  stringr_0.6.2
[35] survival_2.37-7                tools_3.1.1
[37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
[39] xtable_1.7-4                   zlibbioc_1.11.1

On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain <voben...@fhcrc.org
<mailto:voben...@fhcrc.org>> wrote:

    Hi Gabe,

    Have you had a chance to test writeVcf? The changes made over the
    past week have shaved off more time. It now takes ~ 9 minutes to
    write the NA12877 example.

            dim(vcf)

        [1] 51612762        1

            gc()

                      used   (Mb) gc trigger    (Mb)   max used    (Mb)
        Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
        Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8

            print(system.time(writeVcf(__vcf, tempfile())))

            user  system elapsed
        555.282   6.700 565.700

            gc()

                      used   (Mb) gc trigger    (Mb)   max used    (Mb)
        Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
        Vcells 1176960717 8979.5 2183277445 <tel:2183277445> 16657.1
        2171401955 16566.5



    In the most recent version (1.11.35) I've added chunking for files
    with > 1e5 records. Right now the choice of # records per chunk is
    simple, based on total records only. We are still experimenting with
    this. You can override default chunking with 'nchunk'. Examples on
    the man page.

    Valerie


    On 09/08/14 08:43, Gabe Becker wrote:

        Val,

        That is great. I'll check this out and test it on our end.

        ~G

        On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain
        <voben...@fhcrc.org <mailto:voben...@fhcrc.org>
        <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>> wrote:

             The new writeVcf code is in 1.11.28.

             Using the illumina file you suggested, geno fields only,
        writing now
             takes about 17 minutes.

              > hdr
             class: VCFHeader
             samples(1): NA12877
             meta(6): fileformat ApplyRecalibration ... reference source
             fixed(1): FILTER
             info(22): AC AF ... culprit set
             geno(8): GT GQX ... PL VF

              > param = ScanVcfParam(info=NA)
              > vcf = readVcf(fl, "", param=param)
              > dim(vcf)
             [1] 51612762        1

              > system.time(writeVcf(vcf, "out.vcf"))
                  user   system  elapsed
               971.032    6.568 1004.593

             In 1.11.28, parsing of geno data was moved to C. If this didn't
             speed things up enough we were planning to implement 'chunking'
             through the VCF and/or move the parsing of info to C,
        however, it
             looks like geno was the bottleneck.

             I've tested a number of samples/fields combinations in
        files with >=
             .5 million rows and the improvement over writeVcf() in
        release is ~ 90%.

             Valerie




             On 09/04/14 15:28, Valerie Obenchain wrote:

                 Thanks Gabe. I should have something for you on Monday.

                 Val


                 On 09/04/2014 01:56 PM, Gabe Becker wrote:

                     Val and Martin,

                     Apologies for the delay.

                     We realized that the Illumina platinum genome vcf
        files make
                     a good test
                     case, assuming you strip out all the info (info=NA when
                     reading it into
                     R) stuff.

        
ftp://platgene:G3n3s4me@ussd-____ftp.illumina.com/NA12877_S1.____genome.vcf.gz
        
<ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz>

        
<ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
        
<ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>>
                     took about ~4.2 hrs to write out, and is about 1.5x
        the size
                     of the
                     files we are actually dealing with (~50M ranges vs
        our ~30M).

                     Looking forward a new vastly improved writeVcf :).

                     ~G


                     On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
                     <lawrence.mich...@gene.com
        <mailto:lawrence.mich...@gene.com>
                     <mailto:lawrence.michael@gene.__com
        <mailto:lawrence.mich...@gene.com>>
                     <mailto:lawrence.michael@gene.
        <mailto:lawrence.michael@gene.>____com
                     <mailto:lawrence.michael@gene.__com
        <mailto:lawrence.mich...@gene.com>>>> wrote:

                          Yes, it's very clear that the scaling is
        non-linear,
                     and Gabe has
                          been experimenting with a chunk-wise +
        parallel algorithm.
                          Unfortunately there is some frustrating
        overhead with the
                          parallelism. But I'm glad Val is arriving at
        something
                     quicker.

                          Michael


                          On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
                     <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
        <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>
                          <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>
                     <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>>>> wrote:

                              On 08/27/2014 11:56 AM, Gabe Becker wrote:

                                  The profiling I attached in my
        previous email
                     is for 24 geno
                                  fields, as I said,
                                  but our typical usecase involves only ~4-6
                     fields, and is
                                  faster but still on
                                  the order of dozens of minutes.


                              I think Val is arriving at a (much) more
        efficient
                              implementation, but...

                              I wanted to share my guess that the poor
        _scaling_
                     is because
                              the garbage collector runs multiple times
        as the
                     different
                              strings are pasted together, and has to
        traverse,
                     in linear
                              time, increasing numbers of allocated
        SEXPs. So
                     times scale
                              approximately quadratically with the
        number of rows
                     in the VCF

                              An efficiency is to reduce the number of
        SEXPs in
                     play by
                              writing out in chunks -- as each chunk is
        written,
                     the SEXPs
                              become available for collection and are
        re-used.
                     Here's my toy
                              example

                              time.R
                              ======
                              splitIndices <- function (nx, ncl)
                              {
                                   i <- seq_len(nx)
                                   if (ncl == 0L)
                                       list()
                                   else if (ncl == 1L || nx == 1L)
                                       list(i)
                                   else {
                                       fuzz <- min((nx - 1L)/1000, 0.4 *
        nx/ncl)
                                       breaks <- seq(1 - fuzz, nx +
        fuzz, length
                     = ncl + 1L)
                                       structure(split(i, cut(i, breaks,
                     labels=FALSE)), names
                              = NULL)
                                   }
                              }

                              x = as.character(seq_len(1e7)); y = sample(x)
                              if (!is.na <http://is.na> <http://is.na>
                     <http://is.na>(Sys.getenv("____SPLIT", NA))) {

                                   idx <- splitIndices(length(x), 20)
                                   system.time(for (i in idx)
        paste(x[i], y[i],
                     sep=":"))
                              } else {
                                   system.time(paste(x, y, sep=":"))
                              }


                              running under R-devel with $ SPLIT=TRUE R
        --no-save
                     --quiet -f
                              time.R the relevant time is

                                  user  system elapsed
                                15.320   0.064  15.381

                              versus with $ R --no-save --quiet -f
        time.R it is

                                  user  system elapsed
                                95.360   0.164  95.511

                              I think this is likely an overall strategy
        when
                     dealing with
                              character data -- processing in
        independent chunks
                     of moderate
                              (1M?) size (enabling as a consequence parallel
                     evaluation in
                              modest memory) that are sufficient to
        benefit from
                              vectorization, but that do not entail
        allocation of
                     large
                              numbers of in-use SEXPs.

                              Martin


                                  Sorry for the confusion.
                                  ~G


                                  On Wed, Aug 27, 2014 at 11:45 AM, Gabe
        Becker
                                  <becke...@gene.com
        <mailto:becke...@gene.com> <mailto:becke...@gene.com
        <mailto:becke...@gene.com>>
                     <mailto:becke...@gene.com
        <mailto:becke...@gene.com> <mailto:becke...@gene.com
        <mailto:becke...@gene.com>>>
                                  <mailto:becke...@gene.com
        <mailto:becke...@gene.com>
                     <mailto:becke...@gene.com
        <mailto:becke...@gene.com>> <mailto:becke...@gene.com
        <mailto:becke...@gene.com>
                     <mailto:becke...@gene.com
        <mailto:becke...@gene.com>>>>> wrote:

                                       Martin and Val.

                                       I re-ran writeVcf on our (G)VCF data
                     (34790518 ranges,
                                  24 geno fields) with
                                       profiling enabled. The results of
                     summaryRprof for that
                                  run are attached,
                                       though for a variety of reasons
        they are
                     pretty
                     misleading.

                                       It took over an hour to write
                     (3700+seconds), so it's
                                  definitely a
                                       bottleneck when the data get very
        large,
                     even if it
                                  isn't for smaller data.

                                       Michael and I both think the
        culprit is
                     all the pasting
                                  and cbinding that is
                                       going on, and more to the point, that
                     memory for an
                                  internal representation
                                       to be written out is allocated at
        all.
                     Streaming
                                  across the object, looping
                                       by rows and writing directly to
        file (e.g.
                     from C)
                                  should be blisteringly
                                       fast in comparison.

                                       ~G


                                       On Tue, Aug 26, 2014 at 11:57 AM,
        Michael
                     Lawrence
                                  <micha...@gene.com
        <mailto:micha...@gene.com> <mailto:micha...@gene.com
        <mailto:micha...@gene.com>>
                     <mailto:micha...@gene.com
        <mailto:micha...@gene.com> <mailto:micha...@gene.com
        <mailto:micha...@gene.com>>>
                                       <mailto:micha...@gene.com
        <mailto:micha...@gene.com>
                     <mailto:micha...@gene.com
        <mailto:micha...@gene.com>> <mailto:micha...@gene.com
        <mailto:micha...@gene.com>
                     <mailto:micha...@gene.com
        <mailto:micha...@gene.com>>>>>
                                  wrote:

                                           Gabe is still
        testing/profiling, but
                     we'll send
                                  something randomized
                                           along eventually.


                                           On Tue, Aug 26, 2014 at 11:15 AM,
                     Martin Morgan
                                  <mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org> <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>>
                     <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org> <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>>>
                                           <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>
                     <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>
                                  <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>
                     <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>>>>> wrote:

                                               I didn't see in the
        original thread a
                                  reproducible (simulated, I
                                               guess) example, to be
        explicit
                     about what the
                                  problem is??

                                               Martin


                                               On 08/26/2014 10:47 AM,
        Michael
                     Lawrence wrote:

                                                   My understanding is
        that the heap
                                  optimization provided marginal
                                                   gains, and
                                                   that we need to think
        harder
                     about how to
                                  optimize the all of
                                                   the string
                                                   manipulation in
        writeVcf. We
                     either need to
                                  reduce it or reduce its
                                                   overhead (i.e., the
        CHARSXP
                     allocation).
                                  Gabe is doing more tests.


                                                   On Tue, Aug 26, 2014
        at 9:43
                     AM, Valerie
                                  Obenchain
                                                   <voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>
                                  <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>>> <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>
                                  <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>>>>>

                                                   wrote:

                                                       Hi Gabe,

                                                       Martin responded,
        and so
                     did Michael,


        
https://stat.ethz.ch/________pipermail/bioc-devel/2014-________August/006082.html
        
<https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html>

        
<https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
        
<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>>



        
<https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
        
<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

        
<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
        
<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>>




        
<https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
        
<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

        
<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
        <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>


        
<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
        <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

        <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
        <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>>

                                                       It sounded like
        Michael
                     was ok with
                                  working with/around heap
                                                       initialization.

                                                       Michael, is that
        right or
                     should we
                                  still consider this on
                                                       the table?


                                                       Val


                                                       On 08/26/2014
        09:34 AM,
                     Gabe Becker
                     wrote:

                                                           Val,

                                                           Has there
        been any
                     movement on
                                  this? This remains a
                                                           substantial
                                                           bottleneck
        for us when
                     writing very
                                  large VCF files (e.g.

        variants+genotypes for
                     whole genome
                                  NGS samples).

                                                           I was able to
        see a
                     ~25% speedup
                                  with 4 cores and  an
                                                           "optimal" speedup
                                                           of ~2x with 10-12
                     cores for a VCF
                                  with 500k rows  using
                                                           a very naive
                                                           parallelization
                     strategy and no
                                  other changes. I suspect
                                                           this could be
                                                           improved on
        quite a
                     bit, or
                                  possibly made irrelevant
                                                           with
        judicious use
                                                           of serial C code.

                                                           Did you and
        Martin
                     make any plans
                                  regarding optimizing
                                                           writeVcf?

                                                           Best
                                                           ~G


                                                           On Tue, Aug
        5, 2014 at
                     2:33 PM,
                                  Valerie Obenchain

        <voben...@fhcrc.org <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>
                                  <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>>> <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>
                                  <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>>>>

                       <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>>
                                  <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>>> <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>
                                  <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>
                     <mailto:voben...@fhcrc.org
        <mailto:voben...@fhcrc.org>>>>>__>

                                                           wrote:

                                                                 Hi Michael,

                                                                 I'm
        interested
                     in working on
                                  this. I'll discuss
                                                           with Martin next
                                                                 week
        when we're
                     both back in
                                  the office.

                                                                 Val





                                                                 On 08/05/14
                     07:46, Michael
                                  Lawrence wrote:

                                                                     Hi guys
                     (Val, Martin,
                     Herve):


        Anyone have
                     an itch for
                                  optimization? The
                                                           writeVcf
        function is

        currently a

        bottleneck
                     in our WGS
                                  genotyping pipeline. For
                                                           a typical 50

        million row

        gVCF, it was
                     taking 2.25
                                  hours prior to
                                                           yesterday's
        improvements

                       (pasteCollapseRows) that
                                  brought it down to
                                                           about 1 hour,
        which
                                                                     is
        still
                                                                     too
        long by
                     my standards
                                  (> 0). Only takes 3
                                                           minutes to
        call the

        genotypes

        (and associated
                                  likelihoods etc) from the
                                                           variant calls
        (using
                                                                     80
        cores and
                                                                     450
        GB RAM
                     on one node),
                                  so the output is an
                                                           issue. Profiling

        suggests that
                                                                     the
        running
                     time scales
                                  non-linearly in the
                                                           number of rows.


        Digging a
                     little deeper,
                                  it seems to be
                                                           something
        with R's

        string/memory

        allocation.
                     Below,
                                  pasting 1 million strings
                                                           takes 6
        seconds, but
                                                           10
                                                                     million
                     strings takes
                                  over 2 minutes. It gets
                                                           way worse with 50

        million. I

        suspect it
                     has something
                                  to do with R's string
                                                           hash table.


        set.seed(1000)
                                                                     end <-
                     sample(1e8, 1e6)

                       system.time(paste0("END",
                                  "=", end))

          user
                     system elapsed

        6.396
                       0.028   6.420

                                                                     end <-
                     sample(1e8, 1e7)

                       system.time(paste0("END",
                                  "=", end))

          user
                     system elapsed
                                                                     134.714
                       0.352 134.978


        Indeed, even
                     this takes a
                                  long time (in a
                                                           fresh session):


        set.seed(1000)
                                                                     end <-
                     sample(1e8, 1e6)
                                                                     end <-
                     sample(1e8, 1e7)

                                  system.time(as.character(end))

          user
                     system elapsed

          57.224
                       0.156  57.366

                                                                     But
        running
                     it a second
                                  time is faster (about
                                                           what one would

        expect?):


                       system.time(levels <-
                                  as.character(end))

          user
                     system elapsed

          23.582
                       0.021  23.589

                                                                     I
        did some
                     simple
                                  profiling of R to find that
                                                           the resizing of
                                                                     the
        string

        hash table
                     is not a
                                  significant component of
                                                           the time. So
        maybe

        something
                                                                     to
        do with
                     the R heap/gc?
                                  No time right now to
                                                           go deeper. But I

        know Martin

        likes this
                     sort of
                     thing ;)

                                                                     Michael


                     [[alternative
                                  HTML version deleted]]





        _________________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>
                     <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>

                       <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.
                     <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.>______org
                                  <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>>

                       <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project> <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>>.
                                  <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>
                     <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>>.__>______org

                       <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.
                     <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.>______org
                                  <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>>>

        mailing list

        https://stat.ethz.ch/mailman/__________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>>



        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>>




        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>



        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>>





        _________________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>
                     <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>

                       <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.
                     <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.>______org
                                  <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>>

                       <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project> <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>>.
                                  <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>
                     <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>>.__>______org

                       <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.
                     <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.>______org
                                  <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>>> mailing
                                                           list
        https://stat.ethz.ch/mailman/__________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>>



        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>>






        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>



        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>>




                                                           --
                                                           Computational
        Biologist
                                                           Genentech
        Research






          [[alternative HTML
                     version
                     deleted]]



                     _______________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>
                     <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>
                                  <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.
                     <mailto:Bioc-devel@r-project
        <mailto:Bioc-devel@r-project>.>______org
                                  <mailto:Bioc-devel@r-project.
        <mailto:Bioc-devel@r-project.>____org
                     <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>>>
                                                   mailing list
        https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>



        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>


        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>



                                               --
                                               Computational Biology / Fred
                     Hutchinson Cancer
                                  Research Center
                                               1100 Fairview Ave. N.
                                               PO Box 19024 Seattle, WA
        98109

                                               Location: Arnold Building
        M1 B861
                                               Phone: (206) 667-2793
        <tel:%28206%29%20667-2793>
                     <tel:%28206%29%20667-2793>
                                  <tel:%28206%29%20667-2793>
                     <tel:%28206%29%20667-2793>






                                       --
                                       Computational Biologist
                                       Genentech Research




                                  --
                                  Computational Biologist
                                  Genentech Research



                              --
                              Computational Biology / Fred Hutchinson Cancer
                     Research Center
                              1100 Fairview Ave. N.
                              PO Box 19024 Seattle, WA 98109

                              Location: Arnold Building M1 B861
                              Phone: (206) 667-2793
        <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
                     <tel:%28206%29%20667-2793>





                     --
                     Computational Biologist
                     Genentech Research


                 ___________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>
                 mailing list
        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                 <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>





        --
        Computational Biologist
        Genentech Research





--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to