On 03/10/2018 09:03 AM, Pariksheet Nanda wrote:
Hi Claris,
On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel <
bioc-devel@r-project.org> wrote:
[1] "The following files are over 5MB in size:
'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....."
This as well as other data like .gff files, that are being used
for the reference based assembly are all much more than 5mb.
But the total package size is less than 500mb.
Assuming that's not a typo, 500 mb is very large and inappropriate for a
package. It's generally good practice to separate code and data where
possible, not least because it bloats code version control. If your
package size is close to 500 mb, you should think about stashing the data
and accessing it using something like the AnnotationHub or BiocFileCache
yes, large files should be made available by a package that uses
AnnotationHub or ExperimentHub for the resources. Also, it's often
possible to re-use existing resources and, in a vignette, to
_illustrate_ package functionality rather than redo a complete 'real'
analysis.
See
http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/CreateAnAnnotationPackage.html
Martin
(some others on the mailing list might have better and more specific
suggestions as I've not yet had to deal with this particular problem, if
you confirm that the package is indeed that big).
Is it essential that each file within the package is less than
5mb. If so, it would be very kind if anyone could suggest how
to reduce the size of the genomic data files.
Can you gzip compress those data files? Text based files usually compress
quite well and many functions like import() from tracklayer will
automagically decompress them so you might not even need to change much in
your code.
.gz isn't the most disk efficient compression algorithm out there; .bz2
compresses better and is actually what R natively uses for save() and
load() of .RData files, and .xz typically yields even better lossless
compression but, for cross-platform compatibility that Bioconductor strives
for, using .gz might be best to try first.
Claris Baby
Pariksheet
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
This email message may contain legally privileged and/or...{{dropped:2}}
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel