Hi,

On 03/11/2014 09:47 AM, Michael Lawrence wrote:
Except for the checksum, the existing File classes should support this,
where the package provides a dataset via data() that is just the serialized
File object (path). One could create a FileWithChecksum class that
decorates a File object with a checksum. Any attempts to read the file are
intercepted by the decorator, which verifies the checksum, and then
delegates.

Neat. Sounds like this is worth pursuing.


Michael


On Tue, Mar 11, 2014 at 8:53 AM, Vincent Carey
<st...@channing.harvard.edu>wrote:

I'm going to suggest a use case that may motivate this type of development.

Up to 2010 or so, data packages generally made sense.  You have about
100-500MB of serialized or pre-serialized stuff.  Installing it in an R
package is unpleasant from a resource consumption perspective but it works,
you can use data/extdata and work with data with programmatic access,
documentation and checkability.

More recently, it is easy to come across data resources that we'd like to
have package-like control over/access to, but installing such packages
makes no sense.  The volume is too big, and you want to work with the
resource with non-R tools as well from time to time.  You don't want to
move the data.

We should have a protocol for "packaging" data without installing it.  A
digest of the raw data resource should be computed and kept in the
registry.  A registered file can be part of a package that can be checked
and installed, but the data themselves do not move.  Genomic data in S3
buckets should provide a basic use case.

The digest is recomputed whenever we want to start working with the
registry/package to verify that we are working with the intended artifact.


On Tue, Mar 11, 2014 at 11:11 AM, Gabriel Becker <gmbec...@ucdavis.edu>wrote:

Would it be better to let the user (registerer) specify a function, which
could be a simple class constructor or something more complex in cases
where that would be useful?

Yes, good suggestion.


This could allow the concept to generalize to other things, such as
databases that might need some startup machinery called before they are
actually useful to the user.


The intent of the registry was to provide a way to lookup files by their extension. I'm not sure how this applies to the database example. Do you envision creating multiple databases throughout an R session (vs a single set up at load time)? For example if the file has type 'X' extension it becomes a type 'X' database etc.?


This would also deal with Michael's point about package/where since
functions have their own "where" information. Unless I'm missing some
other
intent for specifying a specific package?

~G


On Tue, Mar 11, 2014 at 5:59 AM, Michael Lawrence <
lawrence.mich...@gene.com
wrote:

rtracklayer essentially has this, although registration is implicit
through
extension of RTLFile or RsamtoolsFile, and the extension is taken from
the
class name. There is a BigWigFile, corresponding to ".bigwig", and that
is
extended by BWFile to support the ".bw" extension. The expectation is
that
other packages would extend RTLFile to implictly register handlers.  I'm
not sure there is a use case for generalization, but this proposal makes
registration more explicit, which is probably a good thing. rtracklayer
was
just piggy backing on S4 registration.

I'm a little bit confused by the use of Lists rather than individual
File
objects. Are you also proposing that all RTLFiles would need a
corresponding List, and that there would need to be an RTLFileList
method
for the various generics?

No, I don't want to force the 'List' route. I was using them in GenomicFileViews so that's what I registered. The 'class' should be any class that has a constructor of the same name. Thinking about this more the 'class' probably should be the individual File object instead of the List object. Coercion to List can be done inside the helper.


It may not be necessary to specify the package name. There should be an
environment (where) argument that defaults to topenv(parent.frame()),
and
that should suffice.

I'll look into this.


Any comments on whether this should be it's own package or in an existing one?


Thanks for the input.
Valerie



Michael


On Mon, Mar 10, 2014 at 8:46 PM, Valerie Obenchain <voben...@fhcrc.org
wrote:

Hi all,

I'm soliciting feedback on the idea of a general file 'registry' that
would identify file types by their extensions. This is similar in
spirit
to
FileForformat() in rtracklayer but a more general abstraction that
could
be
used across packages. The goal is to allow a user to supply only file
name(s) to a method instead of first creating a 'File' class such as
BamFile, FaFile, BigWigFile etc.

A first attempt at this is in the GenomicFileViews package (
https://github.com/Bioconductor/GenomicFileViews). A registry
(lookup)
is
created as an environment at load time:

.fileTypeRegistry <- new.env(parent=emptyenv()

Files are registered with an information triplet consisting of class,
package and regular expression to identify the extension. In
GenomicFileViews we register FaFileList, BamFileList and
BigWigFileList
but
any 'File' class can be registered that has a constructor of the same
name.

.onLoad <- function(libname, pkgname)
{
     registerFileType("FaFileList", "Rsamtools", "\\.fa$")
     registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
     registerFileType("BamFileList", "Rsamtools", "\\.bam$")
     registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
}

The makeFileType() helper creates the appropriate class. This
function is
used behind the scenes to do the lookup and coerce to the correct
'File'
class.

makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2
names(2): foo.bam bar.bam

New types can be added at any time with registerFileType():

registerFileType(NewClass, NewPackage, "\\.NewExtension$")


Thoughts:

(1) If this sounds generally useful where should it live? rtracklayer,
GenomicFileViews or other? Alternatively it could be its own
lightweight
package (FileRegister) that creates the registry and provides the
helpers.
It would be up to the package authors that depend on FileRegister to
register their own files types at load time.

(2) To avoid potential ambiguities maybe searching should be by regex
and
package name. Still a work in progress.


Valerie


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to