Except for the checksum, the existing File classes should support this,
where the package provides a dataset via data() that is just the serialized
File object (path). One could create a FileWithChecksum class that
decorates a File object with a checksum. Any attempts to read the file are
intercepted by the decorator, which verifies the checksum, and then
delegates.

Michael


On Tue, Mar 11, 2014 at 8:53 AM, Vincent Carey
<st...@channing.harvard.edu>wrote:

> I'm going to suggest a use case that may motivate this type of development.
>
> Up to 2010 or so, data packages generally made sense.  You have about
> 100-500MB of serialized or pre-serialized stuff.  Installing it in an R
> package is unpleasant from a resource consumption perspective but it works,
> you can use data/extdata and work with data with programmatic access,
> documentation and checkability.
>
> More recently, it is easy to come across data resources that we'd like to
> have package-like control over/access to, but installing such packages
> makes no sense.  The volume is too big, and you want to work with the
> resource with non-R tools as well from time to time.  You don't want to
> move the data.
>
> We should have a protocol for "packaging" data without installing it.  A
> digest of the raw data resource should be computed and kept in the
> registry.  A registered file can be part of a package that can be checked
> and installed, but the data themselves do not move.  Genomic data in S3
> buckets should provide a basic use case.
>
> The digest is recomputed whenever we want to start working with the
> registry/package to verify that we are working with the intended artifact.
>
>
> On Tue, Mar 11, 2014 at 11:11 AM, Gabriel Becker <gmbec...@ucdavis.edu>wrote:
>
>> Would it be better to let the user (registerer) specify a function, which
>> could be a simple class constructor or something more complex in cases
>> where that would be useful?
>>
>> This could allow the concept to generalize to other things, such as
>> databases that might need some startup machinery called before they are
>> actually useful to the user.
>>
>> This would also deal with Michael's point about package/where since
>> functions have their own "where" information. Unless I'm missing some
>> other
>> intent for specifying a specific package?
>>
>> ~G
>>
>>
>> On Tue, Mar 11, 2014 at 5:59 AM, Michael Lawrence <
>> lawrence.mich...@gene.com
>> > wrote:
>>
>> > rtracklayer essentially has this, although registration is implicit
>> through
>> > extension of RTLFile or RsamtoolsFile, and the extension is taken from
>> the
>> > class name. There is a BigWigFile, corresponding to ".bigwig", and that
>> is
>> > extended by BWFile to support the ".bw" extension. The expectation is
>> that
>> > other packages would extend RTLFile to implictly register handlers.  I'm
>> > not sure there is a use case for generalization, but this proposal makes
>> > registration more explicit, which is probably a good thing. rtracklayer
>> was
>> > just piggy backing on S4 registration.
>> >
>> > I'm a little bit confused by the use of Lists rather than individual
>> File
>> > objects. Are you also proposing that all RTLFiles would need a
>> > corresponding List, and that there would need to be an RTLFileList
>> method
>> > for the various generics?
>> >
>> > It may not be necessary to specify the package name. There should be an
>> > environment (where) argument that defaults to topenv(parent.frame()),
>> and
>> > that should suffice.
>> >
>> > Michael
>> >
>> >
>> > On Mon, Mar 10, 2014 at 8:46 PM, Valerie Obenchain <voben...@fhcrc.org
>> > >wrote:
>> >
>> > > Hi all,
>> > >
>> > > I'm soliciting feedback on the idea of a general file 'registry' that
>> > > would identify file types by their extensions. This is similar in
>> spirit
>> > to
>> > > FileForformat() in rtracklayer but a more general abstraction that
>> could
>> > be
>> > > used across packages. The goal is to allow a user to supply only file
>> > > name(s) to a method instead of first creating a 'File' class such as
>> > > BamFile, FaFile, BigWigFile etc.
>> > >
>> > > A first attempt at this is in the GenomicFileViews package (
>> > > https://github.com/Bioconductor/GenomicFileViews). A registry
>> (lookup)
>> > is
>> > > created as an environment at load time:
>> > >
>> > > .fileTypeRegistry <- new.env(parent=emptyenv()
>> > >
>> > > Files are registered with an information triplet consisting of class,
>> > > package and regular expression to identify the extension. In
>> > > GenomicFileViews we register FaFileList, BamFileList and
>> BigWigFileList
>> > but
>> > > any 'File' class can be registered that has a constructor of the same
>> > name.
>> > >
>> > > .onLoad <- function(libname, pkgname)
>> > > {
>> > >     registerFileType("FaFileList", "Rsamtools", "\\.fa$")
>> > >     registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
>> > >     registerFileType("BamFileList", "Rsamtools", "\\.bam$")
>> > >     registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
>> > > }
>> > >
>> > > The makeFileType() helper creates the appropriate class. This
>> function is
>> > > used behind the scenes to do the lookup and coerce to the correct
>> 'File'
>> > > class.
>> > >
>> > > > makeFileType(c("foo.bam", "bar.bam"))
>> > > BamFileList of length 2
>> > > names(2): foo.bam bar.bam
>> > >
>> > > New types can be added at any time with registerFileType():
>> > >
>> > > registerFileType(NewClass, NewPackage, "\\.NewExtension$")
>> > >
>> > >
>> > > Thoughts:
>> > >
>> > > (1) If this sounds generally useful where should it live? rtracklayer,
>> > > GenomicFileViews or other? Alternatively it could be its own
>> lightweight
>> > > package (FileRegister) that creates the registry and provides the
>> > helpers.
>> > > It would be up to the package authors that depend on FileRegister to
>> > > register their own files types at load time.
>> > >
>> > > (2) To avoid potential ambiguities maybe searching should be by regex
>> and
>> > > package name. Still a work in progress.
>> > >
>> > >
>> > > Valerie
>> > >
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>>
>>
>> --
>> Gabriel Becker
>> Graduate Student
>> Statistics Department
>> University of California, Davis
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to