Hi all,
I am thinking of a opensource platform which will be built on top of IPFS ( https://medium.com/@ConsenSys/an-introduction-to-ipfs-9bba4860abd0) I recently found out that NCBI/NIH is struggling with the large amounts of data that is being generated from non-human genomes and is now sending the data off to Europe ( https://ncbiinsights.ncbi.nlm.nih.gov/2017/05/09/phasing-out-support-for-non-human-genome-organism-data-in-dbsnp-and-dbvar/). I am guessing soon even EVA might start suffering from the data overload, as the volume of genomic data grows 4-5 fold over the next decade (Plos <http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195>). GB/$ or bandwidth/$ is unlikely to grow at this pace and this problem is likely to get worse. As I see it, the prohibitive cost will drive these agencies to go for some kind of cost sharing using distributed storage and networking. In short IPFS is a p2p distributed file system with built-in incentivizing systems to encourage users to locally cache and distribute data by combining pre-existing and new systems (git, dht, kademlia) that came out of ARPA/DARPA/IETF/Bell labs (https://github.com/ipfs/ipfs#more-about-ipfs). I am thinking of this more as a middle-layer between the platform (potentially bioconductor) and IPFS. Potentially all the datasets will be assigned a IPFS hash and the hash will be maintained on a website along with any metadata regarding the information contained in the dataset. So instead of users having to query specific servers with the server specific API, they will query the dataset and IPFS through a uniform API (ipfs get /ipfs/<hash>). Potential advantages: 1. Standardizes data access/pipelines across multiple organizations instead of having to use multiple Server specific API’s to get data ( using a simple interface: ipfs get <hash>) 2. Reduces cost of data storage/distribution by distributing data storage/access cost over the entire network 3. Proven to work with large datasets 4. Backward compatible with existing data transport networks 5. Inbuilt incentivization of users to store and distribute data with bitswap, Filecoin and Ethereum IPFS is developed as a networked file system, so it should be integrated similar to how other software platforms are integrated with files systems. So I was thinking that its best to have it as a default package within the platform, perhaps as a middlelayer. There are already amazing platforms, so I am not thinking of building an entire platform, but integrating IPFS with a great platform so that it benefits potential suppliers and consumers of data. Based on this, and because the problem is faced acutely in the genomics community and BioC is one of the most widely used softwares, I was thinking its best to integrate with Bioconductor. However, now that I have read a little more about BioC, I see that BioC is a set of packages. Do you think its better to integrate with Rstudio/R as that is the platform on which BioC is developed? Are there any existing projects that do this already? Or similar projects that I could look into to, to get ideas from? Do you see any holes in my logic? I can go into more details on the use case scenario, that I am currently thinking of. Thanks Paul [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel