thanks On Thu, May 23, 2024 at 5:36 PM Chris Wilks (gmail) <broadsw...@gmail.com> wrote:
> Thanks Vince, understood about the Core's focus right now. > > I think this is something that Leo and I can fix among ourselves for the > time being. > > Looking forward, as you brought up, if we were to refresh recount or > produce a recount4 (discussed) we'd certainly consider additional coverage > formats. > > I'm aware of tiledb though not duckdb (I'll have to check it out), thanks > for the pointer. > > There's also the D4 format from Aaron Quinlan's lab from a few years ago > which was explicitly designed to replace bigwigs: > https://www.nature.com/articles/s43588-021-00085-0 > > All that said, we're pretty committed to bigwigs at this point given the > ~750,000 sequence runs we've encoded using them for recount3. > > On Wed, May 22, 2024 at 7:17 AM Vincent Carey <st...@channing.harvard.edu> > wrote: > >> Really glad to see this discussion moving forward. I would say that the >> core is wrangling with some >> even lower-level technical concerns right now, so I can't jump in just >> now. I just want to raise the question >> of whether bigWig files are a technologically sound format to continue >> investing in for the use case of >> targeted remote query resolution on genomic coordinates. A number of new >> concepts have come into >> play since bigWig was designed and implemented. I'll naively mention >> duckdb and tiledb, which seem >> to have very good remote performance. Maybe these are too generic ... >> are there other concepts in >> GA4GH that might be relevant to leverage for recount-like projects in the >> future? >> >> >> >> On Wed, May 22, 2024 at 6:58 AM Chris Wilks (gmail) <broadsw...@gmail.com> >> wrote: >> >>> Thanks for sharing Leo, this does interest me, especially since so much >>> is >>> built on BigWig access via rtracklayer at least in the recount2 >>> ecosystem. >>> >>> As you alluded to, Megadepth currently supports remote access of BigWigs >>> (and BAMs) over HTTPS on all platforms (Linux, MacOS, and Windows), >>> getting back just the byte ranges overlapping the set of regions >>> requested >>> so it should work for at least recount2/recount3 and anything that uses >>> HTTP/s. >>> >>> I'd be open to exploring updates to the Megadepth C/C++ code side to >>> support Rle if that makes sense to replace rtracklayer. >>> But to do that you'd need to be involved in updating all the R packages >>> if >>> you're willing (both megadepth and those that currently rely on >>> rtracklayer >>> for this functionality). >>> >>> Let me know if you want to chat about this over Zoom, >>> Chris >>> >>> On Tue, May 21, 2024 at 2:41 PM Leonardo Collado Torres < >>> lcollado...@gmail.com> wrote: >>> >>> > Hi Bioc-devel, >>> > >>> > As some of you are aware, rtracklayer::import() has long provided >>> > access to import BigWig files. Those files can be shared on servers >>> > and accessed remotely thanks to all the effort from many of you in >>> > building and maintaining rtracklayer. >>> > >>> > From my side, derfinder::loadCoverage() relies on >>> > rtracklayer::import.bw(), and recount::expressed_regions() + >>> > recount::coverage_matrix() use derfinder::loadCoverage(). >>> > recountWorkflow showcases those recount functions on larger datasets. >>> > brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends >>> > up relying on rtracklayer::import.bw() through these functions. >>> > >>> > At https://github.com/lawremi/rtracklayer/issues/83 I initially >>> > reported some issues once our recount2/3 data host changed, but >>> > previously Brian Schilder also reported that one could no longer read >>> > remote files https://github.com/lawremi/rtracklayer/issues/73. >>> > https://github.com/lawremi/rtracklayer/issues/63 and/or >>> > https://github.com/lawremi/rtracklayer/issues/65 might have been >>> > related. >>> > >>> > Yesterday I updated >>> > >>> https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270 >>> > with a comment showing some small reproducible code, and that the >>> > workaround of downloading the data first, then using >>> > rtracklayer::import() on the local data does work. However, this >>> > workaround does involve a lot of, hmm, wasteful data transfer. >>> > >>> > On the recount vignette at some point I access just chrY of a bigWig >>> > file that is about 1300 MB. On the recountWorkflow vignette I do >>> > something similar for a 7GB bigWig file. Previously accessing just >>> > chrY on these files was a small data transfer. >>> > >>> > On recountWorkflow version 1.29.2 >>> > https://github.com/LieberInstitute/recountWorkflow, I've included >>> > pre-computed results (~2 MB) to avoid downloading tons of data, though >>> > the vignette code shows how to actually fully reproduce the results if >>> > you don't mind downloading those large files. I also implemented some >>> > workarounds on recount, though I haven't yet gone the full route of >>> > including pre-computed results. I have yet to try implementing a >>> > workaround for brainflowprobes. >>> > >>> > >>> > >>> > My understanding is that rtracklayer's root issues are elsewhere and >>> > changes in dependencies rtracklayer has likely created these problems. >>> > These problems are not always in the control of rtracklayer authors to >>> > resolve, and also create an unexpected burden on them. >>> > >>> > If one considers alternatives to rtracklayer, I see that there's a new >>> > package https://github.com/PoisonAlien/trackplot/tree/master that uses >>> > bwtool (a system dependency), and older alternative >>> > https://github.com/andrelmartins/bigWig that hasn't had updates in 4 >>> > years, and a CRAN package >>> > (https://cran.r-project.org/web/packages/wig/readme/README.html) that >>> > recommends using rtracklayer for larger files. I guess that I could >>> > also try using megadepth https://research.libd.org/megadepth/, though >>> > derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for >>> > efficiency >>> > >>> https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401 >>> > and lots of functions in that package were built for that structure >>> > (RleList objects). I likely missed other alternatives. >>> > >>> > >>> > My current line of thought is to keep implementing workarounds using >>> > local data (sometimes with pre-computed results) for recount, >>> > recountWorkflow, and brainflowprobes (derfinder only has tests with >>> > local bigWig files) without really altering the internals of those >>> > packages. That is, assume that the remote BigWig file access via >>> > rtracklayer will indefinitely be suspended, though it could be >>> > supported again at some point and when it does, those packages will >>> > work again with remote BigWig files as if nothing ever happened. But I >>> > wanted to check in if this is what others who use BigWig files are >>> > thinking of doing. >>> > >>> > Thanks! >>> > >>> > Best, >>> > Leo >>> > >>> > >>> > Leonardo Collado Torres, Ph. D. >>> > Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT >>> > Assistant Professor, Department of Biostatistics >>> > Johns Hopkins Bloomberg School of Public Health >>> > 855 N. Wolfe St., Room 382 >>> <https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g> >>> >>> <https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>> >>> Baltimore, MD 21205 >>> <https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g> >>> > lcolladotor.github.io >>> > lcollado...@gmail.com >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >> >> The information in this email is intended only for the person to whom it >> is addressed. If you believe this e-mail was sent to you in error and >> the email contains patient information, please contact the Partners >> Compliance HelpLine at http://www.partners.org/complianceline . If the >> email was sent to you in error but does not contain patient information, >> please contact the sender and properly dispose of the email. > > -- The information in this email is intended only for the p...{{dropped:15}} _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel