FWIW I've documented the process of making a TxDb object for T2T-CHM13v2.0 there:
https://github.com/Bioconductor/GenomicFeatures/issues/65 Please comment there for any follow-up. Note that we're considering wrapping this is an TxDb package that we'll make available to the community. It's a work-in-progress. Thanks! H. On 12/12/23 07:29, James W. MacDonald wrote: > Hi Christian, > > This conversation is off-topic, both for this listserv (it’s meant to help > people developing Bioconductor packages) and for the support site (which is > meant to help people with (again), Bioconductor packages. I’ll answer your > questions one more time, but if you have other questions, please move to > biostars.org, or just ask the ArchR people directly, since it’s their package. > > I believe you are misinterpreting what an OrgDb is intended to provide. There > is no positional data in an OrgDb, and what the CHM13 project has done is > completely positional (what data are provided in the ‘Gene Annotation’ > section of the CHM13 Github are all GFF files, which are meant to provide > positional information of genes on a genome). > > The OrgDb package provides functional and within-annotation mappings. You can > map an NCBI Gene ID to Ensembl, or to the HGNC gene symbol, or a GO term, > etc. For example, I can map Gene symbol P53 to NCBI Gene ID 7157, or its > UniProt symbol K7PPA8. If the new genome build says P53 has moved to a new > genomic position, that has no affect on what UniProt thinks the ID for that > gene’s protein should be, or what ID NCBI uses, or what GO terms are appended > to that gene. Functionally it’s the same gene. We just might think it is > located in a different place in the genome. > > The difference between CHM13 and GRCh38 is not materially different from the > difference between GRCh37 and GRCh38 (they represent the current knowledge of > the genome at a point in time), and while we supply TxDb packages for GRCh38 > and GRCh37 (and variants based on NCBI’s mappings as well as Ensembl’s > mappings), we have never supplied more than one human OrgDb package, because > the positional and functional information are orthogonal. > > It seems pretty simple to make what you need though. > >> library(GenomicAlignments) >> tx <- >> makeTxDbFromGFF(https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz) > Import genomic features from the file as a GRanges object ... trying URL > 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz' > Content type 'application/x-gzip' length 79009538 bytes (75.3 MB) > downloaded 75.3 MB > > OK > Prepare the 'metadata' data frame ... OK > Make the TxDb object ... OK > Warning messages: > 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : > some transcripts have no > "transcript_id" attribute ==> > their name ("tx_name" column in > the TxDb object) was set to NA > 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : > the transcript names ("tx_name" > column in the TxDb object) > imported from the > "transcript_id" attribute are > not unique > 3: In .find_exon_cds(exons, cds) : The following transcripts have > exons that contain more than one > CDS (only the first CDS was kept > for each exon): > rna-NM_001134939.1, > rna-NM_001172437.2, > rna-NM_001184961.1, > rna-NM_001301020.1, > rna-NM_001301302.1, > rna-NM_001301371.1, > rna-NM_002537.3, > rna-NM_004152.3, > rna-NM_015068.3, rna-NM_016178.2 >> tx > TxDb object: > # Db type: TxDb > # Supporting package: GenomicFeatures > # Data > source:https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz > # Organism: NA > # Taxonomy ID: NA > # miRBase build ID: NA > # Genome: NA > # Nb of transcripts: 188205 > # Db created by: GenomicFeatures package from Bioconductor > # Creation time: 2023-12-12 10:17:34 -0500 (Tue, 12 Dec 2023) > # GenomicFeatures version at creation time: 1.54.1 > # RSQLite version at creation time: 2.3.1 > # DBSCHEMAVERSION: 1.2 > > genomeAnnotation <- > createGenomeAnnotation(BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0) > geneAnnotation <- createGeneAnnotation(TxDb = tx, OrgDb = org.Hs.eg.db) > > > Best, > > Jim > > From: Christian Arnold<chrarn...@web.de> > Sent: Tuesday, December 12, 2023 9:35 AM > To: Vincent Carey<st...@channing.harvard.edu>; James W. > MacDonald<jmac...@uw.edu> > Cc:bioc-devel@r-project.org > Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects > > Dear Vincent and others, thanks for the reply! Irrespective of whether a > different OrgDb is required, the name itself suggested that there "should be" > also corresponding OrgDb and TxDb packages. I can build one on my own, I see, > is there anyone > ZjQcmQRYFpfptBannerStart > This Message Is From an Untrusted Sender > You have not previously corresponded with this sender. > Seehttps://itconnect.uw.edu/email-tags for additional information. Please > contact the UW-IT Service Center,h...@uw.edu<mailto:h...@uw.edu> > 206.221.5000, for assistance. > ZjQcmQRYFpfptBannerEnd > > Dear Vincent and others, > > thanks for the reply! Irrespective of whether a different OrgDb is required, > the name itself suggested that there "should be" also corresponding OrgDb and > TxDb packages. I can build one on my own, I see, is there anyone who works on > providing the TxDB object for Bioc? > > I am also asking this because the T2T people specifically provide an > "updated" gene annotation dataset which may differ from what's inside OrgDb > and may be incompatible with? See > here:https://github.com/marbl/CHM13<https://urldefense.com/v3/__https:/github.com/marbl/CHM13__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl5saKKkDg$>: > > JHU RefSeqv110 + Liftoff > v5.1<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6IjF5vbw$>: > This contains curated annotations of the ampliconic genes on the Y > chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and > RefSeqv110 annotation. Additional copies found in T2T-Y were annotated to the > closest available gene in RefSeq, allowing multiple genes to have the same > common name. This file has been modified to correct special character issues > from the original file. > > > > > For ArchR, I tried to understand how one can create a new genome by checking > here:https://www.archrproject.com/bookdown/getting-set-up.html<https://urldefense.com/v3/__https:/www.archrproject.com/bookdown/getting-set-up.html__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6DoYvxHg$>. > There, they explicitly mention the TxDb and OrgDb objects that are needed > for building a custom genome. There seems to be another option when both or > any of these 2 is not available ("Alternatively, if you dont have a TxDb and > OrgDb object, you can create a geneAnnotation object from the following > information" ), but I first tried to do it the easy way as I want to properly > embed it in a pipeline with as little "custom" code as possible. > > > > Thanks, > Christian > > > > > On 11/12/2023 15:30, Vincent Carey wrote: > Thanks Jim, I tend to agree with you. Christian, I had a look at ArchR but > could not tell where the > system contacts the Bioc annotation elements. Can you give some hints? I'd > like to be able to > verify compatibility. > > On Mon, Dec 11, 2023 at 9:19 AM James W. MacDonald > <jmac...@uw.edu<mailto:jmac...@uw.edu>> wrote: > I don't believe a different OrgDb is required. The OrgDb package is meant to > provide annotations for genes such as gene symbol or GO term, etc, which are > orthogonal to the sequence of the genome, so the current version should > suffice. > > -----Original Message----- > From: Bioc-devel > <bioc-devel-boun...@r-project.org<mailto:bioc-devel-boun...@r-project.org>> > On Behalf Of Vincent Carey > Sent: Sunday, December 10, 2023 1:44 PM > To: Christian Arnold <chrarn...@web.de<mailto:chrarn...@web.de>> > Cc:bioc-devel@r-project.org<mailto:bioc-devel@r-project.org> > Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects > > Good question. I believe these will be forthcoming soon. In the mean time > you can create your own. See, for example > > https://urldefense.com/v3/__https://github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$<https://urldefense.com/v3/__https:/github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$> > > It's an active area so you can pull a gff file > fromhttps://urldefense.com/v3/__https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$> > and adjust the code noted above for the TxDb. > > For the org.db I have to get back to you. > > On Sun, Dec 10, 2023 at 12:06 PM Christian Arnold via Bioc-devel > <bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>> wrote: > >> Hello, I am working with the new human T2T-CHM13v2.0 assembly and >> while a BSgenome package already exists >> (BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0), I could not find the >> corresponding TxDb and OrgDb packages. Is there any information when >> they may also become available so it is easier to work with the new >> genome for packages like ArchR, which support a custom genome but need >> these standard annotation packages for their creation? >> >> >> Thanks a lot for any information regarding this! >> >> Best, Christian >> >> _______________________________________________ >> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list >> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/bioc> >> -devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIM >> iLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$ >> > -- > The information in this e-mail is intended only for th...{{dropped:28}} _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel