Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

Hervé Pagès Tue, 12 Dec 2023 11:28:48 -0800

FWIW I've documented the process of making a TxDb object for 
T2T-CHM13v2.0 there:


https://github.com/Bioconductor/GenomicFeatures/issues/65

Please comment there for any follow-up.

Note that we're considering wrapping this is an TxDb package that we'll 
make available to the community. It's a work-in-progress.

Thanks!

H.

On 12/12/23 07:29, James W. MacDonald wrote:
> Hi Christian,
>
> This conversation is off-topic, both for this listserv (it’s meant to help 
> people developing Bioconductor packages) and for the support site (which is 
> meant to help people with (again), Bioconductor packages. I’ll answer your 
> questions one more time, but if you have other questions, please move to 
> biostars.org, or just ask the ArchR people directly, since it’s their package.
>
> I believe you are misinterpreting what an OrgDb is intended to provide. There 
> is no positional data in an OrgDb, and what the CHM13 project has done is 
> completely positional (what data are provided in the ‘Gene Annotation’ 
> section of the CHM13 Github are all GFF files, which are meant to provide 
> positional information of genes on a genome).
>
> The OrgDb package provides functional and within-annotation mappings. You can 
> map an NCBI Gene ID to Ensembl, or to the HGNC gene symbol, or a GO term, 
> etc. For example, I can map Gene symbol P53 to NCBI Gene ID 7157, or its 
> UniProt symbol K7PPA8. If the new genome build says P53 has moved to a new 
> genomic position, that has no affect on what UniProt thinks the ID for that 
> gene’s protein should be, or what ID NCBI uses, or what GO terms are appended 
> to that gene. Functionally it’s the same gene. We just might think it is 
> located in a different place in the genome.
>
> The difference between CHM13 and GRCh38 is not materially different from the 
> difference between GRCh37 and GRCh38 (they represent the current knowledge of 
> the genome at a point in time), and while we supply TxDb packages for GRCh38 
> and GRCh37 (and variants based on NCBI’s mappings as well as Ensembl’s 
> mappings), we have never supplied more than one human OrgDb package, because 
> the positional and functional information are orthogonal.
>
> It seems pretty simple to make what you need though.
>
>> library(GenomicAlignments)
>> tx <- 
>> makeTxDbFromGFF(https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz)
> Import genomic features from the file as a GRanges object ... trying URL 
> 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz'
> Content type 'application/x-gzip' length 79009538 bytes (75.3 MB)
> downloaded 75.3 MB
>
> OK
> Prepare the 'metadata' data frame ... OK
> Make the TxDb object ... OK
> Warning messages:
> 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
>    some transcripts have no
>    "transcript_id" attribute ==>
>    their name ("tx_name" column in
>    the TxDb object) was set to NA
> 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
>    the transcript names ("tx_name"
>    column in the TxDb object)
>    imported from the
>    "transcript_id" attribute are
>    not unique
> 3: In .find_exon_cds(exons, cds) : The following transcripts have
>    exons that contain more than one
>    CDS (only the first CDS was kept
>    for each exon):
>    rna-NM_001134939.1,
>    rna-NM_001172437.2,
>    rna-NM_001184961.1,
>    rna-NM_001301020.1,
>    rna-NM_001301302.1,
>    rna-NM_001301371.1,
>    rna-NM_002537.3,
>    rna-NM_004152.3,
>    rna-NM_015068.3, rna-NM_016178.2
>> tx
> TxDb object:
> # Db type: TxDb
> # Supporting package: GenomicFeatures
> # Data 
> source:https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
> # Organism: NA
> # Taxonomy ID: NA
> # miRBase build ID: NA
> # Genome: NA
> # Nb of transcripts: 188205
> # Db created by: GenomicFeatures package from Bioconductor
> # Creation time: 2023-12-12 10:17:34 -0500 (Tue, 12 Dec 2023)
> # GenomicFeatures version at creation time: 1.54.1
> # RSQLite version at creation time: 2.3.1
> # DBSCHEMAVERSION: 1.2
>
> genomeAnnotation <- 
> createGenomeAnnotation(BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0)
> geneAnnotation <- createGeneAnnotation(TxDb = tx, OrgDb = org.Hs.eg.db)
>
>
> Best,
>
> Jim
>
> From: Christian Arnold<chrarn...@web.de>
> Sent: Tuesday, December 12, 2023 9:35 AM
> To: Vincent Carey<st...@channing.harvard.edu>; James W. 
> MacDonald<jmac...@uw.edu>
> Cc:bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects
>
> Dear Vincent and others, thanks for the reply! Irrespective of whether a 
> different OrgDb is required, the name itself suggested that there "should be" 
> also corresponding OrgDb and TxDb packages. I can build one on my own, I see, 
> is there anyone
> ZjQcmQRYFpfptBannerStart
> This Message Is From an Untrusted Sender
> You have not previously corresponded with this sender.
> Seehttps://itconnect.uw.edu/email-tags  for additional information. Please 
> contact the UW-IT Service Center,h...@uw.edu<mailto:h...@uw.edu>  
> 206.221.5000, for assistance.
> ZjQcmQRYFpfptBannerEnd
>
> Dear Vincent and others,
>
> thanks for the reply! Irrespective of whether a different OrgDb is required, 
> the name itself suggested that there "should be" also corresponding OrgDb and 
> TxDb packages. I can build one on my own, I see, is there anyone who works on 
> providing the TxDB object for Bioc?
>
> I am also asking this because the T2T people specifically provide an 
> "updated" gene annotation dataset which may differ from what's inside OrgDb 
> and may be incompatible with? See 
> here:https://github.com/marbl/CHM13<https://urldefense.com/v3/__https:/github.com/marbl/CHM13__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl5saKKkDg$>:
>
> JHU RefSeqv110 + Liftoff 
> v5.1<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6IjF5vbw$>:
>  This contains curated annotations of the ampliconic genes on the Y 
> chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and 
> RefSeqv110 annotation. Additional copies found in T2T-Y were annotated to the 
> closest available gene in RefSeq, allowing multiple genes to have the same 
> common name. This file has been modified to correct special character issues 
> from the original file.
>
>
>
>
> For ArchR, I tried to understand how one can create a new genome by checking 
> here:https://www.archrproject.com/bookdown/getting-set-up.html<https://urldefense.com/v3/__https:/www.archrproject.com/bookdown/getting-set-up.html__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6DoYvxHg$>.
>  There, they explicitly mention the TxDb and OrgDb objects that are needed 
> for building a custom genome. There seems to be another option when both or 
> any of these 2 is not available ("Alternatively, if you dont have a TxDb and 
> OrgDb object, you can create a geneAnnotation object from the following 
> information" ), but I first tried to do it the easy way as I want to properly 
> embed it in a pipeline with as little "custom" code as possible.
>
>
>
> Thanks,
> Christian
>
>
>
>
> On 11/12/2023 15:30, Vincent Carey wrote:
> Thanks Jim, I tend to agree with you.  Christian, I had a look at ArchR but 
> could not tell where the
> system contacts the Bioc annotation elements.  Can you give some hints?  I'd 
> like to be able to
> verify compatibility.
>
> On Mon, Dec 11, 2023 at 9:19 AM James W. MacDonald 
> <jmac...@uw.edu<mailto:jmac...@uw.edu>> wrote:
> I don't believe a different OrgDb is required. The OrgDb package is meant to 
> provide annotations for genes such as gene symbol or GO term, etc, which are 
> orthogonal to the sequence of the genome, so the current version should 
> suffice.
>
> -----Original Message-----
> From: Bioc-devel 
> <bioc-devel-boun...@r-project.org<mailto:bioc-devel-boun...@r-project.org>> 
> On Behalf Of Vincent Carey
> Sent: Sunday, December 10, 2023 1:44 PM
> To: Christian Arnold <chrarn...@web.de<mailto:chrarn...@web.de>>
> Cc:bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>
> Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects
>
> Good question.  I believe these will be forthcoming soon.  In the mean time 
> you can create your own.  See, for example
>
> https://urldefense.com/v3/__https://github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$<https://urldefense.com/v3/__https:/github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$>
>
> It's an active area so you can pull a gff file 
> fromhttps://urldefense.com/v3/__https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$>
> and adjust the code noted above for the TxDb.
>
> For the org.db I have to get back to you.
>
> On Sun, Dec 10, 2023 at 12:06 PM Christian Arnold via Bioc-devel 
> <bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>> wrote:
>
>> Hello, I am working with the new human T2T-CHM13v2.0  assembly and
>> while a BSgenome package already exists
>> (BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0), I could not find the
>> corresponding TxDb and OrgDb packages. Is there any information when
>> they may also become available so it is easier to work with the new
>> genome for packages like ArchR, which support a custom genome but need
>> these standard annotation packages for their creation?
>>
>>
>> Thanks a lot for any information regarding this!
>>
>> Best, Christian
>>
>> _______________________________________________
>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>  mailing list
>> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/bioc>
>> -devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIM
>> iLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$
>>
> --
> The information in this e-mail is intended only for th...{{dropped:28}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

Reply via email to