Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

James W. MacDonald Tue, 12 Dec 2023 07:33:29 -0800

Hi Christian,

This conversation is off-topic, both for this listserv (it’s meant to help 
people developing Bioconductor packages) and for the support site (which is 
meant to help people with (again), Bioconductor packages. I’ll answer your 
questions one more time, but if you have other questions, please move to 
biostars.org, or just ask the ArchR people directly, since it’s their package.

I believe you are misinterpreting what an OrgDb is intended to provide. There 
is no positional data in an OrgDb, and what the CHM13 project has done is 
completely positional (what data are provided in the ‘Gene Annotation’ section 
of the CHM13 Github are all GFF files, which are meant to provide positional 
information of genes on a genome).

The OrgDb package provides functional and within-annotation mappings. You can 
map an NCBI Gene ID to Ensembl, or to the HGNC gene symbol, or a GO term, etc. 
For example, I can map Gene symbol P53 to NCBI Gene ID 7157, or its UniProt 
symbol K7PPA8. If the new genome build says P53 has moved to a new genomic 
position, that has no affect on what UniProt thinks the ID for that gene’s 
protein should be, or what ID NCBI uses, or what GO terms are appended to that 
gene. Functionally it’s the same gene. We just might think it is located in a 
different place in the genome.

The difference between CHM13 and GRCh38 is not materially different from the 
difference between GRCh37 and GRCh38 (they represent the current knowledge of 
the genome at a point in time), and while we supply TxDb packages for GRCh38 
and GRCh37 (and variants based on NCBI’s mappings as well as Ensembl’s 
mappings), we have never supplied more than one human OrgDb package, because 
the positional and functional information are orthogonal.

It seems pretty simple to make what you need though.

> library(GenomicAlignments)
> tx <- 
> makeTxDbFromGFF(https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz)
Import genomic features from the file as a GRanges object ... trying URL 
'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz'
Content type 'application/x-gzip' length 79009538 bytes (75.3 MB)
downloaded 75.3 MB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  some transcripts have no
  "transcript_id" attribute ==>
  their name ("tx_name" column in
  the TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name"
  column in the TxDb object)
  imported from the
  "transcript_id" attribute are
  not unique
3: In .find_exon_cds(exons, cds) : The following transcripts have
  exons that contain more than one
  CDS (only the first CDS was kept
  for each exon):
  rna-NM_001134939.1,
  rna-NM_001172437.2,
  rna-NM_001184961.1,
  rna-NM_001301020.1,
  rna-NM_001301302.1,
  rna-NM_001301371.1,
  rna-NM_002537.3,
  rna-NM_004152.3,
  rna-NM_015068.3, rna-NM_016178.2
> tx
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: 
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 188205
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-12-12 10:17:34 -0500 (Tue, 12 Dec 2023)
# GenomicFeatures version at creation time: 1.54.1
# RSQLite version at creation time: 2.3.1
# DBSCHEMAVERSION: 1.2

genomeAnnotation <- createGenomeAnnotation(BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0)
geneAnnotation <- createGeneAnnotation(TxDb = tx, OrgDb = org.Hs.eg.db)

Best,

Jim

From: Christian Arnold <chrarn...@web.de>
Sent: Tuesday, December 12, 2023 9:35 AM
To: Vincent Carey <st...@channing.harvard.edu>; James W. MacDonald 
<jmac...@uw.edu>
Cc: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

Dear Vincent and others, thanks for the reply! Irrespective of whether a 
different OrgDb is required, the name itself suggested that there "should be" 
also corresponding OrgDb and TxDb packages. I can build one on my own, I see, 
is there anyone
ZjQcmQRYFpfptBannerStart
This Message Is From an Untrusted Sender
You have not previously corresponded with this sender.
See https://itconnect.uw.edu/email-tags for additional information. Please 
contact the UW-IT Service Center, h...@uw.edu<mailto:h...@uw.edu> 206.221.5000, 
for assistance.
ZjQcmQRYFpfptBannerEnd

Dear Vincent and others,

thanks for the reply! Irrespective of whether a different OrgDb is required, 
the name itself suggested that there "should be" also corresponding OrgDb and 
TxDb packages. I can build one on my own, I see, is there anyone who works on 
providing the TxDB object for Bioc?

I am also asking this because the T2T people specifically provide an "updated" 
gene annotation dataset which may differ from what's inside OrgDb and may be 
incompatible with? See here: 
https://github.com/marbl/CHM13<https://urldefense.com/v3/__https:/github.com/marbl/CHM13__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl5saKKkDg$>:

JHU RefSeqv110 + Liftoff 
v5.1<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6IjF5vbw$>:
 This contains curated annotations of the ampliconic genes on the Y chromosome, 
correcting annotation errors in GENCODEv35 CAT/Liftoff and RefSeqv110 
annotation. Additional copies found in T2T-Y were annotated to the closest 
available gene in RefSeq, allowing multiple genes to have the same common name. 
This file has been modified to correct special character issues from the 
original file.

For ArchR, I tried to understand how one can create a new genome by checking 
here: 
https://www.archrproject.com/bookdown/getting-set-up.html<https://urldefense.com/v3/__https:/www.archrproject.com/bookdown/getting-set-up.html__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6DoYvxHg$>.
 There, they explicitly mention the TxDb and OrgDb objects that are needed for 
building a custom genome. There seems to be another option when both or any of 
these 2 is not available ("Alternatively, if you dont have a TxDb and OrgDb 
object, you can create a geneAnnotation object from the following information" 
), but I first tried to do it the easy way as I want to properly embed it in a 
pipeline with as little "custom" code as possible.

Thanks,
Christian

On 11/12/2023 15:30, Vincent Carey wrote:
Thanks Jim, I tend to agree with you.  Christian, I had a look at ArchR but 
could not tell where the
system contacts the Bioc annotation elements.  Can you give some hints?  I'd 
like to be able to
verify compatibility.

On Mon, Dec 11, 2023 at 9:19 AM James W. MacDonald 
<jmac...@uw.edu<mailto:jmac...@uw.edu>> wrote:
I don't believe a different OrgDb is required. The OrgDb package is meant to 
provide annotations for genes such as gene symbol or GO term, etc, which are 
orthogonal to the sequence of the genome, so the current version should suffice.

-----Original Message-----
From: Bioc-devel 
<bioc-devel-boun...@r-project.org<mailto:bioc-devel-boun...@r-project.org>> On 
Behalf Of Vincent Carey
Sent: Sunday, December 10, 2023 1:44 PM
To: Christian Arnold <chrarn...@web.de<mailto:chrarn...@web.de>>
Cc: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>
Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

Good question.  I believe these will be forthcoming soon.  In the mean time you 
can create your own.  See, for example

https://urldefense.com/v3/__https://github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$<https://urldefense.com/v3/__https:/github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$>

It's an active area so you can pull a gff file from 
https://urldefense.com/v3/__https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$>
and adjust the code noted above for the TxDb.

For the org.db I have to get back to you.

On Sun, Dec 10, 2023 at 12:06 PM Christian Arnold via Bioc-devel < 
bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>> wrote:

> Hello, I am working with the new human T2T-CHM13v2.0  assembly and
> while a BSgenome package already exists
> (BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0), I could not find the
> corresponding TxDb and OrgDb packages. Is there any information when
> they may also become available so it is easier to work with the new
> genome for packages like ArchR, which support a custom genome but need
> these standard annotation packages for their creation?
>
>
> Thanks a lot for any information regarding this!
>
> Best, Christian
>
> _______________________________________________
> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/bioc>
> -devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIM
> iLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$
>

--
The information in this e-mail is intended only for the ...{{dropped:18}}

_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc-devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/bioc-devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$>

The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline<https://urldefense.com/v3/__http:/www.partners.org/complianceline__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl71PvXANw$>
 . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

Reply via email to