Hi Malcolm,

Not too much that hasn't been mentioned before. So I bet that many people can probably walk past this one.

Both GFF and GTF files have many of the same things that come up when you use them. They both are being used for things today (like transcriptomes) which represent a pretty specific use case. And both these file formats were designed a while ago now, and some kinds of information (like exon rank) that are completely crucial for doing something like a transcriptome are therefore still optional when making a GFF or GTF file. Also, because these file formats are very flexible and general in their specification, it is possible for them to be either overly sparse, OR overly loaded with unnecessary stuff (depending on what you were planning to use them for). So it is completely possible that the ensembl file may be smaller and yet still contain what you need. Or it might not be smaller. You will simply have to check it and see how it compares.

If you are using my function makeTranscriptDBFromGFF() from the GenomicFeatures package, it will try to check and see if the file has all the required information for you as it processes it into a transcriptDb object. If you are calling this, the only thing you really have to be "extra careful" about is the exon rank attribute. This function can "guess" at that information for you, but I am betting you don't want that if you can avoid it (which is why you will get a warning if this happens). So for these data, you really want to point to an attribute that has that information (if that is possible).

In addition to seeing problems where a file will have too much or too little information, you will also sometimes see a file that is formatted in some peculiar way that requires you to translate it into a more typical looking GFF or GTF file. This can happen to you because as I mentioned above the file formats are fairly general and open to some interpretation by those who write them out. In general I think the most important piece of advice is that you should always look at GFF or GTF files in person before you try to use them, because you can't really be too sure about what kind of information will be in there unless you do.

The bottom line is that both ensembl and flybase are reputable places to get data from. But because they are different places, they may produce dramatically different looking GFF or GTF files.


Also related to this, please be sure to use the very latest version of makeTranscriptDBFromGFF from the devel branch, as I have made some improvements for performance since the release.


I hope this helps,



  Marc




On 02/11/2013 03:13 PM, Cook, Malcolm wrote:
Marc et. al.,

A colleague of mine (cc:ed) is experiencing memory bloat using 
makeTranscriptDBFromGFF on dmel GFF from Flybase.org

I told him of my success in using Ensembl's GTF-ization but that I would check 
in with you (et al).

So....

Do you have any advice/warnings/gothcas/toldyasos/caveats re: applying 
makeTranscriptDBFromGFF to Flybase

Thanks!

Cheers,

Malcolm


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to