Hi Peter,
Thank you for your very helpful answer. Seriously, it is rare to get such a good answer on such a topic. I actually read your response on academia.sx before you saw your email, and I should have guessed such a good reason would have come from a Debian person. Also, I see you registered the same day as your answer. :-) I'm keeping debian-devel and debian-med cc'd for now, because I do have some general questions about biological data licensing. If the lists want me to go away, just say so. Since you posted your answer publicly, I'm assuming you don't mind if I quote it. I recommend you post your answer to the Debian lists, since there is no guarantee that academia.sx will be around forever. See responses inline. I'm afraid there are a lot of questions, but I really can't pass up the opportunity to get some answers for once. Sorry about that. If you don't want to answer my questions (and let's face it, you probably don't) perhaps you can suggest some suitable mailing list(s)/forum(s)? On Mon, 16 Sep 2013, Peter Rice wrote:
On 16/09/2013 11:31, Faheem Mitha wrote:
Hi,
This is really not Debian-related, except insofar as the software in question is something that might have been in Debian one day. I talked about that with people on debian-med recently. So, it is technically off-topic.
I posted a reply on stackexchange with instructions to get the data from the EBI SRS server.
However, I have run into this issue before in the context of biological database entries and Debian so it may be worth discussing here. There were objections to including SwissProt entries in the example data for the EMBOSS package because the licensing of SwissProt does not allow them to be edited. That was resolved by agreeing that scientific facts should not be edited so that the files could be accepted as part of a Debian package even though they could not be changed. A fine compromise I feel.
So, what license did these files go into Debian as?
regards, Peter Rice EMBOSS team
The copyright is probably on the full database release flatfile and the formatted entries ... you will find similar conditions for UniProt/SwissProt so it is not so unusual.
Yes, but I'm not trying to download their entire database, just a small portion of it.
The restrictions on scripts are common to prevent server performance hits from a large number of requests.
Is such a restriction legally enforceable? I don't see how one can distinguish between a human user downloading using say curl, and a script using curl with random pauses between downloads. Or is acceding to such a request just a matter of common courtesy?
You can simply invite reviewers to download the data from some other server, for example from the EBI SRS server. The URL for entry A00673 would be
"http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[IMGTLIGM-ID:a00673]+-view+FastaSeqs+-ascii"
Wow, that works for me! Cool. I've tried before to download data from other biological data web services, but have always fallen down confused at the complexity of the sites and the multiplicity of their options. IMGT is practically the only such site I have found which I found I was able to navigate without getting brain fever. http://www.ebi.ac.uk/miriam/main/collections/MIR:00000287 So a few possibly dumb questions. Question 1: Is there no general agreement on the licensing of biological data such as that the kind we are talking about? This seems strange. Aren't such data biological "facts", as you put it in your message? To me, it makes as much sense to try to treat the list of prime numbers or any other such mathematical facts as proprietary information. Specifically, I don't understand how IMGT can claim to own this data, to the extent of forbidding its redistribution. They didn't produce this data themselves, did they? Question 2: It looks like EBI is hosting a copy of the IMGT database. Is that right? Also, there are a lot of different kinds of accession numbers. Which accession numbers is IMGT using here? Also, do you know of other servers that have the same data?
You can also use a list of accessions, for example A00673 or A01650
"http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[IMGTLIGM-ID:a00673|a01650]+-view+FastaSeqs+-ascii"
If downloading many entries you should pause between requests, but putting lists into the URLs may reduce it to few enough not to cause a problem. I doubts EBI would be upset by 200 requests - they would be concerned about thousands.
This is *really* useful. I see each of these "list" requests produces one fasta file with multiple sequences in them. I think this is be a better way to go rather than producing hundreds of fasta files, each containing a single sequence, as I have been doing. Also, unlike IMGT, one justs downloads a FASTA file directly, without having to trim off HTML stuff. I suspect that each request corresponds at the backend to a SQL query, and if so, I'm sure the system would prefer one larger SQL query to many small ones. Can one do the same trick with the IMGT servers? In my case, I'm downloading gene segments which contain one or more Recombination Signal Sequences (RSS). I'm doing this for human RSS (63 segments) and mouse RSS (146 segments), so maybe it would make sense to download the human segments and the mouse segments as one fasta file each. This might be a good place to ask: QUESTION 3: I'm using the files http://www.itb.cnr.it/rss/stats/HS12RSS.fasta and http://www.itb.cnr.it/rss/stats/MM12RSS.fasta as the source of the RSS. In each of these files, before the listing of each sequence, there is an annotation (I hope this is the right word). E.g. in HS12RSS.fasta there is
HPRT_12
CACACACACACACACACACACAAATACA So, here for example "HPRT_12" is the annotation, but I have no idea what it refers to. In some cases I was able to look up these annotations at IMGT. For example, again in HS12RSS.fasta, there is
TRAJ3*01
CACTGTGGGTAAGGTCTTTGAGATAACC and I was able to look up TRAJ3*01 in http://www.imgt.org/IMGTrepertoire/LocusGenes/#h1_32 -> http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ But in many cases I was not able to. Do you have any idea what those other strings refer to? Here are a couple more, also from HS12RSS.fasta.
LCK
CACACACACACACACACAAGCCAAAACC
LMO2
CACAGTATTGTCTTACCCAGCAATAATT
There are various fasta formats available for IMGT data, you need to find a server that produces fasta files compatible with your input requirements.
I thought there was one standard fasta file format.
Alternatively of course your reviewers could download the whole database from IMGT or any of the other servers (including ftp://ftp.ebi.ac.uk/pub/databases/imgt/) and generate their own fasta subset from the list of accessions/ids
They could, but I don't see the point of it. The reviewers may not have any special interest in the data unless they happen to be biologists working with that sort of data, and will probably want to expend as little effort working with the data as they can. Regards, Faheem -- To UNSUBSCRIBE, email to debian-med-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/alpine.deb.2.02.1309180045240.3...@orwell.homelinux.org