Hi Julian,

I am not certain if I have a better way but we calculated the CUIs of the
sources using a program in the UMLS-Interface.pl program called findDFS.pl.
It takes as input a configuration file and returns some statistics about
the sources/relations defined in the file.

For example, for SNOMEDCT with the PAR/CHD relations, the program returns:

root : C0000000
max_depth : 29
avg_depth : 10.9971431860278
sd_depth  : 3.84534200775956
paths_to_root : 14656618
sources : SNOMEDCT
max_branch : 2643
avg_branch : 5.12090364789381
leaf_count : 202004
node_count : 84679
decendents : 286654
avg_leaf_depth : 11.7492425892557
min_leaf_depth : 3
max_leaf_depth : 29
avg_node_depth : 9.20299011561308
min_node_depth : 1
max_node_depth : 25


The node + leaf count should give the total number of 'nodes' (286,683 in
this case which I think seems close to yours taking into account an early
version).

Here are the results for MSH with PAR/CHD:

root : C0000000
max_depth : 20
avg_depth : 9.3143175074184
sd_depth  : 1.99530674707347
paths_to_root : 76656
sources : MSH
max_branch : 172
avg_branch : 4.13110886417256
leaf_count : 18059
node_count : 8901
decendents : 26959
avg_leaf_depth : 9.819259095188
min_leaf_depth : 2
max_leaf_depth : 20
avg_node_depth : 8.28985507246377
min_node_depth : 1
max_node_depth : 15

The node + leaf count is 26,960 for 2013AA using findDFS.pl.

So, I agree the MSH numbers for the 2013AB feel high given the above
results and since I would expect SNOMEDCT to have more nodes than MSH
because it is suppose to be the larger source.  I have still to finish
loading the new version of the UMLS -- I am a little behind with that.
Hopefully by the end of this week, I should be able to get it finished. I
will check the numbers then as well both with mysql and findDFS.pl.

Does any of this help?

Let me know if you any questions.

Thanks,

Bridget



On Thu, Jan 2, 2014 at 6:40 AM, julian varghese <[email protected]>wrote:

> Hi Bridget,
> in your paper from http://aclweb.org/anthology/N/N13/N13-3007.pdf
> you gave some numbers on how many concepts there are for different
> vocabularies (snomedct, msh, fma).
>
> I'm wondering about an appropriate SQL-Query to get me the amount of
> concepts from the table MRCONSO, I tried this
> query:
> "Select Count(Distinct CUI) from umls.MRCONSO where SAB="SNOMEDCT" which
> gives reasonably around 320.000 elements
> which sounds reasonable,
> however, if I do it for MSH I get 335642, which seems to be to high...
> Do you know any better way/query to count the distinct concepts?
>
> Thanks!
>
> Julian
>
>
>
>
>
>
>
>
>
> 2013/12/27 Bridget McInnes <[email protected]>
>
>>
>> Hi Julian,
>>
>> Thank you! I didn't know that mysql command. That will be very useful. I
>> have approximately 2.5 G in the umlsinterfaceindex database which contains
>> the index for SNOMEDCT and MSH using the PAR/CHD relations.
>>
>> I have not installed the 2013AB version and uncertain if there has been
>> an increase in the number of concepts. It seems that there must be given
>> such an increase in the size of the index. I do remember seeing emails
>> about changes to SNOMEDCT but I did not look closely at them. I need to
>> upgrade to the newer version so I will look at installing this sooner
>> rather than later and send you the new numbers.
>>
>> I think using SNOMEDCT_US is okay -- the configuration file goes off of
>> what sources are installed on the local machine. I do not hard code those
>> anywhere. There must have been a name change in the sources for this
>> version.
>>
>> I will start installing the new version and get back to you.
>>
>> Thanks!
>>
>> Bridget
>>
>>
>> On Fri, Dec 27, 2013 at 9:52 AM, julian varghese <
>> [email protected]> wrote:
>>
>>> Hi Bridget,
>>> I installed UMLS version 2013AB, and if I took the configuration as you
>>> listed an error message crops up like
>>> "SNOMEDCT is not listed in your umls view, so after having a look at the
>>> table MRSAB I  changed it to:
>>>
>>>  SAB :: include SNOMEDCT_US
>>>   REL :: include PAR, CHD
>>>
>>
>>
>>> which works without any errors but takes a long time (running for 4 days
>>> so far and still not terminated...)
>>> Do you think this expression "SNOMEDCT_US" is ok to get the SnomedCT
>>> vocabulary
>>> (As mentioned,  "SNOMEDCT" solely gives me the error...) ?
>>>
>>> >> The size is approximately 2.5 G
>>> Does this refer to the umlsinterfaceindex database for the sole snomedct
>>> configuration?
>>> In my case there is explicitely a database called umlsinterfaceindex
>>> being created)?
>>> E.g. you can check the size of the databases easily via (just copy and
>>> paste):
>>>
>>> SELECT table_schema                                        databases,
>>>    Round(Sum(data_length + index_length) / 1024 / 1024, 1) "DB Size in
>>> MB"
>>> FROM   information_schema.tables
>>> GROUP  BY table_schema;
>>>
>>>
>>> Thanks,
>>> Julian
>>>
>>>
>>>  2013/12/27 Bridget McInnes <[email protected]>
>>>
>>>> Hi Julian,
>>>>
>>>> I have the UMLS version 2013AA installed. The index using the following
>>>> configuration contains 13,520,173 rows. The size is approximately 2.5 G
>>>> when I look specifically at the size of the tables for that configuration.
>>>> I don't think that would be a direct comparison though. I have a number of
>>>> index files created so I can only give what I think is the estimate --
>>>> total I am using 575G.
>>>>
>>>> Here is the configuration:
>>>>   SAB :: include SNOMEDCT
>>>>   REL :: include PAR, CHD
>>>>
>>>> Just in case this is different from yours.
>>>>
>>>> I do not remember how long it took exactly. I started it on a Friday
>>>> and went away for the weekend. SNOMEDCT is the largest source so it
>>>> normally takes a couple days to process all of the nodes to create the
>>>> index.
>>>>
>>>> I hope this helps somewhat. Please let us know if you have additional
>>>> questions!
>>>>
>>>> Best regards,
>>>>
>>>> Bridget
>>>>
>>>>
>>>> On Fri, Dec 27, 2013 at 6:41 AM, julian varghese <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Bridget and Ted,
>>>>>
>>>>> my apologies if this question showed up before, I tried to add the
>>>>> question
>>>>> to the yahoo mailing group but I guess it did not work, so to make
>>>>> sure I'm writing
>>>>> this question directly to you:
>>>>>
>>>>> Having installed Umls-Similarity and the UMLS database from 2013  and
>>>>> the latest version of  umls similarity/interface on a Ubuntu System
>>>>> (allocated memory 8gb,intel xeo 3 ghz).
>>>>> The UMLS index for SnomedCT is being created (database:
>>>>> umlsinterfaceindex).
>>>>> To do that I listed "SNOMEDCT_US" in the SAB-source.
>>>>>
>>>>>
>>>>> Can somebody tell me how large the umlsinterfaceindex database for
>>>>> SnomedCT is or will be and how long it usually takes to have the
>>>>> SnomedCT-index created?
>>>>>
>>>>> Right now I have installed only the important UMLS tables which are 11
>>>>> GB big,
>>>>> but umlsinterfaceindex database ist getting larger and larger, it's
>>>>> now at
>>>>> 14 gb... just want to know whether this is normal, since the index is
>>>>> getting bigger
>>>>> than the database itself...
>>>>>
>>>>> Thanks,
>>>>> Julian
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to