Hi Abhishek,
You have some interesting timing ...
I can give you the xml specifications that you require if you send me the
format of your dictionary.
Since you are new to the current dictionary module setup, I might also have a
simpler solution for you ...
A couple of days ago I checked a new module into Sandbox called
ctakes-dictionary-lookup2 (how novel a name). It is a complete replacement of
the current dictionary lookup module, but both can sit side-by-side in your
local trunk sandbox or build. It has an example descriptor that tells it to
read a bar-separated value file (BSV) as a dictionary, storing it (indexed) in
memory for fast lookup. There is an example dictionary and xml descriptor for
that dictionary. It accepts 2 or 3 column files in the format CUI|Text or
CUI|TUI|Text. It automatically detects the number of columns, but they must be
in that order. It also does not need the text fields to be tokenized, allowing
it to accept "Tumor, malignant" as well as "tumor , malignant" as it will
perform the tokenization upon reading the file.
As the dictionary will be stored in-memory it should not be huge. If you do
have a very large number of terms (>50k) then I recommend an hsql db. The new
module will take an hsql db with the fixed field names CUI, TUI, RINDEX,
TCOUNT, TEXT, RWORD. I will explain what those mean in some documentation that
I plan to check into sandbox later today, but I can help you build an hsql
dictionary db ...
Yesterday I checked into sandbox a project named "dictionarytool". It is
source-only, but I can give you a jar if you want one. Out-of-the-box it will
build various dictionaries from a UMLS download. It can build BSV, Hsql (new
format) and Hsql (current format) to be used by the new or current dictionary
lookup modules.
This devlist announcement is a little premature on my part. I will not get
usage documentation into sandbox for a day or two, but I can send you copies as
I go if you are in a hurry, or just give you xml snippets for the current
module descriptors. If you send the format of your dictionary then that can be
done quickly. I just wanted to let you know that there is another option wrt
dictionary lookup.
Sean
-Original Message-
From: Abhishek De [mailto:abhishek...@alumnux.com]
Sent: Friday, February 28, 2014 6:58 AM
To: dev@ctakes.apache.org
Subject: How to add a new dictionary database to cTAKES
Hi,
How do I add a new database to the cTAKES pipeline to perform lookup from? How
do I specify what columns to look up and how to annotate the text with the
returned hits? I have gone through the DictionaryLookupAnnotatorDB.xml and
LookupDesc_Db.xml files. However, I could not understand the meanings of the
terms like "lookupField", "metaField", "maxPermutationLevel" and
"exclusionTags". If I add a new database, I need to configure this xml file
properly. Please guide me regarding these problems.
Thanks and Regards,
Abhishek De