Maybe start out with Apache Tika for text extraction from the PDFs, then run 
Apache cTAKES on the 
resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <sekhar.h...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) 
containing different clinical documents. Each PDF has 20 pages and one PDF = 
one patient.
    
    The information to retrieve from these documents is like this for a patient 
quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the 
diagnosis of hypertension (Do not use BP readings from inpatient stay, ED 
visit, diagnostic test, or surgical procedure). Blood pressure should be 
routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? 
Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.
    


Reply via email to