These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).
Thanks Sekhar Hari | Program Lead Health Sciences Business Innovation ASDC CGI Health Solutions Electronic City, Bangalore Karnataka, India 560100 814 7027 779 (C) 080 6642 2536 (D) -----Original Message----- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: 26 June 2017 10:03 To: dev@ctakes.apache.org; u...@ctakes.apache.org Subject: Re: Visit segregation and extraction Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text? On 6/25/17, 5:30 PM, "Hari, Sekhar" <sekhar.h...@cgi.com> wrote: Hello there - I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient. The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' - "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit." Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem? Many thanks, Sekhar H.