vivek joshi created TIKA-1227:
---------------------------------

             Summary: Apache Tika 1.4 Duplicate extract data
                 Key: TIKA-1227
                 URL: https://issues.apache.org/jira/browse/TIKA-1227
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 1.4
         Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4
            Reporter: vivek joshi


When Extracting text using Apache Tika 1.4, the Text is getting duplicated.

APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, 
apache_tika/tika-app-1.4.jar'))

sout = subprocess.check_output("java -jar %s -t %s"%(APACHE_TIKA_PATH, 
document),shell=True)

sout contains duplicate text.

Issue both for Doc and PDF files.





--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to