vivek joshi created TIKA-1227: --------------------------------- Summary: Apache Tika 1.4 Duplicate extract data Key: TIKA-1227 URL: https://issues.apache.org/jira/browse/TIKA-1227 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.4 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4 Reporter: vivek joshi
When Extracting text using Apache Tika 1.4, the Text is getting duplicated. APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar')) sout = subprocess.check_output("java -jar %s -t %s"%(APACHE_TIKA_PATH, document),shell=True) sout contains duplicate text. Issue both for Doc and PDF files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)