[ https://issues.apache.org/jira/browse/SOLR-11640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Pugh reassigned SOLR-11640: -------------------------------- Assignee: Eric Pugh > bin/post should obey fileType list in all modes > ----------------------------------------------- > > Key: SOLR-11640 > URL: https://issues.apache.org/jira/browse/SOLR-11640 > Project: Solr > Issue Type: Bug > Components: documentation, scripts and tools > Affects Versions: 8.0 > Reporter: Jason Gerlowski > Assignee: Eric Pugh > Priority: Trivial > Attachments: SOLR-11640.patch > > > Currently, the QuickStart tutorial included in the ref guide involves running > the following command to index some example documents: {{bin/post -c > techproducts example/exampledocs/*}} > This ends up attempting to index _all_ the files in that directory, which > includes the expected example files, but also as bash script called > {{test_utf8.sh}} and the {{post.jar}} JAR file itself. > The subsequent tutorial step involves searching results, which can bring up > the ugly result: > {code} > { > > "id":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar", > > "resourcename":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar", > "content_type":["application/java-archive"], > "content":[" \n \n \n \n \n \n \n \n \n \n \n > META-INF/MANIFEST.MF \n Manifest-Version: 1.0\r\nAnt-Version: Apache Ant > 1.9.6\r\nCreated-By: 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 (Oracle Corp > orati\r\n on)\r\nMain-Class: org.apache.solr.util.SimplePostTool\r\n\r\n \n\n > \n \n org/apache/solr/util/RTimer$1.class \n package > org.apache.solr.util;\n synchronized class RTimer$1 {\n}\n \n\n \n \n o > rg/apache/solr/util/RTimer$NanoTimeTimerImpl.class \n package > org.apache.solr.util;\n synchronized class RTimer$NanoTimeTimerImpl > implements RTimer$TimerImpl {\n private long start ;\n private > void RTimer$NanoTimeTimerImpl();\n public void start ();\n public > double elapsed ();\n}\n \n\n \n \n > org/apache/solr/util/RTimer$TimerImpl.class \n package > org.apache.solr.util;\n public abstra > ct interface RTimer$TimerImpl {\n public abstract void start ();\n > public abstract double elapsed ();\n}\n \n\n \n \n > org/apache/solr/util/RTimer.class \n package org.apache.solr.util;\n p > ublic synchronized class RTimer {\n public static final int > STARTED = 0;\n public static final int STOPPED = 1;\n public > static final int PAUSED = 2;\n protected int s > ......[remaining code skipped for brevity]........"], > "_version_":1583971861929132032}, > {code} > It's honestly pretty cool that TIKA can extract code from our post.jar file. > It makes sense, but I didn't expect it. But it's probably not what we > intended to show to new users. Especially considering that the bin/post > invocation in the quick-start tutorial claims to be choosy about what > filetypes it will index: > {code} > Entering auto mode. File endings considered are > xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log > {code} > From a quick glance at things, it looks like {{bin/post}} does pass a list of > permissible filetypes to the underlying {{SimplePostTool}}, but that > SimplePostTool doesn't follow this extension whitelist in the particular mode > being invoked by the quickstart tutorial. So this is probably a wider bug, > that the quickstart/tutorial just happens to expose. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org