Re: How to capture number of page e number of line in file pdf indexed?

2014-07-06 Thread Arlei Ferreira Farnetani Junior
50% completed... I managed to map the pages, and the position of the cut and capture content properly. Now we need to navigate back and capture the topics and subtopics. Ok...thanks... 2014-07-06 12:56 GMT-03:00 Erick Erickson : > This isn't a Solr problem, but a PDF problem. The Tika > projec

How to handle words that stem to stop words

2014-07-06 Thread Arjen van der Meijden
Hello list, We have a fairly large Lucene database for a 30+ million post forum. Users post and search for all kinds of things. To make sure users don't have to type exact matches, we combine a WordDelimiterFilter with a (Dutch) SnowballFilter. Unfortunately users sometimes find examples of

small segment sizes

2014-07-06 Thread Kireet Reddy
I am trying to understand why I am seeing very small segment sizes during indexing. I am using elasticsearch and one node sees heavy merge activity. After enabling info stream logs it seems that the node is doing more, smaller merges than the other nodes. In the TMP logs, I see a lot of merges o

Re: How to capture number of page e number of line in file pdf indexed?

2014-07-06 Thread Erick Erickson
This isn't a Solr problem, but a PDF problem. The Tika project is what's used to extract the PDF info, including a bunch of metadata. Tika uses PDFBox, which at least allows you to extract a page at a time and maybe much more (I just barely looked at the interface)... You can use Tika from a Java

How to capture number of page e number of line in file pdf indexed?

2014-07-06 Thread Arlei Ferreira Farnetani Junior
I'm building a new system where I will have several pdf files. The content you will have to have in my indexes are: 1. Name 2. No. of Pages 3. Data File 4. Archive When I run the search by the system, I will be typing full names that are stored within the file in the index, then I need that syste

Re: Having problem with indexing/ searching with _ or -

2014-07-06 Thread Ganesh
Hi Smitha, You need to have your own custom analyzer which breaks the word by - or _. Use the same analyzer for indexing and searching. Regards Aditya www.findbestopensource.com On 7/4/2014 11:41 AM, Smitha Kuldeep (smtt) wrote: Hello team, We are using lucen-core-2.9.1.jar for indexing and