50% completed... I managed to map the pages, and the position of the cut and capture content properly. Now we need to navigate back and capture the topics and subtopics.
Ok...thanks... 2014-07-06 12:56 GMT-03:00 Erick Erickson <erickerick...@gmail.com>: > This isn't a Solr problem, but a PDF problem. The Tika > project is what's used to extract the PDF info, including > a bunch of metadata. > > Tika uses PDFBox, which at least allows you to > extract a page at a time and maybe much more (I just > barely looked at the interface)... > > You can use Tika from a Java program and send the > doc to Solr, here's a place to get started: > http://searchhub.org/2012/02/14/indexing-with-solrj/ > > But the bottom line here is you'll have to do the > extraction & etc yourself, build up the information you > need to identify pages of your text and go from there. > There's nothing OOB that does what you want. > > Best, > Erick > > On Sun, Jul 6, 2014 at 7:28 AM, Arlei Ferreira Farnetani Junior > <farnet...@gmail.com> wrote: > > I'm building a new system where I will have several pdf files. > > > > The content you will have to have in my indexes are: > > 1. Name > > 2. No. of Pages > > 3. Data File > > 4. Archive > > > > When I run the search by the system, I will be typing full names that are > > stored within the file in the index, then I need that system resulting in > > me: > > > > - All variables above (file name, file date) and especially the page > number > > where the occurrence happened and the line number and if possible the > exact > > position of the line on where it starts to occur. > > > > I need it because I have to go back this occurrence for words that > identify > > topics and subtopics, where traversing the file line by line backwards so > > allows me to identify the first subtopic and capture it and do the same > > when you find the topic . Not always the subtopic and the topic will be > on > > the same page of the occurrence. > > > > example: > > > > Document: 00001.pdf > > > > *page 115 * > > Line 1: > > Line 2: *TTTTT* - TITLE occurrence (will be captured by the first > > occurrence of title) > > Line 3: > > Line 4: YYYY - SECOND SUBTITLE (will be ignored because the system will > > have already caught the first subtopic in line 6) > > Line 5: > > Line 6: *XXXX* - First subtitle (will be captured by the first occurrence > > of sought caption) > > Line 7: > > ... page ...116 > > ... page ...121 > > *page 122 * > > Line 1: line break > > Line 2: Content pertaining to occurrence ... > > Line 3: content from occurrence ... > > Line 4: FOUND TO OCCUR FOR EXAMPLE: *JOHN MCLAEN * > > Line 5: content from occurrence ... > > Line 6: line break > > Line 7: > > > > The big problem is that I do not know how to obtain this information from > > the page number and line number. Is there any functionality to it when I > > convert the PDF file to String in the index or will I have to store the > > Lucene index file line by line informing somehow the number of pages on > > which that file belongs? > > > > In the example above, I need the system resulting me: > > > > 1 occurrence on page 122 with the topic = TTTTT and subtopic = XXXX with > > all the content that is before the name *JOHN MCLAEN* until the line > break. > > > > Anyway, that will lead me to string containing the result of the > occurrence > > starting at line 2 (after line break) on page 122 and ending the block to > > line 5 results (before the line break). > > > > *Example of result:* > > > > > -------------------------------------------------------------------------------------------------------------------------------------- > > *Page: 122 - File: 00001.pdf* > > *TÓPIC: TTTTT* > > *SUB-TÓPIC: XXXXX* > > > > Processo 0001933-62.2000.8.26.0081 (001.01.2000.001933) - Procedimento > > Ordinário - Contratos Bancários - Auto Posto Murillo Ltda - - Murillo > > Jaccoud - - Murillo Jaccoud Junior - Banco Santander (brasil) Sa - Fica o > > executado Banco SantanderS/A devidamente intimado através de seu > advogado a > > efetuar o pagamento do valor de R$ 90.200,42 (noventa mil, duzentos > reais e > > quarenta e dois centavos) no prazo de 15 dias, sob pena de multa de 10%, > > nos termos do artigo 475-J. - ADV: *JOHN MCLAEN* (OAB 103587/SP), MARISA > > REGINA AMARO MIYASHIRO (OAB 121739/SP), RODRIGO JARA (OAB 275050/SP) > > > -------------------------------------------------------------------------------------------------------------------------------------- > > Is this possible? > > > > Any help or hint will be of great value. > > > > Thank you very much. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --