your search engine would extract text content from a PDF file and all
markup, pictures etc would be lost. and so when you search you would get
only text, highlighted or not.
Best Regards
Alexander Aristov
On 18 February 2011 21:29, Gong Li wrote:
> Hi,
>
> I am developing a PDF search engine,
Hey,
I am not an expert on this but I think you should look into
CJKAnalyzer / CJKTokenizer
simon
On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser wrote:
> Hey all,
>
> I'm somewhat new to Lucene. Meaning I used it some time ago for a parser we
> wrote to tokenize a document into word grams.
hi Gong Li,
your question is out of scope of this mailing list.
thanks,
simon
On Fri, Feb 18, 2011 at 7:29 PM, Gong Li wrote:
> Hi,
>
> I am developing a PDF search engine, locally. I have used API: pdfbox and
> lucene.
>
> I must show the user the PDF page containing the keywords(if highlight
Hi Gong Li,
your question is out of scope of this list. It seems like you can find
your docs - this is what lucene does for you. PDF creation entirely
out of scope.
simon
On Sat, Feb 19, 2011 at 2:44 PM, Gong Li wrote:
> Hi,
>
> I use PDFBOX to extract the text in the PDF and then use Lucene to
Hi,
I use PDFBOX to extract the text in the PDF and then use Lucene to index and
search. Finally, I can find the context of the keyword but in String.
Question: I need to create a new PDF which contains the context of the
keyword. The format is like the original one, but only contains the context
> Instead of docFreq, did you mean numUniqueTerms?
Right.
> But you have to
> use a terms index impl that supports ord (eg FixedGap).
Ok, and the VariableGap is the new standard because the FST is much
more efficient as a terms index? Perhaps I'd need to create a codec
(or patch the existing) t
I don't quite understand your question Jason...
Seeking to the first term of the field just gets you the smallest term
(in unsigned byte[] order, ie Unicode order if the byte[] is UTF8)
across all docs.
Instead of docFreq, did you mean numUniqueTerms? Ie, you want to seek
to the largest term for