Re: Lucene: If I have picture, table, or somthing others in the PDF

2011-02-19 Thread Alexander Aristov
your search engine would extract text content from a PDF file and all markup, pictures etc would be lost. and so when you search you would get only text, highlighted or not. Best Regards Alexander Aristov On 18 February 2011 21:29, Gong Li wrote: > Hi, > > I am developing a PDF search engine,

Re: Splitting word tokens - other languages

2011-02-19 Thread Simon Willnauer
Hey, I am not an expert on this but I think you should look into CJKAnalyzer / CJKTokenizer simon On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser wrote: > Hey all, > > I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we > wrote to tokenize a document into word grams.

Re: Lucene: If I have picture, table, or somthing others in the PDF

2011-02-19 Thread Simon Willnauer
hi Gong Li, your question is out of scope of this mailing list. thanks, simon On Fri, Feb 18, 2011 at 7:29 PM, Gong Li wrote: > Hi, > > I am developing a PDF search engine, locally. I have used API: pdfbox and > lucene. > > I must show the user the PDF page containing the keywords(if highlight

Re: About PDF+Lucene

2011-02-19 Thread Simon Willnauer
Hi Gong Li, your question is out of scope of this list. It seems like you can find your docs - this is what lucene does for you. PDF creation entirely out of scope. simon On Sat, Feb 19, 2011 at 2:44 PM, Gong Li wrote: > Hi, > > I use PDFBOX to extract the text in the PDF and then use Lucene to

About PDF+Lucene

2011-02-19 Thread Gong Li
Hi, I use PDFBOX to extract the text in the PDF and then use Lucene to index and search. Finally, I can find the context of the keyword but in String. Question: I need to create a new PDF which contains the context of the keyword. The format is like the original one, but only contains the context

Re: Last/max term in Lucene 4.x

2011-02-19 Thread Jason Rutherglen
> Instead of docFreq, did you mean numUniqueTerms? Right. > But you have to > use a terms index impl that supports ord (eg FixedGap). Ok, and the VariableGap is the new standard because the FST is much more efficient as a terms index? Perhaps I'd need to create a codec (or patch the existing) t

Re: Last/max term in Lucene 4.x

2011-02-19 Thread Michael McCandless
I don't quite understand your question Jason... Seeking to the first term of the field just gets you the smallest term (in unsigned byte[] order, ie Unicode order if the byte[] is UTF8) across all docs. Instead of docFreq, did you mean numUniqueTerms? Ie, you want to seek to the largest term for