Which analyzer to use for non-english unicoded text?

2009-05-22 Thread KK
Hi All, I've been trying to index some non-english [Indian languages] in unicode utf-8. For all these languages we don't have any stemmer or tokenizers etc. To keep the searching simple I'ld like to be able to do exact word searches/matches as a first step. I'ld like to know which will be the simpl

semantic search event

2009-05-22 Thread AJ Chen
Semantic search is a nice addition to full text search. recently, there are lots of development in apply semantics to optimize search engine, including the most hyped launch of walfram alpha. Semantic search will be one of the main themes on this year's semantic technology conference, June 14-18,

Re: Term frequencies within a search

2009-05-22 Thread Robert Young
For all the docs, and in fact, I think it might be the document frequency. Basically I need to be able to do a query and get a list of terms with how many documents in the result set contain that term. I'm not so worried about how often the term appears in each document. Thanks Rob On Thu, May 21

Re: Parsing large xml files

2009-05-22 Thread Michael Wechner
crack...@comcast.net schrieb: once you get comfortable with vtd-xml, few people will ever get back to DOM and SAX... maybe you want to consider to contribute a vtd-xml based parsing implementation to Lucene ;-) Thanks Michael - Original Message - From: "Sithu D. Sudarsan" To:

Re: Parsing large xml files

2009-05-22 Thread crackeur
once you get comfortable with vtd-xml, few people will ever get back to DOM and SAX... - Original Message - From: "Sithu D. Sudarsan" To: java-user@lucene.apache.org Sent: Friday, May 22, 2009 6:39:33 AM GMT -08:00 US/Canada Pacific Subject: RE: Parsing large xml files Thanks every

Re: Parsing large xml files

2009-05-22 Thread prasanna pradhan
We had similar a problem where we had to parse 1 GB XML files.Better transform to array like json and write a custom search API using lucene. On Thu, May 21, 2009 at 8:12 PM, Sudarsan, Sithu D. < sithu.sudar...@fda.hhs.gov> wrote: > > Hi, > > While trying to parse xml documents of about 50MB siz

Re: Searching index problems with tomcat

2009-05-22 Thread Marco Lazzara
Thanks a lot.But now I'am going to work(waiter).When I come back I'll immediately do that Thanks again.You are so kind. 2009/5/22 Matthew Hall > humor me. > > Open up your indexing software package. > > Step 1: In all places where you reference your index, replace whatever the > heck you have

Re: Searching index problems with tomcat

2009-05-22 Thread Matthew Hall
humor me. Open up your indexing software package. Step 1: In all places where you reference your index, replace whatever the heck you have there with the following EXACT STRING: /home/marco/testIndex Do not leave off the leading slash. After you have made these changes to the indexing softw

Re: Searching index problems with tomcat

2009-05-22 Thread Marco Lazzara
I dont't know hot to solve the problem..I've tried all rationals things.Maybe the last thing is to try to index not with FSDirectory but with something else.I have to peruse the api documentation. But.IF IT WAS A LUCENE'S BUG??? 2009/5/22 Matthew Hall > because that's the default index write

Re: Searching index problems with tomcat

2009-05-22 Thread Matthew Hall
because that's the default index write behavior. It will create any directory that you ask it to. Matt Marco Lazzara wrote: ok.I understand what you really mean but It doesn't work. I understand one thing.For example When i try to open an index in the following location : "RDFIndexLucene/" but

Re: Searching index problems with tomcat

2009-05-22 Thread Marco Lazzara
ok.I understand what you really mean but It doesn't work. I understand one thing.For example When i try to open an index in the following location : "RDFIndexLucene/" but the folder doesn't exist,*Lucene create an empty folder named "RDFIndexLucene"* in my home folder...WHY??? MARCO LAZZARA 2009/

RE: Searching index problems with tomcat

2009-05-22 Thread Digy
home/marco/RdfIndexLucene and media/disk/users/fratelli/RDFIndexLucene are relative paths. Use /media/disk/users/fratelli/RDFIndexLucene etc. instead. DIGY -Original Message- From: Marco Lazzara [mailto:marco.lazz...@gmail.com] Sent: Friday, May 22, 2009 12:48 AM To: java-user@lucene.apa

Re: Searching index problems with tomcat

2009-05-22 Thread Matthew Hall
For writing indexes? Well I guess it depends on what you want.. but I personally use this: (2.3.2 API) File INDEX_DIR = "/data/searchtool/thisismyindexdirectory" Analyzer analyzer = new WhateverConcreteAnalyzerYouWant(); writer = new IndexWriter(/INDEX_DIR/, /analyzer/, true); Your best bet w

Re: Searching index problems with tomcat

2009-05-22 Thread Marco Lazzara
I was talking with my teacher. Is it correct to use FSDirectory?Could you please look again at the code I've posted here?? Should I choose a different way to Indexing ?? Marco Lazzara 2009/5/22 Ian Lea > OK. I'd still like to see some evidence, but never mind. > > Next suggestion is the old

Re: Parsing large xml files

2009-05-22 Thread Matthew Hall
Yeah, there's a setting on windows that allows you to use up to .. erm 3G I think it was. The limitation there is due to the silly windows file system. I'm don't remember off hand exactly what that setting was, but I'm 100% certain that its there. If you do a google search for jvm maximum me

RE: Parsing large xml files

2009-05-22 Thread Sudarsan, Sithu D.
Hi Matt, We use 32 bit JVM. Though it is supposed to have upto 4GB, any assignment above 2GB in Windows XP fails. The machine has quad-core dual processor. On Linux we're able to use 4GB though! If there is any setting that will let us use 4GB do let me know. Thanks, Sithu D Sudarsan -O

RE: Parsing large xml files

2009-05-22 Thread Sudarsan, Sithu D.
Thanks everyone for your useful suggestions/links. Lucene uses DOM and we tried with SAX. XML Pull & vtd-xml as well as Piccolo seem good. However, for now, we've broken the file into smaller chunks and then parsing it. When we get some time, we'ld like to refactor with the suggested ones. Er

Re: Parsing large xml files

2009-05-22 Thread Matthew Hall
2g... should not be a maximum for any Jvm that I know of. Assuming you are running a 32 bit Jvm you are actually able to address a bit under 4G of memory, I've always used around 3.6G when trying to max out a 32 bit jvm. Technically speaking it should be able to address 4g under a 32 bit or,

Re: hit highlighting in lucene ?

2009-05-22 Thread Robert Muir
Hello, I think if you analyze text correctly, then your highlighting will work too. Your problem is you need an analyzer that analyzes text correctly, then I think everything will work! Here's a short intro with some links: You can get code that applies these algorithms here: http://site.icu-proje

Re: Retrieving payloads for terms matched by a query

2009-05-22 Thread Grant Ingersoll
On May 22, 2009, at 12:28 AM, Dmitri Bichko wrote: Hi, I may be missing something obvious, but how do I get the payloads for the specific token positions that were matched by a query? See SpanQuery.getPayloadSpans() and it's SpanQuery derivatives. For example, if I have a phrase query li

Re: Searching index problems with tomcat

2009-05-22 Thread Ian Lea
OK. I'd still like to see some evidence, but never mind. Next suggestion is the old standby - cut the code down to the absolute minimum to demonstrate the problem and post it here. I know you've already posted some code, but maybe not all of it, and definitely not cut down to the absolute minimu

Re: Parsing large xml files

2009-05-22 Thread Michael Wechner
crack...@comcast.net schrieb: http://vtd-xml.sf.net - Original Message - From: "Sithu D. Sudarsan" To: java-user@lucene.apache.org Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific Subject: Parsing large xml files Hi, While trying to parse xml documents of