Re: Collector is collecting more than the specified hits

2014-02-18 Thread saisantoshi
There might be an issue with the below approach as the docID that is saved might be deleted before the next call to search and I am not sure if it does break the seach functionality when such a thing happens. Thanks, Ranjith. -- View this message in context: http://lucene.472066.n3.nabble.com/

Re: Collector is collecting more than the specified hits

2014-02-18 Thread saisantoshi
The above works fine but how do I get the state of *last docID*. Also, there will be multiple users accessing this and we need to maintain the integrity of last docID. Can we know the last docID from the collector collect call? Thanks, Ranjith. -- View this message in context: http://lucene.4

Re: Collector is collecting more than the specified hits

2014-02-17 Thread saisantoshi
As I mentioned in my original post, I am calling like the below: MyCollector collector; TopScoreDocCollector topScore = TopScoreDocCollector.create(firstIndex+numHits, true); IndexSearcher searcher = new IndexSearcher(reader); try {

Re: Collector is collecting more than the specified hits

2014-02-17 Thread saisantoshi
Could you please elaborate on the above? I am not sure if the collector is already doing it or do I need to call any other API? Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329p4117883.html Sent from

Re: Collector is collecting more than the specified hits

2014-02-17 Thread saisantoshi
The collector is collecting all the documents. Let's say I have 50k documents and I want the collector to give me the results taking the start and maxHits. Can we get this functionality from Lucene? For example, very first time, I want to collect from 0 -100 & the next time I want to collect from 1

Re: Collector is collecting more than the specified hits

2014-02-14 Thread saisantoshi
I am not interested in the scores at all. My requirement is simple, I only need the first 100 hits or the numHits I specify ( irrespective of there scores). The collector should stop after collecting the numHits specified. Is there a way to tell in the collector to stop after collecting the numHits

Collector is collecting more than the specified hits

2014-02-13 Thread saisantoshi
The problem with the below collector is the collect method is not stopping after the numHits count has reached. Is there a way to stop the collector collecting the docs after it has reached the numHits specified. For example: * TopScoreDocCollector topScore = TopScoreDocCollector.create(numHits, t

Lucene 4.0 chokes on multiple requests

2014-02-05 Thread saisantoshi
We recently upgraded to Lucen4.0 and found performance issues in searching the results. Upon some analysis, we found that it chokes when there are multiple requests coming for lucene search. User1 -> Search User2 -> search User3 -> search The search request done by User Search1 is still waiting

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
what about other characters like '&,'( quote) characters. We have a requirement that a text can start with 'sampletext' and when I search with a '* it does not return any results but instead when I search with sample*, it does return the result. Thanks, Ranjith, -- View this message in context:

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
Thanks. So, if I understand correctly, StandardAnalyzer wont work for the following below as it strips out the special characters and does search only on searchText ( in this case). queryText = *&&searchText* If we want to do a search like "*&&**" then we need to use WhiteSpaceAnalyzer. Please l

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
StandardAnalyzer both at index and search time. We use the default one and don't have any custom analyzers. Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096710.html Sent from the Lucene - Java Users mailing

Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
I have created strings like the below &&searchtext +sampletext and when I try to search the following using *&&** or *+** it does not give any result. I am using QueryParser.escape(String s) method to handle the special characters but does not look like it did anything. Also, when I search some

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-03-08 Thread saisantoshi
Could someone please comment on the above? Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-tp4035806p4045855.html Sent from the Lucene - Java Users mailing list archive at N

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-03-06 Thread saisantoshi
Thanks for the response and really appreciate your help. I have read the documentation but could not get it in the first read as I was new to Lucene. I have changed it to AtomicReader and it seems to be working fine. One last clarification is do we also need to use AtomicReader for the following b

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-28 Thread saisantoshi
Thanks a lot. Really appreciate your help here. I have read through the document and understand that the IndexReader uses sub readers (to look into the index files) and AtomicReader does not. But how does this affect from the search stand point of view. I think search results should be consistent

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-28 Thread saisantoshi
Could someone please comment on the above code snippet ? Also, one observation is that our search results are not consistent if we are using* IndexReader vs AtomicReader?* Could this be a problem? Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-27 Thread saisantoshi
Here is how I am using it: public class MyCollector extends PositiveScoresOnlyCollector { private IndexReader indexReader; public MyCollector(IndexReader indexReader, PositiveScoresOnlyCollector topScore) { super(topScore); this.indexReader = indexReader;

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-27 Thread saisantoshi
Thanks. Is there any issue the way we are calling the indexReader.getDocument(doc)? Not sure how do I get an AtomicReaderConext in the following below method? Any pointers on how do I get that instance is appreciated? public void collect(int doc) throws IOException { // ADD YOUR CUSTOM LOGIC

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-27 Thread saisantoshi
I want to get the Document in the following below code and thats why I need an indexReader public void collect(int doc) throws IOException { // ADD YOUR CUSTOM LOGIC HERE *Document doc = indexReader.document(doc)* delegate.collect(doc); } But this seems to be the problem as the in

Boolean Query not working in Lucene 4.0

2013-02-26 Thread saisantoshi
The following query does not seems to work after we upgrade from 2.4 - 4.0 *+type:sometype +title:sometitle** Any ideas as to what are some of the places to look for? Is the above Query correct in syntax. Appreciate if you could advise on the above? Thanks, Sai. -- View this message in contex

IndexSearcher.close() removed in 4.0

2013-02-18 Thread saisantoshi
I understand from the JIRA ticket(Lucene-3640) that the IndexSearcher.close() is no-op operation but not very clear on why it is a no-op? Could someone shed some light on this? We were using this method in the older versions and is it safe now to remove this call. Just want to understand the conseq

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-02-05 Thread saisantoshi
I am looking at the versions supported by newer version of Tika (1.3) and was not sure what version(s) of the Microsoft office it supports (97/2000/2010/2013) for each of the below? http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats Microsoft word (also does it support bot

Re: Example settings for TieredMergePolicy : Lucene 4.0

2013-02-01 Thread saisantoshi
Thanks. I read this ( and also tried it out in my code) and understand that forceMerge(1) is not advisable for performance reasons. My question here is if we don't have a way to compress these files, it will produce enormous amount of files which will lead to some file system issues ( such as excee

Example settings for TieredMergePolicy : Lucene 4.0

2013-02-01 Thread saisantoshi
I am using the TieredMergePolicy and using the compound index: TieredMergePolicy mergePolicy = new TieredMergePolicy(); indexWriterConfig.setMergePolicy(mergePolicy.setNoCFSRatio(1.0d)); Prior to 4.0, there was an optimize() in the IndexWriter which was merging the index files. Is there any sett

Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-02-01 Thread saisantoshi
>>Are you closing or committing your IndexWriter after each added document? Because if you add 100 docs you should not see 100 versions of these files, only one set of files in the end (many docs are written to one segment). No. What I meant to say here is if 100 users have updated the document

Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-01-31 Thread saisantoshi
Is it by design. The older API (2.4) does not have this problem. Lets say if I have 100 updates or so.. then it will create 100 versions of those files in the index. This would increase the number of files in the index directory and might run into some file issues? It would be good to just have th

Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-01-31 Thread saisantoshi
It's _0.si ( typo) For second update, create = "false". Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766p4037785.html Sent from the Lucene - Java Users mailing list archive at Nabble.com

IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-01-31 Thread saisantoshi
I am using the following below for creating the IndexWriter (for my indexing): IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_40, new LimitTokenCountAnalyzer(analyzer, MAX_FIELD_SCAN_LENGTH)); if (create) { // create will be trure for indexing

Re: List of files that Lucene 4.0 generates during indexing

2013-01-30 Thread saisantoshi
The following files are originally created files (upon an initial indexing): _0.fdt _0.fdx _0.fnm _0.si _0_Lucene40_0.frq _0_Lucene40_0.prx _0_Lucene40_0.tim _0_Lucene40_0.tip _0_nrm.cfe _0_nrm.cfs index.v0008

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread saisantoshi
We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika framewor

Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-25 Thread saisantoshi
I want to index the document content( such as PDF/word/excel) and am just wondering if there are any good readers that I can use to integrate into Lucene 4.0. Any pointers/example code is appreciated.. Lucene In Action book mentions "tika" as the library to use but not sure if this is the preferre

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-25 Thread saisantoshi
I am not looking for negative scores and want to skip it. Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-tp4035806p4036378.html Sent from the Lucene - Java Users mailing li

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-25 Thread saisantoshi
Thanks a lot. If we want to wrap TopScoreDocCollector into PositiveScoresOnlyCollector. Can we do that? I need only positive scores and I dont think topscore collector can handle by itself right? Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector-v

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks. Could you please also comment on the following as well? http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-td4035806.html Thanks and really appreciate your help. Thanks, Sai. -- View this message in context: htt

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-24 Thread saisantoshi
Can someone please help us here to validate the above? Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-tp4035806p4036093.html Sent from the Lucene - Java Users mailing list

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks a lot. One last question, how do we set it? IndexWriter.??? Thanks, Ranjith. -- View this message in context: http://lucene.472066.n3.nabble.com/List-of-files-that-Lucene-4-0-generates-during-indexing-tp4035993p4036091.html Sent from the Lucene - Java Users mailing list archive at Nabbl

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks. Are there any best practices to follow here? or leave the the default ( which is hybrid approach as you mentioned). -- View this message in context: http://lucene.472066.n3.nabble.com/List-of-files-that-Lucene-4-0-generates-during-indexing-tp4035993p4036086.html Sent from the Lucene - J

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks Michael. The additional file in the list is just a typo. One more question is, we were using 2.4 before, and it only generated few files _0.cfs _0.cfx // segment files I am assuming that the 2.4 version has the compound index structure enabled by default. Do we need to set it explicitly w

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks. I checked it out. Here are the list of files that has been generated: _0.fdt _0.fdx _0.fnm _0.si _0_Lucene40_0.frq _0_Lucene40_0.prx _0_Lucene40_0.tim _0_Lucene40_0.tip _0_nrm.cfe _0_nrm.cfs index.v000

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-23 Thread saisantoshi
Here is the way I implemented a collector class. Appreciate if you could let me know of any issues.. public class MyCollector extends PositiveScoresOnlyCollector { private IndexReader indexReader; public MyCollector (IndexReader indexReader,PositiveScoresOnlyCollector topScor

Re: Extending TopScoreDocCollector to write a custom collector

2013-01-23 Thread saisantoshi
Here is the way I implemented a collector class. Appreciate if you could let me know of any issues.. public class MyCollector extends PositiveScoresOnlyCollector { private IndexReader indexReader; public MyCollector (IndexReader indexReader,PositiveScoresOnlyCollector topScor

Extending TopScoreDocCollector to write a custom collector

2013-01-23 Thread saisantoshi
I would like to write a custom collector ( similar to the one which is inside the source of TopScoreDocCollector like InOrderTopScoreDocCollector). The reason for extending this is because InOrderTopScoreDocCollector and OutOfOrderTopScoreDocCollector are private to the class and I really wanted

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-23 Thread saisantoshi
I am sorry but I am confused looking at the change logs and the enhancements done. Since we are jumping from 2.4 - 4.0. Could you please point me to any example code that extends one of the new collectors.. that would help a lot or it would be great if you could give some pointers on how we can mo

TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-23 Thread saisantoshi
Our current search implementation (based on 2.4.0) uses a collector extending the TopDocCollector class public class MyHitCollector extends TopDocsCollector { private IndexReader indexReader; private CustomFilter customFilter; public MyHitCollector (IndexReader indexReader, int numbe

IndexWriter.optimize() is removed in 4.0?

2013-01-23 Thread saisantoshi
There is no optimize() method in 4.0. I looked at the 3.6 docs and it did mention the following below. Does the following below mean that we no longer need this method and should not be used anymore. Is there any supplement method that we need to use as it is deprecated as of version 3.6.0 /* Depr

RE: Are Search Index directories backward comptabile? ( when upgrading to latest lucene version)

2013-01-23 Thread saisantoshi
Thanks. We decided to delete the existing index directories and recreate it once we upgrade to 4.0 (unless we hit any major api blockers during compilation, we will prefer to go to 3.6.2 first and then later to 4.0). Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.

RE: Are Search Index directories backward comptabile? ( when upgrading to latest lucene version)

2013-01-22 Thread saisantoshi
Also, I am not sure about the following below statement: >>A direct update from 2.x to 4.0 is not possible Are you saying that its impossible to upgrade from 2.4 - 4.0 version? Why can't we upgrade? Any technical limitations with Lucene that will not allow from upgrading from 2.4 - 4.x?. I am d

RE: Are Search Index directories backward comptabile? ( when upgrading to latest lucene version)

2013-01-22 Thread saisantoshi
We are upgrading from 2.4 - 4.0? What are the options here? ( To delete the existing index directories and reindex with the upgraded ones). We don't want to take any intermediate steps which would cause more work again to upgrade to the latest version. Thanks, Sai. -- View this message in cont

RE: IndexSearcher.search(Weight weight, Filter filter, HitCollector results) is not there in 4.0 version

2013-01-22 Thread saisantoshi
Thanks. Can we use the following method in 4.0 as a replacement for the above method? However, we will rewrite this to use FilteredQuery later but don't want to refactor a lot. public void search(Query query, Filter filter, Collector results) throws IOException ht

IndexSearcher.search(Weight weight, Filter filter, HitCollector results) is not there in 4.0 version

2013-01-22 Thread saisantoshi
We are using the following below method with Lucene 2.4.0 public void search(Weight weight, Filter filter, HitCollector results) throws IOException We are upgrading to the latest version and looking at the API (4.0), the above signature has been d

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-11 Thread saisantoshi
I was just going through compiling our sources against 4.0 and it seems there are quite some API changes. Any major functionality changes from 2.4 - 4.0? with respect to indexing/searching. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Upgrade-Lucene-to-lates

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-10 Thread saisantoshi
Thanks for all the responses. Apart from the API changes, is there any major functionality change from 2.4.0 -> 4.x version. I know we need to modify the API to the latest version but just curious if we need to be aware of any functional changes so as to do more thorough testing? Thanks, Sai.

Re: Field.Store.YES vs Field.Store.NO

2013-01-10 Thread saisantoshi
Not sure what does the following below mean? >>using Field.Store.NO the field itself is definitely searchable. You will not be able to retrieve the field value itself For example, if we have a file that we upload using some keywords and if the keyword (is of type Field.Store.NO but is analyzed)

Field.Store.YES vs Field.Store.NO

2013-01-10 Thread saisantoshi
I am new to lucene and am trying to understand what is the impact on the search in using Field.Store.NO vs Field.Store.YES. I know the earlier does not store the value in the index and later stores it in the index. Would that mean that the one that uses Field.Store.NO is not searchable? new Field

StandardAnalyzer: Support for Japanese

2013-01-10 Thread saisantoshi
We are using StandardAnalyzer for indexing some Japanese Keywords. It works fine so far but just wanted to confirm if the StandardAnalyzer can fully support it ( I have read somewhere in Lucene In Action book, that StandardAnalyzer does support CJK). Just want to confirm if my understanding is corr

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread saisantoshi
Are there any best practices that we can follow? We want to get to the latest version and am thinking if we can directly go from 2.4.0 to 4.x (as supposed to 2.x - 3.x and 3.x - 4.x)? so that it will not only save time but also testing cycle at each migration hop. Are there any limitations in dire

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread saisantoshi
Is there any migration guide from 2.x to 3.x? ( as per the suggestion, i would like to upgrade first from 2.4.0 to 2.9.0 and from 2.9.0 to 3.6) and later we decide if we want to upgrade from 3.6 to 4.x version? -- View this message in context: http://lucene.472066.n3.nabble.com/Upgrade-Lucene-t

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread saisantoshi
Thanks. Could you please elaborate on what is needed other than replacing the jars? Are the jars listed is the only jars or any additional jars required? Is the API not backward compatible? I mean to say whatever the API calls we are using in 2.4.0 is not supported by 4.0? Has the signature modifi

Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread saisantoshi
We have an existing application which uses Lucene 2.4.0 version. We are thinking of upgrading it to alatest version (4.0). I am not sure the process involved in upgrading to latest version. Is it just copying of the jars? If yes, what are all the jars that we need to copy over. Will it be backward

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-09 Thread saisantoshi
Thanks for all the responses. From the above, it sounds that there are two options. 1. Use ICUTokenizer ( is it in Lucene 4.0 or 4.1)? If its in 4.1, then we cannot use at this time as it is not released out. 2. Write a custom analyzer by extending ( StandardAnalyzer) and add filters for addition

Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread saisantoshi
DoesLucene StandardAnalyzer work for all the languagues for tokenizing before indexing (since we are using java, I think the content is converted to UTF-8 before tokenizing/indeing)? or do we need to use special analyzers for each of the language. In this case, if a document has a mixed case ( eng

Lucene support for multi byte characters : 2.4.0 (version).

2013-01-08 Thread saisantoshi
We are using Lucene (2.4.0 libraries) for implementing search in our application. We are using Standard Analyzer for Analyzer part. Our application has a documents upload feature which lets you upload the documents and be able to put in some keywords (while uploading it). When we search (using t