hithighlighter bug

2007-01-09 Thread Jason

Hi all,
	I have come across what I think is a curious but insidious bug with the 
java lucene hit highlighter. I updated to the latest version of lucene 
and the highlighter because I first found this problem in the lucene 
v1.4 version, unfortunately its still there in v2.0.0 versions.


I am indexing XML documents and am also using the hit highlighter for 
search results. This works perfectly in almost every case except for one.


in my I have this:

public class LuceneSearch implements 
org.apache.lucene.search.highlight.Formatter

{
...
public String highlightTerm(String originalText , TokenGroup group)
{
if(group.getTotalScore()<=0)
{
return originalText;
}
return "" + originalText + "";
}

when I search for -> Acquisition Plan <-
in my search results I get:
(ancilliary stuff deleted)
attached to the Acquisition
< em>Planand signed

notice the space between the < and e in the second < em>
This only occurs for these search terms and for this document (as far as 
I know) but because its part of a much larger XML document it breaks the 
whole thing.


the original XML is unremarkable with no strange characters surrounding 
these terms - a snipit from the relevant paragraph from which these 
highlighted terms come:


-> attached to the Acquisition Plan and signed off<-

has anyone seen anything like this before? is this a genuine new bug or 
something of which the lucene folk (or at least whoever wrote the 
highlighter) are aware? can anyone think of a way to fix this without 
scanning every element in my result text for rogue spaces?


Thanks in advance
Jason.






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how can I filter my search to not include items containing a particular field and value?

2007-01-10 Thread Jason
how can I filter my search to not include items containing a particular 
field and value?


I want effectively to add -myfieldname:myvalue to the end of  my search 
query, but I cant see how to do this via the api.
I have a complex query built up via the api and just want to filter it 
based on field name/value pairs.


I'm sure it must be simple - I just cant see how to do it.

thanks.
Jason.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how can I filter my search to not include items containing a particular field and value?

2007-01-11 Thread Jason

Thanks Erick,
	this is what I ended up doing more or less but I'm not happy with it 
really,

 
hparser.setDefaultOperator(QueryParser.Operator.AND);
Query hideQuery =hparser.parse("properties@"+hideterm+":"+hidevalue); 


cquery.add(hideQuery, BooleanClause.Occur.MUST_NOT);

what I was really looking for was a way to code the fields I wanted 
hidden without such a loose interface. I was expecting something like a 
fieldQuery object but couldn't find anything appropriate.


Thank you for your help.

Erick Erickson wrote:

Would something like the following work for you?

BooleanQuery bq = new BooleanQuery();
bq.add(your built-up query);
bq.add(your not clause, MUSTNOT);


Now you can use your bq as your query to search.


NOTE: there is continual confusion what the - syntax really does, you might
want to search the mail archive for one of several explications if you are
thinking of the NOT operator like a boolean logic operator. It's not, 
quite.


On 1/10/07, Jason <[EMAIL PROTECTED]> wrote:


how can I filter my search to not include items containing a particular
field and value?

I want effectively to add -myfieldname:myvalue to the end of  my search
query, but I cant see how to do this via the api.
I have a complex query built up via the api and just want to filter it
based on field name/value pairs.

I'm sure it must be simple - I just cant see how to do it.

thanks.
Jason.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



where is the proper place to report lucene bugs?

2007-01-11 Thread Jason
can someone please tell me where the most appropriate place to report 
bugs might be - in this case for the hit-highlighter contribution


Thanks
Jason.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: where is the proper place to report lucene bugs?

2007-01-11 Thread Jason

Thanks Grant,
	I did make an initial posting on this list but got zero responses so 
I'm guessing nobody else has seen the problem.


basically from this:

public String highlightTerm(String originalText , TokenGroup group)
{
if(group.getTotalScore()<=0)
{
return originalText;
}
return "" + originalText + "";
}

I'm getting '< em> some text ' as the result in just one case. This 
is an xml fragment and clearly the introduced space between '< e' is not 
a good thing.


Thanks for the response.
Jason.



Grant Ingersoll wrote:
 From the resources section of the website, the Issue Tracking link is: 
http://issues.apache.org/jira/browse/LUCENE


Also, it is helpful if you have done a preliminary search on the topic 
and some reasonable investigation to confirm that it is in fact a bug.  
If your not sure, please ask on this list.


-Grant

On Jan 11, 2007, at 9:49 PM, Jason wrote:

can someone please tell me where the most appropriate place to report 
bugs might be - in this case for the hit-highlighter contribution


Thanks
Jason.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



about the wordnet program.

2006-01-12 Thread jason
hi,

i am trying to use the Lucene WordNet program for my application. However, i
got some problems.

When i incorporate these files,  Syns2Index.java, SynLookup.java, and
SynExpand.java, I find some variables are not defined.

For instance, in Syns2Index. java,

writer.setMergeFactor( writer.getMergeFactor() * 2);  // the
writer.getMergeFactor() is not found in the lucene package.


writer.setMaxBufferedDocs( writer.getMaxBufferedDocs() * 2); // the
writer.getMaxBufferedDocs() is not found.

also,
   i do not find the definition of the Field.Store.YES,
Field.Index.UN_TOKENIZED in the Field.java.

how can i handler this problem?

regards
Xing


Re: about the wordnet program.

2006-01-13 Thread jason
Hi,

thx for your reply.

I have checked the source code and found it should be updated now.

For Syns2Index:
doc.add( new Field( F_WORD, g, Field.Store.YES, Field.Index.UN_TOKENIZED))
--> doc.add( Field.Keyword( F_WORD, g))
doc.add( new Field( F_SYN, cur, Field.Store.YES, Field.Index.NO)); -->
doc.add( Field.UnIndexed( F_SYN , cur));

For SynLookup and SynExpand,

tmp.add( tq, BooleanClause.Occur.SHOULD);  --> tmp.add(tq, true, false);


On 1/13/06, Daniel Naber <[EMAIL PROTECTED]> wrote:
>
> On Donnerstag 12 Januar 2006 16:25, jason wrote:
>
> > When i incorporate these files, Syns2Index.java, SynLookup.java, and
> > SynExpand.java, I find some variables are not defined.
>
> It depends on Lucene in SVN, some things in the Lucene API have changed
> since Lucene 1.4. So you need to get the latest development version from
> SVN.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


One problem of using the lucene

2006-01-16 Thread jason
Hi,

I got a problem of using the lucene.

I write a SynonymFilter which can add synonyms from the WordNet. Meanwhile,
i used the SnowballFilter for term stemming. However, i got a problem when
combining the two fiters.

For instance, i got 17 documents containing the Term "support"  and  the
following is the SynonymAnalyzer i wrote.

/**
*
*/
 public TokenStream tokenStream(String fieldName, Reader reader){


TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
if (stopword != null){
  result = new StopFilter(result, stopword);
}

result = new SnowballFilter(result, "Lovins");

   result = new SynonymFilter(result, engine);

return result;
}

If i only used the SnowballFilter, i can find the "support" in the 17
documents. However, after adding the SynonymFilter, the "support" can only
be found in 10 documents. It seems the term "support" cannot be found in the
left 7 documents. I dont know what's wrong with it.

regards

jiang xing


Re: One problem of using the lucene

2006-01-16 Thread jason
Hi,

the following code is the SynonymFilter i wrote.


import org.apache.lucene.analysis.*;


import java.io.*;
import java.util.*;
/**
 * @author JIANG XING
 *
 * Jan 15, 2006
 */
public class SynonymFilter extends TokenFilter {

public static final String TOKEN_TYPE_SYNONYM = "SYNONYM";

private Stack synonymStack;
private WordNetSynonymEngine engine;

public SynonymFilter(TokenStream in, WordNetSynonymEngine engine){
super(in);
synonymStack = new Stack();
this.engine = engine;
}

public Token next () throws IOException {
if(synonymStack.size() > 0){
return (Token) synonymStack.pop();
}

Token token = input.next();


if(token == null){
return null;
}

addAliasesToStack(token);

return token;
}

private void addAliasesToStack(Token token) throws IOException {


String [] synonyms = engine.getSynonyms(token.termText());

if(synonyms == null) return;

for(int i = 0; i < synonyms.length; i++) {
Token synToken = new Token(synonyms[i], token.startOffset(),
token.endOffset(), TOKEN_TYPE_SYNONYM);

synToken.setPositionIncrement(0); //

synonymStack.push(synToken);
}
}
}
It is adding tokens into the same position as the original token. And then,
I used the QueryParser for searching and the snowball analyzer for parsing.

the following is the SynonymAnalyzer I wrote.

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.snowball.*;

import java.io.*;
import java.util.*;

/**
 * @author JIANG XING
 *
 * Jan 15, 2006
 */
public class SynonymAnalyzer extends Analyzer {
private WordNetSynonymEngine engine;
private Set stopword;

public SynonymAnalyzer(String [] word) {
try{
engine = new WordNetSynonymEngine(new
File("C:\\PDF2Text\\SearchEngine\\WordNetIndex"));
stopword = StopFilter.makeStopSet(word);
}catch(IOException e){
e.printStackTrace();
}
}

public TokenStream tokenStream(String fieldName, Reader reader){

TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
if (stopword != null){
  result = new StopFilter(result, stopword);
}

result = new SnowballFilter(result, "Lovins");

result = new SynonymFilter(result, engine);

return result;
}

}
I write some code in the snowballfitler (line 75-79). If i only used the
snowballfilter, the term "support" can be found in all the 17 documents.
However, if the code "result = new SynonymFilter(result, engine);" is used.
The term "support" cannot be found in some documents.


public class SnowballFilter extends TokenFilter {
  private static final Object [] EMPTY_ARGS = new Object[0];

  private SnowballProgram stemmer;
  private Method stemMethod;

  /** Construct the named stemming filter.
   *
   * @param in the input tokens to stem
   * @param name the name of a stemmer
   */
  public SnowballFilter(TokenStream in, String name) {
super(in);
try {
  Class stemClass =
Class.forName("net.sf.snowball.ext." + name + "Stemmer");
  stemmer = (SnowballProgram) stemClass.newInstance();
  // why doesn't the SnowballProgram class have an (abstract?) stem
method?
  stemMethod = stemClass.getMethod("stem", new Class[0]);
} catch (Exception e) {
  throw new RuntimeException(e.toString());
}
  }

  /** Returns the next input Token, after being stemmed */
  public final Token next() throws IOException {
Token token = input.next();
if (token == null)
  return null;
stemmer.setCurrent(token.termText());
try {
  stemMethod.invoke(stemmer, EMPTY_ARGS);
} catch (Exception e) {
  throw new RuntimeException(e.toString());
}

Token newToken = new Token(stemmer.getCurrent(),
  token.startOffset(), token.endOffset(), token.type());
//check the tokens.
if(newToken.termText().equals("support")){
System.out.println("the term support is found");
}

newToken.setPositionIncrement(token.getPositionIncrement());
return newToken;
  }
}



On 1/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> Could you share the details of your SynonymFilter?  Is it adding
> tokens into the same position as the original tokens (position
> increment of 0)?   Are you using QueryParser for searching?  If so,
> try TermQuery to eliminate the parser's analysis from the picture for
> the time being while trouble shooting.
>
> If you are using QueryParser, are you using the same analyzer?  If
> this is the case, what is the .toString of the generated Quer

Re: One problem of using the lucene

2006-01-17 Thread jason
hi,

thx for your replies.

I have test the snowballFilter and it does not stem the term "support". It
means the term "support" should be in all the papers. However, i add the
synonymFilter, the "support" is missing.

I think i have to read the lucene source code again.

yours truly

Jiang Xing

On 1/17/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On Jan 17, 2006, at 12:14 AM, jason wrote:
> > It is adding tokens into the same position as the original token.
> > And then,
> > I used the QueryParser for searching and the snowball analyzer for
> > parsing.
>
> Ok, so you're only using the SynonymAnalyzer for indexing, and the
> SnowballAnalyzer for QueryParser, correct?  If so, that is reasonable.
>
> > public TokenStream tokenStream(String fieldName, Reader reader){
> >
> > TokenStream result = new StandardTokenizer(reader);
> > result = new StandardFilter(result);
> > result = new LowerCaseFilter(result);
> > if (stopword != null){
> >   result = new StopFilter(result, stopword);
> > }
> >
> > result = new SnowballFilter(result, "Lovins");
> >
> > result = new SynonymFilter(result, engine);
> >
> > return result;
> > }
> >
> > }
> > I write some code in the snowballfitler (line 75-79). If i only
> > used the
> > snowballfilter, the term "support" can be found in all the 17
> > documents.
> > However, if the code "result = new SynonymFilter(result, engine);"
> > is used.
> > The term "support" cannot be found in some documents.
>
>
> It looks like you borrowed SynonymAnalyzer from the Lucene in Action
> code.  But you've tweaked some things.  One thing that is clearly
> amiss is that you're looking up synonyms for stemmed words, which is
> not going to work (unless you stemmed the WordNet words beforehand,
> but I doubt you did that and it would quite odd to do so).  You're
> probably not injecting many synonyms at all.
>
> I encourage you to "analyze your analyzer" by running some utilities
> such as the Analyzer demo that comes with Lucene in Action's code.
> You'll have some more insight into this issue when trying this out in
> isolation from query parsing and other complexities.
>
> >   /** Returns the next input Token, after being stemmed */
> >   public final Token next() throws IOException {
> > Token token = input.next();
> > if (token == null)
> >   return null;
> > stemmer.setCurrent(token.termText());
> > try {
> >   stemMethod.invoke(stemmer, EMPTY_ARGS);
> > } catch (Exception e) {
> >   throw new RuntimeException(e.toString());
> > }
> >
> > Token newToken = new Token(stemmer.getCurrent(),
> >   token.startOffset(), token.endOffset(),
> > token.type());
> > //check the tokens.
> > if(newToken.termText().equals("support")){
> > System.out.println("the term support is found");
> > }
>
> I'm not sure what the exact solution to your dilemma is, but doing
> more testing with your analyzer will likely shed light on it for you.
>
>Erik
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: One problem of using the lucene

2006-01-17 Thread jason
Ok,  i  will try it.


On 1/17/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On Jan 17, 2006, at 5:58 AM, jason wrote:
> > I have test the snowballFilter and it does not stem the term
> > "support". It
> > means the term "support" should be in all the papers. However, i
> > add the
> > synonymFilter, the "support" is missing.
>
> Two very valuable troubleshooting techniques:
>
>1) Run your analyzer used for indexing standalone on the trouble
> text.
>
>2) Look at the Query.toString() of the parsed query.
>
> These two things will very likely point to the issue.
>
>Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Use the lucene for searching in the Semantic Web.

2006-01-17 Thread jason
Hi friends,

How do you think use the lucene for searching in the Semantic Web? I am
trying using the lucene for searching documents with ontological annotation.
But i do not get a better model to combine the keywords information and the
ontological information.

regards
jiang xing


Re: Use the lucene for searching in the Semantic Web.

2006-01-17 Thread jason
hi Erik,

thx for your reply.

I think the Kowari is a system for searching information in the RDF files.
It is only for finding information in the meta data files. However, i think
one problem of the Semantic Web is that, if we have a document and its RDF
annotate, how do we retrieve the documents? Right now, we can use keyword
based method to find relevant documents to user's query and use some kinds
of technologies for finding metadata files. But can we combine the two
processes and how can we combine them?

regards
Jiang Xing


On 1/17/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> Have a look at Kowari - http://www.kowari.org
>
> It is a scalable RDF engine that also has full-text search support
> via Lucene.
>
> Professionally I tinker with semweb and search topics, and eventually
> we'll have something to show for these efforts :)
>
>    Erik
>
>
> On Jan 17, 2006, at 9:34 AM, jason wrote:
>
> > Hi friends,
> >
> > How do you think use the lucene for searching in the Semantic Web?
> > I am
> > trying using the lucene for searching documents with ontological
> > annotation.
> > But i do not get a better model to combine the keywords information
> > and the
> > ontological information.
> >
> > regards
> > jiang xing
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Question.

2006-02-05 Thread jason
You can get the term frequency matrix first. Then, select the most frequent
terms.

One letter has said how to build the term frequency matrix.

regards
jiang xing


On 2/6/06, Pranay Jain <[EMAIL PROTECTED]> wrote:
>
> I have earlier used lucene and I must say it has performed bug free for
> the
> limited use I deployed it for. I now want to deploy lucene to do something
> more. Once indexed, I want to know, which is the word which occurs maximum
> times among all the rest in a document set. Does lucene already provide
> such
> a feature?
>
> I would appreciate an answer to my question.
>
> Thanks in advance.
> Pranay
>
>


Re: two problems of using the lucene.

2006-02-05 Thread jason
Hi,

I try to read the source code of the lucene. But i only find in the
TermScorer.java where the tf/idf measure is really implemented. I guess that
whether the Queryparser class will convert each word into a termquery first.
Then, queries such as the the Booleanquery are built.

The source code of the Queryparser.java is hard to read.


regards
jiang xing

On 2/5/06, Klaus <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> you have to write your own similarity object and pass it to your analyzer.
>
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.h
> tml
>
> Cheers,
>
> Klaus
> -Ursprüngliche Nachricht-
> Von: xing jiang [mailto:[EMAIL PROTECTED]
> Gesendet: Sonntag, 5. Februar 2006 04:27
> An: java-user@lucene.apache.org
> Betreff: two problems of using the lucene.
>
> Hi,
>
> I got two problems of using the lucene and may need your help.
>
> 1. For each word, how the lucene calculate its weight. I only know for
> each
> work in the document will be weighed by its tf/idf values.
>
> 2. Can I modify the lucene so that i use the term frequency instead of
> tf/idf value to calculate the similarity between documents and queries.
>
> --
> Regards
>
> Jiang Xing
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


understand the queryNorm and the fieldNorm.

2006-02-06 Thread jason
Hi,

I have a problem of understanding the queryNorm and fieldNorm.

The following is an example. I try to follow what said in the Javadoc
"Computes the normalization value for a query given the sum of the squared
weights of each of the query terms". But the result is different.

ID:0 C:/PDF2Text/SearchEngine/File/SIG/sigkdd/p374-zhang.pdf|initial rank: 0
0.31900567 = sum of:
  0.03968133 = weight(contents:associ in 920), product of:
0.60161763 = queryWeight(contents:associ), product of:
  1.326625 = idf(docFreq=830)
  0.45349488 = queryNorm
0.065957725 = fieldWeight(contents:associ in 920), product of:
  4.2426405 = tf(termFreq(contents:associ)=18)
  1.326625 = idf(docFreq=830)
  0.01171875 = fieldNorm(field=contents, doc=920)
  0.27932435 = weight(contents:rule in 920), product of:
0.7987842 = queryWeight(contents:rule), product of:
  1.7613963 = idf(docFreq=537)
  0.45349488 = queryNorm
0.34968686 = fieldWeight(contents:rule in 920), product of:
  16.941074 = tf(termFreq(contents:rule)=287)
  1.7613963 = idf(docFreq=537)
  0.01171875 = fieldNorm(field=contents, doc=920)

regards
jiang xing


To understand the queryNorm and fieldNorm

2006-02-06 Thread jason
Hi,

I have a problem of understanding the queryNorm and fieldNorm.

The following is an example. I try to follow what said in the Javadoc
"Computes the normalization value for a query given the sum of the squared
weights of each of the query terms". But the result is different.

ID:0 C:/PDF2Text/SearchEngine/File/SIG/sigkdd/p374-zhang.pdf|initial rank: 0
0.31900567 = sum of:
  0.03968133 = weight(contents:associ in 920), product of:
0.60161763 = queryWeight(contents:associ), product of:
  1.326625 = idf(docFreq=830)
  0.45349488 = queryNorm
0.065957725 = fieldWeight(contents:associ in 920), product of:
  4.2426405 = tf(termFreq(contents:associ)=18)
  1.326625 = idf(docFreq=830)
  0.01171875 = fieldNorm(field=contents, doc=920)
  0.27932435 = weight(contents:rule in 920), product of:
0.7987842 = queryWeight(contents:rule), product of:
  1.7613963 = idf(docFreq=537)
  0.45349488 = queryNorm
0.34968686 = fieldWeight(contents:rule in 920), product of:
  16.941074 = tf(termFreq(contents:rule)=287)
  1.7613963 = idf(docFreq=537)
  0.01171875 = fieldNorm(field=contents, doc=920)

regards
jiang xing


Re: understand the queryNorm and the fieldNorm.

2006-02-06 Thread jason
hi, thx.

I think i forget the ^0.5

cheers
Jason


On 2/6/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> Hi Jason,
> I get the same thing for the queryNorm when I calculate it by hand:
> 1/((1.7613963**2 + 1.326625**2)**.5)  = 0.45349488111693986
>
> -Yonik
>
> On 2/6/06, jason <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I have a problem of understanding the queryNorm and fieldNorm.
> >
> > The following is an example. I try to follow what said in the Javadoc
> > "Computes the normalization value for a query given the sum of the
> squared
> > weights of each of the query terms". But the result is different.
> >
> > ID:0 C:/PDF2Text/SearchEngine/File/SIG/sigkdd/p374-zhang.pdf|initialrank: 0
> > 0.31900567 = sum of:
> >   0.03968133 = weight(contents:associ in 920), product of:
> > 0.60161763 = queryWeight(contents:associ), product of:
> >   1.326625 = idf(docFreq=830)
> >   0.45349488 = queryNorm
> > 0.065957725 = fieldWeight(contents:associ in 920), product of:
> >   4.2426405 = tf(termFreq(contents:associ)=18)
> >   1.326625 = idf(docFreq=830)
> >   0.01171875 = fieldNorm(field=contents, doc=920)
> >   0.27932435 = weight(contents:rule in 920), product of:
> > 0.7987842 = queryWeight(contents:rule), product of:
> >   1.7613963 = idf(docFreq=537)
> >   0.45349488 = queryNorm
> > 0.34968686 = fieldWeight(contents:rule in 920), product of:
> >   16.941074 = tf(termFreq(contents:rule)=287)
> >   1.7613963 = idf(docFreq=537)
> >   0.01171875 = fieldNorm(field=contents, doc=920)
> >
> > regards
> > jiang xing
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Stemmer algorithms

2006-02-13 Thread jason
Hi,

I have test some stemmer algorithms in my application. However, i think we'd
better writer a weaker algorithm. I mean, the Porter and some other
algorithms are too strong. maybe an algorithm which can convert plural to
single noun is enough.

On 2/14/06, Yilmazel, Sibel <[EMAIL PROTECTED]> wrote:
>
> Hello all,
>
> We have done some preliminary research on Porter2 and K-stem algorithms
> and have some questions.
>
> Porter2 was found to be a 'strong' stemming algorithm where it strips
> off both inflectional suffixes (-s, -es, -ed) and derivational suffixes
> (-able, -aciousness, -ability). K-Stem seemed to be a weak stemming
> algorithm as it strips off only the inflectional suffixes (-s, -es,
> -ed).
>
> In IR, it is usually recommended using a "weak" stemmer, as the "weak"
> stemmer seldom hurts performance, but it usually provides significant
> improvement with precision.
>
> However, Porter2 is the most widely used stemming algorithm AND it is a
> 'strong' stemmer which is contrary to what is said above.
>
> Can you share your ideas, experiences with stemmer algorithms? Thanks in
> advance.
>
> Sibel
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Add a module to the lucene

2006-03-14 Thread jason
Hi,

Can we add a module to lucene so that we are able to use our own similarity
measure to calculate the similarity between documents and queries? As lucene
has defined its own measure, we can do few with it.

Considering the documents and queries represented as the vectors, we only
need one class to read the vectors and use our own defined measure to
calculate their similarity.

How do you think of it?

regards
jason


Add a module to the lucene!!!

2006-03-14 Thread jason
 Hi,

Can we add a module to lucene so that we are able to use our own similarity
measure to calculate the similarity between documents and queries? As lucene
has defined its own measure, we can do few with it.

Considering the documents and queries represented as the vectors, we only
need one class to read the vectors and use our own defined measure to
calculate their similarity.

How do you think of it?

regards
jason


Add more module to the lucene

2006-03-14 Thread jason
Hi,

Can we add more module to the lucene so that we can easily use our own
measures to calculate similarity between documents and queries? I have read
some codes of the original lucene, i dont think it is easy to change the
similarity measure used. But i think we can build a module which can read
the vectors of documents from the index structure. Then, we can use our own
similarity measures.


FYI.

Regards

jason.


Add a module to the lucene

2006-03-15 Thread jason
 Hi,

Can we add a module to lucene so that we are able to use our own similarity
measure to calculate the similarity between documents and queries? As lucene
has defined its own measure, we can do few with it.

Considering the documents and queries represented as the vectors, we only
need one class to read the vectors and use our own defined measure to
calculate their similarity.

How do you think of it?

regards
jason


Re: how to cluster documents

2006-03-21 Thread jason
I guess you should use some text mining tools. you can use googl find them.
I remember UIUC recently releases one tool. It is very good.

On 3/21/06, Valerio Schiavoni <[EMAIL PROTECTED]> wrote:
>
> Hello,
> not sure if the term 'cluster' is the correct one, but here what i would
> like to do:
> given I have a small set of categories; i manually defined some keywords
> for
> each category.
> ie:
>
> -spielberg: ET, munich, indiana jones;
> -sport: football, basket, volley, etc etc;
>
> then, i have a quite large archive of documents (html, pdf, doc) (~5000,
> still growing) and I want to 'assign' each document
> to those categories, using Lucene possibly (if it can help!).
>
> what approach could I adopt ?
>
> thanks,
> valerio
>
> --
> To Iterate is Human, to Recurse, Divine
> James O. Coplien, Bell Labs
> (how good is to be human indeed)
>
>


for the similarity measure

2006-04-27 Thread jason
Hi,

After reading the code, I found the similarity measure in Lucene is not the
same as the cosine coefficient measure commonly used. I dont know it is
correct. And I wonder whether i can use the cosine coefficient measure in
lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
coefficient measure.


Re: Vector space model

2006-04-28 Thread jason
Hi,

I am also interested in this problem.

Regards
Jason

On 4/28/06, trupti mulajkar <[EMAIL PROTECTED]> wrote:
>
> hi
>
> i am trying to implement the vector space model for lucene.
> i did find some code for generating the vectors, but can any1 suggest a
> better
> way of creating the IndexReader object as it is the only way that can
> return
> the index created.
>
> cheers,
> trupti mulajkar
> MSc Advanced Computer Science
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Jason Rutherglen
> Short answer is that no, there isn't an aggregate
> function. And you shouldn't even try

If that is the case why does a 'stats' component exist for Solr with
the SUM function built in?

http://wiki.apache.org/solr/StatsComponent

On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson  wrote:
> You will encounter endless grief until you stop
> thinking of Solr/Lucene as a replacement for
> an RDBMS. It is a *text search engine*.
> Whenever you start asking "how do I implement
> a SQL statement in Solr", you have to stop
> and reconsider *why* you are trying to do that.
> Then recast the question in terms of searching.
>
> Short answer is that no, there isn't an aggregate
> function. And you shouldn't even try.
>
> Best
> Erick
>
> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
>  wrote:
>> Thanks Eric for the response.
>>
>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>
>> Or I need to use hitCollector to achieve that ?
>>
>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be great.
>>
>> -Thanks,
>> Prasenjit
>>
>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson  
>> wrote:
>>> the time interval is just a RangeQuery in the Lucene
>>> world. The rest is pretty standard search stuff.
>>>
>>> You probably want to have a look at the NRT
>>> (near real time) stuff in trunk.
>>>
>>> Your reads/writes are pretty high, so you'll need
>>> some experimentation to size your site
>>> correctly.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>>  wrote:
 I have a requirement where reads and writes are quite high ( @ 100-500
 per-sec ). A document has the following fields : timestamp,
 unique-docid,  content-text, keyword. Average content-text length is ~
 20 bytes, there is only 1 keyword for a given docid.

 At runtime, given a query-term ( which could be null ) and a
 time-interval,  I need to find out top-k frequent keywords which
 contains the query-term ( optional if its null )  in its context-text
 field within that time-interval. I can purge the data every day, hence
 no need for me to have more than a days data.

 I have quite a few options here : Starting with MySQL, NoSQLs (
 Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
 lucene/solr ) each having its own pros/cons.

 In MySQL we can achieve this via : GROUP-BY/COUNT  clause
 In NoSQL I can probably write a map/reduce task to query these
 numbers. Although I am not very sure about the query response time.
 Not sure of we can achieve it via lucene/solr OOB.

 Any suggestions on what would be a good choice for this use case ?

 -Thanks,
 prasenjit

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Jason Rutherglen
> Although I still question whether this is a *good* use of Solr

It's a great use of Lucene, which can be made into a superior
horizontally scalable database when compared with open source
relational database systems.

My only concern, going back to *other* conversation(s) is whether or
not the field cache used by stats component is operated on per-segment
or not.  If *true* then the stats part of Solr can be checked off as
NRT / soft commit capable / efficient.

I think the answer is *FALSE* based on these lines in StatsComponent
which seem to be operating on the top-level reader (eg, NOT
per-segment).

  si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);

  UnInvertedField uif = UnInvertedField.getUnInvertedField(f, searcher);

On Thu, Jan 5, 2012 at 4:54 PM, Erick Erickson  wrote:
> Hmmm, guess you're right, the stats component
> does return that data. It's been a long day...
>
> Although I still question whether this is a *good*
> use of Solr, I'd still re-examine my approach
> whenever I found myself trying to translate
> SQL queries into Solr
>
> But if, after that examination I still required
> SUM, stats would do it.
>
> Erick
>
> On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen
>  wrote:
>>> Short answer is that no, there isn't an aggregate
>>> function. And you shouldn't even try
>>
>> If that is the case why does a 'stats' component exist for Solr with
>> the SUM function built in?
>>
>> http://wiki.apache.org/solr/StatsComponent
>>
>> On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson  
>> wrote:
>>> You will encounter endless grief until you stop
>>> thinking of Solr/Lucene as a replacement for
>>> an RDBMS. It is a *text search engine*.
>>> Whenever you start asking "how do I implement
>>> a SQL statement in Solr", you have to stop
>>> and reconsider *why* you are trying to do that.
>>> Then recast the question in terms of searching.
>>>
>>> Short answer is that no, there isn't an aggregate
>>> function. And you shouldn't even try.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
>>>  wrote:
>>>> Thanks Eric for the response.
>>>>
>>>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>>>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>>>
>>>> Or I need to use hitCollector to achieve that ?
>>>>
>>>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be 
>>>> great.
>>>>
>>>> -Thanks,
>>>> Prasenjit
>>>>
>>>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson  
>>>> wrote:
>>>>> the time interval is just a RangeQuery in the Lucene
>>>>> world. The rest is pretty standard search stuff.
>>>>>
>>>>> You probably want to have a look at the NRT
>>>>> (near real time) stuff in trunk.
>>>>>
>>>>> Your reads/writes are pretty high, so you'll need
>>>>> some experimentation to size your site
>>>>> correctly.
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>>>>  wrote:
>>>>>> I have a requirement where reads and writes are quite high ( @ 100-500
>>>>>> per-sec ). A document has the following fields : timestamp,
>>>>>> unique-docid,  content-text, keyword. Average content-text length is ~
>>>>>> 20 bytes, there is only 1 keyword for a given docid.
>>>>>>
>>>>>> At runtime, given a query-term ( which could be null ) and a
>>>>>> time-interval,  I need to find out top-k frequent keywords which
>>>>>> contains the query-term ( optional if its null )  in its context-text
>>>>>> field within that time-interval. I can purge the data every day, hence
>>>>>> no need for me to have more than a days data.
>>>>>>
>>>>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>>>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>>>>> lucene/solr ) each having its own pros/cons.
>>>>>>
>>>>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>>>>> In NoSQL I can probably write a map/reduce task to query 

date issues

2012-02-22 Thread Jason Toy
I  have a solr instance with about 400m docs. For text searches it is perfectly 
fine. When I do searches that calculate  the amount of times a word appeared in 
the doc set for every day of a month, it usually causes solr to crash with out 
of memory errors. 
I calculate this by running  ~30 queries, one for each day to see the count for 
that day.
Is there a better way I could do this?

Currently the date fields are stored as:


and the timestamps are stored in the format of:
2012-02-22T21:11:14Z

We have no need to store anything beyond the date. Will just changing the time 
portion to zeros make things faster:
2012-02-22T00:00:00Z

I thought that to optimize this, there would be an actual date type that doesnt 
store the time component, but looking through the solr docs, I don't see 
anything specifically for a date as opposed to a timestamp.  Would it be faster 
for me to store dates in an sint format?  What is the optimal format I should 
use? If the format is to continue to use TrieDateField,  is it not a waste to 
store the hour/minute/seconds even if they are not being used?

Is there anything else I can do to make this more efficient?

I have looked around on the mailing list and on google and not sure what to 
use, I would appreciate any pointers.  Thanks.

Jason
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: date issues

2012-02-22 Thread Jason Toy
Can I still do range searches on a string? It seems like it would be more 
efficient to store as an integer.
> Hi,
> 
> You could consider storing date field as String in "MMDD" format. This
> will save space and it will perform better.
> 
> Regards
> Aditya
> www.findbestopensource.com
> 
> 
> On Thu, Feb 23, 2012 at 11:55 AM, Jason Toy  wrote:
> 
>> I  have a solr instance with about 400m docs. For text searches it is
>> perfectly fine. When I do searches that calculate  the amount of times a
>> word appeared in the doc set for every day of a month, it usually causes
>> solr to crash with out of memory errors.
>> I calculate this by running  ~30 queries, one for each day to see the
>> count for that day.
>> Is there a better way I could do this?
>> 
>> Currently the date fields are stored as:
>> > precisionStep="0" positionIncrementGap="0"/>
>> 
>> and the timestamps are stored in the format of:
>> 2012-02-22T21:11:14Z
>> 
>> We have no need to store anything beyond the date. Will just changing the
>> time portion to zeros make things faster:
>> 2012-02-22T00:00:00Z
>> 
>> I thought that to optimize this, there would be an actual date type that
>> doesnt store the time component, but looking through the solr docs, I don't
>> see anything specifically for a date as opposed to a timestamp.  Would it
>> be faster for me to store dates in an sint format?  What is the optimal
>> format I should use? If the format is to continue to use TrieDateField,  is
>> it not a waste to store the hour/minute/seconds even if they are not being
>> used?
>> 
>> Is there anything else I can do to make this more efficient?
>> 
>> I have looked around on the mailing list and on google and not sure what
>> to use, I would appreciate any pointers.  Thanks.
>> 
>> Jason
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RAMDirectory unexpectedly slows

2012-06-04 Thread Jason Rutherglen
If you want the index to be stored completely in RAM, there is the
ByteBuffer directory [1].  Though I do not see the point in putting an
index in RAM, it will be cached in RAM regardless in the OS system IO
cache.

1. 
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/apache/lucene/store/bytebuffer/ByteBufferDirectory.java

On Mon, Jun 4, 2012 at 10:55 AM, Cheng  wrote:
> My indexes are 500MB+. So it seems like that RAMDirectory is not good for
> that big a size.
>
> My challenge, on the other side, is that I need to update the indexes very
> frequently. So, do you think  MMapDirectory is the solution?
>
> Thanks.
>
> On Mon, Jun 4, 2012 at 10:30 PM, Jack Krupansky 
> wrote:
>
>> From the javadoc for RAMDirectory:
>>
>> "Warning: This class is not intended to work with huge indexes. Everything
>> beyond several hundred megabytes will waste resources (GC cycles), because
>> it uses an internal buffer size of 1024 bytes, producing millions of
>> byte[1024] arrays. This class is optimized for small memory-resident
>> indexes. It also has bad concurrency on multithreaded environments.
>>
>> It is recommended to materialize large indexes on disk and use
>> MMapDirectory, which is a high-performance directory implementation working
>> directly on the file system cache of the operating system, so copying data
>> to Java heap space is not useful."
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Cheng
>> Sent: Monday, June 04, 2012 10:08 AM
>> To: java-user@lucene.apache.org
>> Subject: RAMDirectory unexpectedly slows
>>
>>
>> Hi,
>>
>> My apps need to read from and write to some big indexes frequently. So I
>> use RAMDirectory instead of FSDirectory, and give JVM about 2GB memory
>> size.
>>
>> I notice that the speed of reading and writing unexpectedly slows as the
>> size of the indexes increases. Since the usage of RAM is less than 20%, I
>> think by default the RAMDirectory doesn't take advantage of the memory I
>> assigned to JVM.
>>
>> What are the steps to improve the reading and writing speed of
>> RAMDirectory?
>>
>> Thanks!
>> Jeff
>>
>> --**--**-
>> To unsubscribe, e-mail: 
>> java-user-unsubscribe@lucene.**apache.org
>> For additional commands, e-mail: 
>> java-user-help@lucene.apache.**org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RAMDirectory unexpectedly slows

2012-06-04 Thread Jason Rutherglen
> What about the ByteBufferDirectory? Can this specific directory utilize the
> 2GB memory I grant to the app?

BBD places the byte objects outside of the heap, so increasing the
heap size is only going to rob the system IO cache.  With Lucene the
heap is only used for field caches and the terms dictionary index.

On Mon, Jun 4, 2012 at 11:04 AM, Cheng  wrote:
> Please shed more insight into the difference between JVM heap size and the
> memory size used by Lucene.
>
> What I am getting at is that no matter however much ram I give my apps,
> Lucene can't utilize it. Is that right?
>
> What about the ByteBufferDirectory? Can this specific directory utilize the
> 2GB memory I grant to the app?
>
> On Mon, Jun 4, 2012 at 10:58 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> If you want the index to be stored completely in RAM, there is the
>> ByteBuffer directory [1].  Though I do not see the point in putting an
>> index in RAM, it will be cached in RAM regardless in the OS system IO
>> cache.
>>
>> 1.
>> https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/apache/lucene/store/bytebuffer/ByteBufferDirectory.java
>>
>> On Mon, Jun 4, 2012 at 10:55 AM, Cheng  wrote:
>> > My indexes are 500MB+. So it seems like that RAMDirectory is not good for
>> > that big a size.
>> >
>> > My challenge, on the other side, is that I need to update the indexes
>> very
>> > frequently. So, do you think  MMapDirectory is the solution?
>> >
>> > Thanks.
>> >
>> > On Mon, Jun 4, 2012 at 10:30 PM, Jack Krupansky > >wrote:
>> >
>> >> From the javadoc for RAMDirectory:
>> >>
>> >> "Warning: This class is not intended to work with huge indexes.
>> Everything
>> >> beyond several hundred megabytes will waste resources (GC cycles),
>> because
>> >> it uses an internal buffer size of 1024 bytes, producing millions of
>> >> byte[1024] arrays. This class is optimized for small memory-resident
>> >> indexes. It also has bad concurrency on multithreaded environments.
>> >>
>> >> It is recommended to materialize large indexes on disk and use
>> >> MMapDirectory, which is a high-performance directory implementation
>> working
>> >> directly on the file system cache of the operating system, so copying
>> data
>> >> to Java heap space is not useful."
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -Original Message- From: Cheng
>> >> Sent: Monday, June 04, 2012 10:08 AM
>> >> To: java-user@lucene.apache.org
>> >> Subject: RAMDirectory unexpectedly slows
>> >>
>> >>
>> >> Hi,
>> >>
>> >> My apps need to read from and write to some big indexes frequently. So I
>> >> use RAMDirectory instead of FSDirectory, and give JVM about 2GB memory
>> >> size.
>> >>
>> >> I notice that the speed of reading and writing unexpectedly slows as the
>> >> size of the indexes increases. Since the usage of RAM is less than 20%,
>> I
>> >> think by default the RAMDirectory doesn't take advantage of the memory I
>> >> assigned to JVM.
>> >>
>> >> What are the steps to improve the reading and writing speed of
>> >> RAMDirectory?
>> >>
>> >> Thanks!
>> >> Jeff
>> >>
>> >>
>> --**--**-
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<
>> java-user-unsubscr...@lucene.apache.org>
>> >> For additional commands, e-mail: java-user-help@lucene.apache.**org<
>> java-user-h...@lucene.apache.org>
>> >>
>> >>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Looking for case studies for 'Lucene and Solr: The Definitive Guide' from O'Reilly

2012-12-17 Thread Jason Rutherglen
This is a great chance to get your Lucene/Solr project included in
'Lucene and Solr: The Definitive Guide' from O'Reilly.  Your case
study should be between 2  - 15 pages in length and may include code
from any programming language, diagrams, schemas, etc.

Topics of interest for the case studies chapter are:

* Lucene and Solr in the Enterprise

* Spatial search

* Relevancy tuning

* Big Data with Solr/Lucene

* Scalability and Performance Tuning

* Real Time Search

* Faceting

* Multiple-languages

* Indexing and Analysis Techniques

* Datastax Enterprise Solr

* Solr Cloud

* Hadoop integration


Thanks,

Jason Rutherglen, Jack Krupansky, and Ryan Tabora

http://shop.oreilly.com/product/0636920028765.do

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene VSM scoring

2013-07-09 Thread Jason Z.
Hi,

In the Lucene docs it mentions that Lucene impements a tf-idf weighting
scheme for scoring. Is there anyway to modfiy Lucene to implement a custom
weighting scheme for the VSM?

Thank you.


Monitoring low level IO

2010-06-03 Thread Jason Rutherglen
This is more of a unix related question than Lucene specific
however because Lucene is being used, I'm asking here as perhaps
other people have run into a similar issue.

On an Amazon EC2 merge, read, and write operations are possibly
blocking due to underlying IO. Is there a tool that you have
used to monitor this type of thing?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



CFP for Surge Scalability Conference 2010

2010-06-14 Thread Jason Dixon
We're excited to announce Surge, the Scalability and Performance
Conference, to be held in Baltimore on Sept 30 and Oct 1, 2010.  The
event focuses on case studies that demonstrate successes (and failures)
in Web applications and Internet architectures.

Our Keynote speakers include John Allspaw and Theo Schlossnagle.  We are
currently accepting submissions for the Call For Papers through July
9th.  You can find more information, including our current list of
speakers, online:

http://omniti.com/surge/2010

If you've been to Velocity, or wanted to but couldn't afford it, then
Surge is just what you've been waiting for.  For more information,
including CFP, sponsorship of the event, or participating as an
exhibitor, please contact us at su...@omniti.com.

Thanks,

-- 
Jason Dixon
OmniTI Computer Consulting, Inc.
jdi...@omniti.com
443.325.1357 x.241

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Last Call: Lucene Revolution CFP Closes Tomorrow Wednesday, June 23, 2010, 12 Midnight PDT

2010-06-22 Thread Jason Rutherglen
Grant,

I can probably do the 3 billion document one from Prague, or a
realtime search one... I spaced on submitting for ApacheCon.

Are there cool places in the Carolinas to hang?

Cheers bro,

Jason



On Tue, Jun 22, 2010 at 10:51 AM, Grant Ingersoll
 wrote:
> Lucene Revolution Call For Participation - Boston, Massachusetts October 7 & 
> 8, 2010
>
> The first US conference dedicated to Lucene and Solr is coming to Boston, 
> October 7 & 8, 2010. The conference is sponsored by Lucid Imagination with 
> additional support from community and other commercial co‐sponsors. The 
> audience will include those experienced Solr and Lucene application 
> development, along with those experienced in other enterprise search 
> technologies interested becoming more familiar with Solr and Lucene 
> technologies and the opportunities they present.
>
> We are soliciting 45‐minute presentations for the conference.
>
> Key Dates:
> May 12, 2010         Call For Participation Open
> June 23, 2010        Call For Participation Closes
> June 28, 2010        Speaker Acceptance/Rejection Notification
> October 5‐6, 2010  Lucene and Solr Pre‐conference Training Sessions
> October 7‐8, 2010  Conference Sessions
>
>
> Topics of interest include:
> Lucene and Solr in the Enterprise (case studies, implementation, return on 
> investment, etc.)
>  “How We Did It” Development Case Studies
> Spatial/Geo search
>  Lucene and Solr in the Cloud (Deployment cases as well as tutorials)
> Scalability and Performance Tuning
> Large Scale Search
> Real Time Search
> Data Integration/Data Management
> Lucene & Solr for Mobile Applications
>
> All accepted speakers will qualify for discounted conference admission. 
> Financial assistance is available for speakers that qualify.
>
> To submit a 45‐minute presentation proposal, please send an email to 
> c...@lucenerevolution.org with Subject containing: , Topic  session title> containing the following information in plain text.
>
> If you have more than one topic proposed, send a separate email. Do not 
> attach Word or other text file documents.
>
> Return all fields completed as follows:
> 1.    Your full name, title, and organization
> 2.    Contact information, including your address, email, phone number
> 3.    The name of your proposed session (keep your title simple, interesting, 
> and relevant to the topic)
> 4.    A 75‐200 word overview of your presentation; in addition to the topic, 
> describe whether your
> presentation is intended as a tutorial, description of an implementation, an 
> theoretical/academic
> discussion, etc.
> 5.    A 100‐200‐word speaker bio that includes prior conference speaking or 
> related experience
> To be considered, proposals must be received by 12 Midnight PDT Wednesday, 
> June 23, 2010.
>
> Please email any general questions regarding the conference to 
> i...@lucenerevolution.org. To be added to the conference mailing list, please 
> email sig...@lucenerevolution.org. If your organization is interested in 
> sponsorship opportunities, email spon...@lucenerevolution.org.
>
> We look forward to seeing you in Boston!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



CFP for Surge Scalability Conference 2010

2010-07-02 Thread Jason Dixon
A quick reminder that there's one week left to submit your abstract for
this year's Surge Scalability Conference.  The event is taking place on
Sept 30 and Oct 1, 2010 in Baltimore, MD.  Surge focuses on case studies
that address production failures and the re-engineering efforts that led
to victory in Web Applications or Internet Architectures.

Our Keynote speakers include John Allspaw and Theo Schlossnagle.  We are
currently accepting submissions for the Call For Papers through July
9th.  You can find more information, including suggested topics and our
current list of speakers, online:

http://omniti.com/surge/2010

I'd also like to urge folks who are planning to attend, to get your
session passes sooner rather than later.  We have limited seating and we
are on track to sell out early.  For more information, including the
CFP, sponsorship of the event, or participating as an exhibitor, please
visit the Surge website or contact us at su...@omniti.com.

Thanks,

-- 
Jason Dixon
OmniTI Computer Consulting, Inc.
jdi...@omniti.com
443.325.1357 x.241

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Last day to submit your Surge 2010 CFP!

2010-07-09 Thread Jason Dixon
Today is your last chance to submit a CFP abstract for the 2010 Surge
Scalability Conference.  The event is taking place on Sept 30 and Oct 1,
2010 in Baltimore, MD.  Surge focuses on case studies that address
production failures and the re-engineering efforts that led to victory
in Web Applications or Internet Architectures.

You can find more information, including suggested topics and our
current list of speakers, online:

http://omniti.com/surge/2010

The final lineup should be available on the conference website next
week.  If you have questions about the CFP, attending Surge, or having
your business sponsor/exhibit at Surge 2010, please contact us at
su...@omniti.com.

Thanks!

-- 
Jason Dixon
OmniTI Computer Consulting, Inc.
jdi...@omniti.com
443.325.1357 x.241

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Register now for Surge 2010

2010-08-02 Thread Jason Dixon
Registration for Surge Scalability Conference 2010 is open for all
attendees!  We have an awesome lineup of leaders from across the various
communities that support highly scalable architectures, as well as the
companies that implement them.  Here's a small sampling from our list of
speakers:

John Allspaw, Etsy
Theo Schlossnagle, OmniTI
Rasmus Lerdorf, creator of PHP
Tom Cook, Facebook
Benjamin Black, fast_ip
Artur Bergman, Wikia
Christopher Brown, Opscode
Bryan Cantrill, Joyent
Baron Schwartz, Percona
Paul Querna, Cloudkick

Surge 2010 focuses on real case studies from production environments;
the lessons learned from failure and how to re-engineer your way to a
successful, highly scalable Internet architecture.  The conference takes
place at the Tremont Grand Historic Venue on Sept 30 and Oct 1, 2010 in
Baltimore, MD.  Register now to enjoy the Early Bird discount and
guarantee your seat to this year's event!

http://omniti.com/surge/2010/register

Thanks,

-- 
Jason Dixon
OmniTI Computer Consulting, Inc.
jdi...@omniti.com
443.325.1357 x.241

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Surge 2010 Early Registration ends Tuesday!

2010-08-27 Thread Jason Dixon
Early Bird Registration for Surge Scalability Conference 2010 ends next
Tuesday, August 31.  We have a killer lineup of speakers and architects
from across the Internet.  Listen to experts talk about the newest
methods and technologies for scaling your Web presence.

http://omniti.com/surge/2010/register

This year's event is all about the challenges faced (and overcome) in
real-life production architectures.  Meet the engineering talent from
some of the best and brightest throughout the Internet:

John Allspaw, Etsy
Theo Schlossnagle, OmniTI
Bryan Cantrill, Joyent
Rasmus Lerdorf, creator of PHP
Tom Cook, Facebook
Benjamin Black, fast_ip
Christopher Brown, Opscode
Artur Bergman, Wikia
Baron Schwartz, Percona
Paul Querna, Cloudkick

Surge 2010 takes place at the Tremont Grand Historic Venue on Sept 30
and Oct 1, 2010 in Baltimore, MD.  Register NOW for the Early Bird
discount and guarantee your seat to this year's event!


-- 
Jason Dixon
OmniTI Computer Consulting, Inc.
jdi...@omniti.com
443.325.1357 x.241

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Recreate segment infos

2010-10-04 Thread Jason Rutherglen
Lets say the segment infos file is missing, and I'm aware of
CheckIndex, however is there a tool to recreate a segment infos file?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recreate segment infos

2010-10-05 Thread Jason Rutherglen
I'm not sure how it vanished, I think it was on a Solr rsync based
replication operation, and a lack of disk space.  Need to move to the
Java replication and get larger SSD drives, working on both, at least
they're SSDs, making some progress.  I was going to recover using the
IDs in the terms dict however there should be 130 mil and there were
only 16 mil.  So even if I had a way to recover, the index is far too
incomplete.  This is where re-indexing in Hadoop is coming in handy.

On Tue, Oct 5, 2010 at 3:26 AM, Michael McCandless
 wrote:
> How did you lose your segments file...?
>
> This was discussed before but I don't think the idea ever turned into a tool.
>
> I think it should be possible.  You'd have to sort all files, deriving
> segment names from the prefixes.  Then, you have to reconstruct the
> metadata required for SegmentInfo.  EG open the fdx file to get
> numDocs, the .del file to get delCount, check for prx file to set
> .haxProx, etc.
>
> You'd have to carefully map segment -> doc store segment.  Multiple
> segments in a row may share the same docStore segment.  In this case
> the docStore segment is given the same name as the first segment that
> shares it.  However, unfortunately, because of merging, it's possible
> that this mapping is not easy (maybe not possible, depending on the
> merge policy...) to reconstruct.  I think this'll be the hardest part
> :)
>
> Mike
>
> On Mon, Oct 4, 2010 at 3:25 PM, Jason Rutherglen
>  wrote:
>> Lets say the segment infos file is missing, and I'm aware of
>> CheckIndex, however is there a tool to recreate a segment infos file?
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Jason Rutherglen
In a word, no.  You'd need to customize the Lucene source to accomplish this.

On Wed, Nov 10, 2010 at 1:02 PM, Burton-West, Tom  wrote:
> Hello all,
>
> We have an extremely large number of terms in our indexes.  I want to be able 
> to extract a sample of the terms, say something like every 128th term.   If I 
> use code based on org.apache.lucene.misc.HighFreqTerms or 
> org.apache.lucene.index.CheckIndex I would get a TermsEnum, call 
> termEnum.next() 128 times, grab the term and then call next another 128 times.
> termEnum = reader.terms();
> while (termEnum.next()
> { }
>
> Since the tii file contains every 128th (or IndexInterval ) term and it is 
> loaded into memory, is there some programmatic way (in the public API) to 
> read that data structure in memory rather than having to force Lucene to 
> actually read the entire tis file by using termEnum.next() ?
>
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Jason Rutherglen
Yeah that's customizing the Lucene source. :)  I should have gone into
more detail, I will next time.

On Wed, Nov 10, 2010 at 2:10 PM, Michael McCandless
 wrote:
> Actually, the .tii file pre-flex (3.x) is nearly identical to the .tis
> file, just that it only contains every 128th term.
>
> If you just make SegmentTermEnum public (or, sneak your class into
> oal.index package) then you can instantiate SegmentTermsEnum passing
> it an IndexInput opened on the .tii file.
>
> Then you can enum the terms directly...
>
> Mike
>
> On Wed, Nov 10, 2010 at 4:02 PM, Burton-West, Tom  wrote:
>> Hello all,
>>
>> We have an extremely large number of terms in our indexes.  I want to be 
>> able to extract a sample of the terms, say something like every 128th term.  
>>  If I use code based on org.apache.lucene.misc.HighFreqTerms or 
>> org.apache.lucene.index.CheckIndex I would get a TermsEnum, call 
>> termEnum.next() 128 times, grab the term and then call next another 128 
>> times.
>> termEnum = reader.terms();
>> while (termEnum.next()
>> { }
>>
>> Since the tii file contains every 128th (or IndexInterval ) term and it is 
>> loaded into memory, is there some programmatic way (in the public API) to 
>> read that data structure in memory rather than having to force Lucene to 
>> actually read the entire tis file by using termEnum.next() ?
>>
>>
>> Tom Burton-West
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Storing an ID alongside a document

2011-02-02 Thread Jason Rutherglen
I'm curious if there's a new way (using flex or term states) to store
IDs alongside a document and retrieve the IDs of the top N results?
The goal would be to minimize HD seeks, and not use field caches
(because they consume too much heap space) or the doc stores (which
require two seeks).  One possible way using the pre-flex system is to
place the IDs into a payload posting that would match all documents,
and then [somehow] retrieve the payload only when needed.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Storing an ID alongside a document

2011-02-02 Thread Jason Rutherglen
Is it?  I thought it would load the values into heap RAM like the
field cache and in addition save the values to disk?  Does it also
read the values directly from disk?

On Wed, Feb 2, 2011 at 2:00 PM, Yonik Seeley  wrote:
> That's exactly what the CSF feature is for, right?  (docvalues branch)
>
> -Yonik
> http://lucidimagination.com
>
>
> On Wed, Feb 2, 2011 at 1:03 PM, Jason Rutherglen > wrote:
>
>> I'm curious if there's a new way (using flex or term states) to store
>> IDs alongside a document and retrieve the IDs of the top N results?
>> The goal would be to minimize HD seeks, and not use field caches
>> (because they consume too much heap space) or the doc stores (which
>> require two seeks).  One possible way using the pre-flex system is to
>> place the IDs into a payload posting that would match all documents,
>> and then [somehow] retrieve the payload only when needed.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Storing an ID alongside a document

2011-02-03 Thread Jason Rutherglen
> there is a entire RAM resident part and a Iterator API that reads /
> streams data directly from disk.
> look at DocValuesEnum vs, Source

Nice, thanks!

On Thu, Feb 3, 2011 at 12:20 AM, Simon Willnauer
 wrote:
> On Thu, Feb 3, 2011 at 3:23 AM, Jason Rutherglen
>  wrote:
>> Is it?  I thought it would load the values into heap RAM like the
>> field cache and in addition save the values to disk?  Does it also
>> read the values directly from disk?
>
> there is a entire RAM resident part and a Iterator API that reads /
> streams data directly from disk.
> look at DocValuesEnum vs, Source
>
> simon
>>
>> On Wed, Feb 2, 2011 at 2:00 PM, Yonik Seeley  
>> wrote:
>>> That's exactly what the CSF feature is for, right?  (docvalues branch)
>>>
>>> -Yonik
>>> http://lucidimagination.com
>>>
>>>
>>> On Wed, Feb 2, 2011 at 1:03 PM, Jason Rutherglen >>> wrote:
>>>
>>>> I'm curious if there's a new way (using flex or term states) to store
>>>> IDs alongside a document and retrieve the IDs of the top N results?
>>>> The goal would be to minimize HD seeks, and not use field caches
>>>> (because they consume too much heap space) or the doc stores (which
>>>> require two seeks).  One possible way using the pre-flex system is to
>>>> place the IDs into a payload posting that would match all documents,
>>>> and then [somehow] retrieve the payload only when needed.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Last/max term in Lucene 4.x

2011-02-18 Thread Jason Rutherglen
This could be a rhetorical question.  The way to find the last/max
term that is a unique per document is to use TermsEnum to seek to the
first term of a field, then call seek to the docFreq-1 for the last
ord, then get the term, or is there a better/faster way?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Last/max term in Lucene 4.x

2011-02-19 Thread Jason Rutherglen
> Instead of docFreq, did you mean numUniqueTerms?

Right.

> But you have to
> use a terms index impl that supports ord (eg FixedGap).

Ok, and the VariableGap is the new standard because the FST is much
more efficient as a terms index?  Perhaps I'd need to create a codec
(or patch the existing) to automatically store the max term?

On Sat, Feb 19, 2011 at 3:33 AM, Michael McCandless
 wrote:
> I don't quite understand your question Jason...
>
> Seeking to the first term of the field just gets you the smallest term
> (in unsigned byte[] order, ie Unicode order if the byte[] is UTF8)
> across all docs.
>
> Instead of docFreq, did you mean numUniqueTerms?  Ie, you want to seek
> to the largest term for that field?  In which case, yes seeking by
> term ord to numUniqueTerms-1 gets you to that term.  But you have to
> use a terms index impl that supports ord (eg FixedGap).
>
> Mike
>
> On Fri, Feb 18, 2011 at 9:24 PM, Jason Rutherglen
>  wrote:
>> This could be a rhetorical question.  The way to find the last/max
>> term that is a unique per document is to use TermsEnum to seek to the
>> first term of a field, then call seek to the docFreq-1 for the last
>> ord, then get the term, or is there a better/faster way?
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Mike
>
> http://blog.mikemccandless.com
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Last/max term in Lucene 4.x

2011-02-20 Thread Jason Rutherglen
> Though, if you just want to get to the last term... VarGap's terms
> index can quickly tell you the last indexed term, and from there you
> can scan to the last term?  (It'd be at most 32 (by default) scans).

In VariableGapTermsIndexReader, IndexEnum doesn't support ord.  How
would I seek to the last term in the index using VarGaps?  Or do I
need to interact directly with the FST class (and if so I'm not sure
what to do there either).

Thanks Mike.

On Sun, Feb 20, 2011 at 2:51 PM, Michael McCandless
 wrote:
> On Sat, Feb 19, 2011 at 8:42 AM, Jason Rutherglen
>  wrote:
>
>>> But you have to
>>> use a terms index impl that supports ord (eg FixedGap).
>>
>> Ok, and the VariableGap is the new standard because the FST is much
>> more efficient as a terms index?  Perhaps I'd need to create a codec
>> (or patch the existing) to automatically store the max term?
>
> It's really easy to make a codec that eg copies Standard but swaps in
> FixedGap terms index instead...
>
> Though, if you just want to get to the last term... VarGap's terms
> index can quickly tell you the last indexed term, and from there you
> can scan to the last term?  (It'd be at most 32 (by default) scans).
>
> --
> Mike
>
> http://blog.mikemccandless.com
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Last/max term in Lucene 4.x

2011-02-21 Thread Jason Rutherglen
> Maybe we need a seekFloor in the TermsEnum?  (What we have now is
> really seekCeil).  But, what's the larger use case here..?

I opened an issue LUCENE-2930 to simply store the last/max term,
however the seekFloor would work just as well.  The use case is
finding the last of the ordered IDs stored in the index, so that
remaining documents (that lets say were left in RAM prior to process
termination) can be indexed.  It's an inferred transaction checkpoint.

On Mon, Feb 21, 2011 at 5:31 AM, Michael McCandless
 wrote:
> On Sun, Feb 20, 2011 at 8:47 PM, Jason Rutherglen
>  wrote:
>>> Though, if you just want to get to the last term... VarGap's terms
>>> index can quickly tell you the last indexed term, and from there you
>>> can scan to the last term?  (It'd be at most 32 (by default) scans).
>>
>> In VariableGapTermsIndexReader, IndexEnum doesn't support ord.  How
>> would I seek to the last term in the index using VarGaps?  Or do I
>> need to interact directly with the FST class (and if so I'm not sure
>> what to do there either).
>
> Right, you'd have to work directly w/ the FSTEnum (ie, code changes).
> The FSTEnum "feels" like a TreeMap, so, you can eg seekFloor to eg
> 0x, get the term, seek the TermsEnum there, then .next() until
> you hit the end.
>
> Maybe we need a seekFloor in the TermsEnum?  (What we have now is
> really seekCeil).  But, what's the larger use case here..?
>
> Mike
>
> --
> Mike
>
> http://blog.mikemccandless.com
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Proper way to deal with shared indexer exception

2011-02-25 Thread Jason Tesser
We are having issues with FileChannelClosed and are NOT calling
Thread.interrupt.  We also start to see AlreadyClosedException on Reader.

*
*

we are running the latest 3.0.3


We have code in my lucene Util class like this  http://pastebin.com/ifbxhVLi

*
*

we have a single shared searcher and a single writer which is only checked
out once not shared single threaded http://pastebin.com/YF8nmwg0

*
*

we use to call destroy in all the caches of the first paste bin which I
think is a problem

*
*

 1. what would be the recommended way here?

*
*

in other words if I catch AlreadyClosedException ace OR
ClosedChannelException OR IOException what would be the best to do with my
shared searcher

*
*

2.  is reopen enough?  or should I get a brand new searcher?


Thanks,
Jason Tesser
dotCMS Lead Development Manager
1-305-858-1422


Is ConcurrentMergeScheduler useful for multiple running IndexWriter's?

2011-03-04 Thread Jason Rutherglen
ConcurrentMergeScheduler is tied to a specific IndexWriter, however if
we're running in an environment (such as Solr's multiple cores, and
other similar scenarios) then we'd have a CMS per IW.  I think this
effectively disables CMS's max thread merge throttling feature?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Append Codec random testing

2011-03-21 Thread Jason Rutherglen
I'm seeing an error when using the misc Append codec.

java.lang.AssertionError
at 
org.apache.lucene.store.ByteArrayDataInput.readBytes(ByteArrayDataInput.java:107)
at 
org.apache.lucene.index.codecs.BlockTermsReader$FieldReader$SegmentTermsEnum._next(BlockTermsReader.java:661)
at 
org.apache.lucene.index.codecs.BlockTermsReader$FieldReader$SegmentTermsEnum.next(BlockTermsReader.java:639)

And am wondering if it's related to the codec or not.  The unit tests
for Append are minimal?  Perhaps we can/should include it in the
random codecs chosen for unit testing?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocIdSet to represent small numberr of hits in large Document set

2011-04-05 Thread Jason Rutherglen
I think Solr has a HashDocSet implementation?

On Tue, Apr 5, 2011 at 3:19 AM, Michael McCandless
 wrote:
> Can we simply factor out (poach!) those useful-sounding classes from
> Nutch into Lucene?
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman  
> wrote:
>> I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).
>>
>> Many of our indexes are 5M+ Documents, however, only a small subset of these
>> are relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet,
>> is rather inefficient in terms of memory use, what is the recommended way to
>> DocIdSet implementation to use in this scenario?
>>
>> Seems like SortedVIntList can be used to store the info, but it has no
>> methods to build the list in the first place, requiring an array or bitset
>> in the constructor.
>>
>> I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2
>> deployment, but want to move away from that Nutch dependency, so wondered if
>> Lucene had a way to do this?
>>
>> Thanks
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene Util question

2011-04-08 Thread Jason Rutherglen
Is http://code.google.com/a/apache-extras.org/p/luceneutil/ designed
to replace or augment the contrib benchmark?  For example it looks
like SearchPerfTest would be useful for executing queries over a
pre-built index.  Though there's no indexing tool in the code tree?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



found a bug, not sure if its lucene or solr

2011-06-03 Thread Jason Toy
Greetings all,
I found a bug today while trying to upgrade solr  from 1.4.1 to 3.1  I'm not
sure if this is a lucene or solr problem, I normally use lucene through
solr. I've already posted this to the solr mailing list, but I wanted to
notify the lucene group also.

In 1.4.1 I was able to insert this  doc:
User
14914457UserSan
Franciscojtoyjtoylife
hacker0.05


And then I can run the query:

http://localhost:8983/solr/select?q=life&qf=description_text&defType=dismax&sort=scores:rails_f+desc

and I will get results.

If I insert the same document into solr 3.1 and run the same query I get the
error:

Problem accessing /solr/select. Reason:

undefined field scores

For some reason, solr has cutoff the column name from the colon
forward so "scores:rails_f" becomes "scores"

I can see in the lucene index that the data for scores:rails_f is in
the document. For that reason I believe the bug is in solr and not in
lucene, but I'm not certain.





Jason Toy
socmetrics
http://socmetrics.com
@jtoy


Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> I don't think we'd do the post-filtering solution, but instead maybe
> resolve the deletes "live" and store them in a transactional data

I think Michael B. aptly described the sequence ID approach for 'live' deletes?

On Mon, Jun 13, 2011 at 3:00 PM, Michael McCandless
 wrote:
> Yes, adding deletes to Twitter's approach will be a challenge!
>
> I don't think we'd do the post-filtering solution, but instead maybe
> resolve the deletes "live" and store them in a transactional data
> structure of some kind... but even then we will pay a perf hit to
> lookup del docs against it.
>
> So, yeah, there will presumably be a tradeoff with this approach too.
> However, turning around changes from the adds should be faster (no
> segment gets flushed).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko  
> wrote:
>> Thanks Mike, much appreciated.
>>
>>
>> Wouldn't Twitter's approach fall for the exact same pit-hole you described
>> Zoie does (or did) when it'll handle deletes too? I don't thing there is any
>> other way of handling deletes other than post-filtering results. But perhaps
>> the IW cache would be smaller than Zoie's RAMDirectory(ies)?
>>
>>
>> I'll give all that a serious dive and report back with results or if more
>> input will be required...
>>
>>
>> Itamar.
>>
>>
>> On 13/06/2011 19:01, Michael McCandless wrote:
>>
>>> Here's a blog post describing some details of Twitter's approach:
>>>
>>>
>>> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html
>>>
>>> And here's a talk Michael did last October (Lucene Revolutions):
>>>
>>>
>>> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter
>>>
>>> Twitter's case is simpler since they never delete ;)  So we have to
>>> fix that to do it in Lucene... there are also various open issues that
>>> begin to explore some of the ideas here.
>>>
>>> But this ("immediate consistency") would be a deep and complex change,
>>> and I don't see many apps that actually require it.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko
>>>  wrote:

 Thanks for your detailed answer. We'll have to tackle this and see whats
 more important to us then. I'd definitely love to hear Zoie has overcame
 all
 that...


 Any pointers to Michael Busch's approach? I take this has something to do
 with the core itself or index format, probably using the Flex version?


 Itamar.


 On 12/06/2011 23:12, Michael McCandless wrote:

>>  From what I understand of Zoie (and it's been some time since I last
>
> looked... so this could be wrong now), the biggest difference vs NRT
> is that Zoie aims for "immediate consistency", ie index changes are
> always made visible to the very next query, vs NRT which is
> "controlled consistency", a blend between immediate and eventual
> consistency where your app decides when the changes must become
> visible.
>
> But in exchange for that, Zoie pays a price: each search has a higher
> cost per collected hit, since it must post-filter for deleted docs.
> And since Zoie necessarily adds complexity, there's more risk; eg
> there were some nasty Zoie bugs that took quite some time to track
> down (under https://issues.apache.org/jira/browse/LUCENE-2729).
>
> Anyway, I don't think that's a good tradeoff, in general, for our
> users, because very few apps truly require immediate consistency from
> Lucene (can anyone give an example where their app depends on
> immediate consistency...?).  I think it's better to spend time during
> reopen so that searches aren't slower.
>
> That said, Lucene has already incorporated one big part of Zoie
> (caching small segments in RAM) via the new NRTCachingDirectory (in
> contrib/misc).  Also, the upcoming NRTManager
> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over
> visibility of specific indexing changes to queries that need to see
> the changes.
>
> Finally, even better would be to not have to make any tradeoff
> whatsoever ;)  Twitter's approach (created by Michael Busch) seems to
> bring immediate consistency with no search performance hit, so if we
> do anything here likely it'll be similar to what Michael has done
> (though, those changes are not simple either!).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko
>  wrote:
>>
>> Mike,
>>
>>
>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT
>> apparently
>> isn't fast enough if Zoie was needed, and now that Zoie is around are
>> there
>> any plans to make it Lucene's default? or: why would one sti

Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> deletions made by readers merely mark it for
> deletion, and once a doc has been marked for deletions it is deleted for all
> intents and purposes, right?

There's the point-in-timeness of a reader to consider.

> Does the N in NRT represent only the cost of reopening a searcher?

Aptly put, and yes basically.

> the only thing that comes to mind is the IW unflushed buffer

This is LUCENE-2312.

On Mon, Jun 13, 2011 at 3:19 PM, Itamar Syn-Hershko  wrote:
> Since there should only be one writer, I'm not sure why you'd need
> transactional storage for that? deletions made by readers merely mark it for
> deletion, and once a doc has been marked for deletions it is deleted for all
> intents and purposes, right? But perhaps I need to refresh my memory on the
> internals, it has been a while.
>
> Does the N in NRT represent only the cost of reopening a searcher? meaning,
> if I could ensure reopening always happens fast and returns a searcher for
> the correct index revision, would it guarantee a real real-time search? or
> is there anything else standing in between? the only thing that comes to
> mind is the IW unflushed buffer - which only Twitter's approach seem to
> handle (not even Zoie).
>
> Itamar.
>
> On 14/06/2011 01:00, Michael McCandless wrote:
>>
>> Yes, adding deletes to Twitter's approach will be a challenge!
>>
>> I don't think we'd do the post-filtering solution, but instead maybe
>> resolve the deletes "live" and store them in a transactional data
>> structure of some kind... but even then we will pay a perf hit to
>> lookup del docs against it.
>>
>> So, yeah, there will presumably be a tradeoff with this approach too.
>> However, turning around changes from the adds should be faster (no
>> segment gets flushed).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko
>>  wrote:
>>>
>>> Thanks Mike, much appreciated.
>>>
>>>
>>> Wouldn't Twitter's approach fall for the exact same pit-hole you
>>> described
>>> Zoie does (or did) when it'll handle deletes too? I don't thing there is
>>> any
>>> other way of handling deletes other than post-filtering results. But
>>> perhaps
>>> the IW cache would be smaller than Zoie's RAMDirectory(ies)?
>>>
>>>
>>> I'll give all that a serious dive and report back with results or if more
>>> input will be required...
>>>
>>>
>>> Itamar.
>>>
>>>
>>> On 13/06/2011 19:01, Michael McCandless wrote:
>>>
 Here's a blog post describing some details of Twitter's approach:



 http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html

 And here's a talk Michael did last October (Lucene Revolutions):



 http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter

 Twitter's case is simpler since they never delete ;)  So we have to
 fix that to do it in Lucene... there are also various open issues that
 begin to explore some of the ideas here.

 But this ("immediate consistency") would be a deep and complex change,
 and I don't see many apps that actually require it.

 Mike McCandless

 http://blog.mikemccandless.com

 On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko
  wrote:
>
> Thanks for your detailed answer. We'll have to tackle this and see
> whats
> more important to us then. I'd definitely love to hear Zoie has
> overcame
> all
> that...
>
>
> Any pointers to Michael Busch's approach? I take this has something to
> do
> with the core itself or index format, probably using the Flex version?
>
>
> Itamar.
>
>
> On 12/06/2011 23:12, Michael McCandless wrote:
>
>>>  From what I understand of Zoie (and it's been some time since I last
>>
>> looked... so this could be wrong now), the biggest difference vs NRT
>> is that Zoie aims for "immediate consistency", ie index changes are
>> always made visible to the very next query, vs NRT which is
>> "controlled consistency", a blend between immediate and eventual
>> consistency where your app decides when the changes must become
>> visible.
>>
>> But in exchange for that, Zoie pays a price: each search has a higher
>> cost per collected hit, since it must post-filter for deleted docs.
>> And since Zoie necessarily adds complexity, there's more risk; eg
>> there were some nasty Zoie bugs that took quite some time to track
>> down (under https://issues.apache.org/jira/browse/LUCENE-2729).
>>
>> Anyway, I don't think that's a good tradeoff, in general, for our
>> users, because very few apps truly require immediate consistency from
>> Lucene (can anyone give an example where their app depends on
>> immediate consistency...?).  I think it's better to spend time during
>> reopen so that searches aren't slower.
>>
>> That sa

how to approach phrase queries and term grouping

2011-06-22 Thread Jason Guild

Hi All:

I am new to Lucene and my project is to provide specialized search for a 
set

of booklets. I am using Lucene Java 3.1.

The basic idea is to run queries to find out what booklet and page 
numbers are
match in order to help people know where to look for information in the 
(rather

large and dry) booklets. Therefore each Document in my index represents a
particular page in one of the booklets.

So far I have been able to successfully scrape the raw text from the 
booklets,
insert it into an index, and query it just fine using StandardAnalyzer 
on both

ends.

So here's my general question:
Many queries on the index will involve searching for place names 
mentioned in the
booklets. Some place names use notational variants. For instance, in the 
body text
it will be called "Ship Creek" but in a diagram it might be listed as 
"Ship Cr." or

elsewhere as "Ship Ck.".

If I search for (Ship AND (Cr Ck Creek)) this does not give me what I 
want because
other words may appear between [ship] and [cr]/[ck]/[creek] leading to 
false positives.


What I need to know is how to approach treating the two consecutive 
words as a single
term and add the notational variants as synonyms. So, in a nutshell I 
need the basic
stuff provided by StandardAnalyzer, but with term grouping to emit place 
names

as complete terms and insert synonymous terms to cover the variants.

For instance, the text "...allowed from the mouth of Ship Creek upstream 
to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via 
a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship 
ck][ship cr].


As a bonus it would be nice to treat the trickier text "..except in 
Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird 
creek],

[campbell creek],[where],[limit].

Should the detection and merging be done in a TokenFilter?
Some of the term grouping can probably be done heuristically [*],[creek] 
is [* creek]
but I also have an exhaustive list of places mentioned in the text if 
that helps.


Thanks for any help you can provide.
Jason



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: i'm having some trouble with class FSDirectory

2011-08-24 Thread Sendros, Jason
Hi Mostafa,

Try looking through the API for help with these types of questions:
http://lucene.apache.org/java/3_3_0/api/all/org/apache/lucene/store/FSDi
rectory.html

You can use a number of FSDirectory subclasses depending on your
circumstances.

Hope this helps!

Jason

-Original Message-
From: Mostafa Hadian [mailto:hadian...@gmail.com] 
Sent: Wednesday, August 24, 2011 9:55 AM
To: java-user@lucene.apache.org
Subject: i'm having some trouble with class FSDirectory

hello.
there is this piece of code in the book "lucene in action" :
Directory dir = new FSDirectory(new File(indexDir), null);
but class FSDirectory is an abstract class and cannot be instantiated
like
this.
thank you very much for your helping.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene scoring and random result order

2011-08-25 Thread Sendros, Jason
You can sort on multiple values. Keep the primary sort as a relevancy
sort, and choose something else to sort on to keep the rest of the
responses fairly static.

http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/search/So
rt.html

Example:
Sort sortBy = new Sort(new SortField[] { SortField.FIELD_SCORE, new
SortField("POSITION",SortField.INT) });

-Original Message-
From: Yanick Gamelin [mailto:yanick.game...@ericsson.com] 
Sent: Thursday, August 25, 2011 3:02 PM
To: java-user@lucene.apache.org
Subject: Lucene scoring and random result order

Hi all,

I have the following problem with Lucene being not deterministic.

I use a MultiSearcher to process a search and when I get hits with same
score, those are returned in a random order.
I wouldn't care much about the order of the hits with same score if I
could get them all, so I could sort them myself.
But if we request a maximum number of results lower than the amount of
hits with same score, we only get a subset of those hits and that result
list of hits will change because the order is not guarantied.
Sometimes the first part of the result list is consistent because
scoring is different for those hits, but then we have a bit block with
equals scoring, so Lucene only take what it need to fill the rest of the
list. Lucene takes randomly what its need from the big block of equal
score

As an example imagine x,y,and z which have a high scoring, all other
letters have same score
3 consecutive searches will give
[x,y,z,a,b,c,d,f,g,h,i,j]
[x,y,z,q,w,e,r,t,u,i,o,p]
[x,y,z,m,n,b,v,c,a,s,d,g]

Pretty annoying eh? So, what can I do about that?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: deleting with sorting and max document

2011-09-14 Thread Sendros, Jason
Vincent,

I think you may be looking for the following method:
http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/index/Inde
xWriter.html#deleteDocuments%28org.apache.lucene.search.Query%29

Jason

-Original Message-
From: v.se...@lombardodier.com [mailto:v.se...@lombardodier.com] 
Sent: Wednesday, September 14, 2011 9:24 AM
To: java-user@lucene.apache.org
Subject: deleting with sorting and max document

Hi,

I have an index with 35 millions docs in it. every day I need to delete 
some of the oldest docs that meet some criteria.

I can easily do this on the searcher by using search(Query query, int n,

Sort sort)

but there is nothing equivalent for the deleteDocuments.

what are my options?

thanks,

vincent

 DISCLAIMER 
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: searching / sorting on timestamp and update efficiency

2011-09-22 Thread Sendros, Jason
Storing the date as a long and then searching with NumericRangeQuery will 
provide you with exactly what you're looking for. This is an efficient search 
solution for numeric data.

Optimize() will reduce the size of your index and improve search time at the 
cost of a large burst of overhead. Unless your searches are getting noticeably 
slower or your index is expanding rapidly, you're better off using 
IndexReader.reopen() for regular updates and optimize() occasionally.

Note that when using IndexReader.reopen() you should close the original 
IndexReader if it is still open to avoid memory leaks.

Jason

-Original Message-
From: Sam Jiang [mailto:sam.ji...@karoshealth.com] 
Sent: Thursday, September 22, 2011 10:18 AM
To: java-user@lucene.apache.org
Subject: searching / sorting on timestamp and update efficiency

Hi all

I have some questions about how I should store timestamps.

From my readings, I can see two ways of indexing timestamps:
DateTools (which uses formated timestamp strings) and
NumericUtils (which uses a long?).

I'm not sure which one gives more performance in my scenario:
For each of my document, it needs to have an indexed millisecond resolution
timestamp. Almost all searches would be invoked with a range filter
(searching at hour resolution is sufficient).
There are usually 2-4 updates to this timestamp field for recently indexed
documents. And afterwards, updates to this field or any other fields are
rare.

It would be great if somebody can advice me which format should I use.
p.s. should I be calling optimize() often given my frequent updates?

thanks

-- 
Sam Jiang | karoshealth
(っ゚Д゚;)っ hidden cat here
7 Father David Bauer Drive, Suite 201
Waterloo, ON, N2L 0A2, Canada
www.karoshealth.com


RE: Case insensitive sortable column

2011-10-11 Thread Sendros, Jason
If that's not an option, create another column with the same data
lowercased and search on the new column while displaying the original
column.

Jason

-Original Message-
From: Greg Bowyer [mailto:gbow...@shopzilla.com] 
Sent: Tuesday, October 11, 2011 10:43 PM
To: java-user@lucene.apache.org
Subject: Re: Case insensitive sortable column

I might be missing something here but cant you just lowercase during 
indexing ?

On 11/10/11 09:48, Senthil V S wrote:
> Hi,
>
> I'm new to Lucene. I have records and I wanna sort them by fields.
I've
> created indexes for those fields with 'not_analyzed'.
> The sort is case sensitive. In a sense,
> *A...*
> *X...*
> *b...*
> is the order, while what I would prefer is,
> *A...*
> *b...*
> *X...*
> *
> *
> I believe it's a trivial one to do but not sure how. Any idea?
> *
> *
>
>
> Senthil V S
> Y!: siliconsenthil2003,GTalk:vss123
> <http://pleasantrian.blogspot.com/>
>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-16 Thread Jason Rutherglen
> even high complexity as ES supports lucene-like query nesting via JSON

That sounds interesting.  Where is it described in the ES docs?  Thanks.

On Wed, Nov 16, 2011 at 1:36 PM, Peter Karich  wrote:
>  Hi,
>
> its not really fair to compare NRT of Solr to ElasticSearch.
> ElasticSearch provides NRT for distributed indices as well... also when
> doing heavy indexing Solr lacks real NRT.
>
> The only main disadvantages of ElasticSearch are:
>  * only one (main) committer
>  * no autowarming
>
>
>> the ES team in the end has found it good as a storage but difficult to
> extend for a lucene expert.
>
> The nice thing with ES is that you can e.g. create lucene queries with
> even high complexity as ES supports lucene-like query nesting via JSON.
> Also when implementing server side stuff you can take advantage of full
> lucene power.
>
> Ah, before I forgot it: it is very important to test the software
> yourself. Do not trust me or anybody else :), also the software should
> fit to your environment, requirements + team!
>
> Regards,
> Peter.
>
>
> PS: here is my different comparison:
> http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/
>
>
>> On Wed, Nov 16, 2011 at 10:36 AM, Shashi Kant  wrote:
>>> I had posted this earlier on this list, hope this provides some answers
>>>
>>> http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/
>> Except it's an out of date comparison.
>> We have NRT (near real time search) in Solr now.
>>
>> http://wiki.apache.org/solr/NearRealtimeSearch
>>
>> -Yonik
>> http://www.lucidimagination.com
>
>
> --
> http://jetsli.de news reader for geeks
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-16 Thread Jason Rutherglen
The docs are slim on examples.

On Wed, Nov 16, 2011 at 3:35 PM, Peter Karich  wrote:
>
>>> even high complexity as ES supports lucene-like query nesting via JSON
>> That sounds interesting.  Where is it described in the ES docs?  Thanks.
>
> "Think of the Query DSL as an AST of queries"
> http://www.elasticsearch.org/guide/reference/query-dsl/
>
> For further info ask on ES mailing list.
>
> Regards,
> Peter.
>
> --
> http://jetsli.de news reader for geeks
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



BigInteger usage in numeric Trie range queries

2011-11-28 Thread Jason Rutherglen
Even though the NumericRangeQuery.new* methods do not support
BigInteger, the underlying recursive algorithm supports any sized
number.

Has this been explored?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-13 Thread Jason Corekin
Let me start by stating that I almost certain that I am doing something
wrong, and that I hope that I am because if not there is a VERY large bug
in Lucene.   What I am trying to do is use the method


deleteDocuments(Term... terms)


 out of the IndexWriter class to delete several Term object Arrays, each
fed to it via a separate Thread.  Each array has around 460k+ Term object
in it.  The issue is that after running for around 30 minutes or more the
method finishes, I then have a commit run and nothing changes with my files.
To be fair, I am running a custom Directory implementation that might be
causing problems, but I do not think that this is the case as I do not even
see any of the my Directory methods in the stack trace.  In fact when I set
break points inside the delete methods of my Directory implementation they
never even get hit. To be clear replacing the custom Directory
implementation with a standard one is not an option due to the nature of
the data which is made up of terabytes of small (1k and less) files.  So,
if the issue is in the Directory implementation I have to figure out how to
fix it.


Below are the pieces of code that I think are relevant to this issue as
well as a copy of the stack trace thread that was doing work when I paused
the debug session.  As you are likely to notice, the thread is called a
DBCloner because it is being used to clone the underlying Index based
database (needed to avoid storing trillions of files directly on disk).  The
idea is to duplicate the selected group of terms into a new database and
then delete to original terms from the original database.  The duplicate
work wonderfully, but not matter what I do including cutting the program
down to one thread I cannot shrink the database and the time to try to do
the deletes takes drastically too long.


In an attempt to be as helpful as possible, I will say this.  I have been
tracing this problem for a few days and have seen that

BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)

is where that majority of the execution time is spent.  I have also noticed
that this method return false MUCH more often than it returns true.  I have
been trying to figure out how the mechanics of this process work just in
case the issue was not in my code and I might have been able  to find the
problem.  But I have yet to find the problem either in Lucene 4.5.1 or
Lucene 4.6.  If anyone has any ideas as to what I might be doing wrong, I
would really appreciate reading what you have to say.  Thanks in advance.



Jason



private void cloneDB() throws QueryNodeException {



Document doc;

ArrayList fileNames;

int start = docRanges[(threadNumber * 2)];

int stop = docRanges[(threadNumber * 2) +
1];



try {



fileNames = new
ArrayList(docsPerThread);

for (int i = start; i <
stop; i++) {

doc =
searcher.doc(i);

try {


adder.addDoc(doc);


fileNames.add(doc.get("FileName"));

} catch
(TransactionExceptionRE | TransactionException | LockConflictException te) {


adder.txnAbort();


System.err.println(Thread.currentThread().getName() + ": Adding a message
failed, retrying.");

}

}


deleters[threadNumber].deleteTerms("FileName",
fileNames);


deleters[threadNumber].commit();



} catch (IOException | ParseException ex) {


Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
null, ex);

}

}





public void deleteTerms(String
dbField,ArrayList fieldTexts) throws IOException {

Term[] terms = new Term[fieldTexts.size()];

for(int i=0;i.readFirstRealTargetArc(long, Arc, BytesReader)
line: 979

FST.findTargetArc(int, Arc, Arc, BytesReader)
line: 1220


BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
line: 1679

BufferedUpdatesStream.applyTermDeletes(Iterable,
ReadersAndUpdates, SegmentReader) line: 414

BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
List) line: 283

IndexWriter.applyAllDeletesAndUpdates() line: 3112

IndexWriter.applyDeletesAndPurge(boolean) line: 4641


DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
boolean, boolean) line: 673

IndexWriter.processEvents(Qu

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-14 Thread Jason Corekin
I knew that I had forgotten something.  Below is the line that I use to
create the field that I am trying to use to delete the entries with.  I
hope this avoids some confusion.  Thank you very much to anyone that takes
the time to read these messages.

doc.add(new StringField("FileName",filename, Field.Store.YES));


On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin wrote:

> Let me start by stating that I almost certain that I am doing something
> wrong, and that I hope that I am because if not there is a VERY large bug
> in Lucene.   What I am trying to do is use the method
>
>
> deleteDocuments(Term... terms)
>
>
>  out of the IndexWriter class to delete several Term object Arrays, each
> fed to it via a separate Thread.  Each array has around 460k+ Term object
> in it.  The issue is that after running for around 30 minutes or more the
> method finishes, I then have a commit run and nothing changes with my files.
> To be fair, I am running a custom Directory implementation that might be
> causing problems, but I do not think that this is the case as I do not even
> see any of the my Directory methods in the stack trace.  In fact when I
> set break points inside the delete methods of my Directory implementation
> they never even get hit. To be clear replacing the custom Directory
> implementation with a standard one is not an option due to the nature of
> the data which is made up of terabytes of small (1k and less) files.  So,
> if the issue is in the Directory implementation I have to figure out how to
> fix it.
>
>
> Below are the pieces of code that I think are relevant to this issue as
> well as a copy of the stack trace thread that was doing work when I paused
> the debug session.  As you are likely to notice, the thread is called a
> DBCloner because it is being used to clone the underlying Index based
> database (needed to avoid storing trillions of files directly on disk).  The
> idea is to duplicate the selected group of terms into a new database and
> then delete to original terms from the original database.  The duplicate
> work wonderfully, but not matter what I do including cutting the program
> down to one thread I cannot shrink the database and the time to try to do
> the deletes takes drastically too long.
>
>
> In an attempt to be as helpful as possible, I will say this.  I have been
> tracing this problem for a few days and have seen that
>
> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>
> is where that majority of the execution time is spent.  I have also
> noticed that this method return false MUCH more often than it returns true.
> I have been trying to figure out how the mechanics of this process work
> just in case the issue was not in my code and I might have been able  to
> find the problem.  But I have yet to find the problem either in Lucene
> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be doing
> wrong, I would really appreciate reading what you have to say.  Thanks in
> advance.
>
>
>
> Jason
>
>
>
> private void cloneDB() throws QueryNodeException {
>
>
>
> Document doc;
>
> ArrayList fileNames;
>
> int start = docRanges[(threadNumber * 2)];
>
> int stop = docRanges[(threadNumber * 2) +
> 1];
>
>
>
> try {
>
>
>
> fileNames = new
> ArrayList(docsPerThread);
>
> for (int i = start; i <
> stop; i++) {
>
> doc =
> searcher.doc(i);
>
> try {
>
>
> adder.addDoc(doc);
>
>
> fileNames.add(doc.get("FileName"));
>
> } catch
> (TransactionExceptionRE | TransactionException | LockConflictException te) {
>
>
> adder.txnAbort();
>
>
> System.err.println(Thread.currentThread().getName() + ": Adding a message
> failed, retrying.");
>
> }
>
> }
>
> 
> deleters[threadNumber].deleteTerms("FileName",
> fileNames);
>
>
> deleters[threadNumber].commit();
>
>
>
> } catch (IOException | ParseException ex)
> {
>
> 
> Logger.getLogger(DocReader.class.getName()).log(Leve

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-14 Thread Jason Corekin
Mike,

Thanks for the input, it will take me some time to digest and trying
everything you wrote about.  I will post back the answers to your questions
and results to from the suggestions you made once I have gone over
everything.  Thanks for the quick reply,

Jason


On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> It sounds like there are at least two issues.
>
> First, that it takes so long to do the delete.
>
> Unfortunately, deleting by Term is at heart a costly operation.  It
> entails up to one disk seek per segment in your index; a custom
> Directory impl that makes seeking costly would slow things down, or if
> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
> impl is using the OS).  Is seeking somehow costly in your custom Dir
> impl?
>
> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
> per Term, which may actually be expected.
>
> How many terms in your index?  Can you run CheckIndex and post the output?
>
> You could index your ID field using MemoryPostingsFormat, which should
> be a good speedup, but will consume more RAM.
>
> Is it possible to delete by query instead?  Ie, create a query that
> matches the 460K docs and pass that to
> IndexWriter.deleteDocuments(Query).
>
> Also, try passing fewer ids at once to Lucene, e.g. break the 460K
> into smaller chunks.  Lucene buffers up all deleted terms from one
> call, and then applies them, so my guess is you're using way too much
> intermediate memory by passign 460K in a single call.
>
> Instead of indexing everything into one index, and then deleting tons
> of docs to "clone" to a new index, why not just index to two separate
> indices to begin with?
>
> The second issue is that after all that work, nothing in fact changed.
>  For that, I think you should make a small test case that just tries
> to delete one document, and iterate/debug until that works.  Your
> StringField indexing line looks correct; make sure you're passing
> precisely the same field name and value?  Make sure you're not
> deleting already-deleted documents?  (Your for loop seems to ignore
> already deleted documents).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin 
> wrote:
> > I knew that I had forgotten something.  Below is the line that I use to
> > create the field that I am trying to use to delete the entries with.  I
> > hope this avoids some confusion.  Thank you very much to anyone that
> takes
> > the time to read these messages.
> >
> > doc.add(new StringField("FileName",filename, Field.Store.YES));
> >
> >
> > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin  >wrote:
> >
> >> Let me start by stating that I almost certain that I am doing something
> >> wrong, and that I hope that I am because if not there is a VERY large
> bug
> >> in Lucene.   What I am trying to do is use the method
> >>
> >>
> >> deleteDocuments(Term... terms)
> >>
> >>
> >>  out of the IndexWriter class to delete several Term object Arrays, each
> >> fed to it via a separate Thread.  Each array has around 460k+ Term
> object
> >> in it.  The issue is that after running for around 30 minutes or more
> the
> >> method finishes, I then have a commit run and nothing changes with my
> files.
> >> To be fair, I am running a custom Directory implementation that might be
> >> causing problems, but I do not think that this is the case as I do not
> even
> >> see any of the my Directory methods in the stack trace.  In fact when I
> >> set break points inside the delete methods of my Directory
> implementation
> >> they never even get hit. To be clear replacing the custom Directory
> >> implementation with a standard one is not an option due to the nature of
> >> the data which is made up of terabytes of small (1k and less) files.
>  So,
> >> if the issue is in the Directory implementation I have to figure out
> how to
> >> fix it.
> >>
> >>
> >> Below are the pieces of code that I think are relevant to this issue as
> >> well as a copy of the stack trace thread that was doing work when I
> paused
> >> the debug session.  As you are likely to notice, the thread is called a
> >> DBCloner because it is being used to clone the underlying Index based
> >> database (needed to avoid storing trillions of files directly on disk).
>  The
> >> idea is to duplicate the selected group of terms into a new database and
> >

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-16 Thread Jason Corekin
Mike,



Thank you for your help.  Below are a few comments to directly reply to
your questions, but in general your suggestions helped to get me on the
right track and I believe that have been able to solve the Lucene component
of my problems.  The short answer was that I when I had previously tried to
search by query I used to filenames stored in each document as the query,
which was essentially equivalent to deleting by term.  You email helped me
to realize this and in turn change my query to be time range based, which
now takes seconds to run.



Thank You



Jason Corekin



>It sounds like there are at least two issues.

>

>First, that it takes so long to do the delete.

>

>Unfortunately, deleting by Term is at heart a costly operation.  It

>entails up to one disk seek per segment in your index; a custom

>Directory impl that makes seeking costly would slow things down, or if

>the OS doesn't have enough RAM to cache the "hot" pages (if your Dir

>impl is using the OS).  Is seeking somehow costly in your custom Dir

>impl?



No, seeks are not slow at all.

>

>If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec

>per Term, which may actually be expected.

>

>How many terms in your index?  Can you run CheckIndex and post the output?

In the main test case that was causing problems I believe that there are
around 3.7million terms and this is tiny in comparison to what will need to
be held.  Unfortunately I forgot to save the CheckIndex output that I
created from this test set while the problem was occurring and now that the
problem is solved I do not think it is worth going back to recreate it.



>

>You could index your ID field using MemoryPostingsFormat, which should

>be a good speedup, but will consume more RAM.

>

>Is it possible to delete by query instead?  Ie, create a query that

>matches the 460K docs and pass that to

>IndexWriter.deleteDocuments(Query).

>

Thanks so much for this suggestion, I had thought of it on my own.



>Also, try passing fewer ids at once to Lucene, e.g. break the 460K

>into smaller chunks.  Lucene buffers up all deleted terms from one

>call, and then applies them, so my guess is you're using way too much

>intermediate memory by passign 460K in a single call.



This does not seem to be the issue now, but I will keep it in mind.

>

>Instead of indexing everything into one index, and then deleting tons

>of docs to "clone" to a new index, why not just index to two separate

>indices to begin with?

>

The clone idea is only a test, the final design is to be able to copy date
ranges of data out of the main index and into secondary indexes that will
be backed up and removed from the main system on a regular interval.  This
copy component of this idea seems to work just fine, it’s getting the
deletion from the made index to work that is giving me all the trouble.



>The second issue is that after all that work, nothing in fact changed.

> For that, I think you should make a small test case that just tries

>to delete one document, and iterate/debug until that works.  Your

>StringField indexing line looks correct; make sure you're passing

>precisely the same field name and value?  Make sure you're not

>deleting already-deleted documents?  (Your for loop seems to ignore

>already deleted documents).



This was caused be in incorrect use of the underlying data structure.  This
is partially fixed now and what I am currently working on.  I have this
fixed enough to identify  that it should no longer be related to Lucene.



>

>Mike McCandless


On Sat, Dec 14, 2013 at 5:58 PM, Jason Corekin wrote:

> Mike,
>
> Thanks for the input, it will take me some time to digest and trying
> everything you wrote about.  I will post back the answers to your questions
> and results to from the suggestions you made once I have gone over
> everything.  Thanks for the quick reply,
>
> Jason
>
>
> On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> It sounds like there are at least two issues.
>>
>> First, that it takes so long to do the delete.
>>
>> Unfortunately, deleting by Term is at heart a costly operation.  It
>> entails up to one disk seek per segment in your index; a custom
>> Directory impl that makes seeking costly would slow things down, or if
>> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
>> impl is using the OS).  Is seeking somehow costly in your custom Dir
>> impl?
>>
>> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
>> per Term, which may actually be expected.
>>
>> How many terms in your index?  Can you run CheckIndex and post the output?
&

codec mismatch

2014-02-14 Thread Jason Wee
Hello,

This is my first question to lucene mailing list, sorry if the question
sounds funny.

I have been experimenting to store lucene index files on cassandra,
unfortunately the exception got overwhelmed. Below are the stacktrace.

org.apache.lucene.index.CorruptIndexException: codec mismatch: actual
codec=CompoundFileWriterData vs expected codec=Lucene46FieldInfos
(resource: SlicedIndexInput(SlicedIndexInput(_0.fnm in
lucene-cassandra-desc) in lucene-cassandra-desc slice=31:340))
at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:140)
at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130)
at
org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
at
org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:214)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
at
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
at org.apache.lucene.store.Search.(Search.java:41)
at org.apache.lucene.store.Search.main(Search.java:34)

I'm not sure what does it means, can anybody help?

When I check the hex representation of _0.fnm in cassandra, and translated
to ascii. It is something like this:
??l??Lucene46FieldInfos??path?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?modified?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?contentsPerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0

It looks to me the expected codec is found in the _0.fnm file or am I wrong?

Thank you and please let me know if you need additional information.


Re: codec mismatch

2014-02-17 Thread Jason Wee
Hi Mike,

Thank you.

This exception is pretty clear that during lucene execute readInternal(...)
on _0.cfs and encountered an npe. The root cause is because the object
being read,  FileBlock is null. As far as i can tell, it happen only during
reading _0.cfs but not on the index files that were read before ( that is,
for example, segments.gen, segments_1, _0.cfs).

It's pretty mind boggling to understand without a better description on how
lucene read the file. Tried to search in google, lucene wiki, lucene source
repository on the lib and your blog but without much avail, could you give
some pointer or write a general description on what happened
after IndexReader reader = DirectoryReader.open(cassandraDirectory); ?

2014-02-17 16:40:48 CassandraDirectory [INFO] called length() and returning
1034
2014-02-17 16:40:48 BufferedIndexInput [INFO] length = '1034'
2014-02-17 16:40:48 CassandraDirectory [INFO] called length() and returning
1034
2014-02-17 16:40:48 CassandraDirectory [TRACE] read internal to bytes with
offset 0 and length 309
2014-02-17 16:40:48 CassandraDirectory [INFO] fileDescriptor name =
'_0.cfs' fileLength = '1034'
2014-02-17 16:40:48 CassandraDirectory [INFO] fileDescriptor length 1034
fileDescriptor blockSize 1
java.lang.NullPointerException
at
org.apache.lucene.store.CassandraDirectory$CassandraIndexInput.readInternal(CassandraDirectory.java:1850)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:178)
at
org.apache.lucene.store.Directory$SlicedIndexInput.readInternal(Directory.java:306)
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:298)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:50)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:84)
at
org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:202)
at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:126)
at
org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
at
org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:214)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
at
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
at org.apache.lucene.store.Search.(Search.java:41)
at org.apache.lucene.store.Search.main(Search.java:34)


On Fri, Feb 14, 2014 at 7:14 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> This means Lucene was attempting to open _0.fnm but somehow got the
> contents of _0.cfs instead; seems likely that it's a bug in the
> Cassanda Directory implementation?  Somehow it's opening the wrong
> file name?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Feb 14, 2014 at 3:13 AM, Jason Wee  wrote:
> > Hello,
> >
> > This is my first question to lucene mailing list, sorry if the question
> > sounds funny.
> >
> > I have been experimenting to store lucene index files on cassandra,
> > unfortunately the exception got overwhelmed. Below are the stacktrace.
> >
> > org.apache.lucene.index.CorruptIndexException: codec mismatch: actual
> > codec=CompoundFileWriterData vs expected codec=Lucene46FieldInfos
> > (resource: SlicedIndexInput(SlicedIndexInput(_0.fnm in
> > lucene-cassandra-desc) in lucene-cassandra-desc slice=31:340))
> > at
> org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:140)
> > at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130)
> > at
> >
> org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
> > at
> >
> org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:214)
> > at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
> > at
> >
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
> > at
> >
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
> > at
> >
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
> > at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
> > at org.apache.lucene.store.Search.(Search.java:41)
> > at org.apache.lucene.store.Search.main(Search.java:34)
> >
> > I'm not sure what does it means, can anybody help?
> >
> > When I check the hex representation of _0.fnm in cassandra, and
> translated
> > to ascii. It

Re: codec mismatch

2014-03-06 Thread Jason Wee
Hello Mike,

Thank you and you were right in your first comment, the expected field,
Lucene46FieldInfos is within the file _0.cfs. We have taken a closer look
and in details. The problem was because copy bytes in hex form from
cassandra to the byte array was wrong because the source offset was set
wrongly. It was set from 0 all the time when it should be set based on
lucene called seek(position). Thank you again.

Jack, it is educational purpose and we think lucene is a fantastic software
and we would like to learn it in details.

Jason


On Mon, Feb 17, 2014 at 10:31 PM, Jack Krupansky wrote:

> Are you using or aware of Solandra? See:
>
> https://github.com/tjake/Solandra
>
> Solandra has been superceded by a commercial product, DataStax Enterprise
> that combines Solr/Lucene and Cassandra. Solr/Lucene indexing of Cassandra
> data is supported, but the actual Lucene indexes are stored in the native
> file system for greater performance. Solrandra stored the Lucene indexes in
> Cassandra, but the performance penalty was too high.
>
> -- Jack Krupansky
>
> -Original Message- From: Jason Wee
> Sent: Friday, February 14, 2014 3:13 AM
> To: java-user@lucene.apache.org
> Subject: codec mismatch
>
>
> Hello,
>
> This is my first question to lucene mailing list, sorry if the question
> sounds funny.
>
> I have been experimenting to store lucene index files on cassandra,
> unfortunately the exception got overwhelmed. Below are the stacktrace.
>
> org.apache.lucene.index.CorruptIndexException: codec mismatch: actual
> codec=CompoundFileWriterData vs expected codec=Lucene46FieldInfos
> (resource: SlicedIndexInput(SlicedIndexInput(_0.fnm in
> lucene-cassandra-desc) in lucene-cassandra-desc slice=31:340))
> at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(
> CodecUtil.java:140)
> at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130)
> at
> org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(
> Lucene46FieldInfosReader.java:56)
> at
> org.apache.lucene.index.SegmentReader.readFieldInfos(
> SegmentReader.java:214)
> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
> at
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(
> StandardDirectoryReader.java:62)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.
> run(SegmentInfos.java:843)
> at
> org.apache.lucene.index.StandardDirectoryReader.open(
> StandardDirectoryReader.java:52)
> at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
> at org.apache.lucene.store.Search.(Search.java:41)
> at org.apache.lucene.store.Search.main(Search.java:34)
>
> I'm not sure what does it means, can anybody help?
>
> When I check the hex representation of _0.fnm in cassandra, and translated
> to ascii. It is something like this:
> ??l??Lucene46FieldInfos??path?Q??
> PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?
> 0?modified?Q??PerFieldPostingsFormat.format?Lucene41?
> PerFieldPostingsFormat.suffix?0?contents
> PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0
>
> It looks to me the expected codec is found in the _0.fnm file or am I
> wrong?
>
> Thank you and please let me know if you need additional information.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


background merge hit exception

2014-04-03 Thread Jason Wee
Hello again,

A little background of our experiment. We are storing lucene (version
4.6.0) on top of cassandra. We are using the demo IndexFiles.java from the
lucene with minor modification such that the directory used is reference to
the CassandraDirectory.

With large dataset (that is, index more than 5 of files), after index
is done, and set forceMerge(1) and get the following exception.


BufferedIndexInput readBytes [ERROR] bufferStart = '0' bufferPosition =
'1024' len = '9252' after = '10276'
BufferedIndexInput readBytes [ERROR] length = '8192'
 caught a class java.io.IOException
 with message: background merge hit exception: _1(4.6):c10250
_0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
[maxNumSegments=1]
java.io.IOException: background merge hit exception: _1(4.6):c10250
_0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
[maxNumSegments=1]
at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1755)
at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1691)
at org.apache.lucene.store.IndexFiles.main(IndexFiles.java:159)
Caused by: java.io.IOException: read past EOF:
CassandraSimpleFSIndexInput(_1.nvd in path="_1.cfs" slice=5557885:5566077)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:186)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:125)
at
org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:230)
at
org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:186)
at
org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:159)
at
org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:516)
at
org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:232)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:127)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4057)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)


We do not know what is wrong as our understanding on lucene is limited. Can
someone give explanation on what is happening, or which might be the
possible error source is?

Thank you and any advice is appreciated.

/Jason


Re: background merge hit exception

2014-04-08 Thread Jason Wee
Hello Jose,

Thank you for your response, I took a closer look. Below are my responses:


> Seems that you want to force a max number of segments to 1,

  // you're done adding documents to it):
  //
  writer.forceMerge(1);

  writer.close();

Yes, the line of code is uncommented because we want to understand how
it work when index big data sets. Should this be a concern?


> On a previous thread someone answered that the number of segments will
> affect the Index Size, and is not related with Index Integrity (like size
> of index may vary according with number of segments).

okay, no idea what the above actually mean but I would guess perhaps
the code we added, cause this exception?

  if (file.isDirectory()) {
String[] files = file.list();
// an IO error could occur
if (files != null) {
for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]),
forceMerge);
if (forceMerge && writer.hasPendingMerges()) {
if (i % 1000 == 0 && i != 0) {
logger.trace("forcing merge now.");
try {
writer.forceMerge(50);
writer.commit();
} catch (OutOfMemoryError e) {
logger.error("out of memory
during merging ", e);
throw new
OutOfMemoryError(e.toString());
}
}
}
}
}

} else {
FileInputStream fis;


> Should be...

> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
>  IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46,
> analyzer);

yes, we were and still referencing lucene_46 in our analyzer.


/Jason



On Sat, Apr 5, 2014 at 9:01 PM, Jose Carlos Canova <
jose.carlos.can...@gmail.com> wrote:

> Seems that you want to force a max number of segments to 1,
> On a previous thread someone answered that the number of segments will
> affect the Index Size, and is not related with Index Integrity (like size
> of index may vary according with number of segments).
>
> on version 4.6 there is a small issue on sample that is
>
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
>   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40,
> analyzer);
>
>
> Should be...
>
>
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
>   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46,
> analyzer);
>
>
> With this probably the line related to the codec will change too.
>
>
>
> On Fri, Apr 4, 2014 at 3:52 AM, Jason Wee  wrote:
>
> > Hello again,
> >
> > A little background of our experiment. We are storing lucene (version
> > 4.6.0) on top of cassandra. We are using the demo IndexFiles.java from
> the
> > lucene with minor modification such that the directory used is reference
> to
> > the CassandraDirectory.
> >
> > With large dataset (that is, index more than 5 of files), after index
> > is done, and set forceMerge(1) and get the following exception.
> >
> >
> > BufferedIndexInput readBytes [ERROR] bufferStart = '0' bufferPosition =
> > '1024' len = '9252' after = '10276'
> > BufferedIndexInput readBytes [ERROR] length = '8192'
> >  caught a class java.io.IOException
> >  with message: background merge hit exception: _1(4.6):c10250
> > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
> > [maxNumSegments=1]
> > java.io.IOException: background merge hit exception: _1(4.6):c10250
> > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
> > [maxNumSegments=1]
> > at
> > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1755)
> > at
> > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1691)
> > at org.apache.lucene.store.IndexFiles.main(IndexFiles.java:159)
> > Caused by: java.io.IOException: read past EOF:
> > CassandraSimpleFSIndexInput(_1.nvd in path="_1.cfs"
> slice=5557885:5566077)
> > at
> >
> >
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:186)
> > at
> >
> >
> org.apache.lucene.store.BufferedIndexInpu

Re: background merge hit exception

2014-04-09 Thread Jason Wee
Hi Jose,

Thank you for very informative response.

I have commented out the line of codes that does the forceMerge(50) and
commit() while the indexing is happening. Also increase the ram buffer size

iwc.setRAMBufferSizeMB(512.0);

and after index is done, then only forceMerge and commit but this time with
large merge segments, that is 50.

if (writer != null && forceMerge) {
 writer.forceMerge(50);
  writer.commit();
}

With these changed, the exceptions reported initially, is no longer
happening.

Thank you again.

Jason


On Tue, Apr 8, 2014 at 8:50 PM, Jose Carlos Canova <
jose.carlos.can...@gmail.com> wrote:

> Hi Jason,
>
> No, the StrackTrace shows clearly the cause of the errror occurred during
> the merge into a single index file segment(forgeMerge parameter defines the
> number of desired segments at end).
>
> During the indexing of a document, Lucene might decide to create a new
> segment of the information extracted from a document that you have created
> to index it, somewhere on
> Lucene<http://lucene.apache.org/core/3_0_3/fileformats.html>documentation
> has a description of each file extension and its usage by the
> program.
>
> ForceMerge is an option:
>
> You can also avoid the "forceMerge" letting all segments "as is", the
> retrieval of results will work as same manner, maybe a little slowly
> because the IndexReader will be mounted over several "index segments" but
> works as the same manner, which means the forceMerge to minimize the number
> of index segments can be avoided without harm the search results.
>
> Regarding how to index files,
>
> I did something different to index files found on a directory structure.
> I used the FileVisitor<
> http://docs.oracle.com/javase/7/docs/api/java/nio/file/FileVisitor.html>to
> accumulate which files would be targeted to index, which means first
> scan the files,
> then after the scan, extract their content using
> tika<http://tika.apache.org/> (a
> choice) to finally index them.
>
> With this you can avoid some memory issues and separate the "scan process
> (locate the files)" from the content extraction process (tika extraction or
> other file read routine)  from the "index process(lucene)", because all of
> them are
> memory consuming (for example large pdf files of big string segments).
>
> The disadvantage is that a little bit slow process (if all tasks run's on
> same jvm will obligate to coordinate all threads), but with advantage is
> that permit you to divide the tasks into "sub tasks" and distribute them
> using a cache or a message queue like "activemq<
> http://activemq.apache.org/>",
>  subtasks using a "message queue" also lets you to distribute among
> different processes (jvm's) and machines. On practice take a little bit
> time since you have to write some blocks of line of code to manage all of
> those subtasks.
>
>
>
> att.
>
>
>
>
> On Tue, Apr 8, 2014 at 4:02 AM, Jason Wee  wrote:
>
> > Hello Jose,
> >
> > Thank you for your response, I took a closer look. Below are my
> responses:
> >
> >
> > > Seems that you want to force a max number of segments to 1,
> >
> >   // you're done adding documents to it):
> >   //
> >   writer.forceMerge(1);
> >
> >   writer.close();
> >
> > Yes, the line of code is uncommented because we want to understand how
> > it work when index big data sets. Should this be a concern?
> >
> >
> > > On a previous thread someone answered that the number of segments will
> > > affect the Index Size, and is not related with Index Integrity (like
> size
> > > of index may vary according with number of segments).
> >
> > okay, no idea what the above actually mean but I would guess perhaps
> > the code we added, cause this exception?
> >
> >   if (file.isDirectory()) {
> > String[] files = file.list();
> > // an IO error could occur
> > if (files != null) {
> > for (int i = 0; i < files.length; i++) {
> > indexDocs(writer, new File(file, files[i]),
> > forceMerge);
> > if (forceMerge && writer.hasPendingMerges())
> {
> > if (i % 1000 == 0 && i != 0) {
> > logger.trace("forcing merge now.");
> > try {
> >  

make data search as index progress.

2014-04-14 Thread Jason Wee
https://lucene.apache.org/core/4_6_0/demo/overview-summary.html
https://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html

Hello,

We are using lucene 4.6.0 and storing index on top of cassandra.

As far as I understand, in order to make the index searchable, in the
IndexFiles, method commit() has to be called, is there any other way so
that the index is searchable other than calling commit() ?

Took a look on the NRTCachingDirectory,  but our search and index
application exists in two separate jvm, as far as NRT is concern, instance
of NRTCachingDirectory needed to pass in IndexWriter and DirectoryReader to
make it searchable.

Thanks and appreciate any advice.

/Jason


Re: make data search as index progress.

2014-04-15 Thread Jason Wee
Hello Jose,

Thank you for your insight.

It sounds to me that, before method commit is called, then if there is any
error happened, example, power failure or human error, then the index will
be lost?

> (like at each X docs you commit the index and close it)
iwc.setMaxBufferedDocs(10);

the index speed get very very slow (like 10-20doc per second) unfortunately
and at times, after index on N files, it just stalled forever, am not sure
what went wrong.

/Jason







On Mon, Apr 14, 2014 at 9:01 PM, Jose Carlos Canova <
jose.carlos.can...@gmail.com> wrote:

> Hello,
>
> That's because NRTCachingDirectory uses a in cache memory to "mimic in
> memory the Directory that you used to index your files ", in theory the
> commit is needed because you need to flush the documents recently added
> otherwise this document will not be available for search until the end of
> the indexing when you really need to flush all documents to the index to
> close properly the "task that you created to index the documents", you can
> adopt other strategies for NRT, one alternative is work with several index
> segments with a fixed document length (like at each X docs you commit the
> index and close it) using a new instance of a CompositeReader to perform
> the search, works at same manner, since the CompositeReader as the name
> says open an IndexReader for a IndexSearcher using list of Indexes.
>
> Will work at same manner but with the disadvantage is that you have to
> create your own code.
>
>
>
>
> On Mon, Apr 14, 2014 at 9:29 AM, Jason Wee  wrote:
>
> > https://lucene.apache.org/core/4_6_0/demo/overview-summary.html
> >
> >
> https://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html
> >
> > Hello,
> >
> > We are using lucene 4.6.0 and storing index on top of cassandra.
> >
> > As far as I understand, in order to make the index searchable, in the
> > IndexFiles, method commit() has to be called, is there any other way so
> > that the index is searchable other than calling commit() ?
> >
> > Took a look on the NRTCachingDirectory,  but our search and index
> > application exists in two separate jvm, as far as NRT is concern,
> instance
> > of NRTCachingDirectory needed to pass in IndexWriter and DirectoryReader
> to
> > make it searchable.
> >
> > Thanks and appreciate any advice.
> >
> > /Jason
> >
>


Re: make data search as index progress.

2014-05-02 Thread Jason Wee
Hello Jose,

Well, yes, we are using append but if during indexing, you did not commit
and only when the index is done, that is it close, then the index will
persist. But other than that, during indexing, if commit is not perform,
then during that duration, all index is lost. Yes, we are trying different
settings for the index writer config and merge policy.

Thank for the lengthy information and we have also make our code reachable
via github.com

/Jason




On Wed, Apr 16, 2014 at 10:55 AM, Jose Carlos Canova <
jose.carlos.can...@gmail.com> wrote:

> No, the index remains, you can reopen using OpenMode.Append (an enum
> somewhere) if there is any exception like power loss you must delete the
> lock file(which lock the index stream for other index process), i solved
> some issues on performance using a multithread task since the IndexWriter
> is thread safe, the problem for indexing (and probably the samples that in
> fact i never read ;-) ) is that the task for index is single thread, and
> the commit with few segments on large index will cause a performance
> decrease. Probably one alternative (that i haven't tested yet) is since the
> Index grow you can increase the number of segments allowed for your Index.
>
> Since I don't trust on anybody, i use a Database (Postgres) to manage the
> log for indexing, this keeps the task on track to recover from where was
> stopped, i haven't finished my pseudo project yet, and have another solid
> alternatives like  Hibernate Search which is built on top of lucene, my
> problem is that I don't agree with 3rd part frameworks on top of the Lucene
> component because they are updating and enhancing the component faster than
> the 3rd part companies that uses Lucene, but Hibernate Search and Neo4j are
> Industry standards and both use Lucene.
>
>
>
>
> On Tue, Apr 15, 2014 at 9:57 PM, Jason Wee  wrote:
>
> > Hello Jose,
> >
> > Thank you for your insight.
> >
> > It sounds to me that, before method commit is called, then if there is
> any
> > error happened, example, power failure or human error, then the index
> will
> > be lost?
> >
> > > (like at each X docs you commit the index and close it)
> > iwc.setMaxBufferedDocs(10);
> >
> > the index speed get very very slow (like 10-20doc per second)
> unfortunately
> > and at times, after index on N files, it just stalled forever, am not
> sure
> > what went wrong.
> >
> > /Jason
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Apr 14, 2014 at 9:01 PM, Jose Carlos Canova <
> > jose.carlos.can...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > That's because NRTCachingDirectory uses a in cache memory to "mimic in
> > > memory the Directory that you used to index your files ", in theory the
> > > commit is needed because you need to flush the documents recently added
> > > otherwise this document will not be available for search until the end
> of
> > > the indexing when you really need to flush all documents to the index
> to
> > > close properly the "task that you created to index the documents", you
> > can
> > > adopt other strategies for NRT, one alternative is work with several
> > index
> > > segments with a fixed document length (like at each X docs you commit
> the
> > > index and close it) using a new instance of a CompositeReader to
> perform
> > > the search, works at same manner, since the CompositeReader as the name
> > > says open an IndexReader for a IndexSearcher using list of Indexes.
> > >
> > > Will work at same manner but with the disadvantage is that you have to
> > > create your own code.
> > >
> > >
> > >
> > >
> > > On Mon, Apr 14, 2014 at 9:29 AM, Jason Wee  wrote:
> > >
> > > > https://lucene.apache.org/core/4_6_0/demo/overview-summary.html
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html
> > > >
> > > > Hello,
> > > >
> > > > We are using lucene 4.6.0 and storing index on top of cassandra.
> > > >
> > > > As far as I understand, in order to make the index searchable, in the
> > > > IndexFiles, method commit() has to be called, is there any other way
> so
> > > > that the index is searchable other than calling commit() ?
> > > >
> > > > Took a look on the NRTCachingDirectory,  but our search and index
> > > > application exists in two separate jvm, as far as NRT is concern,
> > > instance
> > > > of NRTCachingDirectory needed to pass in IndexWriter and
> > DirectoryReader
> > > to
> > > > make it searchable.
> > > >
> > > > Thanks and appreciate any advice.
> > > >
> > > > /Jason
> > > >
> > >
> >
>


Lucene Indexing performance issue

2014-10-22 Thread Jason Wu
Hi Team,

I am a new user of Lucene 4.8.1. I encountered a Lucene indexing
performance issue which slow down my application greatly. I tried several
ways from google searchs but still couldn't resolve it. Any suggestions
from your experts might help me a lot.

One of my application uses the lucene index for fast data searching. When I
start my application, I will index all the necessary data from database
which will be 88 MB index data after indexing is done. In this case,
indexing only takes less than 4 minutes.

I have another shell script task running every night, which send a JMX call
to my application to re-indexing all the data. The re-indexing method will
clear my current indexing directory data, reading data from database and
recreating the index from the ground. Everything works fine at the
beginning, indexing only takes a little more than 3 mins. But after my
application running for a while(one day or two), the re-indexing speed
slows down greatly which now takes more than 22 mins.

Here is the procedure of my Lucene indexing and re-indexing:

   1. If index data exists inside index directory, remove all the index
   data.
   2. Create IndexWriter with 200MB RAMBUFFERSIZE, (6.6) MaxMergesAndThreads
   3. Process DB result set
   - When I loop the result set, I reuse the same Document instance.
  - At the end of each loop, I call indexWriter.addDocument(doc)
   4. IndexWriter.commit()
   5. IndexWriter.close();


I did a profiling when it was slow and found out that
indexWriter.addDocument method took most of the time. Then, i put some
logging code as below:

long start = System.currentTimeMillis();
indexWriter.addDocument(doc);
totalAddDocTime += (System.currentTimeMillis() - start);

After several tests, when the indexing is slow down, the total time took by
indexWriter.addDocument(doc) is about 20 mins.

During indexing, i also observed the cpu usage sometimes above 100.

6G memory assigned to my application. When indexing, other processing
modules are all suspended waiting for indexing finish and I don't see any
memory leak in my application.

Can you give me some suggestions about my issue?

Thank you,

Jason


Re: Making lucene indexing multi threaded

2014-10-27 Thread Jason Wu
Hi Nischal,

I had similar indexing issue. My lucene indexing took 22 mins for 70 MB
docs. When i debugged the problem, i found out the
indexWriter.addDocument(doc) taking a really long time.

Have you already found the solution about it?

Thank you,
Jason



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Making-lucene-indexing-multi-threaded-tp4087830p4166094.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Making lucene indexing multi threaded

2014-10-27 Thread Jason Wu
Hi Fuad,

Thanks for your suggestions and quick response. I am using a single-threaded
indexing way to add docs. I will try the multiple-threaded indexing to see
if my issue will be resolved.

This issue only exists after I upgraded lucene version from 2.4.1(with Java
1.6) to 4.8.1(with Java 1.7). I don't have this problem in old lucene
version.

The indexing speed is fast when i start the application, which only takes 3
mins indexing. But after my application running for a while(a day, etc), 
once i send a JMX call to my application to reindex docs, the indexing speed
will slow down and take 22 mins.

Did you have any similar experience like the above before?

Thank you,
Jason



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Making-lucene-indexing-multi-threaded-tp4087830p4166116.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Making lucene indexing multi threaded

2014-10-27 Thread Jason Wu
Hi Gary,

Thanks for your response. I only call the commit when all my docs are added.

Here is the procedure of my Lucene indexing and re-indexing: 

   1. If index data exists inside index directory, remove all the index 
   data. 
   2. Create IndexWriter with 256MB RAMBUFFERSIZE
   3. Process DB result set 
   - When I loop the result set, I reuse the same Document instance. 
  - At the end of each loop, I call indexWriter.addDocument(doc) 
   4. After all docs are added, call IndexWriter.commit() 
   5. IndexWriter.close(); 

Thank you,
Jason



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Making-lucene-indexing-multi-threaded-tp4087830p4166123.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Java8 and lucene version

2015-05-06 Thread Jason Wee
The difficult way is to go through lucene code and read if is work with
java 8. If you can duplicate the index created with lucene 2.9.4, perhaps
you can try to upgrade java in test environment, it should give some direct
indication or result (for example, exception, index cannot write/read, etc)
immediately.

hth

jason

On Thu, May 7, 2015 at 4:19 AM, Pushyami Gundala  wrote:

> Hi, We are using lucene 2.9.4 version for our application that has search.
> We are planning on upgrading our application to run on java 8. My Question
> is when we move to java 8 does the lucene-2.9.4 version still work? or i
> need to upgrade to new version of lucene  to support  java 8.
> Regards,
> Pushyami
>


Re: Global ordinal based query time join documentation

2015-06-06 Thread Jason Wee
https://svn.apache.org/viewvc/lucene/dev/branches/branch_5x/lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java?view=markup&pathrev=1671777
https://svn.apache.org/viewvc?view=revision&revision=1671777
https://issues.apache.org/jira/browse/LUCENE-6352

hth

jason

On Fri, Jun 5, 2015 at 11:35 PM, Eduardo Manrique 
wrote:

> Hi everyone,
>
> I’d like to try the new “Global ordinal based query time join”, but I
> couldn’t find any docs or examples. Do you guys have a link, or example on
> how to use it?
>
> Thanks,
> Eduardo Manrique
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Request for help with Lucene search engine

2015-06-26 Thread Jason Wee
maybe start with this?
https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/search/TestDocValuesScoring.java

hth

jason

On Fri, Jun 26, 2015 at 7:40 PM, Rim REKIK  wrote:

> Dear,
> I m trying Lucene to work with Lucene search engine. But I m asking if
> there are ready examples for scoring documents.
> Thank you.
>
> Regards --
> Rim REKIK rim.re...@ieee.org
>


Lucene IndexSearcher PrefixQuery seach getting really slow after a while

2016-11-03 Thread Jason Wu
Hi Team,

We are using lucene 4.8.1 to do some info searches every day for years. 
However, recently we encounter some performance issues which greatly slow down 
the lucene search.

After application running for a while, we are facing below issues, which  
IndexSearcher PrefixQuery taking much longer time to search:

[cid:image002.png@01D235EC.3C063740]

Our cpu and memory are fine, no leak found:
[cid:image004.jpg@01D235EC.3C063740]


However, for the exactly same java instance we are running on another box,  for 
the same info we are searching, it is very fast.

I/O, memory, CPUS are all fine on both boxes.

So, do you know any reasons can cause this performance issue?

Thank you,
J.W



This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page for contact 
information on our offices worldwide.


Re: term frequency

2016-11-24 Thread Jason Wee
the exception line does not match the code you pasted, but do make
sure your object actually not null before accessing its method.

On Thu, Nov 24, 2016 at 5:42 PM, huda barakat
 wrote:
> I'm using SOLRJ to find term frequency for each term in a field, I wrote
> this code but it is not working:
>
>
>1. String urlString = "http://localhost:8983/solr/huda";;
>2. SolrClient solr = new HttpSolrClient.Builder(urlString).build();
>3.
>4. SolrQuery query = new SolrQuery();
>5. query.setTerms(true);
>6. query.addTermsField("name");
>7. SolrRequest req = new QueryRequest(query);
>8. QueryResponse rsp = req.process(solr);
>9.
>10. System.out.println(rsp);
>11.
>12. System.out.println("numFound: " +
> rsp.getResults().getNumFound());
>13.
>14. TermsResponse termResp =rsp.getTermsResponse();
>15. List terms = termResp.getTerms("name");
>16. System.out.print(terms.size());
>
>
> I got this error:
>
> Exception in thread "main" java.lang.NullPointerException at
> solr_test.solr.App2.main(App2.java:50)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [VOTE] Lucene logo contest

2020-06-16 Thread Jason Gerlowski
Option "A"

On Tue, Jun 16, 2020 at 8:37 PM Man with No Name
 wrote:
>
> A, clean and modern.
>
> On Mon, Jun 15, 2020 at 6:08 PM Ryan Ernst  wrote:
>>
>> Dear Lucene and Solr developers!
>>
>> In February a contest was started to design a new logo for Lucene [1]. That 
>> contest concluded, and I am now (admittedly a little late!) calling a vote.
>>
>> The entries are labeled as follows:
>>
>> A. Submitted by Dustin Haver [2]
>>
>> B. Submitted by Stamatis Zampetakis [3] Note that this has several variants. 
>> Within the linked entry there are 7 patterns and 7 color palettes. Any vote 
>> for B should contain the pattern number, like B1 or B3. If a B variant wins, 
>> we will have a followup vote on the color palette.
>>
>> C. The current Lucene logo [4]
>>
>> Please vote for one of the three (or nine depending on your perspective!) 
>> above choices. Note that anyone in the Lucene+Solr community is invited to 
>> express their opinion, though only Lucene+Solr PMC cast binding votes 
>> (indicate non-binding votes in your reply, please). This vote will close one 
>> week from today, Mon, June 22, 2020.
>>
>> Thanks!
>>
>> [1] https://issues.apache.org/jira/browse/LUCENE-9221
>> [2] 
>> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
>> [3] https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
>> [4] https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>
> --
> Sent from Gmail for IPhone

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-02 Thread Jason Gerlowski
A1, A2, D (binding)

On Wed, Sep 2, 2020 at 10:47 AM Michael McCandless
 wrote:
>
> A2, A1, C5, D (binding)
>
> Thank you to everyone for working so hard to make such cool looking possible 
> future Lucene logos!  And to Ryan for the challenging job of calling this 
> VOTE :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Sep 1, 2020 at 4:21 PM Ryan Ernst  wrote:
>>
>> Dear Lucene and Solr developers!
>>
>> Sorry for the multiple threads. This should be the last one.
>>
>> In February a contest was started to design a new logo for Lucene 
>> [jira-issue]. The initial attempt [first-vote] to call a vote resulted in 
>> some confusion on the rules, as well the request for one additional 
>> submission. The second attempt [second-vote] yesterday had incorrect links 
>> for one of the submissions. I would like to call a new vote, now with more 
>> explicit instructions on how to vote, and corrected links.
>>
>> Please read the following rules carefully before submitting your vote.
>>
>> Who can vote?
>>
>> Anyone is welcome to cast a vote in support of their favorite submission(s). 
>> Note that only PMC member's votes are binding. If you are a PMC member, 
>> please indicate with your vote that the vote is binding, to ease collection 
>> of votes. In tallying the votes, I will attempt to verify only those marked 
>> as binding.
>>
>> How do I vote?
>>
>> Votes can be cast simply by replying to this email. It is a ranked-choice 
>> vote [rank-choice-voting]. Multiple selections may be made, where the order 
>> of preference must be specified. If an entry gets more than half the votes, 
>> it is the winner. Otherwise, the entry with the lowest number of votes is 
>> removed, and the votes are retallied, taking into account the next preferred 
>> entry for those whose first entry was removed. This process repeats until 
>> there is a winner.
>>
>> The entries are broken up by variants, since some entries have multiple 
>> color or style variations. The entry identifiers are first a capital letter, 
>> followed by a variation id (described with each entry below), if applicable. 
>> As an example, if you prefer variant 1 of entry A, followed by variant 2 of 
>> entry A, variant 3 of entry C, entry D, and lastly variant 4e of entry B, 
>> the following should be in your reply:
>>
>> (binding)
>> vote: A1, A2, C3, D, B4e
>>
>> Entries
>>
>> The entries are as follows:
>>
>> A. Submitted by Dustin Haver. This entry has two variants, A1 and A2.
>>
>> [A1] 
>> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
>> [A2] https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png
>>
>> B. Submitted by Stamatis Zampetakis. This has several variants. Within the 
>> linked entry there are 7 patterns and 7 color palettes. Any vote for B 
>> should contain the pattern number followed by the lowercase letter of the 
>> color palette. For example, B3e or B1a.
>>
>> [B] https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
>>
>> C. Submitted by Baris Kazar. This entry has 8 variants.
>>
>> [C1] 
>> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf
>> [C2] 
>> https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf
>> [C3] 
>> https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf
>> [C4] 
>> https://issues.apache.org/jira/secure/attachment/13006395/lucene_logo4_full.pdf
>> [C5] 
>> https://issues.apache.org/jira/secure/attachment/13006396/lucene_logo5_full.pdf
>> [C6] 
>> https://issues.apache.org/jira/secure/attachment/13006397/lucene_logo6_full.pdf
>> [C7] 
>> https://issues.apache.org/jira/secure/attachment/13006398/lucene_logo7_full.pdf
>> [C8] 
>> https://issues.apache.org/jira/secure/attachment/13006399/lucene_logo8_full.pdf
>>
>> D. The current Lucene logo.
>>
>> [D] https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>>
>> Please vote for one of the above choices. This vote will close about one 
>> week from today, Mon, Sept 7, 2020 at 11:59PM.
>>
>> Thanks!
>>
>> [jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221
>> [first-vote] 
>> http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e
>> [second-vote] 
>> http://mail-archives.apache.org/mod_mbox/lucene-dev/202009.mbox/%3cCA+DiXd7eBrQu5+aJQ3jKaUtUTJUqaG2U6o+kUZfNe-m=smn...@mail.gmail.com%3e
>> [rank-choice-voting] https://en.wikipedia.org/wiki/Instant-runoff_voting

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 8.6.3 released

2020-10-08 Thread Jason Gerlowski
The Lucene PMC is pleased to announce the release of Apache Lucene 8.6.3.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

This release contains no additional bug fixes over the previous
version 8.6.2. The release is available for immediate download at:

  

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using may not have
replicated the release yet. If that is the case, please try another mirror.

This also applies to Maven access.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Jason Rutherglen
Hello,

I have been emailing Otis regarding some of the replication issues and
it is good to get them into the Lucene forums to obtain feedback and
try to agree on what is most advantageous.  Solr replication uses what
I call segment replication.  Ocean can do segment replication but
usually simply serializes the documents.  The analyzing is redundant
but I believe it is a small cost.  IO is the largest cost.  I believe
these issues are solved, and a software system that allows high
quantities of cheap hardware will make the IO cost lessen.  The hard
part of the many servers problem is getting the replication to
function consistently in the event of failure of the master node in a
cell of nodes.  There is a reason Google chose to develop BigTable
rather than continue building out large clusters of Mysql servers.
One of the main issues probably had to do with the master slave
failover issues with 1000s of servers.  It is probably simply too hard
to try to rely on master slave alone to insure all the transactions
are completed to all nodes.  It is also too difficult to make the
model work over a geographically distributed set of servers though not
impossible.  In any case the goal with Ocean is to build something a
little bit better than what is currently available, but also something
simpler and easier to understand than what is currently available.

For Ocean I have been attempting to develop a system that successfully
implements conflict resolution without a master slave approach.  I
detail this somewhat in
http://wiki.apache.org/lucene-java/OceanRealtimeSearch replication
section.  I had problems implementing master failover using the Paxos
algorithm.  I tried implementing my own failover algorithms however
they just never worked.  Doug Cutting has been interested in how
CouchDB implements event based replication though I quite frankly do
not want to learn the Erlang language to figure it out.  The
motivations of CouchDB seem to be similar in that I do not think they
have a master slave architecture.  In any case many Mysql
installations implement their own conflict resolution and they are
just now looking at implementing it as a standard part of Mysql
http://mysqlmusings.blogspot.com/2007/06/replication-poll-and-our-plans-for.html.
 For Ocean I want the replication to work out of the box without
master slave as it seems like the right thing to do.

One requirement is the ability to perform an update in parallel and
then not worry if it made it to all nodes.  Then let the nodes get the
lost update (a rare case) by a polling mechanism that involves
comparing transaction ids.  Even in master slave it is possible to
lose transactions during the master slave failover process.  If the
client only performs a transaction once with a unique id, then if the
transaction fails and the client tries again, there would be a new id
and the resolution would not create duplicates.

It seems that Google has implemented this type of asynchronous
replication judging by this
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337
article.  It is just plain easier as the number of servers is added.
The problem with master slave with many servers is knowing which
server is the master at any given point.  It would seem easier to
build an architecture where it becomes irrelevant using asynchronous
conflict resolution.  This also allows the servers to be distributed
geographically where the latency is higher, which is what a
multi-master architecture solves using SQL databases.  Multi master in
SQL databases uses asynchronous conflict resolution.

An important point from the ACM article that is relevant to Lucene is
the section called "Do databases let schemas evolve for a set of items
using a bottom-up consensus/tipping point?"  Lucene is the type of
system that solves this problem with SQL databases.  I believe it is a
fundamental advantage over SQL databases.  If the Ocean system can
scale well then it can offer some unique advantages over SQL
databases, while also providing all of the powerful search
functionality offered by Lucene such as phrase queries, span queries,
payloads, custom scoring, functions, etc.

I am not sure how much more to put here right now as I may just be
blathering on.  I welcome feedback and will try to place the most
current thoughts on the wiki
http://wiki.apache.org/lucene-java/OceanRealtimeSearch.  One thing to
note is that currently Ocean generates ids based on a server number.
This way each id generated can be traced back to a server but still
increments.  This is helpful with conflict resolution.  Right now I am
writing code to use this id for the Ocean conflict resolution.

Cheers,
Jason

On Thu, Aug 28, 2008 at 12:57 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
>
> Yes, I think you pinpointed what I see over and over with Solr.  The two 
> desires pull in opposite directions.  I think Jason Rutherglen is very keen 
> to start talking about Luce

Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen
Hello all,

I don't mean this to sound like a solicitation.  I've been working on
realtime search and created some Lucene patches etc.  I am wondering
if there are social networks (or anyone else) out there who would be
interested in collaborating with Apache on realtime search to get it
to the point it can be used in production.  It is a challenging
problem that only Google has solved and made to scale.  I've been
working on the problem for a while and though a lot has been
completed, there is still a lot more to do and collaboration amongst
the most probable users (social networks) seems like a good thing to
try to do at this point.  I guess I'm saying it seems like a hard
enough problem that perhaps it's best to work together on it rather
than each company try to complete their own.  However I could be
wrong.

Realtime search benefits social networks by providing a scalable
searchable alternative to large Mysql implementations.  Mysql I have
heard is difficult to scale at a certain point.  Apparently Google has
created things like BigTable (a large database) and an online service
called GData (which Google has not published any whitepapers on the
technology underneath) to address scaling large database systems.
BigTable does not offer search.   GData does and is used by all of
Google's web services instead of something like Mysql (this is at
least how I understand it).  Social networks usually grow and so
scaling is continually an issue.  It is possible to build a realtime
search system that scales linearly, something that I have heard
becomes difficult with Mysql.  There is an article that discusses some
of these issues
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
don't think the current GData implementation is perfect and there is a
lot that can be improved on.  It might be helpful to figure out
together what helpful things can be added.

If this sounds like something of interest to anyone feel free to send
your input.

Take care,
Jason

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen
Hi Yonik,

The SOLR 2 list looks good.  The question is, who is going to do the
work?  I tried to simplify the scope of Ocean as much as possible to
make it possible (and slowly at that over time) for me to eventually
finish what is mentioned on the wiki.  I think SOLR is very cool and
was   major step forward when it came out.  I also think it's got a
lot of things now which makes integration difficult to do properly.  I
did try to integrate and received a lukewarm response and so decided
to just move ahead separately until folks have time to collaborate.
We probably should try to integrate SOLR and Ocean somehow however we
may want to simply reduce the scope a bit and figure what is needed
most, with the main use case being social networks.

I think the problem with integration with SOLR is it was designed with
a different problem set in mind than Ocean, originally the CNET
shopping application.  Facets were important, realtime was not needed
because pricing doesn't change very often.  I designed Ocean for
social networks and actually further into the future realtime
messaging based mobile applications.

SOLR needs to be backward compatible and support it's existing user
base.  How do you plan on doing this for a SOLR 2 if the architecture
is changed dramatically?  SOLR solves a problem set that is very
common making SOLR very useful in many situations.  However I wanted
Ocean to be like GData.  So I wanted the scalability of Google which
SOLR doesn't quite have yet, and the realtime, and then I figured the
other stuff could be added later, stuff people seem to spend a lot of
time on in the SOLR community currently (spellchecker, db imports,
many others).  I did use some of the SOLR terminology in building
Ocean, like snapshots!  But most of it is a digression.  I tried to
use schemas, but they just make the system harder to use.  For
distributed search I prefer serialized objects as this enables things
like SpanQueries and payloads without writing request handlers and
such.  Also there is no need to write new request handlers and deploy
(an expensive operation for systems that are in the 100s of servers)
them as any new classes are simply dynamically loaded by the server
from the client.

A lot is now outlined on the wiki site
http://wiki.apache.org/lucene-java/OceanRealtimeSearch now and there
will be a lot more javadocs in the forthcoming patch.  The latest code
is also available all the time at
http://oceansearch.googlecode.com/svn/trunk/trunk/oceanlucene

I do welcome more discussion and if there are Solr developers who wish
to work on Ocean feel free to drop me a line.  Most of all though I
think it would be useful for social networks interested in realtime
search to get involved as it may be something that is difficult for
one company to have enough resources to implement to a production
level.  I think this is where open source collaboration is
particularly useful.

Cheers,

Jason Rutherglen
[EMAIL PROTECTED]

On Wed, Sep 3, 2008 at 4:56 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
>> I am wondering
>> if there are social networks (or anyone else) out there who would be
>> interested in collaborating with Apache on realtime search to get it
>> to the point it can be used in production.
>
> Good timing Jason, I think you'll find some other people right here
> at Apache (solr-dev) that want to collaborate in this area:
>
> http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html
>
> I've looked at your wiki briefly, and all the high level goals/features seem
> to really be synergistic with where we are going with Solr2.
>
> -Yonik
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Jason Rutherglen
Hi Cam,

Thanks!  It has not been easy, probably has taken 3 years or so to get
this far.  At first I thought the new reopen code would be the
solution.  I used it, but then needed to modify it to do a clone
instead of reference the old deleted docs.  Then as I iterated,
realized that just using reopen on a ramdirectory would not be quite
fast enough because of the merging.  Then started using
InstantiatedIndex which provides an in memory version of the document,
without the overhead of merging during the transaction.  There are
other complexities as well.  The basic code works if you are
interested in trying it out.

Take care,
Jason

On Thu, Sep 4, 2008 at 9:08 AM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> Hello Jason,
> I have been trying to do this for a long time on my own. keep up the good
> work.
>
> What I tried was a document cache using apache collections. and before a
> indexwrite/delete i would sync the cache with index.
>
> I am waiting for lucene 2.4 to proceed. (query by delete)
>
> Best.
>
> On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen <
> [EMAIL PROTECTED]> wrote:
>
>> Hello all,
>>
>> I don't mean this to sound like a solicitation.  I've been working on
>> realtime search and created some Lucene patches etc.  I am wondering
>> if there are social networks (or anyone else) out there who would be
>> interested in collaborating with Apache on realtime search to get it
>> to the point it can be used in production.  It is a challenging
>> problem that only Google has solved and made to scale.  I've been
>> working on the problem for a while and though a lot has been
>> completed, there is still a lot more to do and collaboration amongst
>> the most probable users (social networks) seems like a good thing to
>> try to do at this point.  I guess I'm saying it seems like a hard
>> enough problem that perhaps it's best to work together on it rather
>> than each company try to complete their own.  However I could be
>> wrong.
>>
>> Realtime search benefits social networks by providing a scalable
>> searchable alternative to large Mysql implementations.  Mysql I have
>> heard is difficult to scale at a certain point.  Apparently Google has
>> created things like BigTable (a large database) and an online service
>> called GData (which Google has not published any whitepapers on the
>> technology underneath) to address scaling large database systems.
>> BigTable does not offer search.   GData does and is used by all of
>> Google's web services instead of something like Mysql (this is at
>> least how I understand it).  Social networks usually grow and so
>> scaling is continually an issue.  It is possible to build a realtime
>> search system that scales linearly, something that I have heard
>> becomes difficult with Mysql.  There is an article that discusses some
>> of these issues
>> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
>> don't think the current GData implementation is perfect and there is a
>> lot that can be improved on.  It might be helpful to figure out
>> together what helpful things can be added.
>>
>> If this sounds like something of interest to anyone feel free to send
>> your input.
>>
>> Take care,
>> Jason
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can we know if 2 lucene indexes are same?

2008-09-05 Thread Jason Rutherglen
In Ocean I had to use a transaction log and execute everything that
way like SQL database replication.  Then let each node handle it's own
merging process.  Syncing the indexes is used to get a new node up to
speed, otherwise it's avoided for the reasons mentioned in the
previous email.

On Fri, Sep 5, 2008 at 8:33 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> Shalin Shekhar Mangar wrote:
>
>> Let me try to explain.
>>
>> I have a master where indexing is done. I have multiple slaves for
>> querying.
>>
>> If I commit+optimize on the master and then rsync the index, the data
>> transferred on the network is huge. An alternate way is to commit on
>> master,
>> transfer the delta to the slave and issue an optimize on the slave. This
>> is
>> very fast because less data is transferred on the network.
>
> Large segment merges will also send huge traffic.  You may just want to send
> all updates (document adds/deletes) to all slaves directly?  It'd be nice if
> you could somehow NOT sync the effects of segment merging, but do sync doc
> add/deletes... not sure how to do that.
>
>> However, we need to ensure that the index on the slave is actually in sync
>> with the master. So that on another commit, we can blindly transfer the
>> delta to the slave.
>
> I assume your app ensures that no deltas arrive to the slave while it's
> running optimize?  So then the question becomes (I think) "if two indices
> are identical to begin with, and I separately run optimize on each, will the
> resulting two optimized indices be identical?".
>
> By "in sync" you also require the final segment name (after optimize) is
> identical right?
>
> I think the answer is yes, but I'm not certain unless I think more about it.
>  Also this behavior is not "promised" in Lucene's API.
>
> Merges for optimize are now allowed to run concurrently (by default, with
> ConcurrentMergeScheduler), except for the final (< mergeFactor segments)
> merge, which waits until others have finished.  So if there are 7 obvious
> merges necessary to optimize, 3 will run concurrently, while 4 wait.  Those
> 4 then run as the merges finish over time, which may happen in different
> orders for each index and so different merges may run.  Then the final merge
> will run and I *think* the net number of merges that ran should always be
> the same and so the final segment name should be the same.
>
> Mike
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incremental Indexing.

2008-09-08 Thread Jason Rutherglen
Hi Jang,

I've been working on Tag Index to address this issue.  It seems like a
popular feature and I have not had time to fully implement it yet.
http://issues.apache.org/jira/browse/LUCENE-1292  To be technical it
handles UN_TOKENIZED fields (did this name change now?) and some
specialized things to allow updating of parts of the inverted index.
If you're interested in working on it, feel free to let me know.

Cheers,
Jason

2008/9/8 장용석 <[EMAIL PROTECTED]>:
> Hi~.
> I hava a question about lucene incremental indexing.
>
> I want to do incremental indexing my goods data.
> For example, I have 4 products datas with
> "GOOD_ID","NAME","PRICE","CREATEDATE","UPDATEDATE" colunms.
>
> 1, ipod, 3, 2008-11-10:11:00, 2008-11-10:11:00
> 2, java book, 2, 2008-11-10:11:00, 2008-11-10:11:00
> 3, calendar, 1, 2008-11-10:11:00, 2008-11-10:11:00
> 4, lucene book, 5000, 2008-11-10:11:00, 2008-11-10:11:00
>
> If I will Indexing these datas, they will have a unique docid.
>
> And I update one of them that has good_id "1", price colunm 3 to 35000
> and UPDATEDATE colunm 2008-11-10:11:00 to 2008-11-10:12:00.
>
> In this case , I want update my index with new data good_id "1".
>
> In book, If I want to update my index then I should delete target data from
> index and add data to index.
> If the target data is one, I think It is no matter for me, and applications.
> But if the target datas are over 3000 (or more) , this applcations must do
> job delete data and add data each 3000(or more) times.
> I worried about It will be problem to my applications.
> Or Is this job no matter?
>
> I need your helps.. :-)
>
> many thanks.
> Jang.
>
> --
> DEV용식
> http://devyongsik.tistory.com
>


  1   2   3   >