RE: A really hairy token graph case

2014-10-24 Thread Will Martin
HI Benson:

This is the case with n-gramming (though you have a more complicated start 
chooser than most I imagine).  Does that help get your ideas unblocked?

Will

-Original Message-
From: Benson Margulies [mailto:bimargul...@gmail.com] 
Sent: Friday, October 24, 2014 4:43 PM
To: java-user@lucene.apache.org
Subject: A really hairy token graph case

Consider a case where we have a token which can be subdivided in several ways. 
This can happen in German. We'd like to represent this with 
positionIncrement/positionLength, but it does not seem possible.

Once the position has moved out from one set of 'subtokens', we see no way to 
move it back for the second set of alternatives.

Is this something that was considered?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: A really hairy token graph case

2014-10-24 Thread Will Martin
Benson:  I'm in danger of trying to remember CPL's german decompounder and how 
we used it. That would be a very unreliable  memory.

However at the link below David and Rupert have a resoundingly informative 
discussion about making similar work for synonyms. It might bear reading 
through the kb info captured there.

https://github.com/OpenSextant/SolrTextTagger/issues/10




-Original Message-
From: Benson Margulies [mailto:ben...@basistech.com] 
Sent: Friday, October 24, 2014 5:54 PM
To: java-user@lucene apache. org; Richard Barnes
Subject: Re: A really hairy token graph case

I don't think so ... Let me be specific:

First, consider the case of one 'analysis': an input token maps to a lemma and 
a sequence of components.

So, we product

 surface form
 lemmaPI 0
 comp1PI 0
 comp2PI 1
 .

with PL set appropriately to cover the pieces. All the information is there.

Now, if we have another analysis, we want to 'rewind' position, and deliver 
another lemma and another set of components, but, of course, we can't do that.

The best we could do is something like:

surface form
lemma1  PI 0
lemma2 PI 0

lemmaN PI 0

comp0-1  PI 0
comp1-1 PI 0

 
 comp0-N
compM-N

That is, group all the first-components, and all the second-components.

But now the bits and pieces of the compounds are interspersed. Maybe that's OK.


On Fri, Oct 24, 2014 at 5:44 PM, Will Martin  wrote:

> HI Benson:
>
> This is the case with n-gramming (though you have a more complicated 
> start chooser than most I imagine).  Does that help get your ideas unblocked?
>
> Will
>
> -Original Message-
> From: Benson Margulies [mailto:bimargul...@gmail.com]
> Sent: Friday, October 24, 2014 4:43 PM
> To: java-user@lucene.apache.org
> Subject: A really hairy token graph case
>
> Consider a case where we have a token which can be subdivided in 
> several ways. This can happen in German. We'd like to represent this 
> with positionIncrement/positionLength, but it does not seem possible.
>
> Once the position has moved out from one set of 'subtokens', we see no 
> way to move it back for the second set of alternatives.
>
> Is this something that was considered?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: hello,I have a problem about lucene,please help me to explain ,thank you

2015-09-22 Thread will martin
Hi: 
Would you mind doing websearch and cataloging the relevant pages into a
primer?
Thx,
Will
-Original Message-
From: 王建军 [mailto:jianjun200...@163.com] 
Sent: Tuesday, September 22, 2015 4:02 AM
To: java-user@lucene.apache.org
Subject: hello,I have a problem about lucene,please help me to explain
,thank you

There is a Class org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter
which have two parameters,one is DEFAULT_MIN_BLOCK_SIZE,the other is
DEFAULT_MAX_BLOCK_SIZE;their default values is 25 and 48;when I  make their
values to bigger,for example,200 and 398;And then to make index,the result
is that the use of memory is become less,what's more ,there is a good
performance。
Can you tell me why;what's more,if I change that ,whether or not will
make other problem。

Thank you very much。


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Solr java.lang.OutOfMemoryError: Java heap space

2015-09-28 Thread will martin
http://opensourceconnections.com/blog/2014/07/13/reindexing-collections-with-solrs-cursor-support/



-Original Message-
From: Ajinkya Kale [mailto:kaleajin...@gmail.com] 
Sent: Monday, September 28, 2015 2:46 PM
To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
Subject: Solr java.lang.OutOfMemoryError: Java heap space

Hi,

I am trying to retrieve all the documents from a solr index in a batched manner.
I have 100M documents. I am retrieving them using the method proposed here 
https://nowontap.wordpress.com/2014/04/04/solr-exporting-an-index-to-an-external-file/
I am dumping 10M document splits in each file. I get "OutOfMemoryError" if 
start is at 50M. I get the same error even if rows=10 for start=50M.
Curl on start=0 rows=50M in one go works fine too. But things go bad when start 
is at 50M.
My Solr version is 4.4.0.

Caused by: java.lang.OutOfMemoryError: Java heap space at
org.apache.lucene.search.TopDocsCollector.topDocs(TopDocsCollector.java:146)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1502)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:434)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)

--aj


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
So, if its new, it adds to pre-existing time? So it is a cost that needs to be 
understood I think.

 

And, I'm really curious, what happens to the result of the post merge 
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean if 
you let it merge anyway could you get a false positive for integrity?  [see the 
concept of lazy-evaluation]

 

These are, imo, the kinds of engineering questions Selva's post raised in my 
triage mode of the scenario.

 

 

-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com] 
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

 

Indeed this is new but I'm a bit surprised this is the source of your issues as 
it should be much faster than the merge itself. I don't understand your 
proposal to check the index after merge: the goal is to make sure that we do 
not propagate corruptions so it's better to check the index before the merge 
starts so that we don't even try to merge if there are corruptions?

 

Le mar. 15 sept. 2015 à 00:40, Selva Kumar < 
 selva.kumar.at.w...@gmail.com> a écrit :

 

> it appears Lucene 5.2 index merge is running checkIntegrity on 

> existing index prior to merging additional indices.

> This seems to be new.

> 

> We have an existing checkIndex but this is run post index merge.

> 

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running 

> post merge?

> 

> 

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar < 

>   selva.kumar.at.w...@gmail.com

> > wrote:

> 

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

> 



RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-Original Message-
From: McKinley, James T [mailto:james.mckin...@cengage.com] 
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim 

____
From: will martin 
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>  <mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
Ok So I'm a little confused:

The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on a
flag to setCheckIntegrityAtMerge ... 

Method states it controls pre-merge cost.

Ref: 

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/LiveIndex
WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29

And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a whole
lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this a 5.0
feature drop? You can't deprecate, um, er totally remove an index time audit
feature on a point release of any level IMHO.


-Original Message-
From: McKinley, James T [mailto:james.mckin...@cengage.com] 
Sent: Tuesday, September 29, 2015 2:42 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Yes, the indexing workflow is completely separate from the runtime system.
The file system is EMC Isilon via NFS.

Jim

____
From: will martin 
Sent: 29 September 2015 14:29
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-Original Message-
From: McKinley, James T [mailto:james.mckin...@cengage.com]
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim

____
From: will martin 
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No p

RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-30 Thread will martin
Thanks Mike. This is very informative. 



-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Tuesday, September 29, 2015 3:22 PM
To: Lucene Users
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

No, it is not possible to disable, and, yes, we removed that API in 5.x because 
1) the risk of silent index corruption is too high to warrant this small 
optimization and 2) we re-worked how merging works so that this checkIntegrity 
has IO locality with what's being merged next.

There were other performance gains for merging in 5.x, e.g. using much less 
memory in the many-fields case, not decompressing + recompressing stored fields 
and term vectors, etc.

As Adrien pointed out, the cost should be much lower than 25% for a local 
filesystem ... I suspect something about your NFS setup is making it more 
costly.

NFS is in general a dangerous filesystem to use with Lucene (no delete on last 
close, locking is tricky to get right, incoherent client file contents and 
directory listing caching).

If you want to also checkIntegrity of the merged segment you could e.g. install 
an IndexReaderWarmer in your IW and call IndexReader.checkIntegrity.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 29, 2015 at 9:00 PM, will martin  wrote:
> Ok So I'm a little confused:
>
> The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on 
> a flag to setCheckIntegrityAtMerge ...
>
> Method states it controls pre-merge cost.
>
> Ref:
>
> https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/Liv
> eIndex
> WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29
>
> And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a 
> whole lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this 
> a 5.0 feature drop? You can't deprecate, um, er totally remove an 
> index time audit feature on a point release of any level IMHO.
>
>
> -Original Message-
> From: McKinley, James T [mailto:james.mckin...@cengage.com]
> Sent: Tuesday, September 29, 2015 2:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Yes, the indexing workflow is completely separate from the runtime system.
> The file system is EMC Isilon via NFS.
>
> Jim
>
> 
> From: will martin 
> Sent: 29 September 2015 14:29
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> This sounds robust. Is the index batch creation workflow a separate process?
> Distributed shared filesystems?
>
> --will
>
> -Original Message-
> From: McKinley, James T [mailto:james.mckin...@cengage.com]
> Sent: Tuesday, September 29, 2015 2:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Hi Adrien and Will,
>
> Thanks for your responses.  I work with Selva and he's busy right now 
> with other things, so I'll add some more context to his question in an 
> attempt to improve clarity.
>
> The merge in question is part of our batch indexing workflow wherein 
> we index new content for a given partition and then merge this new 
> index with the big index of everything that was previously loaded on 
> the given partition.  The increase in merge time we've seen since 
> upgrading from 4.10 to 5.2 is on the order of 25%.  It varies from 
> partition to partition, but 25% is a good ballpark estimate I think.  
> Maybe our case is non-standard, we have a large number of fields (> 200).
>
> The reason we perform an index check after the merge is that this is 
> the final index state that will be used for a given batch.  Since we 
> have a batch-oriented workflow we are able to roll back to a previous 
> batch if we find a problem with a given batch (Lucene or other 
> problem).  However due to disk space constraints we can only keep a 
> couple batches.  If our indexing workflow completes without errors but 
> the index is corrupt, we may not know right away and we might delete 
> the previous good batch thinking the latest batch is OK, which would 
> be very bad requiring a full reload of all our content.
>
> Checking the index prior to the merge would no doubt catch many 
> issues, but it might not catch corruption that occurs during the merge 
> step itself, so we implemented a check step once the index is in its 
> final state to ensure that it is OK.
>
> So, since we want to do the check post-merge, is there a way to 
> disable the check during merge so we don't have to do two checks?
>
> Thanks!
>
> Jim
>
> 
> Fro

Re: debugging growing index size

2015-11-13 Thread will martin
Hi Rob:


Doesn’t this look like known SE issue JDK-4724038 and discussed by Peter Levart 
and Uwe Schindler on a lucene-dev thread 9/9/2015?

MappedByteBuffer …. what OS are you on Rob? What JVM?

http://bugs.java.com/view_bug.do?bug_id=4724038

http://mail-archives.apache.org/mod_mbox/lucene-dev/201509.mbox/%3c55f0461a.2070...@gmail.com%3E

hth 
-will



> On Nov 13, 2015, at 11:23 AM, Rob Audenaerde  wrote:
> 
> I'm currently running using NIOFS. It seems to prevent the issue from
> appearing.
> 
> This is a second run (with applied deletes etc)
> 
> raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd
> -rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd
> 
> raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd
> -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd
> 
> 
> 
> On Thu, Nov 12, 2015 at 3:40 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
>> Hi Rob,
>> 
>> A couple more things:
>> 
>> Can you print the value of MMapDirectory.UNMAP_SUPPORTED?
>> 
>> Also, can you try your test using NIOFSDirectory instead?  Curious if
>> that changes things...
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 
>> On Thu, Nov 12, 2015 at 7:28 AM, Rob Audenaerde
>>  wrote:
>>> Curious indeed!
>>> 
>>> I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate the
>>> logs. Will get back with them in a day hopefully.
>>> 
>>> Thanks for the extra logging!
>>> 
>>> -Rob
>>> 
>>> On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless <
>>> luc...@mikemccandless.com> wrote:
>>> 
 Hmm, curious.
 
 I looked at the [large] infoStream output and I see segment _3ou7
 present on init of IW, a few getReader calls referencing it, then a
 forceMerge that indeed merges it away, yet I do NOT see IW attempting
 deletion of its files.
 
 And indeed I see plenty (too many: many times per second?) of commits
 after that, so the index itself is no longer referencing _3ou7.
 
 If you are failing to close all NRT readers then I would expect _3ou7
 to be in the lsof output, but it's not.
 
 The NRT readers close method has logic that notifies IndexWriter when
 it's done "needing" the files, to emulate "delete on last close"
 semantics for filesystems like HDFS that don't do that ... it's
 possible something is wrong here.
 
 Can you set the (public, static) boolean
 IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this
 log?  This causes IW to log the ref count of each file it's tracking
 ...
 
 I'll also add a bit more verbosity to IW when NRT readers are opened
 and close, for 5.4.0.
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 
 On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde
  wrote:
> Hi all,
> 
> I'm still debugging the growing-index size. I think closing index
>> readers
> might help (work in progress), but I can't really see them holding on
>> to
> files (at least, using lsof ). Restarting the application sheds some
 light,
> I see logging on files that are no longer referenced.
> 
> What I see is that there are files in the index-directory, that seem
>> to
> longer referenced..
> 
> I put the output of the infoStream online, because is it rather big
>> (30MB
> gzipped):  http://www.audenaerde.org/lucene/merges.log.gz
> 
> Output of lsof:  (executed 'sudo lsof *' in the index directory  ).
>> This
 is
> on an CentOS box (maybe that influences stuff as well?)
> 
> COMMAND   PID   USER   FD   TYPE DEVICE   SIZE/OFF NODE NAME
> java30581 apache  memREG  253,0 3176094924 18880508
> _4gs5_Lucene50_0.dvd
> java30581 apache  memREG  253,0  505758610 18880546 _4gs5.fdt
> java30581 apache  memREG  253,0  369563337 18880631
> _4gs5_Lucene50_0.tim
> java30581 apache  memREG  253,0  176344058 18880623
> _4gs5_Lucene50_0.pos
> java30581 apache  memREG  253,0  378055201 18880606
> _4gs5_Lucene50_0.doc
> java30581 apache  memREG  253,0  372579599 18880400
> _4i5a_Lucene50_0.dvd
> java30581 apache  memREG  253,0   82017

Re: Jensen–Shannon divergence

2015-12-13 Thread will martin
expand your due diligence beyond wikipedia:
i.e.

http://ciir.cs.umass.edu/pubfiles/ir-464.pdf



> On Dec 13, 2015, at 8:30 AM, Shay Hummel  wrote:
> 
> LMDiricletbut its feasibilit


Re: Jensen–Shannon divergence

2015-12-13 Thread will martin
Sorry it was early.

If you go looking on the web, you can find, as I did reputable work on 
implementing DiricletLanguage Models. However, at this hour you might get 
answers here. Extrapolating others work into a lucene implantation is only 
slightly different from getting answers here. imo

g'luck


> On Dec 13, 2015, at 10:55 AM, Shay Hummel  wrote:
> 
> Hi
> 
> I am sorry but I didn't understand your answer. Can you please elaborate?
> 
> Shay
> 
> On Sun, Dec 13, 2015 at 3:41 PM will martin  wrote:
> 
>> expand your due diligence beyond wikipedia:
>> i.e.
>> 
>> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>> 
>> 
>> 
>>> On Dec 13, 2015, at 8:30 AM, Shay Hummel  wrote:
>>> 
>>> LMDiricletbut its feasibilit
>> 
> -- 
> Regards,
> Shay Hummel


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Jensen–Shannon divergence

2015-12-14 Thread will martin
cool list. Thanks Uwe.

Opportunities to gain competitive advantage in selected domains.

> On Dec 14, 2015, at 6:02 PM, Uwe Schindler  wrote:
> 
> Hi,
> 
> Next to BM25 and TF-IDF, Lucene also privides many more similarity 
> implementations:
> 
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/IBSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/DFRSimilarity.html
> 
> If you want to implement your own, choose the closest one and implement the 
> formula as you described. I'll start with SimilarityBase, which is ideal base 
> class for such types like Dirichlet / DFR /..., because it has a default 
> implementation for stuff like phrases.
> 
>> LMDiricletbut its feasibilit
> 
> I am not sure what you want to say with this mistyped sentence fragment.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> -Original Message-
>> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
>> Sent: Monday, December 14, 2015 11:21 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Jensen–Shannon divergence
>> 
>> Is there any particular reason that you find Lucene's builtin TF/IDF and
>> BM25 similarity models insufficient for your needs? In any case,
>> examination of their source code should get you started if you with to do
>> your own:
>> 
>> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
>> larities/TFIDFSimilarity.html
>> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
>> larities/BM25Similarity.html
>> 
>> -- Jack Krupansky
>> 
>> On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel 
>> wrote:
>> 
>>> Hi
>>> 
>>> I need help to implement similarity between query model and document
>> model.
>>> I would like to use the JS-Divergence
>>> 
>> for
>>> ranking documents. The documents and the query will be represented
>>> according to the language models approach - specifically the LMDiriclet.
>>> The similarity will be calculated using the JS-Div between the document
>>> model and the query model.
>>> Is it possible?
>>> if so how?
>>> 
>>> Thank you,
>>> Shay Hummel
>>> --
>>> Regards,
>>> Shay Hummel
>>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any lucene query sorts docs by Hamming distance?

2015-12-22 Thread will martin
Yonghui:

Do you mean sort, rank or score?

Thanks,
Will



> On Dec 22, 2015, at 4:02 AM, Yonghui Zhao  wrote:
> 
> Hi,
> 
> Is there any query can sort docs by hamming distance if field values are
> same length,
> 
> Seems fuzzy query only works on edit distance.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: range query highlighting

2015-12-23 Thread will martin
Todd:

"This trick just converts the multi term queries like PrefixQuery or RangeQuery 
to boolean query by expanding the terms using index reader."

http://stackoverflow.com/questions/7662829/lucene-net-range-queries-highlighting

beware cost. (my comment)


g’luck
will

> On Dec 23, 2015, at 4:49 PM, Fielder, Todd Patrick  wrote:
> 
> I have a NumericRangeQuery and a TermQuery that I am combining into a Boolean 
> query.  I would then like to pass the Boolean query to the highlighter to 
> highlight both the range and term hits.  Currently, only the terms are being 
> highlighted.
> 
> Any help on how to get the range values to highlight would be appreciated
> 
> Thanks
> 
> -Todd


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any lucene query sorts docs by Hamming distance?

2015-12-24 Thread will martin
here’s a thought from the algorithm world:

hamming is the upper bound on levenshtein.

does that help you?

-w


> On Dec 24, 2015, at 4:10 AM, Yonghui Zhao  wrote:
> 
> I mean sort and filter.  I want to filter all documents within some
> hamming distances say 3,  and sort them from distance 0 to 3.
> 
> 2015-12-22 21:42 GMT+08:00 will martin :
> 
>> Yonghui:
>> 
>> Do you mean sort, rank or score?
>> 
>> Thanks,
>> Will
>> 
>> 
>> 
>>> On Dec 22, 2015, at 4:02 AM, Yonghui Zhao  wrote:
>>> 
>>> Hi,
>>> 
>>> Is there any query can sort docs by hamming distance if field values are
>>> same length,
>>> 
>>> Seems fuzzy query only works on edit distance.
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SolrIndexSearcher throws Misleading Error Message When timeAllowed is Specified.

2016-01-08 Thread will martin
Please read the javadoc for System.nanoTime().  I won’t bore you with the 
details about how computer clocks work.

> On Jan 8, 2016, at 4:14 AM, Vishnu Mishra  wrote:
> 
> I am using Solr 5.3.1 and we are facing OutOfMemory exception while doing
> some complex wildcard and proximity query (even for simple wildcard query).
> We are doing distributed solr search using shard across 20 cores. 
> 
> The problem description is given below.
> 
> For example simple query like
> 
> *q=Tile:(eleme* OR proces*)&timeAllowed=50*
> 
> It gives warning given below
> 
> *2016-01-08 14:14:03,874 WARN  org.apache.solr.search.SolrIndexSearcher  –
> Query: Tile:(eleme* OR proces*); The request took too long to iterate over
> terms. Timeout: timeoutAt: 5804340135470 (System.nanoTime(): 5804342454166),
> TermsEnum=org.apache.lucene.codecs.blocktree.IntersectTermsEnum@1d2d4fb*
> 
> I don't understand why the timeout thrown by SolrIndexSearcher is showing
> 5804340135470 nanosecond (5804.342454166001 seconds), even I already given
> timeout of 1000 ms (1 second). Is the log message is correct ? Help me to
> understand this problem.
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrIndexSearcher-throws-Misleading-Error-Message-When-timeAllowed-is-Specified-tp4249356.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to backup index files with Replicator

2016-01-23 Thread will martin
Hi Dancer:

Found this thread with good info that may be irrelevant to your scenario but, 
this in particular struck me

 writer.waitForMerges();
 writer.commit();
 replicator. replicate(new IndexRevision(writer));
writer.close();
—
even though writer.close() can trigger a commit. hmmm


thread:

http://grokbase.com/t/lucene/java-user/143dsnrxh8/replicator-how-to-use-it 


-will



> On Jan 23, 2016, at 4:39 AM, Dancer <462921...@qq.com> wrote:
> 
> Hi,
> here is my code to backup index files with Lucene Replicator,but It doesn't 
> work well, No files were backuped.
> Could you check my code and give me your advice?
> 
> 
> public class IndexFiles {
> 
> 
>   private static Directory dir;
>   private static Path bakPath;
>   private static LocalReplicator replicator;
> 
> 
>   public static LocalReplicator getInstance() {
>   if (replicator == null) {
>   replicator = new LocalReplicator();
>   }
>   return replicator;
>   }
>   public static Directory getDirInstance() {
>   if (dir == null) {
>   try {
>   dir = FSDirectory.open(Paths.get("/tmp/index"));
>   } catch (IOException e) {
>   e.printStackTrace();
>   }
>   }
>   return dir;
>   }
>   public static Path getPathInstance() {
>   if (bakPath == null) {
>   bakPath = Paths.get("/tmp/indexBak");
>   }
>   return bakPath;
>   }
> 
> 
>   
>   /** Index all text files under a directory. */
>   public static void main(String[] args) {
>   String id = "-oderfilssdhsjs";
>   String title = "足球周刊";
>   String body = "今天野狗,我们将关注欧冠赛场,曼联在客场先进一球的情况下,遭对手沃尔夫斯堡以总比分3:2淘汰,"
>   + 
> "遗憾出局,将参加欧联杯的比赛,当红球星马夏尔贡献一球,狼堡进了一个乌龙球,狼堡十号球员德拉克斯勒" + 
> "表现惊艳,多次导演攻势,希望22岁的他能在足球之路上走的更远。";
>   try {
>   // Directory dir = 
> FSDirectory.open(Paths.get(indexPath));
>   Analyzer analyzer = new IKAnalyzer(true);
>   IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
>   iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
>   SnapshotDeletionPolicy snapshotter = new 
> SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy());
>   iwc.setIndexDeletionPolicy(snapshotter);
>   IndexWriter writer = new 
> IndexWriter(IndexFiles.getDirInstance(), iwc);// the
>   LocalReplicator replicator = IndexFiles.getInstance();
> 
> 
>   Document doc = new Document();
>   Field articleId = new StringField("id", id, 
> Field.Store.YES);
>   doc.add(articleId);
>   Field articleTitle = new TextField("title", title, 
> Field.Store.YES);
>   doc.add(articleTitle);
>   Field articleBody = new TextField("body", body, 
> Field.Store.NO);
>   doc.add(articleBody);
>   Field tag1 = new TextField("tags", "野狗", 
> Field.Store.NO);
>   doc.add(tag1);
>   // Field tag2 = new TextField("tags", "运动", 
> Field.Store.NO);
>   // doc.add(tag2);
>   // Field tag3 = new TextField("tags", "国足", 
> Field.Store.NO);
>   // doc.add(tag3);
>   // Field tag4 = new TextField("tags", "席大大", 
> Field.Store.NO);
>   // doc.add(tag4);
> 
> 
>   writer.updateDocument(new Term("id", id), doc);
>   writer.commit();
>   ReplicatorThread p = new ReplicatorThread(); 
>   new Thread(p, "ReplicatorThread").start();
>   replicator.publish(new IndexRevision(writer));
>   Thread.sleep(5);
>   writer.close();
>   } catch (IOException e) {
>   System.out.println(" caught a " + e.getClass() + "\n 
> with message: " + e.getMessage());
>   } catch (InterruptedException e) {
>   e.printStackTrace();
>   }
>   }
> }
> 
> 
> class ReplicatorThread implements Runnable {
> 
> 
>   public void run() {
>   Callable callback = null; 
>   ReplicationHandler handler = null;
>   try {
>   handler = new 
> IndexReplicationHandler(IndexFiles.getDirInstance(), callback);
>   SourceDirectoryFactory factory = new 
> PerSessionDirectoryFactory(IndexFiles.getPathInstance());
>   ReplicationClient clien

Re: Searching in a bitMask

2016-08-27 Thread will martin
hi

aren’t we waltzing terribly close to the use of a bit vector in your field 
caches?
there’s no reason to not filter longword operations on a cache if alignment is 
consistent across multiple caches

just be sure to abstract your operations away from individual bits….imo



-will

> On Aug 27, 2016, at 2:30 PM, Cristian Lorenzetto 
>  wrote:
> 
> Yes thinking a bit more about my question , i understood to make a query to
> process every document will be not a good solution. I preferred to use
> boolean properties with traditional inverted index.  Thanks for
> confirmation :)
> 
> 2016-08-27 20:24 GMT+02:00 Mikhail Khludnev :
> 
>> My guess is that you need to implement own MultyTermQuery, and I guess it's
>> gonna be slow.
>> 
>> On Sat, Aug 27, 2016 at 8:41 AM, Cristian Lorenzetto <
>> cristian.lorenze...@gmail.com> wrote:
>> 
>>> How it is possible to search in a bitmask for soddisfying a request as
>>> 
>>> bitmask&0xf == 0xf ?
>>> 
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multi-field IDF

2016-11-17 Thread Will Martin
are you familiar with pivoted normalized document length practice or 
theory? or croft's recent work on relevance algorithms accounting for 
structured field presence?




On 11/17/2016 5:20 PM, Nicolás Lichtmaier wrote:
That depends on what you want. In this case I want to use a 
discrimination power based in all the body text, not just the titles. 
Because otherwise terms that are really not that relevant end up being 
very high!



El 17/11/16 a las 18:25, Ahmet Arslan escribió:

Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not 
so usual in titles, then it has some discrimination power in that 
domain.


I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:

IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking 
wrong?


Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: Multi-field IDF

2016-11-18 Thread Will Martin

In this work, we aim to improve the field weighting for structured doc-
ument retrieval. We first introduce the notion of field relevance as the
generalization of field weights, and discuss how it can be estimated using
relevant documents, which effectively implements relevance feedback for
field weighting. We then propose a framework for estimating field rele-
vance based on the combination of several sources. Evaluation on several
structured document collections show that field weighting based on the
suggested framework improves retrieval effectiveness signicantly.


https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1051




On 11/18/2016 3:57 AM, Ahmet Arslan wrote:

Hi Nicholas,

Aha, I see that you are into field-based scoring, which is an unsolved problem.

Then, you might find BlendedTermQuery and SynonymQuery relevant.

Ahmet




On Friday, November 18, 2016 12:22 AM, Nicolás Lichtmaier 
 wrote:
That depends on what you want. In this case I want to use a
discrimination power based in all the body text, not just the titles.
Because otherwise terms that are really not that relevant end up being
very high!


El 17/11/16 a las 18:25, Ahmet Arslan escribió:

Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in 
titles, then it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:
IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: Explain Scoring function in LMJelinekMercerSimilarity Class

2016-12-20 Thread Will Martin

https://doi.org/10.3115/981574.981579



On 12/20/2016 12:21 PM, Dwaipayan Roy wrote:

Hello,

Can anyone help me understand the scoring function in the
LMJelinekMercerSimilarity class?

The scoring function in LMJelinekMercerSimilarity is shown below:

float score = stats.getTotalBoost() *
(float)Math.log(1 + ((1 - lambda) * freq / docLen) / (lambda *
((LMStats)stats).getCollectionProbability()));


Can anyone help explain the equation? I can understand the scoring effect
when calculating the stat in the document, i.e.: (1 - lambda) * freq /
docLen).

I hope getCollectionProbability() returns col_freq(t) / col_size. Am I
right?

Also the boosting part is not clear to me (stats.getTotalBoost()).

I want to reproduce the result of the scoring using LM-JM. Hence I want the
details.

Thanks.
Dwaipayan Roy..




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Format of Wikipedia Index

2018-01-22 Thread Will Martin

From the javadoc for DocMaker:


 * *doc.stored* - specifies whether fields should be stored (default
   *false*).
 * *doc.body.stored* - specifies whether the body field should be
   stored (default = *doc.stored*).

So ootb you won't get content stored. Does this help?

regards
-will


On 1/22/2018 10:27 PM, Armins Stepanjans wrote:

Hi,

I have a question regarding the format of the Index created by DocMaker,
from EnWikiContentSource.

After creating the Index from dump of all Wikipedia's articles (
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-
pages-articles-multistream.xml.bz2), I'm having trouble understanding the
format of Documents created, because when I get a document from the Index,
its only field is docid.
Is this an indicator of incorrect indexation and if not, how should I use
the index, in order to search for occurrences of a term, within an article
(I was imagining of doing a boolean query, with on sub-query being the
article's name and the other the term I'm searching for within the article)?

Regards,
Armīns