RE: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Stephen GRAY
UNOFFICIAL Hi Duke, Thanks for your post. So how did you retrieve the NumericDocValues? Did you use the MultiDocValues method or the AtomicReader method? Thanks, Steve -Original Message- From: Duke DAI [mailto:duke.dai@gmail.com] Sent: Thursday, 24 October 2013 1:12 PM To: java-use

Re: corrupted index Lucene 4.4

2013-10-23 Thread Chris
Hi Mike, Thanks, I have asked there also, they are investigating, will let you know if something turns up on that front :) On Thu, Oct 24, 2013 at 1:30 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hi Chris, > > Sorry, I don't know much about Solr cloud; maybe as on the solr-use

Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Duke DAI
Hi Stephen, I have the same scenario with you. I verified with simple pure Lucene test, same way as Mike mentioned, performance with NumericDocValue is 10x faster than retrieving stored field. Hope you can get similar performance measurement. Best regards, Duke If not now, when? If not me, who?

RE: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Stephen GRAY
UNOFFICIAL Thanks to Mike and Adrien for the helpful replies. I actually need to loop through a large number of documents (50,000 - 100,000) calculating a number of statistics (min, max, sum) so I really need the most efficient/fastest solution available. It sounds like it would be best to just

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Arvind Kalyan
On Wed, Oct 23, 2013 at 2:45 PM, Adrien Grand wrote: > Hi, > > On Wed, Oct 23, 2013 at 10:19 PM, Arvind Kalyan wrote: > > Sorting is not an option for our case so we will most likely implement a > > variant that merges the segments in one pass. Using TimSort is great but > in > > our case the 2

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Adrien Grand
Hi, On Wed, Oct 23, 2013 at 10:19 PM, Arvind Kalyan wrote: > Sorting is not an option for our case so we will most likely implement a > variant that merges the segments in one pass. Using TimSort is great but in > our case the 2 segments will be highly interspersed and would not benefit > from th

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Arvind Kalyan
Thanks again. Sorting is not an option for our case so we will most likely implement a variant that merges the segments in one pass. Using TimSort is great but in our case the 2 segments will be highly interspersed and would not benefit from the galloping in TimSort. In additional, if anyone else

Re: corrupted index Lucene 4.4

2013-10-23 Thread Michael McCandless
Hi Chris, Sorry, I don't know much about Solr cloud; maybe as on the solr-user list, and give details about what went wrong? Mike McCandless http://blog.mikemccandless.com On Wed, Oct 23, 2013 at 11:25 AM, Chris wrote: > Wow !!! Thanks a lot for the helpfull tips I will implement this in the

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Shai Erera
SortingAtomicReader uses the TimSort algorithm, which performs well when the two segments are already sorted. Anyway, that's the way to do it, even if it looks like it does more work than it should. Shai On Wed, Oct 23, 2013 at 10:46 PM, Arvind Kalyan wrote: > Thanks, my understanding is that

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Arvind Kalyan
Thanks, my understanding is that SortingMergePolicy performs sorting after wrapping the 2 segments, correct? As I mentioned in my original email I would like to avoid the re-sorting and exploit the fact that the input segments are already sorted. On Wed, Oct 23, 2013 at 11:02 AM, Shai Erera wr

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Shai Erera
Hi You can use SortingMergePolicy and SortingAtomicReader to achieve that. You can read more about index sorting here: http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html Shai On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan wrote: > Hi there, I'm looking for pointers, suggesti

How to use Lucene-spatial

2013-10-23 Thread Smiley, David W.
Hi folks, If anyone reading this is interested in how to use the spatial module in Lucene, you might be interested in a recent two-part blog post by Steven Citron-Pousty on the OpenShift blog: https://www.openshift.com/blogs/free-text-and-spatial-search-with-spatial4j-and-lucene-spatial https://

Merging ordered segments without re-sorting.

2013-10-23 Thread Arvind Kalyan
Hi there, I'm looking for pointers, suggestions on how to approach this in Lucene 4.5. Say I am creating an index using a sequence of addDocument() calls and end up with segments that each contain documents in a specified ordering. It is guaranteed that there won't be updates/deletes/reads etc hap

Re: corrupted index Lucene 4.4

2013-10-23 Thread Chris
Wow !!! Thanks a lot for the helpfull tips I will implement this in the next two days & report back with my indexing speedI have one more question... i tried committing to solr cloud, but then something was not correct as it would not index after a few documents... Also, There seems to be som

JLemmaGen project

2013-10-23 Thread Michal Hlavac
Hi, I rewrote lemmatizer project LemmaGen (http://lemmatise.ijs.si/) to java. Originally it's written in C#. Lemmagen project uses rules to lemmatize word. Algorithm is described here: http://lemmatise.ijs.si/Download/File/Documentation%23JournalPaper.pdf Project is writtten under GPLv3. Sources

Re: corrupted index Lucene 4.4

2013-10-23 Thread Michael McCandless
Indexing 100M web pages really should not take months; if you fix committing after every row that should make things much faster. Use multiple index threads, set a highish RAM buffer (~512 MB), use a local disk not a remote mounted fileserver, ideally an SSD, etc. See http://wiki.apache.org/lucen

Re: Lucene in-memory index

2013-10-23 Thread Michael McCandless
On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov wrote: > Thanks for the link, I'll definitely dig into SpanQuery internals very soon. You could also just make a custom query. If you start from the ProxBooleanTermQuery on that issue, but change it so that it rejects hits that didn't have terms

Re: corrupted index Lucene 4.4

2013-10-23 Thread Chris
Actually, it contains about 100 million webpages and was built out of a web index for NLP processing :( I did the indexing & crawling over one small sized serverand researching and getting it all to this stage took me this much time...and now my index is un-usable :( On Wed, Oct 23, 2013 at

Re: corrupted index Lucene 4.4

2013-10-23 Thread Michael McCandless
On Wed, Oct 23, 2013 at 10:33 AM, Chris wrote: > I am not exactly sure if the commit() was run, as i am inserting each row & > doing a commit right away. My solr will not load the index I'm confused: if you are doing a commit right away after every row (which is REALLY bad practice: that's in

Re: corrupted index Lucene 4.4

2013-10-23 Thread Chris
I am not exactly sure if the commit() was run, as i am inserting each row & doing a commit right away. My solr will not load the index is there anyway that i can fix this, I have a huge index & will loose months if i try to reindex :( I didnt know lucene was not stable, I thought it was On W

Re: corrupted index Lucene 4.4

2013-10-23 Thread Michael McCandless
Hmm. Had you actually run a commit() on the index prior to the power loss? If so, a power loss should have left the index as of that last commit. Unfortunately, without a segments_N file, CheckIndex is unusable; a readable segments_N file is currently necessary to recover anything from the index

Re: corrupted index Lucene 4.4

2013-10-23 Thread Chris
Hi Mike, Thanks for the reply. I think it was due to power outage. I don't see any segments file except for segments.gen this is what i see in the folder. Please help - - _a73s_7sy.del _s91x.tvx _sa7s_Lucene41_0.tip _a73s.fdt _s9ez_9.del _sa7s.nvd _a

Re: corrupted index Lucene 4.4

2013-10-23 Thread Michael McCandless
How did this corruption happen? If you "ls" your index directory, is there any segments_N file? Mike McCandless http://blog.mikemccandless.com On Wed, Oct 23, 2013 at 9:01 AM, Chris wrote: > Hi, > > I am running solr 4.4 & one of my collections seems to have a corrupted > index... > > I tried

corrupted index Lucene 4.4

2013-10-23 Thread Chris
Hi, I am running solr 4.4 & one of my collections seems to have a corrupted index... I tried doing - java -cp lucene-core-4.4.0.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /solr2/example/solr/w1/data/index/ -fix But it didnt help...gives - ERROR: could not read any segments

Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Michael McCandless
You can also use MultiDocValues.getNumericDocValues(reader, field): it returns a "wrapper" that will do the binary search on every doc lookup. If you are only looking up a small number of hits (e.g. the current "page" for the user) then typically this cost is fine. Mike McCandless http://blog.mi

Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Adrien Grand
Hi Stephen, On Wed, Oct 23, 2013 at 9:29 AM, Stephen GRAY wrote: > UNOFFICIAL > Hi everyone, > > I have a question about how to retrieve the values in a > NumericDocValuesField. I understand how to do this in situations where you > have an AtomicReaderContext available > (context.reader().getN

Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Stephen GRAY
UNOFFICIAL Hi everyone, I have a question about how to retrieve the values in a NumericDocValuesField. I understand how to do this in situations where you have an AtomicReaderContext available (context.reader().getNumericDocValues(field)). However in a situation where I have just done a search