a complete solution for building a website search with lucene

2010-01-07 Thread jyzhou817
Hi , I am new in Lucene. To build a web search function, it need to have a backendc indexing function. But, before that, should run a Crawler? because Lucene index based on Html documents, while Crawler can change the website pages to Html documents. Am i right? If so, please anyone suggest

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Babak Farhang
>> I wonder if renaming that to maxSegSizeMergeMB would make it more obvious >> what this does? How about using the *able* moniker to make it clear we're referring to the size of the to-be-merged segment, not the resultant merged segment? I.e. naming it something like "maxMergeableSegSizeMB" ..

Re: Implementing filtering based on multiple fields

2010-01-07 Thread Otis Gospodnetic
Ah, well, masking it didn't help. Yes, ignore Bixo, Nutch, and Droids then. Consider DataImportHandler from Solr or wait a bit for Lucene Connectors Framework to materialize. Or use LuSql, or DbSight, or Sematext's Database Indexer. Yes, I was suggesting a separate index for each user. That's

Re: Implementing filtering based on multiple fields

2010-01-07 Thread Yaniv Ben Yosef
Thanks Otis. If I understand correctly - Bixo, Nutch and Droids are technologies to use for crawling the web and building an index. My project is actually about indexing a large database, where you can think of every row as a web page, and a particular column is the equivalent of a web site. (I di

Re: Implementing filtering based on multiple fields

2010-01-07 Thread Otis Gospodnetic
For something like CSE, I think you want to isolate users and their data/indices. I'd look at Bixo or Nutch or Droids ==> Lucene or Solr Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Yaniv Ben Yosef > To: java-user@lucene.apache.org > S

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Jason Rutherglen
The naming is unclear, when I looked at this I had to thumb through the code a fair bit before discerning if it was the input segments or the output segment of a merge (it's the former). Though I find the current functionality somewhat odd because it will inherently exceed the given size with a mer

Re: Problems with IndexWriter#commit() on Linux

2010-01-07 Thread Naama Kraus
Thanks all for the hints, I'll get back to my code and do some additional checks. Naama On Thu, Jan 7, 2010 at 6:57 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > kill -9 is harsh, but, perfectly fine from Lucene's standpoint. > Likewise if the OS or JVM crashes, power is suddenly l

Implementing filtering based on multiple fields

2010-01-07 Thread Yaniv Ben Yosef
Hi, I'm very new to Lucene. In fact, I'm at the beginning of an evaluation phase, trying to figure whether Lucene is the right fit for my needs. The project I'm involved in requires something similar to the Google Custom Search Engine (CSE). In CSE, each user can defin

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Otis Gospodnetic
Sure, sounds good, maybe even drop "ing". Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Michael McCandless > To: java-user@lucene.apache.org > Sent: Thu, January 7, 2010 2:28:15 PM > Subject: Re: Is there a way to limit the size of an in

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Kay Kay
On 1/7/10 11:23 AM, Otis Gospodnetic wrote: Merge factor controls how many segments are merged at once. The default is 10. The maxMergeMB setting sets the max size for a given segment to be included in a merge. I wonder if renaming that to maxSegSizeMergeMB would make it more obvious wha

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Michael McCandless
On Thu, Jan 7, 2010 at 2:23 PM, Otis Gospodnetic wrote: >> Merge factor controls how many segments are merged at once.  The default is >> 10. >> >> The maxMergeMB setting sets the max size for a given segment to be >> included in a merge. > > I wonder if renaming that to maxSegSizeMergeMB would m

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Otis Gospodnetic
> Merge factor controls how many segments are merged at once. The default is > 10. > > The maxMergeMB setting sets the max size for a given segment to be > included in a merge. I wonder if renaming that to maxSegSizeMergeMB would make it more obvious what this does? Otis -- Sematext -- http:/

Re: R-Tree in lucene thoughts?

2010-01-07 Thread Marvin Humphrey
On Thu, Jan 07, 2010 at 01:03:27PM -0500, Ryan McKinley wrote: > Perhaps each segment has its own r-tree. For each query, the r-tree would > only be used to say if something could be in the results, it does not > contribute to the score, so I don't think interleaving would be any > different t

Re: R-Tree in lucene thoughts?

2010-01-07 Thread Ryan McKinley
On Jan 7, 2010, at 12:21 PM, Marvin Humphrey wrote: On Thu, Jan 07, 2010 at 10:13:00AM -0500, Ryan McKinley wrote: With the new flexible indexing stuff, would it be possible to natively write an rtree to disk in the index process? The question I'd have is, how would you handle interleaving

Re: R-Tree in lucene thoughts?

2010-01-07 Thread Marvin Humphrey
On Thu, Jan 07, 2010 at 10:13:00AM -0500, Ryan McKinley wrote: > With the new flexible indexing stuff, would it be possible to natively > write an rtree to disk in the index process? The question I'd have is, how would you handle interleaving of hits from different segments? Meaning, if you have

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Michael McCandless
Merge factor controls how many segments are merged at once. The default is 10. The maxMergeMB setting sets the max size for a given segment to be included in a merge. Roughly, the upper bound on merged segments is the sum of their sizes. So the rough upper bound on any segment's size is mergeFa

Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Michael McCandless
Do your documents have many different indexed fields? If you do, and norms are enabled, that could be the cause (norms are not stored sparsely). But: what actual sizes are we talking about? Mike On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava wrote: > Otis, > > thanks for the answer. > >

Re: Problems with IndexWriter#commit() on Linux

2010-01-07 Thread Michael McCandless
kill -9 is harsh, but, perfectly fine from Lucene's standpoint. Likewise if the OS or JVM crashes, power is suddenly lost, the index will just fallback to the last successful commit. What will cause corruption is if you have bit errors happening somewhere in the machine... or if two writers are ac

Re: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Otis Gospodnetic
Maybe you can paste a directory listing before optimization and after optimization? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Yuliya Palchaninava > To: "java-user@lucene.apache.org" > Sent: Thu, January 7, 2010 11:50:29 AM > Subjec

Re: Problems with IndexWriter#commit() on Linux

2010-01-07 Thread Erick Erickson
Can you show us the code where you commit? And how do you kill your process? Kill -9 is...er...harsh Yeah, I'm wondering whether the index file size *stays* changed after you kill you process. If it keeps its growing on every run (after you kill your process multiple times), then I'd suspect

AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Yuliya Palchaninava
Otis, thanks for the answer. Unfortunatelly the index *directory* remains larger *after" the optimization. In our case the otimization was/is completed successfully and, as you say, there is only one segment in the directory. Some other ideas? Thanks, Yuliya > -Ursprüngliche Nachricht

Re: Performance Results on changing the way fields are stored

2010-01-07 Thread Otis Gospodnetic
You could try Avro instead of JSON/XML/Java Serialization. It's compact (and new). http://hadoop.apache.org/avro/ Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Paul Taylor > To: java-user@lucene.apache.org > Sent: Tue, January 5, 2010

Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Simon Willnauer
Do you have a reader open on the index which was opened before your your index was optimized? Maybe there is a reader around holding on the references to the merged segments. simon On Thu, Jan 7, 2010 at 5:23 PM, Yuliya Palchaninava wrote: > Hi, > > According to the api documentation: "In genera

Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Otis Gospodnetic
Yuliya, The index *directory* will be larger *while* you are optimizing. After the optimization is completed successfully, the index directory will be smaller. It is possible that your index directory is large(r) because you have some left-over segments (e.g. from some earlier failed/interrup

Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Yuliya Palchaninava
Hi, According to the api documentation: "In general, once the optimize completes, the total size of the index will be less than the size of the starting index. It could be quite a bit smaller (if there were many pending deletes) or just slightly smaller". In our case the index becomes not small

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Dvora
Can you explain how the combination of merge factor and max merge size control the size of files? For example, if one would like to limit the files size to 3,4 or 7MB - how these parameters values can be predicted? Michael McCandless-2 wrote: > > > This tells the IndexWriter NOT to merge any

Re: Problems with IndexWriter#commit() on Linux

2010-01-07 Thread Naama Kraus
Thanks dor the input. 1. While the process is running, I do see the index files growing on disk and the time stamps changing. Should I see a change in size right after killing the process, is that what you mean ? 2. Yes, same directory is being used for indexing and search. 3. Didn't try Luke, goo

R-Tree in lucene thoughts?

2010-01-07 Thread Ryan McKinley
I'm getting back into the spatial search stuff and wanted to get some feedback before starting down any path... I'm building an app, where I need R-tree like functionality -- that is search for all items within some extent / that touch some extent. If anyone has ideas for how this could ma

Re: IndexingChain and TermHash

2010-01-07 Thread Michael McCandless
LUCENE-1410 has the PFor impl, that the PFor codec needs. Mike On Thu, Jan 7, 2010 at 7:46 AM, Renaud Delbru wrote: > Hi Michael, > > I have started to look at the PFOR codec. However, when I include the codec > files inside the flex_1458 branch, it misses the > org.apache.lucene.util.pfor.PFor

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Michael McCandless
On Thu, Jan 7, 2010 at 7:56 AM, Dvora wrote: > Yep, this seems to work :-) Thanks! Super! > The other way of Directory custom implementation seems interesting, and I > would like try it (at least as an oppurtunity to know Lucene better) - is > there a simple guide describing how to that? Perhap

Re: Problems with IndexWriter#commit() on Linux

2010-01-07 Thread Erick Erickson
Several questions: 1> are the index files larger after you kill your process? Or have the timestamps changed? 2> are you absolutely sure that your indexer, when you add documents, is pointing at the same directory your search is pointing to? 3> Have you gotten a copy of Luke and exami

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Dvora
Yep, this seems to work :-) Thanks! The other way of Directory custom implementation seems interesting, and I would like try it (at least as an oppurtunity to know Lucene better) - is there a simple guide describing how to that? Perhaps some other similar Directory implementations? Michael McCa

Re: IndexingChain and TermHash

2010-01-07 Thread Renaud Delbru
Hi Michael, I have started to look at the PFOR codec. However, when I include the codec files inside the flex_1458 branch, it misses the org.apache.lucene.util.pfor.PFor class which is the core of the codec. Where can I find this class ? Thanks, Regards -- Renaud Delbru On 16/11/09 14:01, M

Problems with IndexWriter#commit() on Linux

2010-01-07 Thread Naama Kraus
Hi, I am using IndexWriter#commit() methods in my program to commit document additions to the index. I do that once in a while, after a bunch of documents were added. Since my indexing process is long, I want to make sure I don't loose too many additions in case of a crash. When running on Windows

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Michael McCandless
Is the creation of the index in your control? If so, try something like this: ((LogByteSizeMergePolicy) writer.getMergePolicy()).setMaxMergeMB(1.0)); (NOTE: not tested). This works because [currently] LogByteSizeMergePolicy is the default for IndexWriter. Would be safer to create your own Lo

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Dvora
Thanks for the reply. Can you please add some detailed explanations? I'm trying to upload Lucene index to Google App Engine, and the files size must not exceed 10MB. Michael McCandless-2 wrote: > > I don't think this is implemented [yet] today. You'd have to > implement the Directory, IndexI

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Michael McCandless
I don't think this is implemented [yet] today. You'd have to implement the Directory, IndexInput and IndexOutput classes, to make this work. Do you have a hard 2GB limit on index files? Setting maxMergeDocs/Size (via LogByteSizeMergePolicy) lets you approximately limit the size... Mike On Thu,

Is there a way to limit the size of an index?

2010-01-07 Thread Dvora
Hi all, According to the FAQ, "An even more complex and optimal solution: Write a version of FSDirectory that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files." Is that solution implemented already? If not, can you guide me please how can I achieve t