Re: Sort runs out of memory

2012-05-21 Thread Toke Eskildsen
, 2000 and 5678), you might map them down (to 0, 1, 2 and 3 for this example) and store them as a byte. Currently Lucene only supports atomic types for numerics in the FieldCache, so the smallest one is byte. It is possible to use only ceil(log2(#unique_values)) bits/document, although that

Re: RAM or SSD...

2012-07-18 Thread Toke Eskildsen
ing drive for the large slow stuff. Nowadays, a 30GB index (or 100GB for that matter) falls into the small low-latency bucket. SSDs speeds up almost everything, saves RAM and spares a lot of work hours optimizing I/O-speed. Regard

Re: how do I paginate Lucene search results deeply

2013-03-14 Thread Toke Eskildsen
On Thu, 2013-03-14 at 04:11 +0100, dizh wrote: > each document has a timestamp identify the time which it is indexed, I > want search the documents using sort, the sort field is the timestamp, [...] > but when you do paging, for example in a web app , the user want to go > to the last 4980-50

Re: how do I paginate Lucene search results deeply

2013-03-14 Thread Toke Eskildsen
On Thu, 2013-03-14 at 11:03 +0100, Toke Eskildsen wrote: > (timestamp_in_ms << 10) & counter++ This should be (timestamp_in_ms << 10) | counter++ - To unsubscribe, e-mail: java-user-unsubscr...@lu

RE: search-time facetting in Lucene

2013-05-06 Thread Toke Eskildsen
a.com/wiki/display/BOBO/Create+a+Browse+Index The implicit requirement is that the values for your facet fields are already indexed so that the analyzed content fits your faceting requirements. - Toke Eskildsen - To unsubscribe, e-m

Re: why did I build index slower and slower ?

2013-05-13 Thread Toke Eskildsen
ight want to switch to a setup where the index writer is persistent. - Toke Eskildsen, state and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:

Re: In memory index (current status in Lucene)

2013-07-02 Thread Toke Eskildsen
jvm.html Testing the Zing with MMapDirectory vs. RAMDirectory would be a great addition to Mike's blog post. I wonder if Java's ByteBuffer could be used to make a more GC-friendly RAMDirectory? Regards, Toke Eskildsen, State and

Re: Lucene handling of duplicate terms

2013-09-05 Thread Toke Eskildsen
o's and discard the ones that only has one? That is simpler: (A:foo AND B:foo) OR A:"foo foo"~1000 OR B:"foo foo"~1000 This all works under the assumption that you have less than 1000 terms in each instance of your fields. Adjust accordingly. - Toke Eskildsen, State

Re: Aw: Re: Strange performance of Lucene 4.4.0

2013-09-09 Thread Toke Eskildsen
On Sun, 2013-09-08 at 15:15 +0200, Mirko Sertic wrote: > I have to check, but my usecase does not require sorting or even > scoring at all. I still do not get what the difference is... Please describe how you perform your measurements. How do you ensure that the index is warmed equally for the two

RE: Any Solid State Drive performance comparisons?

2010-06-13 Thread Toke Eskildsen
Rob Bygrave [robin.bygr...@gmail.com] wrote: > Has anyone done a performance comparison for an index on a Solid State Drive > (vs any other hard drive ... SATA/SCSI)? We did a fair amount of testing two years ago and put some graphs at http://wiki.statsbiblioteket.dk/summa/Hardware The short vers

Re: is this the right way to go?

2010-06-15 Thread Toke Eskildsen
On Thu, 2010-06-10 at 04:03 +0200, fujian wrote: > Another thing is about unique. I thought it was unique "field value". If it > means unique term, for English even loading all around 300,000 terms it > won't take much memory, right? (Suppose the average length of term is 10, > the total memory usa

Re: Best practices for searcher memory usage?

2010-07-14 Thread Toke Eskildsen
On Tue, 2010-07-13 at 23:49 +0200, Christopher Condit wrote: > * 20 million documents [...] > * 140GB total index size > * Optimized into a single segment I take it that you do not have frequent updates? Have you tried to see if you can get by with more segments without significant slowdown? > Th

RE: Best practices for searcher memory usage?

2010-07-15 Thread Toke Eskildsen
u can also take a look at the rank for the most common terms. If it is very high this would explain the long execution times for compound queries that uses one or more of these terms. A stopword filter would help in this case if such a filter is acceptable for you. Regards, Toke Eskildsen ---

RE: Best practices for searcher memory usage?

2010-07-16 Thread Toke Eskildsen
and for that we used our standard setup with logged queries in order to emulate the production setting. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: scalability limit in terms of numbers of large documents

2010-08-16 Thread Toke Eskildsen
m, but most of their thoughts and solutions can be used for clean data too. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: slow search threads during a disk copy

2010-08-24 Thread Toke Eskildsen
unting with sync instead of async which gave us much better response times during copying at the cost of a substantially slower copy. dirsync should also be worth looking into. Regards, Toke Eskildsen - To unsubscribe, e-mail: ja

RE: Sorting a Lucene index

2010-08-25 Thread Toke Eskildsen
ifying an existing order-array is cheaper than a full re-sort or not depends on your batch size. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: ja

Re: Bettering search performance

2010-08-27 Thread Toke Eskildsen
On Fri, 2010-08-27 at 05:34 +0200, Shelly_Singh wrote: > I have a lucene index of 100 million documents. [...] total index size is > 7GB. [...] > I get a response time of over 2 seconds. How many documents match such a query and how many of those documents do you process (i.e. extract a term f

Re: How to find performance bottleneck

2010-10-06 Thread Toke Eskildsen
s. Switching to the Java part, try using visualvm https://visualvm.dev.java.net/ with the Visual GC-plugin to see where the time is spend. - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For ad

Re: how to index large number of files?

2010-10-21 Thread Toke Eskildsen
On Thu, 2010-10-21 at 05:01 +0200, Sahin Buyrukbilen wrote: > Unfortunately both methods didnt go through. I am getting memory error even > at reading the directory contents. Then your problem is probably not Lucene related, but the sheer number of files returned by listFiles. A Java File contain

Re: Lucene Software/Hardware Setup Question

2010-10-26 Thread Toke Eskildsen
ision is relatively modest machines with quad-core i7, 16GB of RAM and consumer-grade SSDs (Intel or SandForce). As we have mirrored servers and since no one dies if they can't find a book at our library, using enterprise-

Re: Next Word - Any Suggestions?

2010-10-26 Thread Toke Eskildsen
e same as above. For the faceting method, just reverse the order in the bi-grams. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Lucene Software/Hardware Setup Question

2010-10-27 Thread Toke Eskildsen
size makes the existing 256GB/machine a tight fit. I seem to remember that there are two free slots in our servers, so adding 2 new consumer-class SSDs is the obvious upgrade. We're switching to a more memory- and CPU-efficient way of handling sorting and faceting, so we should not need to boost

Re: a proof that every word is indexing properly

2010-12-02 Thread Toke Eskildsen
On Thu, 2010-12-02 at 03:54 +0100, David Linde wrote: > Has anyone figured out a way to logically prove that lucene indexes ever > word properly? The "Precision and recall in lucene"-thread seems relevant here. > Our company has done alot of research into lucene, all of our IT department > is rea

Re: Scale up design

2010-12-15 Thread Toke Eskildsen
On Wed, 2010-12-15 at 09:42 +0100, Ganesh wrote: > What is the advantage of going for 64 Bit. Larger maximum heap, more memory in the machine. > People claim performance and usage of more RAM. Yes, pointers normally take up 64bit on a 64bit machine. Depending on the application, the overhead can

Re: Re: Scale up design

2010-12-16 Thread Toke Eskildsen
e shard, then multiply the performance of a single created by merging 10 shards with that number. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Scale out design patterns

2011-02-03 Thread Toke Eskildsen
On Fri, 2011-02-04 at 05:54 +0100, Ganesh wrote: > 2. Consider a scenario I am sharding based on the User, I am having single > search server and It is handling 1000 members. Now as the memory consumption > is high, I have added one more search server. New users could access the > second server

RE: Lucene search result produced wrong result (due to java Collation)?

2011-02-28 Thread Toke Eskildsen
On Mon, 2011-02-28 at 22:44 +0100, Zhang, Lisheng wrote: > Very sorry I made a typo, what I meant to say is that lucene sort produced > wrong > result in English names (String ASC): > > liu yu > l yy The standard Java Collator ignores whitespace. It can be hacked, but you will have to write your

Re: Sharding Techniques

2011-05-10 Thread Toke Eskildsen
On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote: > We have an index directory of 30 GB which is divided into 3 subdirectories > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21). So each part is about ½ G

Re: Sharding Techniques

2011-05-13 Thread Toke Eskildsen
On Fri, 2011-05-13 at 12:11 +0200, Samarendra Pratap wrote: > Comparison between - single index Vs 21 indexes > Total Size - 18 GB > Queries run - 500 > % improvement - roughly 18% I was expecting a lot more. Could you test whether this is an IO-issue by selecting a slow query and performing the e

Re: Indexing speed on NTFS

2011-05-31 Thread Toke Eskildsen
On Tue, 2011-05-31 at 08:52 +0200, Maciej Klimczuk wrote: > I did some testing with 3.1.0 demo on Windows and encountered some strange > bahaviour. I tried to index ~6 small text documents using the demo. > - First trial took about 18 minutes. > - Second and third trial took about 2 minutes.

Re: Federated relevance ranking

2011-06-06 Thread Toke Eskildsen
On Thu, 2011-06-02 at 21:51 +0200, Clint Gilbert wrote: > We're also considering a home-grown scheme involving normalizing the > denominators of all the index components in all our indices, based on > the sums of counts obtained from all the indices. This feels like > re-inventing the wheel, and i

Re: RAMDirectory doesn't win over FSDirectory all the time, why?

2011-06-07 Thread Toke Eskildsen
On Mon, 2011-06-06 at 15:29 +0200, zhoucheng2008 wrote: > I read the lucene in action book and just tested the > FSversusRAMDirectoryTest.java with the following uncommented: > [...]Here is the output: > > RAMDirectory Time: 805 ms > > FSDirectory Time : 728 ms This is the code, right? http://ja

Re: Boosting a document at query time, based on a field value/range

2011-06-10 Thread Toke Eskildsen
On Fri, 2011-06-10 at 10:38 +0200, Sowmya V.B. wrote: > I am looking for a possibility of boosting a given document at query-time, > based on the values of a particular field : instead of plainly sorting the > normal lucene results based on this field. I think you misunderstand Eric's answer, as h

Re: Index size and performance degradation

2011-06-14 Thread Toke Eskildsen
e vs. performance looked like the power law: Heavy performance degradation in the beginning, less later. It makes sense when we look at caching and it means that if you do not require stellar performance, you can have very large indexes on few machines (cu

Re: Boosting a document at query time, based on a field value/range

2011-06-15 Thread Toke Eskildsen
build your Query by code, you can use ConstantScoreRangeQuery or RangeQuery for the range part, where you can call setBoost(float). - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: field sorted searches with unbounded hit count

2011-06-23 Thread Toke Eskildsen
On Thu, 2011-06-23 at 22:41 +0200, Tim Eck wrote: > I don't want to accuse anyone of bad code but always preallocating a > potentially large array in org.apache.lucene.util.PriorityQueue seems > non-ideal for the search I want to run. The current implementation of IndexSearcher uses threaded s

RE: distributing the indexing process

2011-06-30 Thread Toke Eskildsen
On Thu, 2011-06-30 at 11:45 +0200, Guru Chandar wrote: > Thanks for the response. The documents are all distinct. My (limited) > understanding on partitioning the indexes will lead to results being > different from the case where you have all in one partition, due to > Lucene currently not supp

Re: deleting 8,000,000 indexes takes forever!!!! any solution to this...

2011-07-06 Thread Toke Eskildsen
On Tue, 2011-07-05 at 17:50 +0200, Hiller, Dean x66079 wrote: > We are using a sort of nosql environment and deleting 200 gig on one machine > from the database is fast, but then we go and delete 5 gigs of indexes that > were created and it takes forever 8 million indexes is at a minimum 16

Re: SSD Experience

2011-08-23 Thread Toke Eskildsen
On Mon, 2011-08-22 at 18:49 +0200, Rich Cariens wrote: > Does anyone have any experiences or stories they can share about how SSDs > impacted search performance for better or worse? Our measurements are getting old, but since spinning disks hasn't improved and SSDs has improved substantially since

Re: SSD Experience

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 10:23 +0200, Dawid Weiss wrote: > This one is humorous (watch for foul language though). It does get to > the point, however, and Bergman is a clever guy: > http://www.livestream.com/oreillyconfs/video?clipId=pla_3beec3a2-54f5-4a19-8aaf-35a839b6ecaa We installed SSDs in all

Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 11:52 +0200, Federico Fissore wrote: > we are probably running out of topic here, but for the record, there is > also someone lamenting about ssd I find all of this highly on-topic. SSD reliability is an important issue. We use customer-grade SSDs (Intel 510 were the latest

Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 14:07 +0200, Marvin Humphrey wrote: > I'm a little confused. What do you mean by a "full to-hardware flush" > and how is that different from the sync()/fsync() calls that Lucene > makes by default on each IndexWriter commit()? A standard flush from the operating system flu

Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
. I would suggest checking with S.M.A.R.T-tool to see if it provides you with write-statistics. I would be surprised if they were that high. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
hem out is unfounded. Regards, Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
statistics on models and recalls would come in handy. > fede Heh. I'm sorry, but in danish "fede" means "fatty". On the other hand, I also know what "Toke" means in english. Regards, Toke Eskildsen -

Re: SSD Experience (on developer machine)

2011-08-26 Thread Toke Eskildsen
be a very bad wear-leveling strategy. Keeping a counter for each cell and selecting the free cell with the lowest count is trivial. However, given the bumpy road to great SSDs, I am sure that some vendors has done it this way. Regards, Toke Eskildsen -

Re: SSD Experience (on developer machine)

2011-08-26 Thread Toke Eskildsen
On Wed, 2011-08-24 at 11:46 +0200, David Nemeskey wrote: > Theoretically, in the case described above, it would be possible to move > 'static' data (data of cells that have not been written to for a long time) > to > the 5GB in question and use the 'fresher' cells as free space; this could be >

Re: Memory issues

2011-09-05 Thread Toke Eskildsen
On Sat, 2011-09-03 at 20:09 +0200, Michael Bell wrote: > To be exact, there are about 300 million documents. This is running on a 64 > bit JVM/64 bit OS with 24 GB(!) RAM allocated. How much memory is allocated to the JVM? > Now, their searches are working fine IF you do not SORT the results. If

Re: Question on the increase in the index space for larger indexes

2011-09-07 Thread Toke Eskildsen
On Tue, 2011-09-06 at 17:32 +0200, Saurabh Gokhale wrote: > Then I saw index size started exponentially increasing and by the end of 1 > year worth of data processing, I was expecting the index to be 60 to 70 GB > but the size grew to more than 120GB. > > 1. Is it an expected behavior? No, quite

Re: Extracting all documents for a given search

2011-09-19 Thread Toke Eskildsen
On Sat, 2011-09-17 at 03:57 +0200, Charlie Hubbard wrote: > I really just want to be called back when a new document is found by the > searcher, and I can load the Document, find my object, and drop that to a > file. I thought that's essentially what a Collector is, being an interface > that is c

RE: Stored fields and OS file caching

2014-04-05 Thread Toke Eskildsen
. The 2K does not always make sense BTW: Never harddrives used 4K as the smallest physical entity: http://en.wikipedia.org/wiki/Disk_sector - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

RE: Compare scores from multiple indices

2014-04-17 Thread Toke Eskildsen
, then the scores between them will be very poorly comparable. > If so, what can I do to make the scores from multiple indices comparable? Wait for https://issues.apache.org/jira/browse/SOLR-1632 or ensure that the content (and sizes) of your indices are homogenou

RE: Lucene: Index Writer to write in multiple file instead make one heavy file

2014-05-13 Thread Toke Eskildsen
produce a single large file. I guess you are performing an optimize. Don't do that (it is not really recommended anyway) and you should have multiple smaller files. If that was not clear, then please show us the part of your code that handles index updates. -

Re: Can RAMDirectory work for gigabyte data which needs refreshing of the index all the time?

2014-05-16 Thread Toke Eskildsen
give you poor performance: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Stick to MMapDirectory: As you are considering using a RAMDirectory, your index must be smaller than the amount of free RAM, which means that everything will be fully cached and fast. - Toke Eskildsen,

RE: search time & number of segments

2014-05-17 Thread Toke Eskildsen
problem or switch to SSD. - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search time & number of segments

2014-05-19 Thread Toke Eskildsen
adays and even the enterprise ones are not that pricey. Same goes for RAM as long as we're talking about a relative small amount such as 32GB. - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-20 Thread Toke Eskildsen
tter to keep it a couple of minutes? That way further searches from the same client would be fast. Overall, I worry about your architecture. It scales badly with the number of documents/client. You might not have any clients with more than 500 documents right now, but can you be sure that this will no

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-20 Thread Toke Eskildsen
ith Lucene seems like the absolute worst of both worlds. Does the DB-selector do anything that cannot easily be replicated in Lucene? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-u

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-20 Thread Toke Eskildsen
-queries are all simple matching? No complex joins and such? If so, this calls even more for a full Lucene-index solution, which handles all aspect of the search process. > - Toke Eskildsen, State and University Library, Denmark ---

Re: search time & number of segments

2014-05-20 Thread Toke Eskildsen
. Some observations you might find relevant: https://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/ - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.

RE: search time & number of segments

2014-05-20 Thread Toke Eskildsen
why you get so many more I/O operations with your 16 segments. Do you have some typical response times from the optimized index and the segmented one, after some hundred or thousand queries has been processed and the OS cache is properly warmed? Can you give us a representa

Re: search performance

2014-06-02 Thread Toke Eskildsen
the right number? I do not see a hardware upgrade changing that with the fine machine you're using. What is your search speed if you disable continuous updates? When you restart the searcher, how long does the first search take? - Toke Eskildsen, State and University Li

Re: search performance

2014-06-03 Thread Toke Eskildsen
ng updates - Limit page size - Limit lookup of returned fields - Disable highlighting - Simpler queries - Whatever else you might think of At some point along the way I would expect a sharp increase in performance. > I've requested access to the indexes so that we can perform further testing.

RE: search performance

2014-06-03 Thread Toke Eskildsen
e. A searchAfter that takes a position would either need to use some clever caching or perform the giant sorted collection when called. - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
he most compact way and perform sorting on the full collection afterwards. - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-ma

Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
rence. I am no expert there, but I will advice you to check how much free memory your JVM has when it is running searches. GC-tweaks does not help much if the JVM is nearly our of memory. - Toke Eskildsen, State and University Library, Denmark

Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
this with Lucene? If so, which API functions do I need to call? InPlaceMergeSorter is a nice one to extend. But again, with 50K result sets, this seems like overkill. - Toke Eskildsen, State and University Library, Denmark - T

Re: Searching on Large Indexes

2014-06-27 Thread Toke Eskildsen
atency? Increasing throughput? More complex queries? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Speed up searching in multiple-thread?

2014-09-15 Thread Toke Eskildsen
outcome of your test? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: 回复: Speed up searching in multiple-thread?

2014-09-15 Thread Toke Eskildsen
other services that are the bottleneck. - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Lucene Java Caching Question

2014-10-02 Thread Toke Eskildsen
get right, but the only somewhat-sound approximation of real world performance. - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Caused by: java.lang.OutOfMemoryError: Map failed

2014-11-07 Thread Toke Eskildsen
ChannelImpl.map(FileChannelImpl.java:846) That error can also be thrown when the number of open files exceeds the given limit. "OutOfMemory" should really have been named "OutOfResources". Check the maximum number of open files with 'ulimit -n'. Try r

RE: Caused by: java.lang.OutOfMemoryError: Map failed

2014-11-07 Thread Toke Eskildsen
to 16k, which had been working well. If you don't use compound indexes and all your indexes are handled under the same process constraint, then 16K seems quite low for hundreds of indexes. You could check by issuing a file count on your index fol

RE: Memory consumption on lucene 2.4

2014-11-21 Thread Toke Eskildsen
) and how ca we do such request ? Luke has term statistics build-in. I don't remember the details, but I recall that it was straight forward. - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org F

RE: how to "load" mmap directory into memory?

2014-12-03 Thread Toke Eskildsen
se 'cp' as it is smart enough to bypass the operation if the destination is /dev/null. Caveat: This does not guarantee that your index stays fully cached. It can be evicted just like all other disk cache, if other programs

RE: A question on performance

2015-01-07 Thread Toke Eskildsen
our response times grows about linear (with a bump at one point, due to switch from sparse to non-sparse docset) as a function of hitcount, there is not much about it besides sharding, with the current single-threaded processing of lucene qu

Re: lucene scalability query

2015-01-08 Thread Toke Eskildsen
ndex time, your indexes are tiny. What you are seeing is probably just statistical flukes. Try re-running your tests a few times and you will see the numbers change. - Toke Eskildsen - To unsubscribe, e-mail: jav

Re: large amount of data cause performance problem

2015-02-05 Thread Toke Eskildsen
oblems If that does not help, give us some information to work with: How large is your index (byte size and document count), what hardware do you have, how large is your JVM heap, how many documents do you request at a time, what is a typical query? - Toke Eskildsen, State and University Library, D

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Toke Eskildsen
onvention and the special method being BytesRef#shallowCopyOf(BytesRef). But we are where we are, so I don't find it viable to change behaviour. More explicit documentation, as Dawid suggests, seems the best band aid. - Toke Eskil

Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Toke Eskildsen
instead. This seems contrary to http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html Maybe you could update the JavaDoc for that field to warn against using it? - Toke Eskildsen - To unsubs

Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Toke Eskildsen
Arjen van der Meijden wrote: > On 9-8-2015 16:22, Toke Eskildsen wrote: > > Maybe you could update the JavaDoc for that field to warn against using it? > It (probably) depends on the contents of the values. That was my impression too, but we both seem to be second-guessing Robert

Re: Lucene 5 : are FixedBitSet and SparseFixedBitSet thread-safe?

2015-09-13 Thread Toke Eskildsen
ces to change code in order for it to take advantage of a changes FixedBitSet. What is it you are trying to achieve? - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-m

Re: one large index vs many small indexes

2015-11-11 Thread Toke Eskildsen
t okay to have a slow first-search but faster subsequent searches? - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: 500 millions document for loop.

2015-11-12 Thread Toke Eskildsen
simple to emulate in your Lucene handling code: http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ - Toke Eskildsen - To unsubscribe, e-mail: java-user-unsubscr...@

Re: Why Two Levels of Indirection in BytesRefHash class ?

2016-12-11 Thread Toke Eskildsen
l, there is no need to store the length of the BytesRefs. They can be calculated with bytesStarts[id+1] - bytesStarts[id]. This saves 1-2 bytes per entry and upholds memory locality, so it should have the same performance as now (needs to be tested of course). - To

Re: Question about threading in search

2018-08-17 Thread Toke Eskildsen
sue is that fewer larger segments gets slower DocValues retrieval, compared to more smaller segments. So a force merge to 1 segment can result in worse performance. - Toke Eskildsen, the Royal Danish Library, Denmark - To unsubscr

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Toke Eskildsen
k. And +1 to the issue BTW. It does not matter too much for us now, as we have shifted to a setup where we build more indexes in parallel, but 3 years ago our process was sequential so the 8 hour delay before building the next part was a bit of

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Toke Eskildsen
is simply the number of set bits at the same locations: An AND and a POPCNT of the bitmaps. This does imply a sequential pass of all potential documents, which means that it won't scale well. On the other hand each comparison is a fast check with very low memory overhead, so I hope it will wor

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Toke Eskildsen
d them but I figured I'd share. A few of them, but not all. And your notes on the articles are great. Thanks, Toke Eskildsen, Royal Danish Library - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ramdisks

2008-09-05 Thread Toke Eskildsen
On Thu, 2008-09-04 at 17:58 +0200, Cam Bazz wrote: > anyone using ramdisks for storage? there is ramsam and there is also fusion > io. but they are kinda expensive. any other alternatives I wonder? We've done some comparisons of RAM (Lucene RAMDirectory) vs. Flash-SSD vs. conventional harddrives.

Re: ramdisks

2008-09-05 Thread Toke Eskildsen
On Fri, 2008-09-05 at 10:33 +0200, Cam Bazz wrote: [RAM vs. Flash-SSD vs. harddrives] > I have done similar test with ram vs. disk, and IO was the bottleneck. > What flash ssd did you try with? For disks (as in conventional 10.000/15.000 RPM harddrives), IO is clearly the bottleneck for us also.

Re: ramdisks

2008-09-05 Thread Toke Eskildsen
On Fri, 2008-09-05 at 11:00 +0200, Toke Eskildsen wrote: > As for Flash-SSDs, we've tried 2 * MTRON 6000 32GB RAID 0, 2 * SanDisk > 5000 32GB RAID 0 and SanDisk something (64GB model) both as single drive > and 4 drives in RAID 0. Update: The "SanDisk something" tu

RE: Multi -threaded indexing of large number of PDF documents

2008-10-24 Thread Toke Eskildsen
On Fri, 2008-10-24 at 16:01 +0200, Sudarsan, Sithu D. wrote: > 4. We've tried using larger JVM space by defining -Xms1800m and > -Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m seems > stable. That is strange as we have 32 GB of RAM and 34GB swap space. > Typically no other appli

RE: Multi -threaded indexing of large number of PDF documents

2008-10-24 Thread Toke Eskildsen
Sudarsan, Sithu D. [EMAIL PROTECTED] wrote: > There have been some earlier messages, where memory consumption issue > for Lucene Documents due to 64 bit (double that of 32 bit). All pointers are doubled, yes. While not a doubling in total RAM consumption, it does give a substantial overhead. > We

Re: Performance of never optimizing

2008-11-03 Thread Toke Eskildsen
On Mon, 2008-11-03 at 04:42 +0100, Justus Pendleton wrote: > 1. Why does the merge factor of 4 appear to be faster than the merge > factor of 2? Because you alternate between updating the index and searching? With 4 segments, chances are that most of the segment-data will be unchanged between sear

Re: Performance of never optimizing

2008-11-04 Thread Toke Eskildsen
On Mon, 2008-11-03 at 23:37 +0100, Justus Pendleton wrote: > What constitutes a "proper warm up before measuring"? The simplest way is to do a number of searches before you start measuring. The first searches are always very slow, compared to later searches. If you look at http://wiki.statsbiblio

Query time document group boosting

2008-11-26 Thread Toke Eskildsen
We use Lucene at our library for indexing from different sources into the same logical index. The sources are very diverse and are prioritized differently at index-time with document boosts. However, different groups of users (or individual users for that matter) have different preferences for the

Re: Query time document group boosting

2008-11-27 Thread Toke Eskildsen
On Thu, 2008-11-27 at 07:30 +0100, Karl Wettin wrote: > The most scary part is that that you will have to score each and every > document that has a source, probably all of the documents in your > corpus. I now see my query-logic was flawed. In order to avoid matching all documents every time,

Re: Query time document group boosting

2008-12-01 Thread Toke Eskildsen
On Thu, 2008-11-27 at 20:55 +0100, Karl Wettin wrote: > A cosmetic remark, I would personally choose a single field for the > boosts and then one token per source. (groupboost:A^10 groupboost:B^1 > groupboost:C^0.1). Agreed. Thanks. > If I'm not misstaken CustomScoreQuery is a non matching qu

  1   2   >