If you can categorize the documents based on user permissions, that is the route I would go. For example users 1, 2, and 3 are allowed to search documents a and b. In addition, user 1 can search documents c and d, while users 2 and 3 can search documents e and f. I would create 3 indexes: one for docs a and b, one for docs c and d, and finally one for docs e and f. Then using your method of choice, you can restrict documents based on the users permission.

I realize scaling may cause an issue, but this route would allow you to normalize your data and reduce duplication in the system.

Shane

ariel goldberg wrote:
Greetings,





I'm creating an application that requires the indexing of millions of documents on behalf of a large group of users, and was hoping to get an opinion on whether I should use one index per user or one index per day.





My application will have to handle the following:





- the indexing of about 1 million 5K documents per day, with each document containing about 5 fields



- expiration of documents, since after a while, my hard drive would run out of room



- queries that consist of boolean expressions (e.g., the body field contains "a" AND "b", and the title field contains "c"), as well as ranges (e.g., the document needs to have been indexed between 2/25/07 10:00 am and 2/28/07 9:00 pm)



- permissions; in other words, user A might be able to search on documents X and Y, but user B might be able to search on documents Y and Z.



- up to 1,000 users





So, I was considering the following:





1) Using one index per user





This would entail creating and using up to 1,000 indices. Document Y in the example above would have to be duplicated. Expiration is performed via IndexWriter.deleteDocuments. The advantage here is that querying should be reasonably quick, because each index would only contain tens of thousands of documents, instead of millions. The disadvantages: I'm concerned about the "too many open files" error, and I'm also concerned about the performance of deleteDocuments.





2) Using one index per day





Each day, I create a new index. Again, document Y in the example above would have to be duplicated (is there any way around this?) The advantage here is that expiring documents means simply deleting the index corresponding to a particular day. The disadvantage is the query performance, since the queries, which are already very complex, would have to be performed using MultiSearcher (if expiration is after 10 days, that's 10 indices to search across).





Tough to know for sure which option is better without testing, but does anyone have a gut reaction? Any advice would be greatly appreciated!





Thanks,



Ariel






____________________________________________________________________________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to