for 2.x and 3.x you can simply use this codes: Directory dir=FSDirectory.open(new File("./testindex")); IndexReader reader=IndexReader.open(dir); List<String> urls=new ArrayList<String>(reader.numDocs()); for(int i=0;i<reader.maxDoc();i++){ if(!reader.isDeleted(i)){ Document doc=reader.document(i); urls.add(doc.get("url")); }
} if url fields is indexed, you can use FieldCache.StringIndex to speed up. as for Trunk 4.x, I can't find the isDeleted(int) method. any one could tell me why this method is removed? On Mon, Feb 13, 2012 at 10:31 PM, SearchTech <searc...@gmail.com> wrote: > Hi there, > > I am currently working on a search engine based on lucene and have some > issues because java is not my regular programming language, which makes > things a it hard. > What I was wondering about is if you would be available for a small custom > (paid) job to solve one of my issues. > > I am basically looking for a way to extract and save all links from a .fdt > file to a text file. > > The reason for this is simple: The engine I am building is indexing remote > sites based on the dmoz dump. The issue is that my mysql database where all > urls are stored contains 2 million ntries, but when I have indexed > everything, I get about 1.8 million documents because some timeout, some > redirect to another domain or some just fail. So my mission is extracting > all URL's from the final fdt files and then enter them to my database again > to have a "fresh" set of URL's to index without the need to run the crawler > on all domains again just to waste bandwidth. > > That said, I was wondering if you would possiblbe available for a quick > project to write me some java tool which works like: > > java tool.jar index.fdt links.txt > > which would basically export all found links from the fdt file and save it > line by line to links.txt > > This would be really wonderful and would enable me to finalize my project > :) > > If you are up for this, please do let me know and also let me know how much > you would charge for this. > > Thank you for your time reading this. > > Juergen >