RE: [ANNOUNCE] Web Crawler

2013-07-15 Thread Ramakrishna
so, There is no way to crawl if they blocked their web-sites to crawl ? I've one idea, But seems little bit foolish(not works/I've to Modify whole architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of fetcher then? Anyhow Html-parser easily takes all contents of the web-page.Can i

Re: [ANNOUNCE] Web Crawler

2013-07-15 Thread Ramakrishna
so, There is no way to crawl if they blocked their web-sites to crawl ? I've one idea, But seems little bit foolish(not works/I've to Modify whole architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of fetcher then? Anyhow Html-parser easily takes all contents of the web-page.Can i

Re: ngrams in Lucene 4.3.0

2013-07-15 Thread Malgorzata Urbanska
thanks !! On Mon, Jul 15, 2013 at 1:31 PM, Ivan Krišto wrote: > On 07/15/2013 07:50 PM, Malgorzata Urbanska wrote: >> Hi, >> >> I've been trying to figure out how to use ngrams in Lucene 4.3.0 >> I found some examples for earlier version but I'm still confused. >> How I understand it, I should

Re: ngrams in Lucene 4.3.0

2013-07-15 Thread Ivan Krišto
On 07/15/2013 07:50 PM, Malgorzata Urbanska wrote: > Hi, > > I've been trying to figure out how to use ngrams in Lucene 4.3.0 > I found some examples for earlier version but I'm still confused. > How I understand it, I should: > 1. create a new analyzer which uses ngrams > 2. apply it to my indexe

ngrams in Lucene 4.3.0

2013-07-15 Thread Malgorzata Urbanska
Hi, I've been trying to figure out how to use ngrams in Lucene 4.3.0 I found some examples for earlier version but I'm still confused. How I understand it, I should: 1. create a new analyzer which uses ngrams 2. apply it to my indexer 3. search using the same analyzer I found in a documentation:

RE: [ANNOUNCE] Web Crawler

2013-07-15 Thread karl.wright
Usually, if a webmaster finds that your crawler has ignored their robots.txt, they will block you machine, or maybe even your entire IP block, from accessing their site. Karl -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, July 15, 2013 9:30 AM To

Re: MemoryIndex in Lucene 4.x

2013-07-15 Thread Simon Willnauer
hey, can you share your benchmark and/or tell us a little more about how your data looks like and how you analyze the data. There might be analysis changes that contribute to that? simon On Sun, Jul 14, 2013 at 7:56 PM, cischmidt77 wrote: > I use Lucene/MemoryIndex for a large number of quer

Re: [ANNOUNCE] Web Crawler

2013-07-15 Thread Jack Krupansky
Lucene does not provide any capabilities for crawling websites. You would have to contact the Nutch project, the ManifoldCF project, or other web crawling projects. As far as bypassing robots.txt, that is a very unethical thing to do. It is rather offensive that you seem to be suggesting that

Re: [ANNOUNCE] Web Crawler

2013-07-15 Thread Ramakrishna
Hi.. I'm trying nutch to crawl some web-sites. Unfortunately they restricted to crawl their web-site by writing robots.txt. By using crawl-anywhere can I crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz send me the materials/links to study about crawl-anywhere or else p

Re: NRT + static rank based sorting

2013-07-15 Thread Michael McCandless
Also, it's in general not good to check for IR reopen on every search request: this could be way too often if you suddenly hit high search load, and if it's a big reopen (a large segment merge just completed) you slow down that one unlucky search too much; it's better to have a background thread th

Re: Features added after Lucene 4

2013-07-15 Thread Ian Lea
The Changes and Migration Guide on http://lucene.apache.org/core/4_3_1/ (or 4_2_x) should help. They usually link through to JIRA pages which will have more detail. If you want info about lower level stuff such as Codecs, try googling "lucene codecs" or whatever it is you're interested in. -- I

Features added after Lucene 4

2013-07-15 Thread VIGNESH S
Hi, I am trying to upgrade our older index from Lucene 3.6 to Lucene 4.2 I need to understand the changes in Indexing Structure.. can any one please post some articles and links through which i can understand indexing changes and search changes. -- Thanks and Regards Vignesh Srinivasan 9739135

Re: Lucene in Action

2013-07-15 Thread Ian Lea
Have you read and worked through http://lucene.apache.org/core/4_3_1/demo/overview-summary.html? To build and run applications using lucene you need either lucene-4.3.1.tgz or lucene-4.3.1.zip. If you're on unix you might go for the gzipped tar file, windows users might prefer the Zip file. The

RE: Lucene in Action

2013-07-15 Thread Vinh Đặng
Dear everyone, Sorry if I raised this question again. I quoted my previous email here. " I still find a details tutorial which guide me step by step, from >> download lucene until configure IDE (I unzipped lucene and received a >> complex folder - I don't know what should I do next?) >> >> And,