Actually, I was hoping you could try leaving the getHTML calls in, but increase the heap size of your Tomcat instance.
Ie, to be sure there really is a leak vs you're just not giving the JRE enough memory. I do like your hypothesis, but looking at HTMLParser it seems like the thread should exit after parsing the HTML. Or, maybe there's something about the particular HTML documents you're parsing? I just tested this test case: public void testHTMLParserLeak() throws Exception { for(int i=0;i<100000;i++) { InputStream is = new ByteArrayInputStream("<title>Here</title>".getBytes()); HTMLParser parser = new HTMLParser(is); String title = parser.getTitle(); assertEquals("Here", title); is.close(); } } And it runs fine and memory seems stable. Can you try that test case, but swap in some of your own HTML docs? Also: can you run "kill -QUIT" on your app to get a full thread dump? (Hmm I think you may be on windows; I'm not sure what the equivalent operation is). Mike Chetan Shah <chetankrs...@gmail.com> wrote: > > Highly appreciate your replies Michael. > > No, I don't hit OOME if I comment out the call to getHTMLTitle. The heap > behaves perfectly. > > I completely agree with you, the thread count goes haywire the moment I call > the HTMLParser.getTitle(). I have seen a thread count of like 600 before my > I hit OOME (with the getTitle() call on) and 90% of those threads are in > wait state. They are not doing anything but just sitting there forever, I am > sure they are consuming the heap and never giving it back. > > Does my hypothesis make sense? > > > > > > > > > Michael McCandless-2 wrote: >> >> Odd. I don't know of any memory leaks w/ the demo HTMLParser, hmm >> though it's doing some fairly scary stuff in its getReader() method. >> EG it spawns a new thread every time you run it. And, it's parsing >> the entire HTML document even though you only want the title. >> >> You may want to switch to better supported HTMLParsers, eg NekoHTML. >> >> Plus, it would be better if you extracted the title during indexing, >> and stored in the document, than doing all this work at search time. >> You want CPU at search time to be minimized (think of all the >> electricity...). >> >> But: if you increase the HEAP do you still eventually hit OOME? >> >> Mike >> >> Chetan Shah <chetankrs...@gmail.com> wrote: >>> >>> After some more researching I discovered that the following code snippet >>> seems to be the culprit. I have to call this to get the "title" of the >>> indexed html page. And this is called 10 times as my I display 10 results >>> on >>> a page. >>> >>> Any Suggestions on how to achieve this without the OOME issue. >>> >>> >>> File f = new File(htmlFileName); >>> FileInputStream fis = new FileInputStream(f); >>> HTMLParser parser = new HTMLParser(fis); >>> String title = parser.getTitle(); >>> /* following was added to for my sanity :) */ >>> parser = null; >>> fis.close(); >>> fis = null; >>> f = null; >>> /* till here */ >>> return title; >>> >>> >>> Chetan Shah wrote: >>>> >>>> I am initiating a simple search and after profiling the my application >>>> using NetBeans. I see a constant heap consumption and eventually a >>>> server >>>> (tomcat) crash due to "out of memory" error. The thread count also keeps >>>> on increasing and most of the threads in "wait" state. >>>> >>>> Please let me know what am I doing wrong here so that I can avoid server >>>> crash. I am using Lucene 2.4.0. >>>> >>>> >>>> IndexSearcher indexSearcher = >>>> IndexSearcherFactory.getInstance().getIndexSearcher(); >>>> >>>> //Create the query and search >>>> QueryParser queryParser = new >>>> QueryParser("contents", new >>>> StandardAnalyzer()); >>>> Query query = queryParser.parse(searchCriteria); >>>> >>>> >>>> TermsFilter categoryFilter = null; >>>> >>>> // Create the filter if it is needed. >>>> if (filter != null) { >>>> Term aTerm = new >>>> Term(Constants.WATCH_LIST_TYPE_TERM); >>>> categoryFilter = new TermsFilter(); >>>> for (int i = 0; i < filter.length; i++) { >>>> aTerm = >>>> aTerm.createTerm(filter[i]); >>>> categoryFilter.addTerm(aTerm); >>>> } >>>> } >>>> >>>> // Create sort criteria >>>> SortField [] sortFields = new SortField[2]; >>>> SortField watchList = new >>>> SortField(Constants.WATCH_LIST_TYPE_TERM, >>>> SortField.STRING); >>>> SortField score = SortField.FIELD_SCORE; >>>> if (sortByWatchList) { >>>> sortFields[0] = watchList; >>>> sortFields[1] = score; >>>> } else { >>>> sortFields[1] = watchList; >>>> sortFields[0] = score; >>>> >>>> } >>>> Sort sort = new Sort(sortFields); >>>> >>>> // Collect results >>>> TopDocs topDocs = indexSearcher.search(query, >>>> categoryFilter, >>>> Constants.MAX_HITS, sort); >>>> ScoreDoc scoreDoc[] = topDocs.scoreDocs; >>>> int numDocs = scoreDoc.length; >>>> if (numDocs > 0) results = scoreDoc; >>>> >>>> >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Memory-Leak--tp22663917p22685294.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > -- > View this message in context: > http://www.nabble.com/Memory-Leak--tp22663917p22686500.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org