Actually, I was hoping you could try leaving the getHTML calls in, but
increase the heap size of your Tomcat instance.

Ie, to be sure there really is a leak vs you're just not giving the
JRE enough memory.

I do like your hypothesis, but looking at HTMLParser it seems like the
thread should exit after parsing the HTML.  Or, maybe there's
something about the particular HTML documents you're parsing?  I just
tested this test case:

  public void testHTMLParserLeak() throws Exception {
    for(int i=0;i<100000;i++) {
      InputStream is = new
ByteArrayInputStream("<title>Here</title>".getBytes());
      HTMLParser parser = new HTMLParser(is);
      String title = parser.getTitle();
      assertEquals("Here", title);
      is.close();
    }
  }

And it runs fine and memory seems stable.  Can you try that test case,
but swap in some of your own HTML docs?

Also: can you run "kill -QUIT" on your app to get a full thread dump?
(Hmm I think you may be on windows; I'm not sure what the equivalent
operation is).

Mike

Chetan Shah <chetankrs...@gmail.com> wrote:
>
> Highly appreciate your replies Michael.
>
> No, I don't hit OOME if I comment out the call to getHTMLTitle. The heap
> behaves perfectly.
>
> I completely agree with you, the thread count goes haywire the moment I call
> the HTMLParser.getTitle(). I have seen a thread count of like 600 before my
> I hit OOME (with the getTitle() call on) and 90% of those threads are in
> wait state. They are not doing anything but just sitting there forever, I am
> sure they are consuming the heap and never giving it back.
>
> Does my hypothesis make sense?
>
>
>
>
>
>
>
>
> Michael McCandless-2 wrote:
>>
>> Odd.  I don't know of any memory leaks w/ the demo HTMLParser, hmm
>> though it's doing some fairly scary stuff in its getReader() method.
>> EG it spawns a new thread every time you run it.  And, it's parsing
>> the entire HTML document even though you only want the title.
>>
>> You may want to switch to better supported HTMLParsers, eg NekoHTML.
>>
>> Plus, it would be better if you extracted the title during indexing,
>> and stored in the document, than doing all this work at search time.
>> You want CPU at search time to be minimized (think of all the
>> electricity...).
>>
>> But: if you increase the HEAP do you still eventually hit OOME?
>>
>> Mike
>>
>> Chetan Shah <chetankrs...@gmail.com> wrote:
>>>
>>> After some more researching I discovered that the following code snippet
>>> seems to be the culprit. I have to call this to get the "title" of the
>>> indexed html page. And this is called 10 times as my I display 10 results
>>> on
>>> a page.
>>>
>>> Any Suggestions on how to achieve this without the OOME issue.
>>>
>>>
>>>                File f = new File(htmlFileName);
>>>                FileInputStream fis = new FileInputStream(f);
>>>                HTMLParser parser = new HTMLParser(fis);
>>>                String title = parser.getTitle();
>>>                /* following was added to for my sanity :) */
>>>                parser = null;
>>>                fis.close();
>>>                fis = null;
>>>                f = null;
>>>                /* till here */
>>>                return title;
>>>
>>>
>>> Chetan Shah wrote:
>>>>
>>>> I am initiating a simple search and after profiling the my application
>>>> using NetBeans. I see a constant heap consumption and eventually a
>>>> server
>>>> (tomcat) crash due to "out of memory" error. The thread count also keeps
>>>> on increasing and most of the threads in "wait" state.
>>>>
>>>> Please let me know what am I doing wrong here so that I can avoid server
>>>> crash. I am using Lucene 2.4.0.
>>>>
>>>>
>>>>                       IndexSearcher indexSearcher =
>>>> IndexSearcherFactory.getInstance().getIndexSearcher();
>>>>
>>>>                       //Create the query and search
>>>>                       QueryParser queryParser = new
>>>> QueryParser("contents", new
>>>> StandardAnalyzer());
>>>>                       Query query = queryParser.parse(searchCriteria);
>>>>
>>>>
>>>>                       TermsFilter categoryFilter = null;
>>>>
>>>>                       // Create the filter if it is needed.
>>>>                       if (filter != null) {
>>>>                               Term aTerm = new
>>>> Term(Constants.WATCH_LIST_TYPE_TERM);
>>>>                               categoryFilter = new TermsFilter();
>>>>                               for (int i = 0; i < filter.length; i++) {
>>>>                                       aTerm =
>>>> aTerm.createTerm(filter[i]);
>>>>                                       categoryFilter.addTerm(aTerm);
>>>>                               }
>>>>                       }
>>>>
>>>>                       // Create sort criteria
>>>>                       SortField [] sortFields = new SortField[2];
>>>>                       SortField watchList = new
>>>> SortField(Constants.WATCH_LIST_TYPE_TERM,
>>>> SortField.STRING);
>>>>                       SortField score = SortField.FIELD_SCORE;
>>>>                       if (sortByWatchList) {
>>>>                               sortFields[0] = watchList;
>>>>                               sortFields[1] = score;
>>>>                       } else {
>>>>                               sortFields[1] = watchList;
>>>>                               sortFields[0] = score;
>>>>
>>>>                       }
>>>>                       Sort sort = new Sort(sortFields);
>>>>
>>>>                       // Collect results
>>>>                       TopDocs topDocs = indexSearcher.search(query,
>>>> categoryFilter,
>>>> Constants.MAX_HITS, sort);
>>>>                       ScoreDoc scoreDoc[] = topDocs.scoreDocs;
>>>>                       int numDocs = scoreDoc.length;
>>>>                       if (numDocs > 0) results = scoreDoc;
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Memory-Leak--tp22663917p22685294.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Memory-Leak--tp22663917p22686500.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to