Re: parsing an xml document chunk by chunk

Jeff Greif Wed, 08 Apr 2009 10:43:23 -0700

I think you have two problems here, a network handling problem and an
XML parsing problem.  The options you describe are by no means the
only ones.  In particular, you don't need a thread per request;
however, you do need an XML parser exclusively handling each single
document if you are using Xerces.

In the limit where your machine is heavily loaded processing chunks
from many requests, it becomes less important whether you process the
chunks when they're all accumulated or in more piecemeal fashion.  So
if you're expecting heavy loads, you could do the easier thing.

To use fewer threads, you can receive the document chunks using one of
the networking patterns like a Reactor, which handles many socket
connections at once, backed by, for example, a message queue and
associated parser for the chunks of each document.  The associated
parser can be reading from a stream wrapped around the message queue.
The chunks accumulate in the message queue until one of a few worker
threads gets around to processing those that have accumulated since
the last time the queue was accessed.  The worker threads handle all
the parsing for however many documents are active at any one time, in
round-robin or some priority-based fashion.

I believe that the Java nio classes are designed to the provide the
Reactor, or similar patterns.  You can Google for Reactor Pattern or
look in the "Pattern-Oriented Software Architecture" volume 2.  One of
the authors of that book is Douglas Schmidt, who has a large online
collection of papers on subjects in this field.

It's possible to put a message queue on the far side of the parser to
hold the generated SAX events.  This might make it possible for the
application processing the received documents to be less complicated.
But the freedom to provide your own event-handlers to the parser might
remove the need for such queues.  For example, if the XML docs were
simple ones, suitable for incremental action before they were
completely received, such as a linear sequence of instructions of some
kind, the endElement event could have an application-specific handler
that processed one instruction if the name of the element demarcated
the completion of an instruction.

Jeff

On Wed, Apr 8, 2009 at 1:01 AM, Marco Testa <marco.te...@funambol.com> wrote:
> Hi,
> I have to parse a xml document, that actually is received in many chunks,
> and unfortunately I have to parse it chunk by chunk and not at the end, when
> I've received all the pieces.
> I was thinking at a SAX parser, since I have to push the parser when i
> receive the data.
> I was also thinking at an OutputStream where to write the chunks when I
> receive them, and pipe the OutputStream to an InputStream to be passed to
> the parser.
> But I think there is no way to let the parser read from the InputStream in
> the same thread.
> So I have to create a thread for every receiving document, but since the
> program may actually receive many different documents at the same time and
> that chunks may be received with long delays I have to create many threads
> that will be mainly idle while waiting for the chunks.
> Is there a way to bypass the piped input and output streams and directly
> call the parser on a single chunk when it is received?
> Does exist a non-blocking parser that does not wait if the input stream is
> not ready?
> In other words, is there a way to call a parse in the same thread only for a
> xml document piece, and call it many times until the document is completely
> received?
> thank you very much,
> marco
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> For additional commands, e-mail: j-users-h...@xerces.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org

Re: parsing an xml document chunk by chunk

Reply via email to