On Thu, Sep 18, 2014 at 11:06 AM, Vinayakumar B <vinayakum...@apache.org> wrote:
> bq. I don't know about the merits of this, but I do know that native
> filesystems
> implement this by not raising the EOF exception on the seek() but only on
> the read ... some of the non-HDFS filesystems Hadoop support work this way.

Pretty much all of them should.  POSIX specifies that seeking past the
end of a file is not an error.  Reading past the end of the file gives
an EOF, but the seek always succeeds.

It would be nice if HDFS had this behavior as well.  It seems like
this would have to be a 3.0 thing, since it's a potential
incompatibility.

> I agree with you steve. read only will throw EOF. But when we know that
> file is being written  and it can have fresh data, then polling can be done
> by calling available(). later we can continue read or call seek.

InputStream#available() has a really specific function in Java:
telling you approximately how much data is currently buffered by the
stream.

As a side note, InputStream#available seems to be one of the most
misunderstood APIs in Java.  It's pretty common for people to assume
that it means "how much data is left in the stream" or something like
that.  I think I made that mistake at least once when getting started
with Java.  I guess the JavaDoc is kind of vague-- it specifies that
available returns "an estimate of the number of bytes that can be read
(or skipped over) from this input stream without blocking."  But in
practice, that means how much is buffered (for a file-backed stream,
to pull more bytes from the OS would require a syscall, which is
"blocking."  Similarly for network-backed streams.)

In any case, we certainly could create a new API to refresh
inputstream data.  I guess the idea would be to check if the last
block we knew about had reached full length-- if so, we would ask the
NameNode for any new block locations.  So it would be a DN operation
in most cases, but sometimes a NN operation.

Have you looked at https://issues.apache.org/jira/browse/HDFS-6633:
Support reading new data in a being written file until the file is
closed?  That patch seems to take the approach of turning reading past
the end of the file into an operation that blocks until there is new
data.  (when dfs.client.read.tail-follow is set.)  I think I prefer
the idea of a new refresh API, just because it puts more control in
the hands of the user.

Another thing to consider is how this all interacts with the proposed
HDFS truncate operation (see HDFS-3107).

best,
Colin


>
> One simple example use case is tailing a file.
>
> Regards,
> Vinay
>
> On Thu, Sep 18, 2014 at 3:35 PM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>> I don't know about the merits of this, but I do know that native
>> filesystems implement this by not raising the EOF exception on the seek()
>> but only on the read ... some of the non-HDFS filesystems Hadoop support
>> work this way.
>>
>> -I haven't ever looked to see what code assumes that it is the seek that
>> fails, not the read.
>> -PositionedReadable had better handle this too, even if it isn't done via a
>> seek()-read()-seek() sequence
>>
>>
>> On 18 September 2014 08:48, Vinayakumar B <vinayakum...@apache.org> wrote:
>>
>> > Hi all,
>> >
>> > Currently *DFSInputStream *doen't allow reading a write-inprogress file,
>> > once all written bytes, by the time of opening an input stream, are read.
>> >
>> > To read further update on the same file, needs to be read by opening
>> > another stream to the same file again.
>> >
>> > Instead how about refreshing length of such open files if the current
>> > position is at earlier EOF.
>> >
>> > May be this could be done in *available() *method, So that clients who
>> > knows that original writer will not close then read can continuously poll
>> > for new data using the same stream?
>> >
>> > PS: This is possible in local disk read using FileInputStream
>> >
>> > Regards,
>> > Vinay
>> >
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>

Reply via email to