Re: Heuristic for detecting 'binary' data vs. 'text' data [was: FW: Generating a dump file using a powershell script]

Julian Foad Tue, 22 Jun 2010 08:25:59 -0700

(I'm just changing the subject line.)
- Julian
 

On Tue, 2010-06-22 at 16:58 +0200, Bert Huijben wrote:
> > -----Original Message-----
> > From: Geoff Worboys [mailto:ge...@telesiscomputing.com.au]
> > Sent: dinsdag 22 juni 2010 16:37
> > To: us...@subversion.apache.org
> > Subject: Generating a dump file using a powershell script
> > 
> 
> <snip>
> 
> > Q2:  When writing the code to try and identify text versus
> > binary files I decided to look at what subversion did ... but
> > now I am confused.  In libsvn_subr\io.c function
> > svn_io_detect_mimetype2 a comment says:
> >      going to examine the first block of data, and make sure that 85%
> >      of the bytes are such that their value is in the ranges 0x07-0x0D
> >      or 0x20-0x7F, and that 100% of those bytes is not 0x00.
> > but my reading of this code
> >       if (((binary_count * 1000) / amt_read) > 850)
> >         {
> >           *mimetype = generic_binary;
> >           return SVN_NO_ERROR;
> >         }
> > suggests that it is actually setting the type to binary only
> > if it finds more than 85% are binary bytes (in earlier code a
> > file binary if forced if any null byte is found).
> > 
> > Can anyone explain this?  A bug or am I missing something?
> 
> Looking at the code, this seems looks like a bug to me. But it's not a bug
> that I like to fix without further review, because the current code might
> work better then the intended behavior for users of different character
> sets.
> 
> So it might be safer to just fix the documentation.
> 
>       Bert

Re: Heuristic for detecting 'binary' data vs. 'text' data [was: FW: Generating a dump file using a powershell script]

Reply via email to