Thanks Mark, appreciate these code pointers! (I’m cc’ing in the mailing list so others can comment)
Chris > On 4 Jan 2016, at 8:21 PM, Mark Hung <mark...@gmail.com> wrote: > > > I meant there is a chance for SvParser::GetNextChar() to switch encoding, but > yes it is less relevant. > > Grepping content-type under ucb , there are some suspicious code > http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav-neon/ContentProperties.cxx#454 > > <http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav-neon/ContentProperties.cxx#454> > http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav/ContentProperties.cxx#471 > > <http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav/ContentProperties.cxx#471> > > Which seems incosistent with > http://opengrok.libreoffice.org/xref/core/sc/source/filter/html/htmlpars.cxx#264 > > <http://opengrok.libreoffice.org/xref/core/sc/source/filter/html/htmlpars.cxx#264> > > > 2016-01-04 16:17 GMT+08:00 Chris Sherlock <chris.sherloc...@gmail.com > <mailto:chris.sherloc...@gmail.com>>: > Hi Mark, > > BOM detection is irrelevant here. The HTTP header states that it should be > UTF8, but this is not being honoured. > > There is something further down the stack that isn’t recording the HTTP > headers. > > Chris > >> On 4 Jan 2016, at 4:23 PM, Mark Hung <mark...@gmail.com >> <mailto:mark...@gmail.com>> wrote: >> >> Hi Chris, >> >> As recently I'm working on SvParser and HTMLParser, >> >> There is BOM detection is in SvParser::GetNextChar(). >> >> A quick look at eehtml, EditHTMLParser:: <>EditHTMLParser seems relevant. >> >> Best regards. >> >> >> 2016-01-04 12:02 GMT+08:00 Chris Sherlock <chris.sherloc...@gmail.com >> <mailto:chris.sherloc...@gmail.com>>: >> Hey guys, >> >> Probably nobody saw this because of the time of year (Happy New Year, >> incidentally!!!). >> >> Just a quick ping to the list to see if anyone can give me some pointers. >> >> Chris >> >>> On 30 Dec 2015, at 12:15 PM, Chris Sherlock <chris.sherloc...@gmail.com >>> <mailto:chris.sherloc...@gmail.com>> wrote: >>> >>> Hi guys, >>> >>> In bug 95217 - https://bugs.documentfoundation.org/show_bug.cgi?id=95217 >>> <https://bugs.documentfoundation.org/show_bug.cgi?id=95217> - Persian test >>> in a webpage encoded as UTF-8 is corrupting. >>> >>> If I take the webpage and save to an HTML file encoded as UTF8, then there >>> are no problems and the Persian text comes through fine. However, when >>> connecting to a webserver directly, the HTTP header correctly gives the >>> content type as utf8. >>> >>> I did a test using Charles Proxy with its SSL interception feature turned >>> on and pointed Safari to >>> https://bugs.documentfoundation.org/attachment.cgi?id=119818 >>> <https://bugs.documentfoundation.org/attachment.cgi?id=119818> >>> >>> The following headers are gathered: >>> >>> HTTP/1.1 200 OK >>> Server: nginx/1.2.1 >>> Date: Sat, 26 Dec 2015 01:41:30 GMT >>> Content-Type: text/html; name="text.html"; charset=UTF-8 >>> Content-Length: 982 >>> Connection: keep-alive >>> X-xss-protection: 1; mode=block >>> Content-disposition: inline; filename="text.html" >>> X-content-type-options: nosniff >>> >>> Some warnings are spat out that it editeng's eehtml can't detect the >>> encoding. I initially thought it was looking for a BOM, which makes no >>> sense for a webpage, but that's wrong. Instead, for some reason the headers >>> don't seem to be processed and the HTML parser is falling back to >>> ISO-8859-1 and not UTF8 as the character encoding. >>> >>> We seem to use Neon to make the GET request to the webserver. A few >>> observations: >>> >>> 1. We detect a server OK response as an error >>> 2. (Probably more to the point) I believe PROPFIND is being used, but >>> actually even though the function being used indicates a PROPFIND verb is >>> used a GET is used as is normal but the headers aren't being stored. This >>> ,Evans that when the parser looks for the headers to find the encoding it's >>> not finding anything, resulting in a fallback to ISO-8859-1. >>> >>> One easy thing (doesn't solve the root issue) is that wouldn't it be a >>> better idea to fallback to UTF8 and not ISO-8859-1, given ISO-8859-1 is >>> really just a subset of UTF-8? >>> >>> Any pointers on how to get to the bottom of this would be appreciated, I'm >>> honestly not up on webdav or Neon. >>> >>> Chris Sherlock >> >> >> _______________________________________________ >> LibreOffice mailing list >> LibreOffice@lists.freedesktop.org <mailto:LibreOffice@lists.freedesktop.org> >> http://lists.freedesktop.org/mailman/listinfo/libreoffice >> <http://lists.freedesktop.org/mailman/listinfo/libreoffice> >> >> >> >> >> -- >> Mark Hung > > > > > -- > Mark Hung
_______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice