bug#32528: http-post breaks with XML response payload containing boundary

2018-08-28 Thread Mark H Weaver
Ricardo Wurmus  writes:

> I’m having a problem with http-post and I think it might be a bug.  I’m
> talking to a Debbugs SOAP service over HTTP by sending (via POST) an XML
> request.  The Debbugs SOAP service responds with a string of XML.
>
> Here’s a simplified version of what I do:
>
>   (use-module (web http))
>   (let ((req-xml ""))
> (receive (response body)
> (http-post uri
>#:body req-xml
>#:headers
>`((content-type . (text/xml))
>  (content-length . ,(string-length req-xml
>  ;; Do something with the response body
>  (xml->sxml body #:trim-whitespace? #t)))
>
> This fails for some requests with an error like this:
>
> web/http.scm:1609:23: Bad Content-Type header: multipart/related; 
> type="text/xml"; start=""; boundary="=-=-="

[...]

> The reason why it fails is that Guile processes the response and treats
> the *payload* contained in the XML response as HTTP.

No, this was a good guess, but it's not actually the problem.

If you add --save-headers to the wget command line, you'll see the full
response, and the HTTP headers are what's being parsed, as it should be.
It looks like this (except that I removed the carriage returns below):

  HTTP/1.1 200 OK
  Date: Tue, 28 Aug 2018 21:40:30 GMT
  Server: Apache
  SOAPServer: SOAP::Lite/Perl/1.11
  Strict-Transport-Security: max-age=63072000
  Content-Length: 32650
  X-Content-Type-Options: nosniff
  X-Frame-Options: sameorigin
  X-XSS-Protection: 1; mode=block
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: multipart/related; type="text/xml"; start=""; 
boundary="=-=-="
  
  

bug#32528: http-post breaks with XML response payload containing boundary

2018-08-28 Thread Mark H Weaver
Mark H Weaver  writes:

> Ricardo Wurmus  writes:
>
>> I’m having a problem with http-post and I think it might be a bug.  I’m
>> talking to a Debbugs SOAP service over HTTP by sending (via POST) an XML
>> request.  The Debbugs SOAP service responds with a string of XML.
[...]
> The problem is simply that our Content-Type header parser is broken.
> It's very simplistic and merely splits the string wherever ';' is found,
> and then checks to make sure there's only one '=' in each parameter,
> without taking into account that quoted strings in the parameters might
> include those characters.
>
> I'll work on a proper parser for Content-Type headers.

I've attached preliminary patches to fix the Content-Type header parser,
and also to fix the parsing of response header lines to support
continuation lines.

With these patches applied, I'm able to fetch and decode the SOAP
response that you fetched with your 'wget' example, as follows:

--8<---cut here---start->8---
mhw@jojen ~/guile-stable-2.2 [env]$ meta/guile
GNU Guile 2.2.4.10-4c91d
Copyright (C) 1995-2017 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> (use-modules (web http) (web uri) (web client) (sxml 
simple) (ice-9 receive))
scheme@(guile-user)> ,pp (let ((req-xml "http://schemas.xmlsoap.org/soap/envelope/\"; 
xmlns:xsi=\"http://www.w3.org/1999/XMLSchema-instance\"; 
xmlns:xsd=\"http://www.w3.org/1999/XMLSchema\"; 
xmlns:soapenc=\"http://schemas.xmlsoap.org/soap/encoding/\"; 
soapenc:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\";>http://schemas.xmlsoap.org/soap/encoding/\";>32514"))
   (receive (response body-port)
   (http-post "https://debbugs.gnu.org/cgi/soap.cgi";
  #:streaming? #t
  #:body req-xml
  #:headers
  `((content-type . (text/xml))
(content-length . ,(string-length 
req-xml
 (set-port-encoding! body-port "UTF-8")
 (xml->sxml body-port #:trim-whitespace? #t)))
$1 = (*TOP* (*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
   (http://schemas.xmlsoap.org/soap/envelope/:Envelope
 (@ (http://schemas.xmlsoap.org/soap/envelope/:encodingStyle
  "http://schemas.xmlsoap.org/soap/encoding/";))
 (http://schemas.xmlsoap.org/soap/envelope/:Body
   (urn:Debbugs/SOAP:get_bug_logResponse
 (http://schemas.xmlsoap.org/soap/encoding/:Array
   (@ (http://www.w3.org/1999/XMLSchema-instance:type
"soapenc:Array")
  (http://schemas.xmlsoap.org/soap/encoding/:arrayType
"xsd:ur-type[4]"))
   (urn:Debbugs/SOAP:item
 (urn:Debbugs/SOAP:header
   (@ (http://www.w3.org/1999/XMLSchema-instance:type
"xsd:string"))
   "Received: (at submit) by debbugs.gnu.org; 23 Aug 2018 
20:17:46 +\nFrom debbugs-submit-boun...@debbugs.gnu.org [...]
[...]
--8<---cut here---end--->8---

Note that I needed to make two other changes to your preliminary code,
namely:

* I passed "#:streaming? #t" to 'http-post', to ask for a port to read
  the response body instead of reading it eagerly.

* I explicitly set the port encoding to "UTF-8" on that port before
  using 'xml->sxml' to read it.

Otherwise, the entire 'body' response will be returned as a bytevector,
because the response Content-Type is not recognized as a textual type.
The HTTP Content-Type is "multipart/related", with a parameter:
type="text/xml".  I'm not sure if we should be automatically
interpreting that as a textual type or not.

There's no 'charset' parameter in the Content-Type header, but the XML
internally specifies: encoding="UTF-8".

Anyway, here are the preliminary patches.

   Mark


>From 41764d60dba80126b3c97f883d0225510b55f3fa Mon Sep 17 00:00:00 2001
From: Mark H Weaver 
Date: Tue, 28 Aug 2018 18:39:34 -0400
Subject: [PATCH 1/2] web: Add support for HTTP header continuation lines.

* module/web/http.scm (spaces-and-tabs, space-or-tab?): New variables.
(read-header-line): After reading a header, if a space or tab follows,
then read the continuation lines and append them all together.
---
 module/web/http.scm | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/module/web/http.scm b/module/web/http.scm
index de61c9495..15f173173 100644
--- a/module/web/http.scm
+++ b/module/web/http.scm
@@ -1,6 +1,6 @@
 ;;; HTTP messages
 
-;; Copyright (C)  2010-2017