Re: [Bug debuginfod/27983] ignore duplicate urls

2021-07-20 Thread Mark Wielaard
Hi Noah,

On Mon, Jul 19, 2021 at 09:31:17AM -0400, Noah Sanci wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Mark Wielaard  wrote:
> > You deduplicate the full URLs after they are fully constructed.  Would
> > it make sense to do the deduplication on server_url, maybe even as
> > part of the Count number of URLs code? That might make the code
> > simpler. And you can change num_urls upfront.
> Deduplication before fully building the URL would work well, however I
> was concerned about the slashbuildid situation. I would need to alter
> all urls to either have a trailing '/' or no trailing '/' to ensure
> comparison between 'http://127.0.0.1:8000/' and 'http://127.0.0.1:8000'
> is considered equal. This is possible, but I ultimately decided to
> wait until full construction as those issues would have been handled.
> I would be glad to make the change if you want.

I see what you mean, we have:

  /* Build handle url. Tolerate both  http://foo:999  and
 http://foo:999/  forms */
  char *slashbuildid;
  if (strlen(server_url) > 1 && server_url[strlen(server_url)-1] == '/')
slashbuildid = "buildid";
  else
slashbuildid = "/buildid";

I think the code could become simpler if we did the deduplication of
the server_url list up-front. That also gives you the oppertunity to
make sure all server_urls end with a slash. But if you think that is
too much work for this patch we can do it as a followup patch. You
already included a test, which makes checking such a refactoring
easier (sadly the run-debuginfod-find.sh test is somewhat fragile).

> From be4e07a8732dd688595b99f92ba275ef5060a34d Mon Sep 17 00:00:00 2001
> From: Noah Sanci 
> Date: Fri, 9 Jul 2021 14:53:10 -0400
> Subject: [PATCH] debuginfod: PR27983 - ignore duplicate urls
> 
> Gazing at server logs, one sees a minority of clients who appear to have
> duplicate query traffic coming in: the same URL, milliseconds apart.
> Chances are the user accidentally doubled her $DEBUGINFOD_URLS somehow,
> and the client library is dutifully asking the servers TWICE. Bug #27863
> reduces the pain on the servers' CPU, but dupe network traffic is still
> being paid.  We should reject sending outright duplicate concurrent
> traffic.
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=27983


> diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c
> index f417b40a..a9447cae 100644
> --- a/debuginfod/debuginfod-client.c
> +++ b/debuginfod/debuginfod-client.c
> @@ -795,6 +795,7 @@ debuginfod_query_server (debuginfod_client *c,
>
>char *strtok_saveptr;
>char *server_url = strtok_r(server_urls, url_delim, &strtok_saveptr);
> +  int unduplicated_urls = 0;
> 
>/* Initialize each handle.  */
>for (int i = 0; i < num_urls && server_url != NULL; i++)

Maybe this should be:

/* Initialize each handle.  */
   int i = 0;
   while (server_url != NULL)

With an i++ at the end after the data[i] handle has been completely assigned.
See below (*) for why.

> +  char *tmp_url = calloc(PATH_MAX, sizeof(char));

This can fail. So you'll have to handle tmp_url == NULL. Otherwise
snprintf will crash.

>if (filename) /* must start with / */
> -snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url,
> +snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s%s", server_url,
>   slashbuildid, build_id_bytes, type, filename);
>else
> -snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s", server_url,
> +snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s", server_url,
>   slashbuildid, build_id_bytes, type);
>  
> +  /* PR 27983: If the url is already set to be used use, skip it */
> +  int url_index = -1;
> +  for (url_index = 0; url_index < i; ++url_index)

No need to initialize url_index twice. Just declar int url_index; it
will be assigned 0 at the next line.

> +{
> +  if(strcmp(tmp_url, data[url_index].url) == 0)
> +{
> +  url_index = -1;
> +  break;
> +}
> +}
> +  if (url_index == -1)
> +{
> +  if (vfd >= 0)
> +dprintf(vfd, "duplicate url: %s, skipping\n", tmp_url);
> +  free(tmp_url);
> +  // Ensure that next iteration doesn't skip over an index mid-array
> +  i--;
> +  server_url = strtok_r(NULL, url_delim, &strtok_saveptr);
> +  continue;

(*) this i--; confused me a little. Which is why I suggest turning
that for loop into a while loop.

> +}
> +  else
> +{
> +  unduplicated_urls++;
> +  strncpy(data[i].url, tmp_url, PATH_MAX);

Both strings are max PATH_MAX chars, and tmp_url is zero terminated,
so you can use strcpy.

> +  rc = -ENETUNREACH;
> +   goto out1;
> +}
> +  data[i].client = c;
>curl_easy_setopt(data[i].handle, CURLOPT_URL, data[i].url);
>if (vfd >= 0)
>   curl_easy_setopt(data[i].handle, CURLOPT_ERRORBUFFER, data[i].e

Re: [Bug debuginfod/28034] %-escape url characters

2021-07-20 Thread Mark Wielaard
Hi Noah,

On Mon, Jul 19, 2021 at 10:53:17AM -0400, Noah Sanci via Elfutils-devel wrote:
> When requesting some source files, some URL-inconvenient chars
> sometimes pop up.  Example from f33 libstdc++:
> /buildid/44d8485cb75512c2ca5c8f70afbd475cae30af4f/source/usr/src/debug/
> gcc-10.3.1-1.fc33.x86_64/obj-x86_64-redhat-linux/x86_64-redhat-linux/
> libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/
> condition_variable.cc
> As this URL is passed into debuginfod's handler_cb, it appears that the
> + signs are helpfully unescaped to spaces by libmicrohttpd, which
> 'course breaks everything.  We need to suppress this HTTP URL
> processing step.  Also worth checking that %HEX decoding is also
> suppressed.

I find this description slightly confusing. It ends with "Also worth
checking..." but that is actually what is done in this patch. The part
before that about what debuginfod/microhttpd does on the server side
is interesting, but not really what this patch is about (just why this
patch is necessary, but it seems necessary on the client side always
whatever server is used).

Could you rewrite the commit message to describe what is done in this
patch?

> https://sourceware.org/bugzilla/show_bug.cgi?id=28034

> +2021-07-16  Noah Sanci  
> +
> + PR28034
> + * debuginfod-client.c (debuginfod_query_server): % escape filename
> + so the completed url is processed properly.

> diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c
> index f12f609c..e30f73eb 100644
> --- a/debuginfod/debuginfod-client.c
> +++ b/debuginfod/debuginfod-client.c
> @@ -831,9 +831,15 @@ debuginfod_query_server (debuginfod_client *c,
>else
>  slashbuildid = "/buildid";
>  
> -  if (filename) /* must start with / */
> -snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url,
> - slashbuildid, build_id_bytes, type, filename);
> +  if (filename)/* must start with / */
> +{
> +  /* PR28034 escape characters in completed url to %hh format. */
> +  char *escaped_string;
> +  escaped_string = curl_easy_escape(data[i].handle, filename, 0);

curl_easy_escape () can return NULL on failure.

> +  snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url,
> +   slashbuildid, build_id_bytes, type, escaped_string);
> +  curl_free(escaped_string);
> +}

I note that filename is actually the full path component of the URL so
includes slashes ('/'). curl_easy_escape seems to convert these to %2F
(if I am correct). Is this intended?

>else
>  snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s", server_url,
>   slashbuildid, build_id_bytes, type);
> diff --git a/doc/debuginfod.8 b/doc/debuginfod.8
> index 1adf703a..90cdcf94 100644
> --- a/doc/debuginfod.8
> +++ b/doc/debuginfod.8
> @@ -299,6 +299,7 @@ l l.
>  \../bar/foo.c AT_comp_dir=/zoo/  
> /buildid/BUILDID/source/zoo//../bar/foo.c
>  .TE
>  
> +Note: the client should %-escape characters in /SOURCE/FILE that are not 
> shown as "unreserved" in section 2.3 of RFC3986.

This is a very long line. Could you break it up?
Also, maybe just give the information instead of only a reference.
(The "unreserved" characters are "a"-"z"", "A"-"Z", "0"-"9", "-", ".", "_" and 
"~")

Also same question as above. slash ('/') is not an unreserved
character, should it be encoded?

>  .SS /metrics
>  
>  This endpoint returns a Prometheus formatted text/plain dump of a
> diff --git a/tests/ChangeLog b/tests/ChangeLog
> index 1460b422..cfa0dee4 100644
> --- a/tests/ChangeLog
> +++ b/tests/ChangeLog
> @@ -1,3 +1,9 @@
> +2021-07-16  Noah Sanci  
> +
> + PR28034
> + * run-debuginfod-find.sh: Added a test case ensuring files with %
> + escapable characters in their paths are accessible.

There are also a couple of changes (fixes?) to the testcases.
Could those be split out?

>  2021-07-06  Alice Zhang  
>  
>  PR27531
> diff --git a/tests/run-debuginfod-find.sh b/tests/run-debuginfod-find.sh
> index 73bbe65b..11c1a1a1 100755
> --- a/tests/run-debuginfod-find.sh
> +++ b/tests/run-debuginfod-find.sh
> @@ -44,6 +44,7 @@ cleanup()
>if [ $PID2 -ne 0 ]; then kill $PID2; wait $PID2; fi
>if [ $PID3 -ne 0 ]; then kill $PID3; wait $PID3; fi
>if [ $PID4 -ne 0 ]; then kill $PID4; wait $PID4; fi
> +  sleep 5; # Give time to ensure debuginfod cleans and closes sqlite 
> databases.
>rm -rf F R D L Z ${PWD}/foobar ${PWD}/mocktree ${PWD}/.client_cache* 
> ${PWD}/tmp*
>exit_cleanup
>  }

This seems a testsuite fixup?

> @@ -54,7 +55,7 @@ trap cleanup 0 1 2 3 5 9 15
>  errfiles_list=
>  err() {
>  echo ERROR REPORTS
> -for ports in $PORT1 $PORT2
> +for ports in $PORT1 $PORT2 $PORT3
>  do
>  echo ERROR REPORT $port metrics
>  curl -s http://127.0.0.1:$port/metrics

As is this?

> @@ -149,18 +150,18 @@ ps -q $PID1 -e -L -o '%p %c %a' | grep traverse
>  # Compile a simple program, strip 

Re: [Bug debuginfod/28034] %-escape url characters

2021-07-20 Thread Frank Ch. Eigler via Elfutils-devel
Hi -

> Could you rewrite the commit message to describe what is done in this
> patch?

(Yeah, Noah's commit text on his branch was corrected.)

> [...]
> I note that filename is actually the full path component of the URL so
> includes slashes ('/'). curl_easy_escape seems to convert these to %2F
> (if I am correct). Is this intended?

It's harmless.

> > +Note: the client should %-escape characters in /SOURCE/FILE that are not 
> > shown as "unreserved" in section 2.3 of RFC3986.
> 
> This is a very long line. Could you break it up?
> Also, maybe just give the information instead of only a reference.
> (The "unreserved" characters are "a"-"z"", "A"-"Z", "0"-"9", "-", ".", "_" 
> and "~")
> Also same question as above. slash ('/') is not an unreserved
> character, should it be encoded?

As we know from the status quo working for a year+, it doesn't matter
for "/".  But RFC3986 does not give a character class that corresponds
exactly to what MUST be encoded, so for documentation purposes this
simple SHOULD guidance seems fine.


- FChE



Re: [Bug debuginfod/28034] %-escape url characters

2021-07-20 Thread Mark Wielaard
Hi Noah,

On Tue, Jul 20, 2021 at 07:50:11PM +0200, Mark Wielaard wrote:
> > +   * run-debuginfod-find.sh: Added a test case ensuring files with %
> > +   escapable characters in their paths are accessible.
> 
> There are also a couple of changes (fixes?) to the testcases.
> Could those be split out?

I think you almost had the right fix for a race in killing the last
debuginfod server. Does the attached work for you?

Thanks,

Mark
>From 83b7eb24a5796a4aecc5d32eb0c3f459788c4690 Mon Sep 17 00:00:00 2001
From: Mark Wielaard 
Date: Tue, 20 Jul 2021 20:50:48 +0200
Subject: [PATCH] tests: wait for PID4 before setting to zero

A debuginfod server might take a while to shut down to clean and close
the sqlite databases. Wait for the process after killing it and
clearing the PID variable so it won't be killed again.

Reported-by: Noah Sanci 
Signed-off-by: Mark Wielaard 
---
 tests/ChangeLog  | 4 
 tests/run-debuginfod-find.sh | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/tests/ChangeLog b/tests/ChangeLog
index 1196d6b2..b0303e00 100644
--- a/tests/ChangeLog
+++ b/tests/ChangeLog
@@ -1,3 +1,7 @@
+2021-07-20  Mark Wielaard  
+
+	* tests/run-debuginfod-find.sh: wait for PID4 before setting to zero.
+
 2021-06-28  Noah Sanci 
 
 	PR25978
diff --git a/tests/run-debuginfod-find.sh b/tests/run-debuginfod-find.sh
index 1d664be9..23eac329 100755
--- a/tests/run-debuginfod-find.sh
+++ b/tests/run-debuginfod-find.sh
@@ -754,6 +754,8 @@ wait_ready $PORT3 'groom{statistic="files scanned (#)"}' 0
 wait_ready $PORT3 'groom{statistic="files scanned (mb)"}' 0
 
 kill $PID4
+wait $PID4
+PID4=0
 
 
 # set up tests for retrying failed queries.
-- 
2.32.0