Re: [Bug debuginfod/27983] ignore duplicate urls
Hi Noah, On Mon, Jul 19, 2021 at 09:31:17AM -0400, Noah Sanci wrote: > On Wed, Jul 14, 2021 at 12:36 PM Mark Wielaard wrote: > > You deduplicate the full URLs after they are fully constructed. Would > > it make sense to do the deduplication on server_url, maybe even as > > part of the Count number of URLs code? That might make the code > > simpler. And you can change num_urls upfront. > Deduplication before fully building the URL would work well, however I > was concerned about the slashbuildid situation. I would need to alter > all urls to either have a trailing '/' or no trailing '/' to ensure > comparison between 'http://127.0.0.1:8000/' and 'http://127.0.0.1:8000' > is considered equal. This is possible, but I ultimately decided to > wait until full construction as those issues would have been handled. > I would be glad to make the change if you want. I see what you mean, we have: /* Build handle url. Tolerate both http://foo:999 and http://foo:999/ forms */ char *slashbuildid; if (strlen(server_url) > 1 && server_url[strlen(server_url)-1] == '/') slashbuildid = "buildid"; else slashbuildid = "/buildid"; I think the code could become simpler if we did the deduplication of the server_url list up-front. That also gives you the oppertunity to make sure all server_urls end with a slash. But if you think that is too much work for this patch we can do it as a followup patch. You already included a test, which makes checking such a refactoring easier (sadly the run-debuginfod-find.sh test is somewhat fragile). > From be4e07a8732dd688595b99f92ba275ef5060a34d Mon Sep 17 00:00:00 2001 > From: Noah Sanci > Date: Fri, 9 Jul 2021 14:53:10 -0400 > Subject: [PATCH] debuginfod: PR27983 - ignore duplicate urls > > Gazing at server logs, one sees a minority of clients who appear to have > duplicate query traffic coming in: the same URL, milliseconds apart. > Chances are the user accidentally doubled her $DEBUGINFOD_URLS somehow, > and the client library is dutifully asking the servers TWICE. Bug #27863 > reduces the pain on the servers' CPU, but dupe network traffic is still > being paid. We should reject sending outright duplicate concurrent > traffic. > > https://sourceware.org/bugzilla/show_bug.cgi?id=27983 > diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c > index f417b40a..a9447cae 100644 > --- a/debuginfod/debuginfod-client.c > +++ b/debuginfod/debuginfod-client.c > @@ -795,6 +795,7 @@ debuginfod_query_server (debuginfod_client *c, > >char *strtok_saveptr; >char *server_url = strtok_r(server_urls, url_delim, &strtok_saveptr); > + int unduplicated_urls = 0; > >/* Initialize each handle. */ >for (int i = 0; i < num_urls && server_url != NULL; i++) Maybe this should be: /* Initialize each handle. */ int i = 0; while (server_url != NULL) With an i++ at the end after the data[i] handle has been completely assigned. See below (*) for why. > + char *tmp_url = calloc(PATH_MAX, sizeof(char)); This can fail. So you'll have to handle tmp_url == NULL. Otherwise snprintf will crash. >if (filename) /* must start with / */ > -snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url, > +snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s%s", server_url, > slashbuildid, build_id_bytes, type, filename); >else > -snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s", server_url, > +snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s", server_url, > slashbuildid, build_id_bytes, type); > > + /* PR 27983: If the url is already set to be used use, skip it */ > + int url_index = -1; > + for (url_index = 0; url_index < i; ++url_index) No need to initialize url_index twice. Just declar int url_index; it will be assigned 0 at the next line. > +{ > + if(strcmp(tmp_url, data[url_index].url) == 0) > +{ > + url_index = -1; > + break; > +} > +} > + if (url_index == -1) > +{ > + if (vfd >= 0) > +dprintf(vfd, "duplicate url: %s, skipping\n", tmp_url); > + free(tmp_url); > + // Ensure that next iteration doesn't skip over an index mid-array > + i--; > + server_url = strtok_r(NULL, url_delim, &strtok_saveptr); > + continue; (*) this i--; confused me a little. Which is why I suggest turning that for loop into a while loop. > +} > + else > +{ > + unduplicated_urls++; > + strncpy(data[i].url, tmp_url, PATH_MAX); Both strings are max PATH_MAX chars, and tmp_url is zero terminated, so you can use strcpy. > + rc = -ENETUNREACH; > + goto out1; > +} > + data[i].client = c; >curl_easy_setopt(data[i].handle, CURLOPT_URL, data[i].url); >if (vfd >= 0) > curl_easy_setopt(data[i].handle, CURLOPT_ERRORBUFFER, data[i].e
Re: [Bug debuginfod/28034] %-escape url characters
Hi Noah, On Mon, Jul 19, 2021 at 10:53:17AM -0400, Noah Sanci via Elfutils-devel wrote: > When requesting some source files, some URL-inconvenient chars > sometimes pop up. Example from f33 libstdc++: > /buildid/44d8485cb75512c2ca5c8f70afbd475cae30af4f/source/usr/src/debug/ > gcc-10.3.1-1.fc33.x86_64/obj-x86_64-redhat-linux/x86_64-redhat-linux/ > libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/ > condition_variable.cc > As this URL is passed into debuginfod's handler_cb, it appears that the > + signs are helpfully unescaped to spaces by libmicrohttpd, which > 'course breaks everything. We need to suppress this HTTP URL > processing step. Also worth checking that %HEX decoding is also > suppressed. I find this description slightly confusing. It ends with "Also worth checking..." but that is actually what is done in this patch. The part before that about what debuginfod/microhttpd does on the server side is interesting, but not really what this patch is about (just why this patch is necessary, but it seems necessary on the client side always whatever server is used). Could you rewrite the commit message to describe what is done in this patch? > https://sourceware.org/bugzilla/show_bug.cgi?id=28034 > +2021-07-16 Noah Sanci > + > + PR28034 > + * debuginfod-client.c (debuginfod_query_server): % escape filename > + so the completed url is processed properly. > diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c > index f12f609c..e30f73eb 100644 > --- a/debuginfod/debuginfod-client.c > +++ b/debuginfod/debuginfod-client.c > @@ -831,9 +831,15 @@ debuginfod_query_server (debuginfod_client *c, >else > slashbuildid = "/buildid"; > > - if (filename) /* must start with / */ > -snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url, > - slashbuildid, build_id_bytes, type, filename); > + if (filename)/* must start with / */ > +{ > + /* PR28034 escape characters in completed url to %hh format. */ > + char *escaped_string; > + escaped_string = curl_easy_escape(data[i].handle, filename, 0); curl_easy_escape () can return NULL on failure. > + snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url, > + slashbuildid, build_id_bytes, type, escaped_string); > + curl_free(escaped_string); > +} I note that filename is actually the full path component of the URL so includes slashes ('/'). curl_easy_escape seems to convert these to %2F (if I am correct). Is this intended? >else > snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s", server_url, > slashbuildid, build_id_bytes, type); > diff --git a/doc/debuginfod.8 b/doc/debuginfod.8 > index 1adf703a..90cdcf94 100644 > --- a/doc/debuginfod.8 > +++ b/doc/debuginfod.8 > @@ -299,6 +299,7 @@ l l. > \../bar/foo.c AT_comp_dir=/zoo/ > /buildid/BUILDID/source/zoo//../bar/foo.c > .TE > > +Note: the client should %-escape characters in /SOURCE/FILE that are not > shown as "unreserved" in section 2.3 of RFC3986. This is a very long line. Could you break it up? Also, maybe just give the information instead of only a reference. (The "unreserved" characters are "a"-"z"", "A"-"Z", "0"-"9", "-", ".", "_" and "~") Also same question as above. slash ('/') is not an unreserved character, should it be encoded? > .SS /metrics > > This endpoint returns a Prometheus formatted text/plain dump of a > diff --git a/tests/ChangeLog b/tests/ChangeLog > index 1460b422..cfa0dee4 100644 > --- a/tests/ChangeLog > +++ b/tests/ChangeLog > @@ -1,3 +1,9 @@ > +2021-07-16 Noah Sanci > + > + PR28034 > + * run-debuginfod-find.sh: Added a test case ensuring files with % > + escapable characters in their paths are accessible. There are also a couple of changes (fixes?) to the testcases. Could those be split out? > 2021-07-06 Alice Zhang > > PR27531 > diff --git a/tests/run-debuginfod-find.sh b/tests/run-debuginfod-find.sh > index 73bbe65b..11c1a1a1 100755 > --- a/tests/run-debuginfod-find.sh > +++ b/tests/run-debuginfod-find.sh > @@ -44,6 +44,7 @@ cleanup() >if [ $PID2 -ne 0 ]; then kill $PID2; wait $PID2; fi >if [ $PID3 -ne 0 ]; then kill $PID3; wait $PID3; fi >if [ $PID4 -ne 0 ]; then kill $PID4; wait $PID4; fi > + sleep 5; # Give time to ensure debuginfod cleans and closes sqlite > databases. >rm -rf F R D L Z ${PWD}/foobar ${PWD}/mocktree ${PWD}/.client_cache* > ${PWD}/tmp* >exit_cleanup > } This seems a testsuite fixup? > @@ -54,7 +55,7 @@ trap cleanup 0 1 2 3 5 9 15 > errfiles_list= > err() { > echo ERROR REPORTS > -for ports in $PORT1 $PORT2 > +for ports in $PORT1 $PORT2 $PORT3 > do > echo ERROR REPORT $port metrics > curl -s http://127.0.0.1:$port/metrics As is this? > @@ -149,18 +150,18 @@ ps -q $PID1 -e -L -o '%p %c %a' | grep traverse > # Compile a simple program, strip
Re: [Bug debuginfod/28034] %-escape url characters
Hi - > Could you rewrite the commit message to describe what is done in this > patch? (Yeah, Noah's commit text on his branch was corrected.) > [...] > I note that filename is actually the full path component of the URL so > includes slashes ('/'). curl_easy_escape seems to convert these to %2F > (if I am correct). Is this intended? It's harmless. > > +Note: the client should %-escape characters in /SOURCE/FILE that are not > > shown as "unreserved" in section 2.3 of RFC3986. > > This is a very long line. Could you break it up? > Also, maybe just give the information instead of only a reference. > (The "unreserved" characters are "a"-"z"", "A"-"Z", "0"-"9", "-", ".", "_" > and "~") > Also same question as above. slash ('/') is not an unreserved > character, should it be encoded? As we know from the status quo working for a year+, it doesn't matter for "/". But RFC3986 does not give a character class that corresponds exactly to what MUST be encoded, so for documentation purposes this simple SHOULD guidance seems fine. - FChE
Re: [Bug debuginfod/28034] %-escape url characters
Hi Noah, On Tue, Jul 20, 2021 at 07:50:11PM +0200, Mark Wielaard wrote: > > + * run-debuginfod-find.sh: Added a test case ensuring files with % > > + escapable characters in their paths are accessible. > > There are also a couple of changes (fixes?) to the testcases. > Could those be split out? I think you almost had the right fix for a race in killing the last debuginfod server. Does the attached work for you? Thanks, Mark >From 83b7eb24a5796a4aecc5d32eb0c3f459788c4690 Mon Sep 17 00:00:00 2001 From: Mark Wielaard Date: Tue, 20 Jul 2021 20:50:48 +0200 Subject: [PATCH] tests: wait for PID4 before setting to zero A debuginfod server might take a while to shut down to clean and close the sqlite databases. Wait for the process after killing it and clearing the PID variable so it won't be killed again. Reported-by: Noah Sanci Signed-off-by: Mark Wielaard --- tests/ChangeLog | 4 tests/run-debuginfod-find.sh | 2 ++ 2 files changed, 6 insertions(+) diff --git a/tests/ChangeLog b/tests/ChangeLog index 1196d6b2..b0303e00 100644 --- a/tests/ChangeLog +++ b/tests/ChangeLog @@ -1,3 +1,7 @@ +2021-07-20 Mark Wielaard + + * tests/run-debuginfod-find.sh: wait for PID4 before setting to zero. + 2021-06-28 Noah Sanci PR25978 diff --git a/tests/run-debuginfod-find.sh b/tests/run-debuginfod-find.sh index 1d664be9..23eac329 100755 --- a/tests/run-debuginfod-find.sh +++ b/tests/run-debuginfod-find.sh @@ -754,6 +754,8 @@ wait_ready $PORT3 'groom{statistic="files scanned (#)"}' 0 wait_ready $PORT3 'groom{statistic="files scanned (mb)"}' 0 kill $PID4 +wait $PID4 +PID4=0 # set up tests for retrying failed queries. -- 2.32.0