Created a PR
https://github.com/dovecot/core/pull/155
On 2021-02-11 13:25, Joan Moreau wrote:
Hello
Checking further, and putting logs a bit every where in the dovecot
code, the core is sending FIRST the initial document (not decoded) then
SECOND the decoded version
Thisi is really weird, and the indexer then indexes a lot of binary
crap
I am struggling to find where in the code this double call is made.
Anyone knows ?
On 2021-02-10 00:05, John Fawcett wrote:
On 09/02/2021 15:33, Joan Moreau wrote:
If I place the following code in the plugin
fts_backend_xxx_update_build_more function (lucene, squat and xapian,
as solr refuses to work properly on my setup)
{
char * s = i_strdup("EMPTY");
if(data != NULL) { i_free(s); s = i_strndup(data,20); }
i_info("fts_backend_update_build_more: data like '%s'",s);
i_free(s);
}
and if I send a PDF by email, the data shown in the log is "%PDF-1.7 "
so it does mean the decoder data is not properly transmitted to the
plugin
Something is wrong in the data transmission
Joan
I too see something similar with fts_solr. I do see the raw %PDF string
and PDF binary data being passed through to
fts_backend_xxx_update_build_more function but I disagree with the
conclusion you draw from it.
After the raw data I also see the decoded data, so at least in my case
it is possible to see both the raw and decoded data in
fts_backend_xxx_update_build_more function. In the rawlog I no longer
see the binary data (but some blank lines), so something is filtering
it. I do see the decoded data in the rawlog. I do get hits on the solr
search for the decoded text.
John