Hi Graham, All,

I suspect that the spec has changed in regards to N-Triples over the years. Specifically when Turtle became a W3C standard.

For example the spec for 1.1 N-Triples says [1]


Encoding considerations:
The syntax of N-Triples is expressed over code points in Unicode [UNICODE]. The encoding is always UTF-8 [UTF-8]. Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]

And also the note here

6.1 Other Media Types

N-Triples has been historically provided with other media types. N-Triples may also be provided as text/plain. When used in this way N-Triples MUST use the escaped form of any character outside US-ASCII.

Hope that helps pointing out if it is a bug or not.

Regards,
Jerven

[1] https://www.w3.org/TR/2014/REC-n-triples-20140225/
On 01/08/2022 21:13, Graham Higgins wrote:


On Monday, August 1, 2022 at 5:43:54 PM UTC Etienne Posthumus wrote:

    Thanks for the excellent spelunking Graham.


Happy to help, thanks for the kind words.

    Is it common practice nowadays for most serializers to just do UTF-8
    and not do \-escape sequences anymore? I guess if this has been the
    behaviour in rdflib for years now and no-one complains too much, we
    can just assume it is OK and keep on doing it.


I don't know about "common practice" but I treat Jena's behaviour as a useful ad hoc yardstick, if it passes muster with Andy Seaborn then it's probably the right way to go.

    Maybe it is a good idea for us to add a line in the docs that the
    rdflib serializer intentionally deviates from the spec.


Yes, either document the difference or, given that known-working code still exists, perhaps just enabling strictness by setting an *args flag might be a viable solution ... something along the lines of:

diff --git a/rdflib/plugins/serializers/nt.py b/rdflib/plugins/serializers/nt.py
index 913dbedf..b73f223f 100644
--- a/rdflib/plugins/serializers/nt.py
+++ b/rdflib/plugins/serializers/nt.py
@@ -38,7 +38,11 @@ class NTSerializer(Serializer):
              )

          for triple in self.store:
-            stream.write(_nt_row(triple).encode())
+            stream.write(
+                _nt_row(triple).encode("ascii", "_rdflib_nt_escape")
+                if "w3c" in args
+                else _nt_row(triple).encode()
+            )


  class NT11Serializer(NTSerializer):

Which, on casual testing,  behaves as desired, producing “<urn:aap> <urn:noot> "mi\u00EBs" .” with the flag set and “<urn:aap> <urn:noot> "miës" .” when not set.

What does the team think?

Cheers,
Graham

--
http://github.com/RDFLib <http://github.com/RDFLib>
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+unsubscr...@googlegroups.com <mailto:rdflib-dev+unsubscr...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/1b1503b0-dc7a-40a5-963e-0875a6f4b843n%40googlegroups.com <https://groups.google.com/d/msgid/rdflib-dev/1b1503b0-dc7a-40a5-963e-0875a6f4b843n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--

        *Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss

--
http://github.com/RDFLib
--- You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rdflib-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rdflib-dev/286be9dd-6674-def2-3691-f19fc9e9b39b%40sib.swiss.

Reply via email to