Questions about RFC 8771

from Andy Balaam's Blog

Mar 17, 2021, 9:30:41 AM

During my work on RFC 8771 The Internationalized Deliberately Unreadable Network NOtation (I-DUNNO) I have come across a number of questions. I am documenting them here so I can send them to the authors and try to improve my understanding of the intention.

This is an excellent RFC, and I thank the authors for their efforts in creating it.

1. Non-printable characters

In 4.2. Satisfactory Confusion Level, the RFC states that encodings may be deemed Satisfactory if they contain ‘At least one non-printable character’ (as well as one other condition in this section).

Both of the existing implementations that I know of ( and ) interpret “printable” to mean the same as Python’s isprintable() function: that is Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable.

However, the definition of this function may be rather Python-specific, since its intention appears to be related to language internals like the repr function.

It would be helpful to find out exactly what is meant by “non-printable character” in the RFC.

2. Are Modifier Symbols, Symbols?

Also in section 4.2, the RFC mentions ‘”A character from the “Symbol” category’.

The Python implementation excludes Modifier Symbols from its definition. I believe this is incorrect, and have logged an issue on the topic: Some Symbol characters not recognised as such.

It would be helpful to have clarification on this point.

3. What does “different directionalities” mean?

Unicode classifies characters into several Bidi_Classes (for example, U+CED6E is Left_To_Right). In section 4.3. Delightful Confusion Level, the RFC refers to ‘Characters from scripts with different directionalities’.

As far as I can see there are two possible interpretations of this phrase:

  1. The encoding should contain characters from at least two different Bidi_Classes, or
  2. The encoding should contain characters that are both left-to-right and right-to-left in direction, either weakly or strongly.

Both current implementations interpret this statement like number 1, but I suspect the intention may actually be something more like number 2.

If number 2 was meant, I think that would mean ignoring characters with Neutral directionality, and treating weakly directional characters as the same directionality as strongly directional ones.

4. What is a Confusable character?

Section 4.3 mentions ‘Character classified as “Confusables”‘. Both implementations interpret this quite loosely, as meaning something like ‘the encoding contains any character or substring which might be confused with any other character or substring’.

This means a lot of “normal” characters are included: all of the ASCII digits, and many of the Latin letters.

Was this is intention?

That’s all my questions. It has been great fun working on this RFC.