Gobbledygook - Engaged versus Picospan

Incompatible Character Encodings on The WELL

Any text characters are fine when posted and viewed in the same character encoding. Engaged posts are encoded as WIN-1252 and when viewed as WIN-1252, no problem. Picospan posts use UTF-8 encoded characters[1] and when viewed as UTF-8, no problem. Most internet content is now UTF-8, so copy and paste from the web and smart quotes from phones are often where the trouble starts.

When Picospan posts “I’m” with smart quotes in UTF-8, it becomes “I’m†when viewed in Engaged as WIN-1252, that's a problem known as mojibake or gobbledygook. Conversely, when Engaged posts “I’m” with smart quotes in WIN-1252, it becomes �I�m� when viewed in Picospan as UTF-8. These characters are encoded differently and incompatibly by WIN-1252 and UTF-8:

€  ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ   Ž  ‘ ’ “ ” • – — ˜  ™ š › œ   ž Ÿ  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬   ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Multibyte Unicode characters in Engaged posts are automatically converted into html entities, so posting  𝓔𝓷𝓰𝓪𝓰𝓮𝓭 (a fancy unicode text trick), turns into 𝓔𝓷𝓰𝓪𝓰𝓮𝓭 That is plain text so it looks the same in Engaged and Picospan, but no one can read it. It’s possible to post 𝓔𝓷𝓰𝓪𝓰𝓮𝓭 for viewing in Engaged by using backslash escapes like this \𝓔\𝓷\𝓰\𝓪\𝓰\𝓮\𝓭 That is seen as 𝓔𝓷𝓰𝓪𝓰𝓮𝓭 by Engaged users, but it isn’t readable in Picospan.

A browser extension to bridge the gap

The extension solves these problems for Engaged viewing by showing posts as UTF-8 when they have no encoding errors (byte sequences that are not valid as UTF-8), but falling back to use WIN-1252 when they do, with

 ☑ Show I�m as I’m. It also converts html entities into readable UTF-8. So no gobbledygook is ever seen. And show � can be clicked to see the UTF-8 version together with WIN-1252 when they differ. For viewing in UTF-8 only, uncheck  ☐ Show I�m as I’m. That avoids reloading pages but it will display �’s for bad unicode.

Posting is not as readily fixed, because other users are seeing different encodings. UTF-8 and WIN-1252 encodings are identical for printable ASCII. They use compatible encoding for only the following characters:        

~! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

So for posting compatibility, no diacritical marks or symbols, no résumé € £ © ‘ ’ “ ” etc, are allowed because these create gobbledygook due to their different roles in the two encodings. So the Post in ASCII part of the extension converts those examples to this resume EUR GBP ' ' " " using only ASCII characters. However there are no such equivalents for most unicode.

When Unicode characters are needed, and it is acceptable that some users will not be able to read them, the extension enables Engaged users to post in UTF-8 by unchecking Post in ASCII That is the same encoding as Pico users post (usually!) and will be readable for Pico users and those Engaged users who can view UTF-8, using the extension or another that can view in UTF-8. A post with converted ASCII can also have UTF-8 added to it.

If all Engaged users used the extension the encoding gap would vanish, and Post in ASCII would never be needed. And if the Well itself is updated to fully support UTF-8 that will also eliminate the problem. The extension is useful in the meantime and it demonstrates that posts can be stored and retrieved from the server using either encoding on request.


[1] Usually. UTF-8 is the default charset for all current terminal software used with Picospan