Tuesday, April 14, 2020

Converting UTF-8 and UTF-16 arrays to strings in Javascript and vice versa

Support for UTF-8 and UTF-16 conversion is not that great in Javascript. There are libraries for Node.js, like StringDecoder, but you have to require them. And in the browser they won't work. For browser Javascript you can use TextEncoder but it doesn't work in all browsers consistently and only in Node.js via the util module. So if you want (like me) something that can convert UTF-8 byte arrays and UTF-16 character arrays into strings and vice versa, and have exactly the same code work in both Node.js and in browsers with no dependencies you might begin to understand my problem.

A few people recommend using unescape(encodeURIComponent(s)) to encode utf-8 and decodeURIComponent(escape(s)) to decode, but both escape and unescape are deprecated. Also this method only produces strings, not Uint8Arrays and doesn't handle the UTF-16 case. Why would you need an array of UTF-8 bytes or UTF-16 characters? Because char and byte arrays can be compared and indexed into more easily. Also files store string data in these formats, especially UTF-8. If only bits of your file are in UTF-8 then you have to convert the string parts piecemeal. There are probably other uses too, or else Uint8Array and Uint16Array wouldn't exist.

UTF-8

For UTF-8 conversion Javascript already has two functions that do most of the work: encodeURIComponent and decodeURIComponent. encodeURIComponent takes a string and escapes a few reserved characters and also ASCII codes greater than 127 into single byte escape sequences. So '%" becomes '%25' and '贸' becomes '%C3%B3'. This method also works on Unicode characters outside the Basic MultiLingual plane, for example the gothic character Hwair: 饜崍, which is escaped to '%F0%90%BD%88'. Once we have the escaped sequence it is fairly easy to take each byte and encode it as a 8-bit integer within a Uint8Array. The reverse process (Uint8Array to string) is also simple: any byte less than 128 can be converted back into a character using String.fromCodePoint(n), where n is the 8-bit value. For code points from 128-255 they can be converted back into their escape string form. Then the string built up this way can be passed through decodeURIComponent to produce the original string.

UTF-16

UTF-16 is even easier since all Javascript strings are already encoded in UTF-16. To convert a string to an array we can use str.charCodeAt(index), where str is our string and index is the index into the string. If the character is longer than a 16-bit integer it will be encoded as a 'surrogate pair', but it will still be extracted by charCodeAt as two 16-bit integers. Indeed, the length of the string in that latter case is the number of UTF-16 characters, not the length of the Unicode string, which will be shorter, because each surrogate pair is only 1 character. To reverse the process we can use String.fromCharCode, which converts each half of the surrogate pair separately and the character is put back together by the browser.

Here's my code. For Node.js just trim it to the class definition and add module.exports=unicode. This way you can test it in the browser easily.

<!DOCTYPE html> 
<head><script>
/**
 * A simple class to convert utf8 or utf16 byte arrays to strings etc
 * Works in Node.js OR in any browser. No dependencies.
 */
class unicode {
    /**
     * Convert a Uint8Array in UTF-8 to a Javascript string
     * @param uint8_array a Uint8Array in UTF-8
     * @return a Javascript string encoded in standard UTF-16
     */
    static utf8_to_string(uint8_array) {
        var str = "";
        for ( var i=0;i<uint8_array.byteLength;i++ ) {
            if ( uint8_array[i] < 128 )
                str += String.fromCodePoint(uint8_array[i]);
            else 
                str += '%'+uint8_array[i].toString(16);
        }
        return decodeURIComponent(str);
    }
    /**
     * Convert a javascript string to Uint8Array UTF-8. 
     * @param str the string to convert
     * @return a Uint8Array in UTF-8
     */
    static string_to_utf8(str) {
        var encoded = encodeURIComponent(str);
        // NB % sign itself encoded as %25
        var bytes = Array();
        var state = 0;
        for ( var i=0;i<encoded.length;i++ ) {
            switch ( state ) {
                case 0:    // convert characters to bytes
                    if ( encoded[i] == '%' )
                        state = 1;
                    else
                        bytes.push(encoded.codePointAt(i));
                    break;
                case 1:    // seen '%'
                    state = 2;
                    break;
                case 2:    // seen %H
                    bytes.push(parseInt(encoded.substring(i-1,i+1),16));
                    state = 0;
                    break;
            }
        }
        return new Uint8Array(bytes);
    }
    /**
     * Convert a javascript string to Uint16Array UTF-16. 
     * @param str the string to convert
     * @return a Uint16Array in UTF-16
     */
    static string_to_utf16(str) {
        var arr = new Uint16Array(str.length);
        for ( var i=0;i<str.length;i++ ) 
            arr[i] = str.charCodeAt(i);
        return arr;
    }
    /**
     * Convert a Uint16Array in UTF-16 to a Javascript string
     * @param uint16_array a Uint16Array in utf-16
     * @return a Javascript string
     */
    static utf16_to_string(uint16_array) {
        var str = "";
        for ( var i=0;i<uint16_array.length;i++ ) 
            str += String.fromCharCode(uint16_array[i]);
        return str;
    }
}
function test() {
    var u8_arr = unicode.string_to_utf8("d贸gs lov€ 黏s");
    var str = unicode.utf8_to_string(u8_arr);
    console.log(("d贸gs lov€ 黏s"==str)?"utf-8 test passed":"utf-8 test failed");
    var u16_arr = unicode.string_to_utf16("d贸gs lov€ 黏s");
    str = unicode.utf16_to_string(u16_arr);
    console.log(("d贸gs lov€ 黏s"==str)?"utf-16 test passed":"utf-16 test failed");
}
</script>
</head>
<body>
<p><input type="button" value="test" onclick="test()"> (read result in console)</p>
</body>
</html>

No comments:

Post a Comment