Teilstring durch utf-8 Byte-Positionen extrahieren

Ich habe eine Zeichenfolge und Start und Länge, mit der eine Teilzeichenfolge extrahiert werden. Beide Positionen (Start und Länge) basieren auf den Byte-Offsets in der ursprünglichen UTF8-Zeichenfolge.Teilstring durch utf-8 Byte-Positionen extrahieren

Allerdings gibt es ein Problem:

Der Start und die Länge in Bytes sind, so kann ich nicht "substring" verwenden. Die UTF8-Zeichenfolge enthält mehrere Multi-Byte-Zeichen. Gibt es einen hyper-effizienten Weg dies zu tun? (Ich brauche nicht das Bytes zu entschlüsseln ...)

Beispiel: var orig = '? 你好吗'

Die s, e sein könnte 3,3 das zweite Zeichen zu extrahieren (好). Ich suche nach

var result = orig.substringBytes(3,3);

Hilfe!

Update # 1 In C/C++ würde ich es einfach in ein Byte-Array umwandeln, aber nicht sicher, ob es eine Entsprechung in Javascript gibt. BTW, ja, wir könnten es in ein Bytearray analysieren und es zu einer Zeichenkette zurück analysieren, aber es scheint, dass es einen schnellen Weg geben sollte, es an der richtigen Stelle zu schneiden. Stellen Sie sich vor, dass "orig" 1000000 Zeichen und s = 6 Bytes und l = 3 Bytes ist.

Update # 2 Dank zerkms hilfreich Wieder Richtung, ich mit der folgenden endete, die NICHT Arbeit richtig macht - arbeitet direkt für multibyte aber für einzelne Byte vermasselt.

function substrBytes(str, start, length) 
{ 
    var ch, startIx = 0, endIx = 0, re = ''; 
    for (var i = 0; 0 < str.length; i++) 
    { 
     startIx = endIx++; 

     ch = str.charCodeAt(i); 
     do { 
      ch = ch >> 8; // a better way may exist to measure ch len 
      endIx++; 
     } 
     while (ch); 

     if (endIx > start + length) 
     { 
      return re; 
     } 
     else if (startIx >= start) 
     { 
      re += str[i]; 
     } 
    } 
}

Update # 3 Ich glaube nicht, dass der Code char Verschiebung wirklich funktioniert. Ich lese zwei Bytes, wenn die richtige Antwort drei ist ... irgendwie vergesse ich das immer. Der Codepunkt ist für UTF8 und UTF16 derselbe, aber die Anzahl der Bytes, die beim Codieren benötigt werden, hängt von der Codierung ab !!! Das ist also nicht der richtige Weg.

Quelle

2012-06-26 tofutim

Der Start und die Länge für 'substr' sind im Zeichen, nicht Bytes. – nhahtdh

http://stackoverflow.com/q/1240408/251311 – zerkms

@zerkms - Ich fand das auch, obwohl ich denke, dass das Dekodieren der ganzen Zeichenfolge zu Bytes, das Abgreifen des Teilstrings und das Zurückgehen wirklich ineffizient wäre. Was ist, wenn es 10000000 Zeichen gibt und ich Bytes 6-12 möchte? Scheint, dass das Umwandeln der ganzen Schnur eine schreckliche Idee wäre. – tofutim

hatte ich eine schöne Zeit mit dieser Hantieren. Hoffe das hilft.

Da Javascript keinen direkten Bytezugriff auf eine Zeichenfolge zulässt, kann die Startposition nur durch einen Vorwärtsscan ermittelt werden.

Update # 3 Ich glaube nicht, dass der Code char Verschiebung wirklich funktioniert. Ich lese zwei Bytes, wenn die richtige Antwort drei ist ... irgendwie vergesse ich das immer. Der Codepunkt ist für UTF8 und UTF16 derselbe, aber die Anzahl der Bytes, die beim Codieren benötigt werden, hängt von der Codierung ab !!! Das ist also nicht der richtige Weg.

Das ist nicht korrekt - Eigentlich gibt es keine UTF-8-Zeichenfolge in Javascript. Gemäß der ECMAScript 262-Spezifikation müssen alle Zeichenfolgen unabhängig von der Eingabecodierung intern als UTF-16 ("[Sequenz von] 16-Bit-Ganzzahlen ohne Vorzeichen") gespeichert werden.

In Anbetracht dessen ist die 8-Bit-Verschiebung korrekt (aber unnötig).

Wrong ist die Annahme, dass Ihr Zeichen als eine 3-Byte-Sequenz gespeichert ...
In der Tat, alle Zeichen in einem JS (ECMA-262) 16-Bit-String (2 Byte) lang sind.

Dies kann man umgehen, indem man die Multibyte-Zeichen manuell in utf-8 umwandelt, wie im folgenden Code gezeigt.

Siehe die Details in meinem Beispielcode erläutert:

function encode_utf8(s) 
{ 
    return unescape(encodeURIComponent(s)); 
} 

function substr_utf8_bytes(str, startInBytes, lengthInBytes) { 

    /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes. 
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored 
    * in utf-16 internally - so we need to convert characters to utf-8 
    * to detect their length in utf-8 encoding. 
    * 
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string. 
    * in utf-8, for example: 
    *  "a" is 1 byte, 
      "ü" is 2 byte, 
     and "你" is 3 byte. 
    * 
    * NOTE: 
    * according to ECMAScript 262 all strings are stored as a sequence 
    * of 16-bit characters. so we need a encode_utf8() function to safely 
    * detect the length our character would have in a utf8 representation. 
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf 
    * see "4.3.16 String Value": 
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers. 
    */ 

    var resultStr = ''; 
    var startInChars = 0; 

    // scan string forward to find index of first character 
    // (convert start position in byte to start position in characters) 

    for (bytePos = 0; bytePos < startInBytes; startInChars++) { 

     // get numeric code of character (is >128 for multibyte character) 
     // and increase "bytePos" for each byte of the character sequence 

     ch = str.charCodeAt(startInChars); 
     bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length; 
    } 

    // now that we have the position of the starting character, 
    // we can built the resulting substring 

    // as we don't know the end position in chars yet, we start with a mix of 
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position 
    end = startInChars + lengthInBytes - 1; 

    for (n = startInChars; startInChars <= end; n++) { 
     // get numeric code of character (is >128 for multibyte character) 
     // and decrease "end" for each byte of the character sequence 
     ch = str.charCodeAt(n); 
     end -= (ch < 128) ? 1 : encode_utf8(str[n]).length; 

     resultStr += str[n]; 
    } 

    return resultStr; 
} 

var orig = 'abc你好吗？'; 

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab" 
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c" 
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你" 
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"

Quelle

2012-06-26 18:06:09 Kaii

aktualisiert, um diese Funktion mit UTF-8-Eingang kompatibel zu machen. (Wenn die Zeichenfolge anfänglich utf-8 war und die Byte-Positionen auch die einer utf-8-Zeichenfolge sind) – Kaii

Das System.ArraySegment ist nützlich, aber Sie müssen mit Array-Eingabe und Offset und Indexer konstruieren.

Quelle

2012-06-26 04:22:41

Ist das in Javascript? Oder nur eine C# -Bibliothek? – tofutim

function substrBytes(str, start, length) 
{ 
    var buf = new Buffer(str); 
    return buf.slice(start, start+length).toString(); 
}

AYB

Quelle

2012-06-26 09:56:44 tofutim

Ich habe das versucht, aber ich habe kein Buffer() Objekt. Welchen Rahmen hast du benutzt? – Kaii

Es ist in node.js gefunden – tofutim

Dies funktioniert nicht für mich in Node.js. Gibt eine Reihe von Fragezeichen zurück. Regular substr funktioniert gut. – Gavin

@Kaii ‚s Antwort ist fast richtig, aber es ist ein Fehler drin. Es schlägt fehl, die Unicode-Zeichen, von denen 128 bis 255 zu handhaben Hier ist die überarbeitete Version (nur 256-128 aus):

function encode_utf8(s) 
{ 
    return unescape(encodeURIComponent(s)); 
} 

function substr_utf8_bytes(str, startInBytes, lengthInBytes) { 

    /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes. 
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored 
    * in utf-16 internally - so we need to convert characters to utf-8 
    * to detect their length in utf-8 encoding. 
    * 
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string. 
    * in utf-8, for example: 
    *  "a" is 1 byte, 
      "ü" is 2 byte, 
     and "你" is 3 byte. 
    * 
    * NOTE: 
    * according to ECMAScript 262 all strings are stored as a sequence 
    * of 16-bit characters. so we need a encode_utf8() function to safely 
    * detect the length our character would have in a utf8 representation. 
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf 
    * see "4.3.16 String Value": 
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers. 
    */ 

    var resultStr = ''; 
    var startInChars = 0; 

    // scan string forward to find index of first character 
    // (convert start position in byte to start position in characters) 

    for (bytePos = 0; bytePos < startInBytes; startInChars++) { 

     // get numeric code of character (is >= 128 for multibyte character) 
     // and increase "bytePos" for each byte of the character sequence 

     ch = str.charCodeAt(startInChars); 
     bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length; 
    } 

    // now that we have the position of the starting character, 
    // we can built the resulting substring 

    // as we don't know the end position in chars yet, we start with a mix of 
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position 
    end = startInChars + lengthInBytes - 1; 

    for (n = startInChars; startInChars <= end; n++) { 
     // get numeric code of character (is >= 128 for multibyte character) 
     // and decrease "end" for each byte of the character sequence 
     ch = str.charCodeAt(n); 
     end -= (ch < 128) ? 1 : encode_utf8(str[n]).length; 

     resultStr += str[n]; 
    } 

    return resultStr; 
} 

var orig = 'abc你好吗？©'; 

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab" 
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c" 
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你" 
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗" 
alert('res: ' + substr_utf8_bytes(orig, 15, 2)); // alerts: "©"

By the way, es ist ein Bug-Fix ist, und es sollte sein nützlich für diejenigen, die das gleiche Problem haben. Warum haben die Gutachter meinen Änderungsvorschlag wegen "zu viel" oder "zu kleiner" abgelehnt? @Adam Eberlin @Kjuly @Jasonw

Quelle

2012-11-02 17:17:12 sunzhuoshi

nahm ich dies in den Kredit und editierte meine Antwort. Danke für deine scharfen Augen – Kaii

Für IE-Benutzer geben die Codes in obiger Antwort undefined aus. Weil es in IE str[n] nicht unterstützt wird, mit anderen Worten, Sie können String nicht als Array verwenden. Sie müssen str[n] durch str.charAt(n) ersetzen. Der Code sollte sein;

function encode_utf8(s) { 
    return unescape(encodeURIComponent(s)); 
} 

function substr_utf8_bytes(str, startInBytes, lengthInBytes) { 

    var resultStr = ''; 
    var startInChars = 0; 

    for (bytePos = 0; bytePos < startInBytes; startInChars++) { 
     ch = str.charCodeAt(startInChars); 
     bytePos += (ch < 128) ? 1 : encode_utf8(str.charAt(startInChars)).length; 
    } 

    end = startInChars + lengthInBytes - 1; 

    for (n = startInChars; startInChars <= end; n++) { 
     ch = str.charCodeAt(n); 
     end -= (ch < 128) ? 1 : encode_utf8(str.charAt(n)).length; 

     resultStr += str.charAt(n); 
    } 

    return resultStr; 
}

Quelle

2014-03-11 12:06:55

Vielleicht verwenden, um Byte und Beispiel zu zählen. Es zählt 你 Zeichen ist 2 Bytes, stattdessen 3 Bytes folgen @ Kaii Funktion:

jQuery.byteLength = function(target) { 
    try { 
     var i = 0; 
     var length = 0; 
     var count = 0; 
     var character = ''; 
     // 
     target = jQuery.castString(target); 
     length = target.length; 
     // 
     for (i = 0; i < length; i++) { 
      // 1 文字を切り出し Unicode に変換 
      character = target.charCodeAt(i); 
      // 
      // Unicode の半角 : 0x0 - 0x80, 0xf8f0, 0xff61 - 0xff9f, 0xf8f1 - 
      // 0xf8f3 
      if ((character >= 0x0 && character < 0x81) 
        || (character == 0xf8f0) 
        || (character > 0xff60 && character < 0xffa0) 
        || (character > 0xf8f0 && character < 0xf8f4)) { 
       // 1 バイト文字 
       count += 1; 
      } else { 
       // 2 バイト文字 
       count += 2; 
      } 
     } 
     // 
     return (count); 
    } catch (e) { 
     jQuery.showErrorDetail(e, 'byteLength'); 
     return (0); 
    } 
}; 

for (var j = 1, len = value.length; j <= len; j++) { 
    var slice = value.slice(0, j); 
    var slength = $.byteLength(slice); 
    if (slength == 106) { 
     $(this).val(slice); 
     break; 
    } 
}

Quelle

2017-09-07 03:37:26 user3331563

Teilstring durch utf-8 Byte-Positionen extrahieren

Antwort

Verwandte Themen