为什么英语字符比其他字母表中的字符需要更少的字节来表示它们？

虽然我们大多数人可能从来没有停下来思考过这个问题，但字母字符在表示它们所需的字节数上并不完全相同。但为什么呢？今天的超级用户问答帖子回答了一位好奇的读者的问题。...

为什么英语字符比其他字母表中的字符需要更少的字节来表示它们？

虽然我们大多数人可能从来没有停下来思考过这个问题，但字母字符在表示它们所需的字节数上并不完全相同。但为什么呢？今天的超级用户问答帖子回答了一位好奇的读者的问题。

今天的问答环节是由SuperUser提供的，SuperUser是Stack Exchange的一个分支，是一个由社区驱动的问答网站分组。

维基百科提供的部分ASCII图表截图。

问题

超级用户阅读器khajvah想知道为什么不同的字母表在保存时占用不同的磁盘空间：

When I put ‘a’ in a text file and save it, it makes it 2 bytes in size. But when I put a character like ‘ա’ (a letter from the Armenian alphabet) in, it makes it 3 bytes in size.

What is the difference between alphabets on a computer? Why does English take up less space when saved?

字母就是字母，对吧？也许不是！这个按字母顺序排列的谜的答案是什么？

答案

超级用户贡献者Doktoro Reichard和ernie为我们提供了答案。首先，Doktoro Reichard：

One of the first encoding schemes to be developed for use in mainstream computers is the ASCII (American Standard Code for Information Interchange) standard. It was developed in the 1960s in the United States.

The English alphabet uses part of the Latin alphabet (for instance, there are few accented words in English). There are 26 individual letters in that alphabet, not c***idering case. And there would also have to exist the individual numbers and punctuation marks in any scheme that pretends to encode the English alphabet.

The 1960s was also a time when computers did not have the amount of memory or disk space that we have now. ASCII was developed to be a standard representation of a functional alphabet across all American computers. At the time, the decision to make every ASCII character 8 bits (1 byte) long was made due to technical details of the time (the Wikipedia article menti*** the fact that perforated tape held 8 bits in a position at a time). In fact, the original ASCII scheme can be tran**itted using 7 bits, and the eighth could be used for parity checks. Later developments expanded the original ASCII scheme to include several accented, mathematical, and terminal characters.

With the recent increase of computer usage across the world, more and more people from different languages had access to a computer. That meant that, for each language, new encoding schemes had to be developed, independently from other schemes, which would conflict if read from different language terminals.

Unicode came into being as a solution to the existence of different terminals by merging all possible meaningful characters into a single abstract character set.

UTF-8 is one way to encode the Unicode character set. It is a variable-width encoding (i.e. different characters can have different sizes) and it was designed for backwards compatibility with the former ASCII scheme. As such, the ASCII character set will remain one byte in size whilst any other characters are two or more bytes in size. UTF-16 is another way to encode the Unicode character set. In comparison to UTF-8, characters are encoded as either a set of one or two 16-bit code units.

As stated in other comments, the ‘a’ character occupies a single byte while ‘ա’ occupies two bytes, denoting a UTF-8 encoding. The extra byte in the original question was due to the existence of a newline character at the end.

接下来是厄尼的回答：

1 byte is 8 bits, and can thus represent up to 256 (2^8) different values.

For languages that require more possibilities than this, a simple 1 to 1 mapping can not be maintained, so more data is needed to store a character.

Note that generally, most encodings use the first 7 bits (128 values) for ASCII characters. That leaves the 8th bit, or 128 more values for more characters. Add in accented characters, Asian languages, Cyrillic, etc. and you can easily see why 1 byte is not sufficient for holding all characters.

有什么要补充的解释吗？在评论中发出声音。想从其他精通技术的Stack Exchange用户那里了解更多答案吗？在这里查看完整的讨论主题。