How computers represent text and symbols.
Computers, at their core, only understand binary - 0s and 1s. So, how do they store and display the text you're reading right now, or your name, or even emojis? The answer lies in character setsA defined list of characters recognised by the computer hardware and software. Each character is assigned a unique numerical code.!
This lesson will explore how computers use character sets, focusing on two important ones: ASCII and Unicode.
1. In your own words, what do you think a "character set" means in the context of computers?
2. How many different, unique things can you represent with:
3. Have you ever seen strange symbols like "" or "β―" when text doesn't display correctly on a website or in a document? What do you think might cause this?
Based on the introduction and the starter activity, try to write your own definition of a "character set" in the context of computers:
Here's a more formal way to think about it:
A character setA defined list of characters recognised by the computer hardware and software. Each character is assigned a unique numerical code. is essentially a dictionary that computers use. It's a defined list of characters (like letters, numbers, punctuation marks, symbols, and even control characters like 'newline' or 'tab') that a computer system can recognize and work with.
Crucially, each character in the set is assigned a unique numerical value, called a character codeThe unique numerical value assigned to a character within a character set. This code is often represented in binary.. This code is what the computer actually stores and processes.
Key Definition:
A Character SetA defined list of characters recognised by the computer hardware and software. is a defined list of characters recognised by the computer hardware and software.
One of the earliest and most influential character sets is ASCII. Originally, it used 7 bits to represent characters, allowing for 27 = 128 different characters. This included uppercase and lowercase English letters, digits 0-9, punctuation marks, and control characters.
Later, Extended ASCIIVersions of ASCII that use 8 bits, allowing for 256 characters. The additional characters often include accented letters and graphical symbols. became common, using 8 bits (1 byte) per character. This allowed for 28 = 256 distinct characters. The extra characters were often used for accented letters (like Γ©, ΓΌ), special symbols, and box-drawing characters.
ASCII Facts:
Given that in ASCII, 'A' is decimal 65 (binary 01000001
) and 'B' is decimal 66 (binary 01000010
), how do you think the computer would store the word "CAB" using 8-bit ASCII?
When you type text, each character is converted into its corresponding ASCII character code. This code, which is a binary number, is what the computer stores.
Example: The word "CAB" would be stored as:
C (67) = 01000011
A (65) = 01000001
B (66) = 01000010
ASCII, with its 256 character limit, was great for English. But think about global communication and all the symbols we use today. What are some potential limitations of only having 256 characters?
You're right to think about it! 256 characters are not nearly enough to represent all characters from all languages around the world (e.g., Chinese, Japanese, Arabic, Cyrillic alphabets) or the vast array of symbols and emojis we use today. This is where Unicode comes in!
To address the limitations of ASCII and other character sets, UnicodeA universal character encoding standard designed to support text in most of the world's writing systems. was developed. Its goal is to provide a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
Unicode Facts:
Unicode defines a unique number (a "code point") for every character. But if every character has a number, why do we have different "encodings" like UTF-8, UTF-16, and UTF-32? What problem might these different encodings be trying to solve?
That's a great question! The main reason for different encodings is efficiency and compatibility. How these code points are stored in bytes is determined by an encoding schemeA method for converting Unicode code points into sequences of bytes for storage or transmission. Examples: UTF-8, UTF-16, UTF-32.. Common ones include:
Let's see how a character might be represented:
Character: A
ASCII (8-bit): 01000001
(Decimal: 65)
Unicode (UTF-16, as U+0041): 00000000 01000001
(Hex: 0041)
Character: β¬ (Euro Sign)
ASCII (8-bit): Not directly representable in standard 7-bit or common 8-bit ASCII. Some extended ASCII versions might include it at a specific code.
Unicode (UTF-16, as U+20AC): 00100000 10101100
(Hex: 20AC)
Character: π (Smiling Face Emoji)
ASCII (8-bit): Definitely not representable!
Unicode (UTF-16, as U+1F60A): 1101100000111101 1101111000001010
(Hex: D83D DE0A - this is a surrogate pair in UTF-16)
Test your understanding of the differences between ASCII and Unicode.
1. Which character set typically uses 8 bits and can represent up to 256 characters?
2. Which character set is designed to represent characters from almost all writing systems in the world, including emojis?
3. If a character set uses 16 bits for each character, how many unique characters can it represent?
4. Is Unicode a superset of ASCII (meaning it includes all ASCII characters)?
Remember that each ASCII character has a unique 8-bit binary code. Try inputting an 8-bit binary number below to see which character it represents!
Look at the text snippets below. For each one, decide if it could be fully represented by standard 8-bit ASCII, or if Unicode would be necessary.
Hello World!
Can this be represented by 8-bit ASCII?
RΓ©sumΓ© & CafΓ©
Can this be represented by 8-bit ASCII?
δ½ ε₯½δΈη ππ
Can this be represented by 8-bit ASCII?
Consider the string: "Code Club π»"
Think about how much storage this string might take up using different character sets/encodings. This isn't a direct calculation task for now, but a conceptual one.
1. Using 8-bit ASCII:
2. Using Unicode (e.g., UTF-8 encoding):
Discussion Points:
For 8-bit ASCII:
For Unicode (UTF-8):
This shows that Unicode can be very efficient for text that is mostly ASCII characters (like in UTF-8), but provides the ability to represent a much wider range of characters, which might take more space for those specific non-ASCII characters.
Computers use numerical codes for every character. Let's explore this! This is similar to how Python's ord()
(character to code) and chr()
(code to character) functions work.
Note: The "Code Point to Character" part will show special messages for non-printable control characters (like ASCII 0-31, 127 and Unicode 128-159).
Essentially, any time a computer needs to store, process, or display textual information, character sets are involved!
Myth: ASCII is completely outdated and no longer used.
Buster: While Unicode (especially UTF-8) is dominant for general text, ASCII is still foundational. The first 128 characters of Unicode are identical to ASCII. This means UTF-8 is backward-compatible with ASCII for these characters. Also, some simple systems, embedded devices, or older protocols might still use plain ASCII or Extended ASCII for efficiency or legacy reasons.
Myth: All Unicode characters take up 16 bits (2 bytes) of storage.
Buster: Not necessarily! Unicode is a standard that assigns a unique number (code point) to each character. How that number is stored in bytes depends on the Unicode encodingMethods like UTF-8, UTF-16, and UTF-32 that define how Unicode code points are mapped to byte sequences. used.
Myth: You need to install different fonts to see different languages.
Buster: While fonts are necessary to *display* characters, the ability to *represent* characters from different languages is the job of the character set (like Unicode). A font is a collection of graphical representations (glyphs) for characters. If your system has a font that includes the glyphs for the Unicode characters in a document, it can display them. If not, you might see placeholder boxes (like β) or question marks, even if the underlying Unicode data is correct. Modern operating systems come with fonts that cover a very wide range of Unicode characters.
1. Define the term 'character set'. (1 mark)
Answer: A defined list of characters (1) recognised by computer hardware/software (and each character has a unique code).
(Accept variations like: A set of symbols/letters/numbers that a computer can understand/represent.)
2. Explain one reason why Unicode was introduced. (2 marks)
Answer (any one of these for 2 marks, or similar points):
3. A computer uses 8-bit ASCII to store text. The ASCII code for 'C' is 67 and for 'a' is 97. Show the 8-bit binary representation for the word "Ca". (2 marks)
Answer:
C (67) = 01000011
(1 mark)
a (97) = 01100001
(1 mark)
So, "Ca" = 0100001101100001
(or 01000011 01100001
)
(Award marks for correct binary for each character. Deduct if not 8-bit per character if specified by question, but here it is.)
4. Compare ASCII and Unicode in terms of the number of characters they can represent and their suitability for representing text in different languages. (4 marks)
Answer (award marks for points like these, up to 4):
Now that you understand how text is represented, you might want to explore: