Understanding Character Sets: ASCII & Unicode

How computers represent text and symbols.

Welcome!

Mark as Read

Computers, at their core, only understand binary - 0s and 1s. So, how do they store and display the text you're reading right now, or your name, or even emojis? The answer lies in character setsA defined list of characters recognised by the computer hardware and software. Each character is assigned a unique numerical code.!

This lesson will explore how computers use character sets, focusing on two important ones: ASCII and Unicode.

Lesson Outcomes:

Define what a 'character set' is.
Explain how ASCIIAmerican Standard Code for Information Interchange. An early character set standard, typically using 7 or 8 bits per character. is used to represent text and discuss its limitations.
Explain how UnicodeA universal character encoding standard designed to support text in most of the world's writing systems, including emojis. is used to represent text and its advantages over ASCII.
Compare the key features of ASCII and Unicode.
Understand how character codes (binary numbers) are used to represent individual characters.

Starter Activity: Warm-up Your Brain!

1. In your own words, what do you think a "character set" means in the context of computers?

2. How many different, unique things can you represent with:

1 bit?
2 bits?
3 bits?

3. Have you ever seen strange symbols like "" or "▯" when text doesn't display correctly on a website or in a document? What do you think might cause this?

Task 1: Defining "Character Set"

Mark as Read

Based on the introduction and the starter activity, try to write your own definition of a "character set" in the context of computers:

Here's a more formal way to think about it:

A character setA defined list of characters recognised by the computer hardware and software. Each character is assigned a unique numerical code. is essentially a dictionary that computers use. It's a defined list of characters (like letters, numbers, punctuation marks, symbols, and even control characters like 'newline' or 'tab') that a computer system can recognize and work with.

Crucially, each character in the set is assigned a unique numerical value, called a character codeThe unique numerical value assigned to a character within a character set. This code is often represented in binary.. This code is what the computer actually stores and processes.

Key Definition:

A Character SetA defined list of characters recognised by the computer hardware and software. is a defined list of characters recognised by the computer hardware and software.

Task 2: ASCII (American Standard Code for Information Interchange)

Mark as Read

One of the earliest and most influential character sets is ASCII. Originally, it used 7 bits to represent characters, allowing for 2⁷ = 128 different characters. This included uppercase and lowercase English letters, digits 0-9, punctuation marks, and control characters.

Later, Extended ASCIIVersions of ASCII that use 8 bits, allowing for 256 characters. The additional characters often include accented letters and graphical symbols. became common, using 8 bits (1 byte) per character. This allowed for 2⁸ = 256 distinct characters. The extra characters were often used for accented letters (like é, ü), special symbols, and box-drawing characters.

ASCII Facts:

Typically uses 8 bitsAn 8-bit ASCII can represent 2⁸ = 256 different characters. (1 byte) per character in its extended form.
Can represent 256 distinct characters.
Contains characters for English and some other Western European languages.
Each character is assigned a unique character code (a binary number). For example, 'A' is 65 (binary 01000001), 'B' is 66 (01000010), 'a' is 97 (01100001).

Interactive: ASCII Lookup & Text to Binary

Enter a single character (A-Z, a-z, 0-9):

Enter short text (e.g., CAT):

Challenge: How ASCII Represents Text

Given that in ASCII, 'A' is decimal 65 (binary 01000001) and 'B' is decimal 66 (binary 01000010), how do you think the computer would store the word "CAB" using 8-bit ASCII?

When you type text, each character is converted into its corresponding ASCII character code. This code, which is a binary number, is what the computer stores.

Each character is assigned a unique character code.
Each letter/symbol is converted to its character code (which is a binary number).

Example: The word "CAB" would be stored as:

C (67) = 01000011
A (65) = 01000001
B (66) = 01000010

Brainstorm: Limitations of ASCII

ASCII, with its 256 character limit, was great for English. But think about global communication and all the symbols we use today. What are some potential limitations of only having 256 characters?

You're right to think about it! 256 characters are not nearly enough to represent all characters from all languages around the world (e.g., Chinese, Japanese, Arabic, Cyrillic alphabets) or the vast array of symbols and emojis we use today. This is where Unicode comes in!

Task 3: Unicode - A Universal Solution

Mark as Read

To address the limitations of ASCII and other character sets, UnicodeA universal character encoding standard designed to support text in most of the world's writing systems. was developed. Its goal is to provide a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Unicode Facts:

Can use 16, 24, or 32 bitsUnicode encodings like UTF-16 use 16 bits (2 bytes), UTF-32 uses 32 bits (4 bytes). UTF-8 is variable width. (or more) per character, allowing for a massive number of characters.
- e.g., 16-bit Unicode (like basic UTF-16) can represent 2¹⁶ = 65,536 characters.
- Modern Unicode can represent over a million characters!
Designed to be a superset of ASCIIThe first 128 characters of Unicode are identical to the ASCII character set, ensuring compatibility.. This means the first 128 characters of Unicode are the same as ASCII, making transition easier.
Can represent characters from virtually all known writing systemsIncludes scripts like Latin, Greek, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and many more., plus a vast collection of symbols and emojisYes, emojis are part of the Unicode standard! Each emoji has its own unique Unicode code point.. 👍🎉😊

Why Different Unicode Encodings? (UTF-8, UTF-16, UTF-32)

Unicode defines a unique number (a "code point") for every character. But if every character has a number, why do we have different "encodings" like UTF-8, UTF-16, and UTF-32? What problem might these different encodings be trying to solve?

That's a great question! The main reason for different encodings is efficiency and compatibility. How these code points are stored in bytes is determined by an encoding schemeA method for converting Unicode code points into sequences of bytes for storage or transmission. Examples: UTF-8, UTF-16, UTF-32.. Common ones include:

UTF-8: Variable-width encoding. Uses 1 byte for ASCII characters, and up to 4 bytes for other characters. Very common on the web due to its efficiency for English-heavy text and ASCII compatibility.
UTF-16: Uses 2 bytes (16 bits) for most common characters (the Basic Multilingual Plane), and 4 bytes for less common ones.
UTF-32: Uses 4 bytes (32 bits) for every character. Simple, but can be less space-efficient.

Interactive: Character Representation

Let's see how a character might be represented:

Character: A

ASCII (8-bit): 01000001 (Decimal: 65)

Unicode (UTF-16, as U+0041): 00000000 01000001 (Hex: 0041)

Character: € (Euro Sign)

ASCII (8-bit): Not directly representable in standard 7-bit or common 8-bit ASCII. Some extended ASCII versions might include it at a specific code.

Unicode (UTF-16, as U+20AC): 00100000 10101100 (Hex: 20AC)

Character: 😊 (Smiling Face Emoji)

ASCII (8-bit): Definitely not representable!

Unicode (UTF-16, as U+1F60A): 1101100000111101 1101111000001010 (Hex: D83D DE0A - this is a surrogate pair in UTF-16)

Task 4: ASCII vs. Unicode - Quick Quiz!

Test your understanding of the differences between ASCII and Unicode.

1. Which character set typically uses 8 bits and can represent up to 256 characters?

2. Which character set is designed to represent characters from almost all writing systems in the world, including emojis?

3. If a character set uses 16 bits for each character, how many unique characters can it represent?

4. Is Unicode a superset of ASCII (meaning it includes all ASCII characters)?

Task 5: Binary to ASCII Character

Remember that each ASCII character has a unique 8-bit binary code. Try inputting an 8-bit binary number below to see which character it represents!

Enter an 8-bit binary number (e.g., 01000001):

Task 6: Which Character Set is Needed?

Look at the text snippets below. For each one, decide if it could be fully represented by standard 8-bit ASCII, or if Unicode would be necessary.

Hello World!

Can this be represented by 8-bit ASCII?

Résumé & Café

Can this be represented by 8-bit ASCII?

你好世界 😊👍

Can this be represented by 8-bit ASCII?

Task 7: Storage Size - Food for Thought!

Consider the string: "Code Club 💻"

Think about how much storage this string might take up using different character sets/encodings. This isn't a direct calculation task for now, but a conceptual one.

1. Using 8-bit ASCII:

How many characters in "Code Club " can be represented by standard ASCII?
What happens to the '💻' emoji? Can standard ASCII represent it?
Roughly, how many bytes would "Code Club " (without the emoji) take in ASCII?

2. Using Unicode (e.g., UTF-8 encoding):

UTF-8 uses 1 byte for standard English/ASCII characters. How many bytes for "Code Club "?
Emojis in UTF-8 often take 3 or 4 bytes. If '💻' takes 4 bytes, what's the total for the whole string "Code Club 💻" in UTF-8?

Discussion Points:

For 8-bit ASCII:

"Code Club " (10 characters, including space) can be represented.
The '💻' emoji cannot be represented by standard 8-bit ASCII. It would likely be lost, replaced by a '?', or cause an error.
"Code Club " would take 10 bytes (1 byte per character).

For Unicode (UTF-8):

"Code Club " (10 characters) would take 10 bytes (1 byte each as they are ASCII-compatible).
If '💻' takes 4 bytes in UTF-8, the total string "Code Club 💻" would take 10 (for "Code Club ") + 4 (for 💻) = 14 bytes.

This shows that Unicode can be very efficient for text that is mostly ASCII characters (like in UTF-8), but provides the ability to represent a much wider range of characters, which might take more space for those specific non-ASCII characters.

Task 8: Interactive Character Code Explorer

Computers use numerical codes for every character. Let's explore this! This is similar to how Python's ord() (character to code) and chr() (code to character) functions work.

Character to Code Point

Enter a Single Character:

Code Point to Character

Enter Decimal Code Point:

Challenge: String to Code Points

Enter a Short String (e.g., "Hi!"):

Note: The "Code Point to Character" part will show special messages for non-printable control characters (like ASCII 0-31, 127 and Unicode 128-159).

Real-World Context & Importance

Mark as Read

Web Pages: The text you see on websites is encoded using character sets. UTF-8 (a Unicode encoding) is the most common encoding for HTML documents on the internet, ensuring that pages can display diverse languages and symbols correctly.
Text Documents & Emails: When you save a text file or send an email, the characters are stored and transmitted using a character set.
Programming Languages: Modern programming languages (like Python, Java, JavaScript) use Unicode internally to handle strings, allowing developers to work with text from any language.
Operating Systems: Your computer's OS uses character sets to display filenames, user interface text, and manage text input/output.
Databases: Databases need to store text data from various sources, often in multiple languages. Unicode support is crucial here.
Text Messaging (SMS/MMS): While older SMS had limitations, modern messaging apps heavily rely on Unicode to support emojis and international characters.

Essentially, any time a computer needs to store, process, or display textual information, character sets are involved!

Character Set Myth Busters!

Mark as Read

Myth: ASCII is completely outdated and no longer used.

Buster: While Unicode (especially UTF-8) is dominant for general text, ASCII is still foundational. The first 128 characters of Unicode are identical to ASCII. This means UTF-8 is backward-compatible with ASCII for these characters. Also, some simple systems, embedded devices, or older protocols might still use plain ASCII or Extended ASCII for efficiency or legacy reasons.

Myth: All Unicode characters take up 16 bits (2 bytes) of storage.

Buster: Not necessarily! Unicode is a standard that assigns a unique number (code point) to each character. How that number is stored in bytes depends on the Unicode encodingMethods like UTF-8, UTF-16, and UTF-32 that define how Unicode code points are mapped to byte sequences. used.

UTF-8 is a variable-width encoding: ASCII characters use 1 byte, while other characters can use 2, 3, or 4 bytes. This is very efficient for documents that are mostly English/ASCII.
UTF-16 typically uses 2 bytes for most common characters, but can use 4 bytes (a surrogate pair) for characters outside the Basic Multilingual Plane (BMP).
UTF-32 uses a fixed 4 bytes for every character.

Myth: You need to install different fonts to see different languages.

Buster: While fonts are necessary to *display* characters, the ability to *represent* characters from different languages is the job of the character set (like Unicode). A font is a collection of graphical representations (glyphs) for characters. If your system has a font that includes the glyphs for the Unicode characters in a document, it can display them. If not, you might see placeholder boxes (like ☐) or question marks, even if the underlying Unicode data is correct. Modern operating systems come with fonts that cover a very wide range of Unicode characters.

Exam-Style Questions

1. Define the term 'character set'. (1 mark)

Answer: A defined list of characters (1) recognised by computer hardware/software (and each character has a unique code).
(Accept variations like: A set of symbols/letters/numbers that a computer can understand/represent.)

2. Explain one reason why Unicode was introduced. (2 marks)

Answer (any one of these for 2 marks, or similar points):

ASCII (or other older character sets) could not represent enough characters (1) for all world languages/symbols/emojis (1).
To provide a universal standard (1) that could represent characters from (almost) all languages (1).
To overcome the limitation of ASCII's small character capacity (1) which was insufficient for global communication/software (1).

3. A computer uses 8-bit ASCII to store text. The ASCII code for 'C' is 67 and for 'a' is 97. Show the 8-bit binary representation for the word "Ca". (2 marks)

Answer:

C (67) = 01000011 (1 mark)

a (97) = 01100001 (1 mark)

So, "Ca" = 0100001101100001 (or 01000011 01100001)

(Award marks for correct binary for each character. Deduct if not 8-bit per character if specified by question, but here it is.)

4. Compare ASCII and Unicode in terms of the number of characters they can represent and their suitability for representing text in different languages. (4 marks)

Answer (award marks for points like these, up to 4):

Number of Characters:
- ASCII (8-bit) can represent 256 characters. (1 mark)
- Unicode can represent a vastly larger number of characters (e.g., over a million / 65,536 for 16-bit). (1 mark)
Suitability for Languages:
- ASCII is suitable for English and some Western European languages but limited for others. (1 mark)
- Unicode is suitable for representing characters from (almost) all world languages, including complex scripts and emojis. (1 mark)
(Also accept: ASCII uses fewer bits per character (typically 1 byte) vs Unicode which can use more (e.g. 2 or 4 bytes, or variable with UTF-8). Unicode is a superset of ASCII.)

Lesson Summary: Key Takeaways

Mark as Read

A character set is a defined list of characters, each with a unique numerical code.
ASCII (8-bit) can represent 256 characters, mainly for English and some Western languages. Each character is assigned a unique binary code.
Unicode was created to represent characters from (virtually) all languages and symbols, including emojis. It can use more bits (e.g., 16 or 32, or variable with UTF-8) and is a superset of ASCII.
Computers store text by converting characters into their binary character codes based on the character set being used.
UTF-8 is a common Unicode encoding, especially on the web, due to its efficiency and ASCII compatibility.

Where to Next?

Mark as Read

Now that you understand how text is represented, you might want to explore:

How images and sound are represented in binary.
Data compression techniques used to reduce file sizes.
Units of data storage (bits, bytes, kilobytes, etc.).

Going Further: Extension Activities

Mark as Read

Research UTF-8: Investigate how UTF-8 encoding works. How does it manage to be variable-width? Why is it the most popular encoding on the web?
Explore a Unicode Chart: Go to a website like unicode-table.com. Find the Unicode block for a non-Latin script (e.g., Cyrillic, Arabic, Hiragana) or the emoji block. What are some interesting characters or symbols you find? Note their hexadecimal code points.
Character Encoding Issues: Research "mojibake". What causes it, and how can it be prevented? Have you ever encountered it?
Interactive Unicode Explorer: Experiment with character codes directly below!

Enter a Character:

Enter Decimal Code Point:

Welcome!

Lesson Outcomes:

Starter Activity: Warm-up Your Brain!

Task 1: Defining "Character Set"

Task 2: ASCII (American Standard Code for Information Interchange)

Interactive: ASCII Lookup & Text to Binary

Challenge: How ASCII Represents Text

Brainstorm: Limitations of ASCII

Task 3: Unicode - A Universal Solution

Why Different Unicode Encodings? (UTF-8, UTF-16, UTF-32)

Interactive: Character Representation

Task 4: ASCII vs. Unicode - Quick Quiz!

Task 5: Binary to ASCII Character

Task 6: Which Character Set is Needed?

Task 7: Storage Size - Food for Thought!

Task 8: Interactive Character Code Explorer

Character to Code Point

Code Point to Character

Challenge: String to Code Points

Real-World Context & Importance

Character Set Myth Busters!

Exam-Style Questions

Helpful Videos

Lesson Summary: Key Takeaways

Where to Next?

Going Further: Extension Activities