\1
Back to Data Storage

Understanding Character Sets: ASCII & Unicode

How computers represent text and symbols.

Welcome!

Computers, at their core, only understand binary - 0s and 1s. So, how do they store and display the text you're reading right now, or your name, or even emojis? The answer lies in character setsA defined list of characters recognised by the computer hardware and software. Each character is assigned a unique numerical code.!

This lesson will explore how computers use character sets, focusing on two important ones: ASCII and Unicode.

Lesson Outcomes:

  • Define what a 'character set' is.
  • Explain how ASCIIAmerican Standard Code for Information Interchange. An early character set standard, typically using 7 or 8 bits per character. is used to represent text and discuss its limitations.
  • Explain how UnicodeA universal character encoding standard designed to support text in most of the world's writing systems, including emojis. is used to represent text and its advantages over ASCII.
  • Compare the key features of ASCII and Unicode.
  • Understand how character codes (binary numbers) are used to represent individual characters.

Starter Activity: Warm-up Your Brain!

1. In your own words, what do you think a "character set" means in the context of computers?

2. How many different, unique things can you represent with:

  • 1 bit?
  • 2 bits?
  • 3 bits?

3. Have you ever seen strange symbols like "" or "β–―" when text doesn't display correctly on a website or in a document? What do you think might cause this?

Task 1: Defining "Character Set"

Based on the introduction and the starter activity, try to write your own definition of a "character set" in the context of computers:

Task 2: ASCII (American Standard Code for Information Interchange)

One of the earliest and most influential character sets is ASCII. Originally, it used 7 bits to represent characters, allowing for 27 = 128 different characters. This included uppercase and lowercase English letters, digits 0-9, punctuation marks, and control characters.

Later, Extended ASCIIVersions of ASCII that use 8 bits, allowing for 256 characters. The additional characters often include accented letters and graphical symbols. became common, using 8 bits (1 byte) per character. This allowed for 28 = 256 distinct characters. The extra characters were often used for accented letters (like Γ©, ΓΌ), special symbols, and box-drawing characters.

ASCII Facts:

  • Typically uses 8 bitsAn 8-bit ASCII can represent 28 = 256 different characters. (1 byte) per character in its extended form.
  • Can represent 256 distinct characters.
  • Contains characters for English and some other Western European languages.
  • Each character is assigned a unique character code (a binary number). For example, 'A' is 65 (binary 01000001), 'B' is 66 (01000010), 'a' is 97 (01100001).

Interactive: ASCII Lookup & Text to Binary

Challenge: How ASCII Represents Text

Given that in ASCII, 'A' is decimal 65 (binary 01000001) and 'B' is decimal 66 (binary 01000010), how do you think the computer would store the word "CAB" using 8-bit ASCII?

Brainstorm: Limitations of ASCII

ASCII, with its 256 character limit, was great for English. But think about global communication and all the symbols we use today. What are some potential limitations of only having 256 characters?

Task 3: Unicode - A Universal Solution

To address the limitations of ASCII and other character sets, UnicodeA universal character encoding standard designed to support text in most of the world's writing systems. was developed. Its goal is to provide a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Unicode Facts:

  • Can use 16, 24, or 32 bitsUnicode encodings like UTF-16 use 16 bits (2 bytes), UTF-32 uses 32 bits (4 bytes). UTF-8 is variable width. (or more) per character, allowing for a massive number of characters.
    • e.g., 16-bit Unicode (like basic UTF-16) can represent 216 = 65,536 characters.
    • Modern Unicode can represent over a million characters!
  • Designed to be a superset of ASCIIThe first 128 characters of Unicode are identical to the ASCII character set, ensuring compatibility.. This means the first 128 characters of Unicode are the same as ASCII, making transition easier.
  • Can represent characters from virtually all known writing systemsIncludes scripts like Latin, Greek, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and many more., plus a vast collection of symbols and emojisYes, emojis are part of the Unicode standard! Each emoji has its own unique Unicode code point.. πŸ‘πŸŽ‰πŸ˜Š

Why Different Unicode Encodings? (UTF-8, UTF-16, UTF-32)

Unicode defines a unique number (a "code point") for every character. But if every character has a number, why do we have different "encodings" like UTF-8, UTF-16, and UTF-32? What problem might these different encodings be trying to solve?

Interactive: Character Representation

Let's see how a character might be represented:

Character: A

ASCII (8-bit): 01000001 (Decimal: 65)

Unicode (UTF-16, as U+0041): 00000000 01000001 (Hex: 0041)


Character: € (Euro Sign)

ASCII (8-bit): Not directly representable in standard 7-bit or common 8-bit ASCII. Some extended ASCII versions might include it at a specific code.

Unicode (UTF-16, as U+20AC): 00100000 10101100 (Hex: 20AC)


Character: 😊 (Smiling Face Emoji)

ASCII (8-bit): Definitely not representable!

Unicode (UTF-16, as U+1F60A): 1101100000111101 1101111000001010 (Hex: D83D DE0A - this is a surrogate pair in UTF-16)

Task 4: ASCII vs. Unicode - Quick Quiz!

Test your understanding of the differences between ASCII and Unicode.

1. Which character set typically uses 8 bits and can represent up to 256 characters?

2. Which character set is designed to represent characters from almost all writing systems in the world, including emojis?

3. If a character set uses 16 bits for each character, how many unique characters can it represent?

4. Is Unicode a superset of ASCII (meaning it includes all ASCII characters)?

Task 5: Binary to ASCII Character

Remember that each ASCII character has a unique 8-bit binary code. Try inputting an 8-bit binary number below to see which character it represents!

Task 6: Which Character Set is Needed?

Look at the text snippets below. For each one, decide if it could be fully represented by standard 8-bit ASCII, or if Unicode would be necessary.

Hello World!

Can this be represented by 8-bit ASCII?

RΓ©sumΓ© & CafΓ©

Can this be represented by 8-bit ASCII?

δ½ ε₯½δΈ–η•Œ πŸ˜ŠπŸ‘

Can this be represented by 8-bit ASCII?

Task 7: Storage Size - Food for Thought!

Consider the string: "Code Club πŸ’»"

Think about how much storage this string might take up using different character sets/encodings. This isn't a direct calculation task for now, but a conceptual one.

1. Using 8-bit ASCII:

  • How many characters in "Code Club " can be represented by standard ASCII?
  • What happens to the 'πŸ’»' emoji? Can standard ASCII represent it?
  • Roughly, how many bytes would "Code Club " (without the emoji) take in ASCII?

2. Using Unicode (e.g., UTF-8 encoding):

  • UTF-8 uses 1 byte for standard English/ASCII characters. How many bytes for "Code Club "?
  • Emojis in UTF-8 often take 3 or 4 bytes. If 'πŸ’»' takes 4 bytes, what's the total for the whole string "Code Club πŸ’»" in UTF-8?

Task 8: Interactive Character Code Explorer

Computers use numerical codes for every character. Let's explore this! This is similar to how Python's ord() (character to code) and chr() (code to character) functions work.

Character to Code Point

Code Point to Character

Challenge: String to Code Points

Note: The "Code Point to Character" part will show special messages for non-printable control characters (like ASCII 0-31, 127 and Unicode 128-159).

Real-World Context & Importance

Essentially, any time a computer needs to store, process, or display textual information, character sets are involved!

Character Set Myth Busters!

Myth: ASCII is completely outdated and no longer used.

Buster: While Unicode (especially UTF-8) is dominant for general text, ASCII is still foundational. The first 128 characters of Unicode are identical to ASCII. This means UTF-8 is backward-compatible with ASCII for these characters. Also, some simple systems, embedded devices, or older protocols might still use plain ASCII or Extended ASCII for efficiency or legacy reasons.

Myth: All Unicode characters take up 16 bits (2 bytes) of storage.

Buster: Not necessarily! Unicode is a standard that assigns a unique number (code point) to each character. How that number is stored in bytes depends on the Unicode encodingMethods like UTF-8, UTF-16, and UTF-32 that define how Unicode code points are mapped to byte sequences. used.

  • UTF-8 is a variable-width encoding: ASCII characters use 1 byte, while other characters can use 2, 3, or 4 bytes. This is very efficient for documents that are mostly English/ASCII.
  • UTF-16 typically uses 2 bytes for most common characters, but can use 4 bytes (a surrogate pair) for characters outside the Basic Multilingual Plane (BMP).
  • UTF-32 uses a fixed 4 bytes for every character.

Myth: You need to install different fonts to see different languages.

Buster: While fonts are necessary to *display* characters, the ability to *represent* characters from different languages is the job of the character set (like Unicode). A font is a collection of graphical representations (glyphs) for characters. If your system has a font that includes the glyphs for the Unicode characters in a document, it can display them. If not, you might see placeholder boxes (like ☐) or question marks, even if the underlying Unicode data is correct. Modern operating systems come with fonts that cover a very wide range of Unicode characters.

Exam-Style Questions

1. Define the term 'character set'. (1 mark)

Answer: A defined list of characters (1) recognised by computer hardware/software (and each character has a unique code).
(Accept variations like: A set of symbols/letters/numbers that a computer can understand/represent.)

2. Explain one reason why Unicode was introduced. (2 marks)

Answer (any one of these for 2 marks, or similar points):

  • ASCII (or other older character sets) could not represent enough characters (1) for all world languages/symbols/emojis (1).
  • To provide a universal standard (1) that could represent characters from (almost) all languages (1).
  • To overcome the limitation of ASCII's small character capacity (1) which was insufficient for global communication/software (1).

3. A computer uses 8-bit ASCII to store text. The ASCII code for 'C' is 67 and for 'a' is 97. Show the 8-bit binary representation for the word "Ca". (2 marks)

Answer:

C (67) = 01000011 (1 mark)

a (97) = 01100001 (1 mark)

So, "Ca" = 0100001101100001 (or 01000011 01100001)

(Award marks for correct binary for each character. Deduct if not 8-bit per character if specified by question, but here it is.)

4. Compare ASCII and Unicode in terms of the number of characters they can represent and their suitability for representing text in different languages. (4 marks)

Answer (award marks for points like these, up to 4):

  • Number of Characters:
    • ASCII (8-bit) can represent 256 characters. (1 mark)
    • Unicode can represent a vastly larger number of characters (e.g., over a million / 65,536 for 16-bit). (1 mark)
  • Suitability for Languages:
    • ASCII is suitable for English and some Western European languages but limited for others. (1 mark)
    • Unicode is suitable for representing characters from (almost) all world languages, including complex scripts and emojis. (1 mark)
  • (Also accept: ASCII uses fewer bits per character (typically 1 byte) vs Unicode which can use more (e.g. 2 or 4 bytes, or variable with UTF-8). Unicode is a superset of ASCII.)

Lesson Summary: Key Takeaways

Where to Next?

Now that you understand how text is represented, you might want to explore:

Going Further: Extension Activities