Jump to content

2.1.3 Character Encoding (ASCII, Unicode, UTF-8)

From Computer Science Knowledge Base
Revision as of 23:07, 6 July 2025 by Mr. Goldstein (talk | contribs) (Created page with "Okay, let's unlock another secret of how computers understand our language: '''Character Encoding'''! === 2.1.3 Character Encoding: How Computers Read Your Words === You're reading these words right now, but how does your computer know that the pattern of 0s and 1s it stores is supposed to show you the letter 'A' or the symbol '?'? It's because of something called '''character encoding'''. Think of character encoding like a giant secret codebook or a dictionary that bo...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Okay, let's unlock another secret of how computers understand our language: Character Encoding!

2.1.3 Character Encoding: How Computers Read Your Words

You're reading these words right now, but how does your computer know that the pattern of 0s and 1s it stores is supposed to show you the letter 'A' or the symbol '?'? It's because of something called character encoding.

Think of character encoding like a giant secret codebook or a dictionary that both your computer and the programs it runs agree to use. This codebook assigns a unique number to every letter, number, symbol, and even emojis you see on your screen. When you type a letter, the computer looks it up in the codebook, finds its number, and then stores that number in binary (as 0s and 1s). When the computer needs to show you that letter, it looks up the binary number, finds the corresponding character in the codebook, and displays it!

Let's look at the most important "codebooks" computers use:

ASCII (American Standard Code for Information Interchange): The Original English Code

  • What it is: ASCII was one of the very first and most widely used character encoding systems, created in the 1960s.
  • How it works: It's quite simple! ASCII uses a 7-bit code, which means it can represent 27 or 128 different characters. These 128 characters include:
    • All uppercase English letters (A-Z)
    • All lowercase English letters (a-z)
    • Numbers (0-9)
    • Common punctuation marks (like ?, !, .)
    • Some special control characters (like the "Enter" key or "Tab" key)
  • Why it was great: For a long time, especially when computers were mostly used in English-speaking countries, ASCII worked perfectly. It was efficient because each character took up just one byte of memory (since computers usually work with bytes, and 7 bits fit neatly into an 8-bit byte).
  • Its Limitation: The problem with ASCII is that 128 characters simply aren't enough for all the languages in the world! It couldn't handle letters with accents (like é or ü), characters from languages like Chinese or Arabic, or even cool symbols like a copyright sign (©).

Unicode: A Universal Code for All Languages

  • What it is: As computers became global, we needed a much bigger and more inclusive codebook. That's where Unicode comes in! Unicode is a huge international standard that aims to include every character from every language on Earth, plus mathematical symbols, emojis, and more.
  • How it works: Instead of just 128 characters, Unicode can define over a million different characters! Each character gets its own unique number, called a "code point." So, the letter 'A' still has a number, but so does 'é', '你好' (Ni hao), and even the 😂 (Face with Tears of Joy) emoji!
  • Why it was needed: Unicode solves the problem of language barriers in computers. Now, a document written in Japanese can be opened and read correctly on a computer in America, as long as both understand Unicode.

UTF-8 (Unicode Transformation Format - 8-bit): The Smart Storage Solution

  • What it is: Unicode defines the numbers for all characters, but it doesn't say how those numbers should be stored as binary 0s and 1s. That's where UTF-8 comes in. UTF-8 is the most popular way to actually encode (turn into binary) Unicode characters for storage and transmission.
  • How it works: UTF-8 is clever because it's a "variable-width" encoding. This means:
    • Small characters stay small: For characters that were already in ASCII (like English letters and numbers), UTF-8 uses just one byte (8 bits). This is great because it makes files containing mostly English text very efficient and backward-compatible with older ASCII systems.
    • Other characters use more bytes: For characters from other languages, or emojis, UTF-8 uses more than one byte (like two, three, or four bytes). This allows it to represent all the characters in Unicode without wasting space for simple text.
  • Why it's the standard: UTF-8 is widely used on the internet and in most modern software because it's efficient, flexible, and can handle text from almost any language in the world. When you send a text message with emojis, or open a webpage with different languages, chances are UTF-8 is making it all work behind the scenes!

So, thanks to character encoding systems like ASCII, Unicode, and especially UTF-8, our computers can understand and display all the diverse text and symbols that make up our digital world!


Bibliography