Binary-to-Text Encoding: Base64, Base32, Base16, Base62, and Base58

In the course of web development, one frequently encounters encoding schemes such as Base64. Until recently, I had merely conceptualized it as a standard for encoding binary data, but upon investigation, the subject proves remarkably intriguing. This exposition examines the fundamental nature of character encoding, the nomenclature conventions for base-N encoding systems, and the distinguishing characteristics among various standards including Base64, Base62, Base32, and Base16.

Character Encoding

Encoding constitutes the transformation of human-comprehensible data such as textual characters and visual representations into computational signals (sequences of 0 and 1).
Character encoding specifically addresses the conversion of textual data into computational signals
Representative character encoding methodologies:
- ASCII (American Standard Code for Information Interchange)
  - The inaugural standard for textual representation in computing systems (established 1963)
  - Encompasses exclusively alphabetic characters and select special characters
- Unicode
  - A comprehensive standard enabling consistent representation of all global writing systems within computational environments
  - Achieves unification of previously incompatible character encoding methodologies
  - UTF encoding and UCS encoding constitute transformation methodologies for Unicode, with UTF-8 representing one of the most prevalent implementations
For comprehensive exposition regarding character encoding varieties and historical development, consult the Wikipedia reference

Binary-to-Text Encoding: Fundamental Concepts

The process of encoding binary data into printable character strings
Email systems exemplify principal applications, wherein the SMTP protocol transmits data through text-based mechanisms. The protocol specification mandates 7-bit ASCII transmission, whereas contemporary computational systems predominantly store data in 8-bit (byte) memory structures. Binary-to-text encoding becomes essential for lossless transmission of non-English languages utilizing more than 8 bits, as well as graphical, audio, and video content.

Let us examine the most prevalent binary-to-text encoding methodologies, proceeding from Base64 through Base32, Base16, Base62, and Base58.

Base64

Base-N denotes N-ary numeral systems; consequently, Base64 represents a sexagesimal (64-ary) system. This encoding methodology transforms 8-bit binary data into character strings composed exclusively of ASCII characters unaffected by character code variations. (Source: Wikipedia)
- The proliferation of diverse character encoding standards (ISO-8859, Unicode, et cetera) arose from the necessity to accommodate global writing systems, yet each possesses distinct character coverage… Ultimately, representation independent of character codes necessitates conversion to ASCII.
- The methodology partitions 24 bits (three 8-bit segments) into four 6-bit segments. (2^6 = 64)
- Composition: A-Z, a-z, 0-9 (62 characters total) plus the symbols “+” and “/” = 64 characters (with “=” serving as a termination indicator)
  - Decimal: “1234567890”
  - Hexadecimal: “1234567890abcdef”
  - Sexagesimal: “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/”
  - → Base64 represents the maximum radix expressible using printable ASCII characters (excluding control characters). The total printable ASCII character set comprises fewer than 128 characters.
- However, the standard Base64 definition includes characters incompatible with URL specifications.
- (Referencing RFC 3986): “+” and “/” constitute reserved characters
- → “URL and Filename Safe” Base64 variant addresses this limitation (also defined in RFC documentation)
  - Substitutes “-” and “_” for “+” and “/”

Base64 Encoding Methodology

During conversion, data rarely aligns perfectly with 24-bit boundaries. Padding mechanisms accommodate this constraint.

base64-2

Vacant positions are populated with zeros, while entirely absent bit positions utilize the padding character “=”.
→ Encoding results in approximately 4/3 size expansion relative to the original data.
Note:
- An octet comprises eight bits. While typically equivalent to one byte, early computing systems did not uniformly define bytes as 8-bit units; hence, the term “octet” was coined to unambiguously specify 8-bit groupings.

Base32

A 32-ary numeral representation. Employs 5-bit character set notation
Advantages compared to Base64:
- Beneficial for case-insensitive file systems, DNS nomenclature, and vocal communication
- Excludes the Unix path separator (’/’), enabling usage in filenames
- Avoids character pairs visually similar to humans (omits 1, 8, 0 which may be confused with I, B, O)
Disadvantages compared to Base64:
- Occupies 20% additional storage capacity

Base16 (Hexadecimal)

A 16-ary numeral representation
- Converts 8-bit octets (1 byte) into two 4-bit characters (2 bytes)
Advantages:
- Most programming languages possess inherent functionality for parsing ASCII-encoded hexadecimal notation
- Unlike Base32 and Base64, requires no padding (being precisely half of an 8-bit byte)
- Hexadecimal notation enjoys widespread usage, enabling immediate comprehension without reference tables
Disadvantages:
- Space efficiency degrades to 50%

Base62

A variant of Base64 excluding special characters, comprising exclusively A-Z, a-z, 0-9
Addresses scenarios necessitating special character substitution (URLs representing a prominent example)

Base58Check

A Base64 derivative eliminating visually ambiguous characters (0, O, I, l) and special symbols (+, /)
Developed for Bitcoin address representation
- Character disambiguation prevents Bitcoin accounts from appearing identical visually
- Alphanumeric-only composition facilitates selection via double-clicking
- Special characters in email contexts occasionally trigger line breaks
Incorporates four bytes (32 bits) of SHA256-based error-checking code for invalid address detection
Being non-power-of-two, proves more suitable for integer encoding than binary encoding
- Bitcoin addresses constitute 25-byte (200-bit) numerals, representable as 35 characters in Base58

Character Encoding#

Binary-to-Text Encoding: Fundamental Concepts#

Base64#

Base64 Encoding Methodology#

Base32#

Base16 (Hexadecimal)#

Base62#

Base58Check#

References#