Mastering Unicode in Modern C++: A Comprehensive Guide to Wide Characters, Encodings, and Best Practices

6 min readApr 29, 2023

Introduction

The growing globalization of software and the increasing importance of diverse languages and scripts have made it necessary to develop a standardized character encoding system. Unicode, the Universal Character Set (UCS), and various character encoding formats have emerged to meet this need. In this article, we will explore the concepts of Unicode, UCS, and WideChar in the context of C++ 20/17 programming, with a focus on modern C++ practices. We will also provide examples to illustrate each concept, highlighting the benefits of these character encoding standards.

Unicode: The Universal Language

Unicode is a character encoding standard developed by the Unicode Consortium to represent text in modern software, handling virtually all written languages worldwide. Unicode provides a unique number, known as a code point, for every character, regardless of the platform or language. As of Unicode version 14.0, there are over 143,000 characters from various scripts, including alphabets, symbols, and emojis.

UTF Encoding Formats

UTF (Unicode Transformation Format) is a family of variable-length character encodings that map Unicode code points to sequences of bytes. The most common UTF formats are:

UTF-8

UTF-8 is a variable-length encoding using 8-bit code units. It is the most popular and widely-used Unicode encoding, as it is backward-compatible with ASCII, the standard encoding for most web content.

UTF-16

UTF-16 is another variable-length encoding using 16-bit code units. It is used primarily for representing characters from the Basic Multilingual Plane (BMP) of Unicode, which includes most of the world’s scripts.

UTF-32

UTF-32 is a fixed-length encoding using 32-bit code units. Each Unicode code point is represented directly as a single 32-bit integer. This encoding is less common due to its larger memory footprint but offers faster random access to individual characters.

Universal Character Set (UCS)

The Universal Character Set (UCS) is an international standard developed by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). It defines a set of characters for representing text in various languages, similar to Unicode. UCS and Unicode have a close relationship, as UCS is a subset of Unicode. The character sets are nearly identical, with Unicode including additional characters and properties not present in UCS.

UCS Encoding Formats

UCS uses two encoding formats:

UCS-2

UCS-2 is a fixed-length encoding using 16-bit code units. It is capable of representing characters from the BMP but cannot encode characters outside this plane.

UCS-4

UCS-4 is a fixed-length encoding using 32-bit code units, similar to UTF-32. It can represent all Unicode code points directly.

WideChar in Modern C++

In C++, the wchar_t type, known as WideChar, is used to represent wide characters. The size of wchar_t is implementation-dependent and may vary between platforms and compilers. Typically, wchar_t is 16 bits on Windows and 32 bits on Unix-like systems.

To handle Unicode text in C++ applications, developers should use wide string literals, which are sequences of wchar_t. These wide string literals can be defined using the L prefix, as shown in the example below:

#include <iostream> 
int main() {
  const wchar_t* wide_string = L"Hello, World!";
  std::wcout << wide_string << std::endl;
  return 0; 
}

C++ Standard Library Support for Wide Characters

The C++ Standard Library provides various utilities and functions for working with wide characters and strings. Some of these include:

Wide Character I/O

The C++ Standard Library offers wide character input and output streams, such as std::wcin, std::wcout, std::wcerr, and std::wclog. These streams allow reading and writing wide characters or strings to standard input, output, error, and log streams.

Wide Character String Manipulation

C++ provides a set of wide character string manipulation functions, such as wcslen, wcscmp, wcsncpy, and wcstok. These functions are similar to their char counterparts, but they operate on wchar_t arrays.

std::wstring and std::wstringstream

std::wstring is a specialization of std::basic_string for wchar_t. It provides a convenient way to work with wide character strings. std::wstringstream is a specialization of std::basic_stringstream for wchar_t, allowing you to perform formatted input and output operations on wide character strings.

Converting Between Wide and Narrow Character Strings

To convert between wide and narrow character strings, C++ provides the following functions:

mbstowcs and wcstombs

These functions convert between multi-byte character strings (char-based) and wide character strings (wchar_t-based). They use the current locale’s character encoding.

#include <iostream>
#include <cstdlib>
#include <cstring>
int main() {
  const char* narrow_string = "Hello, World!";
  wchar_t wide_string[256];
  std::mbstate_t state{};
  std::size_t len = std::strlen(narrow_string);
  std::mbsrtowcs(wide_string, &narrow_string, len, &state);
  std::wcout << wide_string << std::endl;
  return 0;
}

codecvt

The C++11 standard introduced std::codecvt, a locale-independent facility for converting between character encodings. It can be used to convert between UTF-8, UTF-16, and UTF-32 encodings, as well as between wide and narrow character strings.

#include <locale>
#include <iostream>
#include <codecvt>
#include <string>
int main() {
  std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
  std::string narrow_string = "Hello, World!";
  std::wstring wide_string = converter.from_bytes(narrow_string);
  std::wcout << wide_string << std::endl;
  return 0;
}

However, it is worth noting that std::codecvt is deprecated in C++17 and removed in C++20. Instead, developers are encouraged to use external libraries, such as ICU or Boost.Locale, for character encoding conversions.

Best Practices for Unicode in Modern C++

When working with Unicode in modern C++, consider the following best practices:

Prefer UTF-8 for Storage and Transmission

UTF-8 is the most widely-used and recommended encoding for storage and transmission of Unicode text. It is compact, backward-compatible with ASCII, and suitable for most text-processing scenarios.

Use UTF-16 or UTF-32 for Internal Processing

For internal processing, consider using UTF-16 or UTF-32, depending on the platform and use case. These encodings offer a balance between memory usage and processing speed. Remember that wchar_t size may vary between platforms, so use explicit types such as char16_t and char32_t for portable code.

Employ Unicode-aware Functions and Libraries

When manipulating Unicode text, use Unicode-aware functions and libraries instead of relying on legacy C++ functions. For instance, use std::wstring, std::wstring_view, and std::wstringstream instead of their narrow-character counterparts.

Avoid Mixing Narrow and Wide Character Strings

Mixing narrow and wide character strings can lead to unexpected behavior and hard-to-find bugs. Keep the two separate and only convert between them when necessary, using well-defined conversion functions or libraries.

Use External Libraries for Complex Unicode Operations

For complex Unicode operations, such as normalization, case folding, or text segmentation, consider using external libraries like ICU (International Components for Unicode) or Boost.Locale. These libraries offer robust support for Unicode handling in C++.

Example: Reading a UTF-8 File and Processing Unicode Text

In this example, we will demonstrate how to read a UTF-8 encoded text file, convert the content to UTF-32 for processing, and then convert the result back to UTF-8 for output.

import iostream; // new modular system in C++ 20
#include <fstream>
#include <vector>
#include <string>
#include <codecvt>
#include <locale>
// Read the content of a UTF-8 encoded file
std::string read_utf8_file(const std::string& filename) {
  std::ifstream input_file(filename, std::ios::binary);
  std::vector<char> bytes(
      (std::istreambuf_iterator<char>(input_file)),
      std::istreambuf_iterator<char>()
  );
  return std::string(bytes.begin(), bytes.end());
}
// Process Unicode text (e.g., convert to uppercase)
std::u32string process_unicode_text(const std::u32string& input) {
  std::u32string result;
  for (char32_t ch : input) {
    char32_t upper_case_ch = std::towupper(ch);
    result.push_back(upper_case_ch);
  }
  return result;
}
int main() {
  // Read a UTF-8 encoded file
  std::string utf8_content = read_utf8_file("example.txt");
  // Convert UTF-8 content to UTF-32
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
  std::u32string utf32_content = converter.from_bytes(utf8_content);
  // Process Unicode text
  std::u32string processed_content = process_unicode_text(utf32_content);
  // Convert the result back to UTF-8
  std::string output_utf8_content = converter.to_bytes(processed_content);
  // Write the output to stdout
  std::cout << output_utf8_content << std::endl;
  return 0;
}

Conclusion

In this article, we have explored the concepts of Unicode, UCS, and WideChar in the context of modern C++ programming. We have discussed various encoding formats, including UTF-8, UTF-16, UTF-32, UCS-2, and UCS-4, and demonstrated how to work with wide characters and strings in C++. We have also covered best practices for handling Unicode text in C++ and provided examples to illustrate each concept.

By understanding and implementing these concepts and practices, C++ developers can create software that is more robust, efficient, and capable of handling a wide range of languages and scripts. This will enable developers to cater to a global audience and meet the increasing demands of modern software development.

Mastering Unicode in Modern C++: A Comprehensive Guide to Wide Characters, Encodings, and Best Practices

Written by Salik Tariq

No responses yet