You write C++ statements using a basic source character set. This is simply the set of characters that you’re allowed to use explicitly in a C++ source file. Obviously, the character set that you can use to define a name is going to be a subset of this. Of course, the basic source character set in no way constrains the character data that you work with in your code. Your program can create strings consisting of characters outside this set in various ways, as you’ll see. The basic source character set consists of the following characters:
This is easy and straightforward. You have 96 characters that you can use, and it’s likely that these will accommodate your needs most of the time.
This definition of the characters that you can use in C++ does not say how the characters are encoded. Your particular compiler will determine how the characters that you use to write your C++ source code are represented in the computer. On a PC, these characters will typically be represented in the machine by an American Standard Code for Information Interchange (ASCII) code such as ISO Latin-1, but other ways of encoding characters may be used.
Most of the time the basic source character set will be adequate, but occasionally you’ll need characters that aren’t included in the basic set. You saw earlier that you can include UCS characters in a name. You can also include UCS characters in other parts of your program, such as when you specify character data. In the next section I elaborate a little on what UCS is all about.
The Universal Character Set
UCS is specified by the standard ISO/IEC 10646, and it defines codes for characters used in all the national languages that are current and many more besides. The ISO/IEC 10646 standard defines several character-encoding forms. The simplest is UCS-2, which represents characters as 16-bit codes, so it can accommodate 65,536 different character codes that can be written as four hexadecimal digits, dddd. This encoding is described as the basic multilingual plane because it accommodates all of the languages in current use, and the likelihood of you ever wanting more than this is remote. UCS-4 is another encoding within the ISO/IEC 10646 standard that represents characters as 32-bit codes that you can express as eight hexadecimal digits, dddddddd. With more than 4 billion different codes, UCS-4 provides the capacity for accommodating all the character sets that you might ever need.
This isn’t all there is to UCS, though. For example, there’s another 16-bit encoding called UTF-16 (UTF stands for Unicode Transformation Format) that is different from UCS-2 in that it accommodates more than 65,535 characters by encoding characters outside of the first 65,536 by what are referred to as surrogate pairs of 16-bit code values. There are other character encodings with UCS too. Generally, a given character will have a code with the same value in any UCS encoding that you choose. The values of codes in US_ASCII are the same as those in UCS character encodings.
Regardless of whether a compiler supports an extended character set for writing source statements, you can include characters from the UCS in your source code by specifying them in the form of a hexadecimal representation of their codes, either as \udddd or \Udddddddd, where d is a hexadecimal digit. Note the lowercase u in the first case and the uppercase U in the second. However, you must not specify any of the characters in the basic source character set in this way. This is because the codes for these characters will be determined by the compiler, and they may not be consistent with the UCS codes.
If your compiler supports an extended character set with characters outside the base source character set, you’ll be able to use these characters in your source code and the compiler will translate the characters to the internal representation before compilation begins.
You’re unlikely to see this in use very often—if ever—but the C++ standard allows you to specify certain characters as trigraph sequences. A trigraph sequence is a sequence of three characters that’s used to identify another character. This was necessary way back in the dark ages of computing to accommodate characters that were missing from some keyboards. Table 1-1 shows the characters that may be specified in this way in C++.
The compiler will replace all trigraph sequences with their equivalent characters before any other processing of the source code.Escape Sequences
When you want to use character constants in a program, certain characters can be problematic. A character constant is a data item that your program will use in some way, and it can be either a single character or a character string such as the one in the earlier simple example. Obviously, you can’t enter characters such as newline or tab directly as character constants, as they’ll just do what they’re supposed to do: go to a new line or tab to the next tab position in your source code file. What you want in a character constant is the appropriate code for the character.
You can enter control characters as constants by means of an escape sequence. An escape sequence is an indirect way of specifying a character, and it always begins with a backslash (\). Table 1-2 shows the escape sequences that represent control characters.
There are some other characters that are a problem to represent directly. Clearly, the backslash character itself is difficult, because it signals the start of an escape sequence, and there are others with special significance too. Table 1-3 shows the “problem” characters you can specify with an escape sequence.
Because the backslash signals the start of an escape sequence, the only way to enter a backslash as a character constant is by using two successive backslashes (\\).
Escape sequences also provide a general way of representing characters such as those in languages other than the one your keyboard supports, because you can use a hexadecimal (base 16) or octal (base 8) number after the backslash to specify the code for a character. Because you’re using a numeric code, you can specify any character in this way. In C++, hexadecimal numbers start with x or X, so \x99A and \XE3 are examples of escape sequences in this format.
You can also specify a character by using up to three octal digits after the backslash—\165, for example. The absence of x or X determines that the code will be interpreted as an octal number.
Try It Out: Using Escape Sequences
You can produce an example of a program that uses escape sequences to specify a message to be displayed on the screen. To see the results, you’ll need to enter, compile, link, and execute the following program.
As I explained in the Introduction, exactly how you perform these steps will depend on your compiler, and you’ll need to consult your compiler’s documentation for more information. If you look up “edit”, “compile”, and “link” (and, with some compilers, “build”), you should be able to find out what you need to do.
When you do manage to compile, link, and run this program, you should see the following output displayed:
You should also hear a beep or some equivalent noise from whatever sound output facility your computer has.
HOW IT WORKS
The output you get is determined by what’s between the outermost double quotes in the statement
cout << "\n\"Least said\n\t\tsoonest mended.\"\n\a";
In principle, everything between the outer double quotes in the preceding statement gets sent to cout.A string of characters between a pair of double quotes is called a string literal. The double quote characters just identify the beginning and end of the string literal; they aren’t part of the string. I said “in principle” because any escape sequence in the string literal would have been converted by the compiler to the character it represents, so the character will be sent to cout, not the escape sequence itself. A backslash in a string literal always indicates the start of an escape sequence, so the first character that’s sent to cout is a newline character. This positions the screen cursor at the beginning of the next line.
The next character in the string is specified by another escape sequence, \", so a double quote will be sent to cout and displayed on the screen, followed by the characters Least said. Next is another newline character corresponding to \n, so the cursor will move to the beginning of the next line. You then send two tab characters to cout with \t\t, so the cursor will be moved two tab positions to the right. The characters soonest mended. will then be displayed from that point on, followed by another double quote from the escape sequence \". Lastly, you have another newline character, which will move the cursor to the start of the next line, followed by the character equivalent of the \a escape sequence that will cause the beep to sound.
The double quote characters that are interior to the string aren’t interpreted as marking the end of the string literal because each of them is preceded by a backslash and is therefore recognized as an escape sequence. If you didn’t have the escape sequence, \", available, you would have no way of outputting a double quote because it would otherwise be interpreted as indicating the end of the string.
The name endl is defined in the <iostream>
Using endl, the statement in the preceding code to output the string could be written as follows:
This statement sends five separate things in sequence to cout: endl, "\"Least said", endl, "\t\tsoonest mended.\"\a", and endl. This will produce exactly the same output as the original statement. Of course, for this statement to compile as written, you would need to add another using directive at the beginning of the program:
You don’t have to choose between using either endl or the escape sequence for newline. They aren’t mutually exclusive, so you can mix them to suit yourself. For example, you could produce the same result as the original again with this statement:
Here you’ve just used endl for the first and last newline characters. The one in the middle is still produced by an escape sequence. Of course, each instance of endl in the output will result in the output buffer being flushed after writing a newline character to the stream.
blog comments powered by Disqus