Home arrow Practices arrow Page 7 - Basic Ideas

C++ Source Characters - Practices

This article will give you a good understanding of the basic concepts and practices of the C++ language, so that you will have the foundation to eventually learn these ideas in detail as you continue working with the language. It is excerpted from Ivor Horton's Beginning ANSI C++ The Complete Language (Apress, 2004; ISBN 1590592271).

TABLE OF CONTENTS:
  1. Basic Ideas
  2. Interpreted vs. Compiled Program Execution
  3. A Simple C Program
  4. Names Using Extended Character Sets
  5. C Statements and Statement Blocks
  6. Creating an Executable from Your Source Files
  7. C Source Characters
  8. Whitespace in Statements
  9. Procedural and Object-Oriented Programming
By: Apress Publishing
Rating: starstarstarstarstar / 18
March 23, 2005

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

You write C++ statements using a basic source character set. This is simply the set of characters that you’re allowed to use explicitly in a C++ source file. Obviously, the character set that you can use to define a name is going to be a subset of this. Of course, the basic source character set in no way constrains the character data that you work with in your code. Your program can create strings consisting of characters outside this set in various ways, as you’ll see. The basic source character set consists of the following characters:

  • The letters a to z and A to Z

  • The digits 0 to 9

  • The control characters representing horizontal tab, vertical tab, form-feed, and newline

  • The characters _ {}[]#()<>%:;.?*+-/^&|~!=,\"'

This is easy and straightforward. You have 96 characters that you can use, and it’s likely that these will accommodate your needs most of the time.

This definition of the characters that you can use in C++ does not say how the characters are encoded. Your particular compiler will determine how the characters that you use to write your C++ source code are represented in the computer. On a PC, these characters will typically be represented in the machine by an American Standard Code for Information Interchange (ASCII) code such as ISO Latin-1, but other ways of encoding characters may be used.

Most of the time the basic source character set will be adequate, but occasionally you’ll need characters that aren’t included in the basic set. You saw earlier that you can include UCS characters in a name. You can also include UCS characters in other parts of your program, such as when you specify character data. In the next section I elaborate a little on what UCS is all about.

The Universal Character Set

UCS is specified by the standard ISO/IEC 10646, and it defines codes for characters used in all the national languages that are current and many more besides. The ISO/IEC 10646 standard defines several character-encoding forms. The simplest is UCS-2, which represents characters as 16-bit codes, so it can accommodate 65,536 different character codes that can be written as four hexadecimal digits, dddd. This encoding is described as the basic multilingual plane because it accommodates all of the languages in current use, and the likelihood of you ever wanting more than this is remote. UCS-4 is another encoding within the ISO/IEC 10646 standard that represents characters as 32-bit codes that you can express as eight hexadecimal digits, dddddddd. With more than 4 billion different codes, UCS-4 provides the capacity for accommodating all the character sets that you might ever need.

This isn’t all there is to UCS, though. For example, there’s another 16-bit encoding called UTF-16 (UTF stands for Unicode Transformation Format) that is different from UCS-2 in that it accommodates more than 65,535 characters by encoding characters outside of the first 65,536 by what are referred to as surrogate pairs of 16-bit code values. There are other character encodings with UCS too. Generally, a given character will have a code with the same value in any UCS encoding that you choose. The values of codes in US_ASCII are the same as those in UCS character encodings.

Regardless of whether a compiler supports an extended character set for writing source statements, you can include characters from the UCS in your source code by specifying them in the form of a hexadecimal representation of their codes, either as \udddd or \Udddddddd, where d is a hexadecimal digit. Note the lowercase u in the first case and the uppercase U in the second. However, you must not specify any of the characters in the basic source character set in this way. This is because the codes for these characters will be determined by the compiler, and they may not be consistent with the UCS codes.

If your compiler supports an extended character set with characters outside the base source character set, you’ll be able to use these characters in your source code and the compiler will translate the characters to the internal representation before compilation begins.


NOTE  The character codes defined by the UCS standard are identical to codes defined by Unicode, so Unicode is essentially UCS by another name. If you are keen to explore the delights of UCS and Unicode in detail, http://www.unicode.org is a good place to start.

Trigraph Sequences

You’re unlikely to see this in use very often—if ever—but the C++ standard allows you to specify certain characters as trigraph sequences. A trigraph sequence is a sequence of three characters that’s used to identify another character. This was necessary way back in the dark ages of computing to accommodate characters that were missing from some keyboards. Table 1-1 shows the characters that may be specified in this way in C++.

Table 1-1. Trigraph Sequence Characters

Character Trigraph Sequence
# ??=
[ ??(
] ??)
\ ??/
{ ??<
} ??>
^ ??'
| ??!
~ ??

The compiler will replace all trigraph sequences with their equivalent characters before any other processing of the source code.

Escape Sequences

When you want to use character constants in a program, certain characters can be problematic. A character constant is a data item that your program will use in some way, and it can be either a single character or a character string such as the one in the earlier simple example. Obviously, you can’t enter characters such as newline or tab directly as character constants, as they’ll just do what they’re supposed to do: go to a new line or tab to the next tab position in your source code file. What you want in a character constant is the appropriate code for the character.

You can enter control characters as constants by means of an escape sequence. An escape sequence is an indirect way of specifying a character, and it always begins with a backslash (\). Table 1-2 shows the escape sequences that represent control characters.

Table 1-2. Escape Sequences That Represent Control Characters

Escape Sequence Control Character
\n Newline
\t Horizontal tab
\v Vertical tab
\b Backspace
\r Carriage return
\f Form feed
\a Alert/bell


There are some other characters that are a problem to represent directly. Clearly, the backslash character itself is difficult, because it signals the start of an escape sequence, and there are others with special significance too. Table 1-3 shows the “problem” characters you can specify with an escape sequence.

Table 1-3. Escape Sequences That Represent “Problem” Characters

Escape Sequence “Problem” Character
\\ Backslash
\' Single quote
\" Double quote
\? Question mark


Because the backslash signals the start of an escape sequence, the only way to enter a backslash as a character constant is by using two successive backslashes (\\).

Escape sequences also provide a general way of representing characters such as those in languages other than the one your keyboard supports, because you can use a hexadecimal (base 16) or octal (base 8) number after the backslash to specify the code for a character. Because you’re using a numeric code, you can specify any character in this way. In C++, hexadecimal numbers start with x or X, so \x99A and \XE3 are examples of escape sequences in this format.

You can also specify a character by using up to three octal digits after the backslash—\165, for example. The absence of x or X determines that the code will be interpreted as an octal number.


Try It Out: Using Escape Sequences

You can produce an example of a program that uses escape sequences to specify a message to be displayed on the screen. To see the results, you’ll need to enter, compile, link, and execute the following program.

As I explained in the Introduction, exactly how you perform these steps will depend on your compiler, and you’ll need to consult your compiler’s documentation for more information. If you look up “edit”, “compile”, and “link” (and, with some compilers, “build”), you should be able to find out what you need to do.

// Program 1.2 Using escape sequences
#include <iostream>
using std::cout;
int main() {
  cout << "\n\"Least said\n\t\tsoonest mended.\"\n\a";
  return 0;
}

When you do manage to compile, link, and run this program, you should see the following output displayed:

============================================================
"Least said
                soonest mended."
============================================================

You should also hear a beep or some equivalent noise from whatever sound output facility your computer has.

HOW IT WORKS

The output you get is determined by what’s between the outermost double quotes in the statement

 cout << "\n\"Least said\n\t\tsoonest mended.\"\n\a";

In principle, everything between the outer double quotes in the preceding statement gets sent to cout.A string of characters between a pair of double quotes is called a string literal. The double quote characters just identify the beginning and end of the string literal; they aren’t part of the string. I said “in principle” because any escape sequence in the string literal would have been converted by the compiler to the character it represents, so the character will be sent to cout, not the escape sequence itself. A backslash in a string literal always indicates the start of an escape sequence, so the first character that’s sent to cout is a newline character. This positions the screen cursor at the beginning of the next line.

The next character in the string is specified by another escape sequence, \", so a double quote will be sent to cout and displayed on the screen, followed by the characters Least said. Next is another newline character corresponding to \n, so the cursor will move to the beginning of the next line. You then send two tab characters to cout with \t\t, so the cursor will be moved two tab positions to the right. The characters soonest mended. will then be displayed from that point on, followed by another double quote from the escape sequence \". Lastly, you have another newline character, which will move the cursor to the start of the next line, followed by the character equivalent of the \a escape sequence that will cause the beep to sound.

The double quote characters that are interior to the string aren’t interpreted as marking the end of the string literal because each of them is preceded by a backslash and is therefore recognized as an escape sequence. If you didn’t have the escape sequence, \", available, you would have no way of outputting a double quote because it would otherwise be interpreted as indicating the end of the string.

The name endl is defined in the <iostream> header, and its effect when you use it in an output statement is to write a single newline character so you can use endl instead of \n. \n and endl aren’t exactly equivalent, though, because using endl will result in the output buffer being flushed so any characters still in memory will be written to the output device. This won’t be the case with \n. Obviously, you can’t include endl in a string literal because it would be interpreted as simply four letters, e, n, d, and l.


CAUTION  Be aware that the final character of endl is the letter l, not the number 1.It can sometimes be difficult to tell the two apart.

Using endl, the statement in the preceding code to output the string could be written as follows:

cout << endl
     << "\"Least said"
     << endl
     << "\t\tsoonest mended.\"\a"
     << endl;

This statement sends five separate things in sequence to cout: endl, "\"Least said", endl, "\t\tsoonest mended.\"\a", and endl. This will produce exactly the same output as the original statement. Of course, for this statement to compile as written, you would need to add another using directive at the beginning of the program:

using std::endl;

You don’t have to choose between using either endl or the escape sequence for newline. They aren’t mutually exclusive, so you can mix them to suit yourself. For example, you could produce the same result as the original again with this statement:

cout << endl
      << "\"Least said\n\t\tsoonest mended.\"\a"
      << endl;

Here you’ve just used endl for the first and last newline characters. The one in the middle is still produced by an escape sequence. Of course, each instance of endl in the output will result in the output buffer being flushed after writing a newline character to the stream.


This article is excerpted from Beginning ANSI C++ The Complete Language by Ivor Horton (Apress, 2004; ISBN  1590592271). Check it out at your favorite bookstore today. Buy this book now.



 
 
>>> More Practices Articles          >>> More By Apress Publishing
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PRACTICES ARTICLES

- Calculating Development Project Costs
- More Techniques for Finding Things
- Finding Things
- Finishing the System`s Outlines
- The System in So Many Words
- Basic Data Types and Calculations
- What`s the Address? Pointers
- Design with ArgoUML
- Pragmatic Guidelines: Diagrams That Work
- Five-Step UML: OOAD for Short Attention Span...
- Five-Step UML: OOAD for Short Attention Span...
- Introducing UML: Object-Oriented Analysis an...
- Class and Object Diagrams
- Class Relationships
- Classes

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: