Home arrow Practices arrow Page 9 - Basic Data Types and Calculations

Floating-Point Operations - Practices

This article looks at some of the basic data types that are built into C++. If you're learning how to use C++, you will want to keep reading, since you'll be using these data types in all of your programs. It is taken from chapter two of the book Beginning ANSI C++: The Complete Language, by Ivor Horton (Apress, 2004; ISBN: 1590592271).

  1. Basic Data Types and Calculations
  2. Performing Simple Calculations
  3. Try It Out: Integer Arithmetic in Action
  4. Try It Out: Fixing the Appearance of the Output
  5. Try It Out: Using Integer Variables
  6. The Assignment Operator
  7. Incrementing and Decrementing Integers
  8. Numerical Functions for Integers
  9. Floating-Point Operations
  10. Try It Out: Floating-Point Arithmetic
  11. Try It Out: Yet More Output Manipulators
  12. Working with Characters
  13. Functional Notation for Initial Values
  14. Exercises
By: Apress Publishing
Rating: starstarstarstarstar / 14
September 08, 2005

print this article



Numerical values that aren’t integers are stored as floating-point numbers. Internally, floating-point numbers have three parts: a sign (positive or negative), a mantissa (which is a value greater than or equal to 1 and less than 2 that has a fixed number of digits), and an exponent. Inside your computer, of course, both the mantissa and the exponent are binary values, but for the purposes of explaining how floating-point numbers work, I’ll talk about them as decimal values.

Numerical values that aren’t integers are stored as numbers. Internally, floating-point numbers have three parts: a sign (positive or negative), a mantissa (which is a value greater than or equal to 1 and less than 2 that has a fixed number of digits), and an exponent. Inside your computer, of course, both the mantissa and the exponent are binary values, but for the purposes of explaining how floating-point numbers work, I’ll talk about them as decimal values.

The value of a floating-point number is the signed value of the mantissa, multiplied by 10 to the power of the exponent, as shown in Table 2-7.

Table 2-7. Floating-Point Number Value

Sign(+/-)  Mantissa  Exponent           Value
    -      1.2345      3    –1.2345x103 (which is –1234.5)


You can write a floating point literal in three basic forms:

  • As a decimal value including a decimal point (for example, 110.0).


  • With an exponent (for example, 11E1) in which the decimal part is multiplied by the power of 10 specified after the E (for exponent). You have the option of using either an upper- or a lowercase letter E to precede the exponent.


  • Using both a decimal point and an exponent (for example, 1.1E2).


All three examples correspond to the same value, 110.0. Note that spaces aren’t allowed within floating-point literals, so you must not write 1.1 E2, for example. The latter would be interpreted by the compiler as two separate things: the floating-point literal 1.1 and the name E2.

NOTE A floating-point literal must contain a decimal point, or an exponent, or both. If you write a numeric literal with neither, then you have an integer.

Floating-Point Data Types

There are three floating-point data types that you can use, as described in Table 2-8.

Table 2-8. Floating-Point Data Types

Data Type Description
float Single precision floating-point values
double Double precision floating-point values
long double Double-extended precision floating-point values

The term “precision” here refers to the number of significant digits in the mantissa. The data types are in order of increasing precision, withfloatproviding the lowest number of digits in the mantissa andlong doublethe highest. Note that the precision only determines the number of digits in the mantissa. The range of numbers that can be represented by a particular type is determined by the range of possible exponents.

The precision and range of values aren’t prescribed by the ANSI standard for C++, so what you get with each of these types depends on your compiler. This will usually make the best of the floating-point hardware facilities provided by your computer. Generally, typelong doublewill provide a precision that’s greater than or equal to that of typedouble, which in turn will provide a precision that is greater than or equal to that of typefloat.

Typically, you’ll find that typefloatwill provide 7 digits precision, typedoublewill provide 15 digits precision, and typelong doublewill provide 19 digits precision, althoughdoubleandlong doubleturn out to be the same with some compilers. As well as increased precision, you’ll usually get an increased range of values with typesdoubleandlong double.

Typical ranges of values that you can represent with the floating-point types on an Intel processor are shown in Table 2-9.

Table 2-9. Floating-Point Type Ranges

Type Precision (Decimal Digits) Range (+ or –)
float 7 1.2x10-38 to 3.4x1038
double 15 2.2x10-308 to 1.8x10308
long double 19 3.3x10-4932 to 1.2x104932

The numbers of decimal digits of precision in Table 2-9 are approximate. Zero can be represented exactly for each of these types, but values between zero and the lower limit in the positive or negative range can’t be represented, so these lower limits for the ranges are the smallest nonzero values that you can have.

Simple floating-point literals with just a decimal point are of typedouble, so let’s look at how to define variables of that type first. You can specify a floating-point variable using the keyworddouble, as in this statement:

double inches_to_mm = 25.4;

This declares the variableinches_to_mmto be of typedoubleand initializes it with the value 25.4. You can also useconstwhen declaring floating-point variables, and this is a case in which you could sensibly do so. If you want to fix the value of the variable, the declaration statement might be

const double inches_to_mm = 25.4; // A constant conversion factor

If you don’t need the precision and range of values that variables of typedoubleprovide you can opt to use the keywordfloatto declare your floating-point variable, for example:

float pi = 3.14159f;

This statement defines a variablepiwith the initial value 3.14159. Thefat the end of the literal specifies it to be afloattype. Without thef, the literal would have been of typedouble, which wouldn’t cause a problem in this case, although you may get a warning message from your compiler. You can also use an uppercase letter F to indicate that a floating-point literal is of typefloat.

To specify a literal of typelong double, you append an upper- or lowercase L to the number. You could therefore declare and initialize a variable of this type with the statement

long double root2 = 1.4142135623730950488L; // Square root of 2

Floating-Point Operations

The modulus operator,%, can’t be used with floating-point operands, but all the other binary arithmetic operators that you have seen,+,-,*, and/, can be. You can also apply the prefix and postfix increment and decrement operators,++and--, to a floating-point variable with the same effect as for an integer—the variable will be incremented or decremented by 1.

As with integer operands, the result of division by zero is undefined so far as the standard is concerned, but specific C++ implementations generally have their own way of dealing with this, so consult your product documentation.

With most computers today, the hardware floating-point operations are implemented according to the IEEE 754 standard (also known as IEC 559). Although IEEE 754 isn’t required by the C++ standard, it does provide for identification of some aspects of floating-point operations on machines on which IEEE 754 applies. The float-ing-point standard defines special values having a binary mantissa of all zeros and an exponent of all ones to represent+infinityor-infinity, depending on the sign. When you divide a positive nonzero value by zero, the result will be+infinity, and dividing a negative value by zero will result in-infinity. Another special floating-point value defined by IEEE 754 is called Not a Number, usually abbreviated toNaN. This is used to represent a result that isn't mathematically defined, such as arises when you divide zero by zero or you divide infinity by infinity.

Any subsequent operation in which either or both operands are a value ofNaNresults inNaN. Once an operation in your program results in a value of±infinity, this will pollute all subsequent operations in which it participates. Combining a normal value with±infinityresults in±infinity. Dividing±infinityby±infinityor multiplying±infinityby zero results inNaN. Table 2-10 summarizes all these possibilities.

Table 2-10. Floating-Point Operations with NaN Operands

Operation Result Operation Result
±N/0 ±Infinity 0/0 NaN
±Infinity±N ±Infinity ±Infinity/±Infinity NaN
±Infinity*N ±Infinity Infinity-Infinity NaN
±Infinity/N ±Infinity Infinity*0 NaN

Using floating-point variables is really quite straightforward, but there’s no substitute for experience, so let’s try an example.

>>> More Practices Articles          >>> More By Apress Publishing

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort


- Calculating Development Project Costs
- More Techniques for Finding Things
- Finding Things
- Finishing the System`s Outlines
- The System in So Many Words
- Basic Data Types and Calculations
- What`s the Address? Pointers
- Design with ArgoUML
- Pragmatic Guidelines: Diagrams That Work
- Five-Step UML: OOAD for Short Attention Span...
- Five-Step UML: OOAD for Short Attention Span...
- Introducing UML: Object-Oriented Analysis an...
- Class and Object Diagrams
- Class Relationships
- Classes

Developer Shed Affiliates


Dev Shed Tutorial Topics: