数学代写|数值分析代写Numerical analysis代考|MAT12004 Floating point formats

数学代写|数值分析代写Numerical analysis代考|Floating point formats

The IEEE standard consists of a set of binary representations of real numbers. A floating point number consists of three parts: the sign $(+$ or -$)$, a mantissa, which contains the string of significant bits, and an exponent. The three parts are stored together in a single computer word.

There are three commonly used levels of precision for floating point numbers: single precision, double precision, and extended precision, also known as long-double precision. The number of bits allocated for each floating point number in the three formats is 32,64 , and 80 , respectively. The bits are divided among the parts as follows:
\begin{tabular}{|l|c|c|c|}
\hline precision & sign & exponent & mantissa \
\hline \hline single & 1 & 8 & 23 \
\hline double & 1 & 11 & 52 \
\hline long double & 1 & 15 & 64 \
\hline
\end{tabular}
All three types of precision work essentially the same way. The form of a normalized IEEE floating point number is
$$\pm 1 . b b b \ldots b \times 2^p$$
where each of the $N b$ ‘s is 0 or 1 , and $p$ is an $M$-bit binary number representing the exponent. Normalization means that, as shown in (0.6), the leading (leftmost) bit must be 1.

When a binary number is stored as a normalized floating point number, it is “leftjustified,” meaning that the leftmost 1 is shifted just to the left of the radix point. The shift is compensated by a change in the exponent. For example, the decimal number 9 , which is 1001 in binary, would be stored as
$$+1.001 \times 2^3$$
because a shift of 3 bits, or multiplication by $2^3$, is necessary to move the leftmost one to the correct position.

数学代写|数值分析代写Numerical analysis代考|IEEE Rounding to Nearest Rule

For double precision, if the $53 \mathrm{rd}$ bit to the right of the binary point is 0 , then round down (truncate after the $52 \mathrm{nd}$ bit). If the $53 \mathrm{rd}$ bit is 1 , then round up (add 1 to the 52 bit), unless all known bits to the right of the 1 are 0 ‘s, in which case 1 is added to bit 52 if and only if bit 52 is 1.

For the number 9.4 discussed previously, the 53 rd bit to the right of the binary point is a 1 and is followed by other nonzero bits. The Rounding to Nearest Rule says to round up, or add 1 to bit 52 . Therefore, the floating point number that represents 9.4 is
$$+1.0010110011001100110011001100110011001100110011001101 \times 2^3 .$$
Denote the IEEE double precision floating point number associated to $x$, using the Rounding to Nearest Rule, by $\mathbf{f l}(\mathbf{x})$.

Representation of floating point number
To represent a real number as a double precision floating point number, convert the number to binary, and carry out two steps:

1. Justify. Shift radix point to the right of the leftmost 1 , and compensate with the exponent.
2. Round. Apply a rounding rule, such as the IEEE Rounding to Nearest Rule, to reduce the mantissa to 52 bits.
To find $\mathrm{fl}(1 / 6)$, note that $1 / 6$ is equal to $0.0 \overline{01}=0.001010101 \ldots$ in binary.

数学代写|数值分析代写NUMERICAL ANALYSIS代 考|FLOATING POINT FORMATS

IEEE 标准由一组实数的二进制表示组成。一个浮点数由三部分组成: 符号 $(+$ 或者- ）个尾数，其中包含有效位串和一个指数。这 三个部分一起存储在一个计算机字中。

$\backslash$ begin ${$ tabular $}|||c| c|c|} \backslash h l i n e$ 精庶 \& 符号 \& 指数 \& 尾数 $\backslash \backslash$ hline $\backslash$ hline single \& $1 \& 8$ \& $23 \backslash \backslash h l i n e$ double \& $1 \& 11 \& 52 \backslash \backslash h l i n e$ long double \& 1 \& $15 \& 64 \backslash \backslash h l i r$

$$\pm 1 . b b b \ldots b \times 2^p$$

$$+1.001 \times 2^3$$

数学代写|数值分析代写NUMERICAL ANALYSIS代 考|IEEE ROUNDING TO NEAREST RULE

$+1.0010110011001100110011001100110011001100110011001101 \times 2^3$.

1. 证明合法。将最左边的 1 的小数点向右移动，并用指数进行补偿
2. 圆形的。应用舍入规则，例如 IEEE 舍入到最近规则，将尾数减少到 52 位。
寻找fl $(1 / 6)$ ，注意 $1 / 6$ 等于 $0.0 \overline{01}=0.001010101 \ldots$ 以二进制形式

