\centerline{\bfXII FLOATING POINT ARITHMETIC}\bigskip A (nonzero, normalized) base $\beta$, $t$-digit floating point number has the mathematical representation $$\pm.d_1d_2\cdots d_t\times \beta^e,$$ where $$m\le e\le M,\qquad 1\le d_1\le\beta-1, \qquad 0\le d_i\le\beta-1\hbox{ for }i=2,\ldots,t.$$ A real number $x$ is said to be within floating range if $|x|\le \beta^M$. Let $fl(x)$ denote the floating point representation of the real number $x$ within floating point range. Then $$fl(x)=x(1+\delta),\qquad |\delta|\le u=\cases{\beta^{1-t}&chopped,\cr {1\over2}\beta^{1-t}&rounded,\cr}$$ depending on whether $fl(x)$ is obtained by chopping or rounding the base $\beta$ expansion of $x$. $u$ is called the {\it unit round-off}. \noindent{\bf Assumption}. Let $\bowtie$ denote any of the floating point arithmetical operations +, $-$, *, or /. Then for any floating point numbers $x$ and $y$, $$fl(x\bowtie y)=(x\bowtie y)(1+ \delta), \qquad |\delta|\le u.$$ Example. $$fl\left(\sum_{i=1}^3 x_iy_i\right)=\left\{\left[ x_1y_1 (1+\delta_1) +x_2y_2(1+\delta_2) \right](1+\delta_3) + x_3y_3(1+ \delta_4)\right\} (1+\delta_5)$$ where all $|\delta_i|\le u$. \noindent{\bf Note}. An equivalent definition of the unit round-off $u$ is that $u$ is the smallest floating point number such that $fl(1+u)>1$. \bigskip\leftline{\sl Internal machine representation of floating point numbers---\bf IEEE 754 Standard.}\medskip \line{\vtop{\hbox{\bf 32-bit format}\smallskip\table{&\hfil\strut#\cr &\multispan{5}31\hfil 0\cr \tablerule \vrule&\ sign bit &\vrule&\ biased exponent &\vrule&\ significand & \vrule\cr \tablerule &1\hfil&&8\hfil&&$\mathord\uparrow1.$\hfill23\hfill\cr}}\hfil \vtop{\hsize=3in\leftline{Exponent bias $=7F_{16}$.} \leftline{$\beta=2$, $t=24$, $-126\le e\le127$.}}} \bigskip \line{\vtop{\hbox{\bf 64-bit format}\smallskip\table{&\hfil\strut#\cr &\multispan{5}63\hfil 0\cr \tablerule \vrule&\ sign bit &\vrule&\ biased exponent &\vrule&\ significand & \vrule\cr \tablerule &1\hfil&&11\hfil&&$\mathord\uparrow1.$\hfill52\hfill\cr}}\hfil \vtop{\hsize=3in\leftline{Exponent bias $=3FF_{16}$.} \leftline{$\beta=2$, $t=53$, $-1022\le e\le1023$.}}} \bigskip \noindent Except for zero and denormals, the significand is assumed to follow 1.\ (in binary). The stored exponent $E={\rm bias}+e$. \vfil\eject\hrule\medskip Memory storage for Intel 80*, MIPS R*, DEC Alpha chips: $$\vbox {\table{#\tabskip=7pt&&#\cr \strut15\hfill0&31\hfill16&47\hfill32&63\hfill48\cr \hrulefill&\hrulefill&\hrulefill&\hrulefill\cr \strut\vrule\ address $A$ \vrule& \vrule\ address $A+1$ \vrule& \vrule\ address $A+2$ \vrule& \vrule\ address $A+3$ \vrule\cr \hrulefill&\hrulefill&\hrulefill&\hrulefill\cr}}$$ Memory storage for Motorola 68*, IBM RS6000, SUN Sparc, Power 60* chips: $$\vbox {\table{#\tabskip=7pt&&#\cr \strut63\hfill48&47\hfill32&31\hfill16&15\hfill0\cr \hrulefill&\hrulefill&\hrulefill&\hrulefill\cr \strut\vrule\ address $A$ \vrule& \vrule\ address $A+1$ \vrule& \vrule\ address $A+2$ \vrule& \vrule\ address $A+3$ \vrule\cr \hrulefill&\hrulefill&\hrulefill&\hrulefill\cr}}$$ \hrule\medskip \centerline{\bf Some other floating point representations.}\bigskip \line{\vtop{\hbox{\bf Honeywell 68/60} \smallskip\table{&\hfil\strut#\cr \tablerule \vrule&\ sign bit &\vrule&\ exponent &\vrule&\ sign bit &\vrule& \ mantissa & \vrule\cr \tablerule &1\hfil&&7\hfil&&1\hfil&&27$|$63\hfil \cr}}\hfil \vtop{\hsize=3in\parskip=0pt \leftline{Exponent: fixed point binary integer.} \leftline{Mantissa: fixed point binary fraction.} \noindent The sign bits are part of the exponent and mantissa.\par \leftline{$\beta=2$, $t=27\mid63$, $-128\le e\le127$.}}} \bigskip \hrule\medskip \line{\vtop{\hbox{\bf CDC 6000 series} \smallskip\table{&\hfil\strut#\cr \tablerule \vrule&\ sign bit &\vrule&\ biased exponent &\vrule& \ integer coefficient &\vrule\cr \tablerule &1\hfil&&11\hfil&&48\hfil \cr}}\hfil \vtop{\hsize=3in\parskip=0pt \leftline{Exponent bias $=2^{10}$.} \noindent Note that the mantissa is an integer, not a fraction.\par \leftline{$\beta=2$, $t=48$, $-1024\le e\le1023$.}}} \bigskip \hrule\medskip \line{\vtop{\hbox{\bf IBM SYSTEM/360 series} \smallskip\table{&\hfil\strut#\cr \tablerule \vrule&\ sign bit &\vrule&\ biased exponent &\vrule&\ mantissa & \vrule\cr \tablerule &1\hfil&&7\hfil&&24$|$56\hfil \cr}}\hfil \vtop{\hsize=3in\leftline{Exponent bias $=2^6$.} \leftline{$\beta=16$, $t=6\mid14$, $-64\le e\le63$.}}} \bigskip \hrule\medskip \line{\vtop{\hbox{\bf DEC VAX series} \smallskip\table{&\hfil\strut#\cr &15\hfil&&14\hfill7 &&\ 6\hfill0\cr \tablerule \vrule&\ sign bit &\vrule&\ biased exponent &\vrule& \ mantissa &\vrule\cr \tablerule &1\hfil&&8\hfil&&$\mathord\uparrow.1$\hfill7\hfill\cr \noalign{\medskip} &\multispan{5}31\hfil16\cr \tablerule \vrule&\multispan{5} coefficient continued (part 2)\hfil&\vrule\cr \tablerule \noalign{\medskip} &\multispan{5}47\hfil32\cr \tablerule \vrule&\multispan{5} coefficient continued (part 3)\hfil&\vrule\cr \tablerule \noalign{\medskip} &\multispan{5}63\hfil48\cr \tablerule \vrule&\multispan{5} coefficient continued (part 4)\hfil&\vrule\cr \tablerule }}\hfil \vtop{\hsize=3in\leftline{Exponent bias $=2^7$.} \leftline{The most significant mantissa bit is not stored.} \leftline{$\beta=2$, $t=24\mid56$, $-128\le e\le127$.}}} \bigskip \hrule\vfil\eject