Regarding Chapter 3, there are many specific comments
to be made:
- p. 29, line 5, change 10 percent to less
than 10 percent
- p. 29, line 2 of first bulleted entry, change in
the text to in the text or attached to it
- p. 29, 2nd bullet, consider the quotation To
be or not to be, that is the and what is left after
stopword removal.
- p. 29, last two bullets, consider the abbreviation
AT&T, how the embedded punctuation is handled,
and what would be left from indexing after stopword
removal (where single letters are usually removed too).
- p. 29, top paragraph, is rather unclear. Usually,
think of the process of taking a word and finding the
information for it - usually a list of record identifiers
indicating where it occurs. Then, the list can
be read and used.
- p. 30, early part of 3.2, consider the inverted file
as a way to start with a string or number, and find a set
of occurrences. This can be done in one step, by using a
B-tree with variable length leaves, or in two steps by
using other search schemes, and a pointer to the
description of the set of occurrences. Note that a
scheme that maintains a sorted order has the added
benefit of also supported alphabetical browsing and
truncation operations in the list of terms.
- p. 32, 9th line from bottom, change 3.4.1 to
3.4.1 or 3.4.2.
- p. 33, bottom, exercise for the researcher -
look at a real collection and plot a histogram of the number
of postings.
- p. 33, exercise for the researcher - why and
when would it make sense to use a binary search over an
inverted file? Why should the lists be doubly linked?
What is the cost of updating a list?
- p. 35, Figure 3.6, in-depth reading - for the
data in Figure 3.4, show these representations.
- p. 35, 2nd line, exercise for the researcher -
why use a right-threaded tree?
- p. 36, Table 3.1, in-depth reading - what is the
main benefit of new vs. old?
- p. 37, 3rd line of section 3.4.2, change principles to facts.
- p. 40, middle of page, exercise for the researcher
- How does the buffering help? What is the effect of
buffer size?
- p. 41, Figure 3.9, in-depth reading - What is
the value of M for this example? Can you explain
how the 4 for term 11 in Load 2 gets to
that particular place, from its place in the entry in Load 2 for document 4.
- p. 42, Table 3.2, fix the column headings for the
two right columns.