In general, this is a good chapter, with introduction, history,
experimental results, guidelines for implementers, enhancements, and a
summary of related topics.
The organization makes sense, but does cause some findings to be
mentioned several times.
This chapter should be read before Chapter 11, but there are cross
references, and some repetition between the two.
Regarding Chapter 14, there are many specific comments to be made:
- p. 363, very bottom, note that the robustness of natural language
queries that is mentioned depends purely on having long enough queries,
so that incorrect terms can be ignored.
- p. 364, very top, note that it is not certain how precise one can be
with natural language queries when searching large full-text collections.
- p. 365, figure, bottom section, in each group in the table, the
third line is misleading. The vector shown is the and of the query
and the record, and the number is the dot product.
- p. 368, figure, note that there are two names for each of the
equations at the bottom. Thus you can use w sup 1 or F sub 1
to refer to the first equation.
- p. 369, 4th paragraph, discusses two types of experiments. The
retrospective ones give an upper bound on performance using this
approach, showing what would be optimal in those circumstances. The
predictive experiments suggest how the method would work in
actual practice.
- p. 370, first equation, add in a left parenthesis. Regarding the
second equation, see also p. 375, where it is repeated. When less than
or equal to 0.5
we have the equation shown in the middle of p. 375, for w sub iq
which was developed by Fox
[1].
- p. 370-1, omit Sections 14.3.3-4.
- p. 372, end of section 14.4.1, note that tf*idf or something very
much like it is being discussed here.
- p. 373, top, note that latent semantic indexing (LSI) aims to
find the key concepts, or orthogonal semantic aspects, reducing the
dimensionality of the document-term matrix. Recently, there have been
some new algorithms to speed up the indexing.
- p. 374, item 2, note that this encourages use of tf*idf. However,
the tf component may be ill-advised when there are
very short documents (like those used with MARIAN).
- p. 377, item 1, the first line is really just a conjecture - we do
not yet know how well we can do with large collections of long texts,
using only ranking, and avoiding adjacency and field restrictions. This is
also the case with the similar conjecture that underlies the discussion
under item 2.
- p. 378, par. after the figure, recall p. 34 and Figure 3.5.
- p. 380, 2nd from last par., recall Chapter 4. Note that a sort is
not necessary if one only wants the top k items. In that case, one can
do the job in linear time using an array of size k that stores the
document number and weight for the k items that have been seen so
far that have the highest similarity values. This process of accumulators
and finding the top k items was used in the REVTOLC experiment at
VPI&SU.
- p. 387, you can omit Section 14.8.