A Bayesian classifier

The probabilistic approach is most common in pattern classification. This results in a Bayesian classifier. Here it is assumed that $ P(y\vert\theta)$ and $ P(\theta)$ are given. The distribution of features for a given class is indicated by $ P(y\vert\theta)$. The overall frequency of class occurrences is given by $ P(\theta)$. If large, preclassified data sets are available, then these distributions can be reliably learned. The feature space is often continuous, which results in a density $ p(y\vert\theta)$, even though $ P(\theta)$ remains a discrete probability distribution. An optimal classifier, $ \pi^*$, is designed according to (9.26). It performs classification by receiving a feature vector, $ y$, and then declaring that the class is $ u = \pi^*(y)$. The expected cost using (9.32) is the probability of error.

Example 9..11 (Optical Character Recognition)   An example of classification is given by a simplified optical character recognition (OCR) problem. Suppose that a camera creates a digital image of a page of text. Segmentation is first performed to determine the location of each letter. Following this, the individual letters must be classified correctly. Let $ \Theta = \{A, B, C, D, E,
F, G, H\}$, which would ordinarily include all of the letters of the alphabet.

Suppose that there are three different image processing algorithms:

  1. [] Shape extractor: This returns $ s = 0$ if the letter is composed of straight edges only, and $ s = 1$ if it contains at least one curve.
  2. [] End counter: This returns $ e$, the number of segment ends. For example, $ O$ has none and $ X$ has four.
  3. [] Hole counter: This returns $ h$, the number of holes enclosed by the character. For example, $ X$ has none and $ O$ has one.
The feature vector is $ y = (s,e,h)$. The values that should be reported under ideal conditions are shown in Figure 9.1. These indicate $ \Theta(s)$, $ \Theta(e)$, and $ \Theta(h)$. The intersection of these yields $ \Theta(y)$ for any combination of $ s$, $ e$, and $ h$.

Figure 9.1: A mapping from letters to feature values.
\begin{tabular}{\vert c\vert c\vert l\vert}\hline
...G H \\
& 1 & A D \\
& 2 & B  \hline

Imagine doing classification under the nondeterministic model, with the assumption that the features always provide correct information. For $ y = (0,2,1)$, the only possible letter is $ A$. For $ y =
(1,0,2)$, the only letter is $ B$. If each $ (s,e,h)$ is consistent with only one or no letters, then a perfect classifier can be constructed. Unfortunately, $ (0,3,0)$ is consistent with both $ E$ and $ F$. In the worst case, the cost of using (9.32) is $ 1$.

One way to fix this is to introduce a new feature. Suppose that an image processing algorithm is used to detect corners. These are places at which two segments meet at a right ($ 90$ degrees) angle. Let $ c$ denote the number of corners, and let the new feature vector be $ y = (s,e,h,c)$. The new algorithm nicely distinguishes $ E$ from $ F$, for which $ c=2$ and $ c=1$, respectively. Now all letters can be correctly classified without errors.

Of course, in practice, the image processing algorithms occasionally make mistakes. A Bayesian classifier can be designed to maximize the probability of success. Assume conditional independence of the observations, which means that the classifier can be considered naive. Suppose that the four image processing algorithms are run over a training data set and the results are recorded. In each case, the correct classification is determined by hand to obtain probabilities $ P(s\vert\theta)$, $ P(e\vert\theta)$, $ P(h\vert\theta)$, and $ P(c\vert\theta)$. For example, suppose that the hole counter receives the letter $ A$ as input. After running the algorithm over many occurrences of $ A$ in text, it may be determined that $ P(h = 1 \vert\; \theta = A) = 0.9$, which is the correct answer. With smaller probabilities, perhaps $ P(h = 0 \vert\; \theta = A)
= 0.09$ and $ P(h = 2 \vert\; \theta = A) = 0.01$. Assuming that the output of each image processing algorithm is independent given the input letter, a joint probability can be assigned as

$\displaystyle P(y\vert\theta) = P(s,e,h,c\vert \theta) = P(s\vert\theta)P(e\vert\theta)P(h\vert\theta)P(c\vert\theta) .$ (9.33)

The value of the prior $ P(\theta)$ can be obtained by running the classifier over large amounts of hand-classified text and recording the relative numbers of occurrences of each letter. It is interesting to note that some context-specific information can be incorporated. If the text is known to be written in Spanish, then $ P(\theta)$ should be different than from text written in English. Tailoring $ P(\theta)$ to the type of text that will appear improves the performance of the resulting classifier.

The classifier makes its decisions by choosing the action that minimizes the probability of error. This error is proportional to

$\displaystyle \sum_{\theta \in \Theta} P(s\vert\theta)P(e\vert\theta)P(h\vert\theta)P(c\vert\theta)P(\theta) ,$ (9.34)

by neglecting the constant $ P(y)$ in the denominator of Bayes' rule in (9.26). $ \blacksquare$

Steven M LaValle 2012-04-20