*NEWS*: since June 2016 vision-ary project joined ARGO Vision, an innovative firm that excels in visual recognition. For inquiry about cascades and more, please contact ARGO Vision.

In this note it is discussed how real-time face tracking in video can be achieved by relying on a Bayesian approach realized in a multi-threaded architecture. The system is based on a probabilistic interpretation of the output provided by a cascade of AdaBoost classifiers. Results show that such integrated approach is appealing with respect either to robustness (please note how the wrong hypothesis are going to be deleted and how the almost-corrected ones are “tuned”) and computational efficiency (in 2009, on a dual-core PC the face tracking is ~100 fps)

**Introduction**

Face tracking can be performed either according to a frame based approach (e.g., [1]) or according to a detection and tracking approach, where faces are detected in the first frame and tracked through the video sequence (for a review, refer to [2]). Clearly, in the first case, temporal information is not exploited, and the intrinsic independence among successive detections makes it difficult to reconstruct the track of each subject. In the second case a loss of information may occur (e.g., new faces entering the scene) and, in general, the output of face detection is used only at initialization, while tracking relies upon low level features (color histograms [3], contours [4], etc.) which are very sensitive to the conditions of acquisition. To overcome these drawbacks, a tighter coupling between face detection and tracking has been proposed [2]. Such an approach can be given a simple and elegant form in the Bayesian framework.

Each face is characterized at frame t of the video stream by a state vector xt, e.g., a face bounding box. The tracking goal is to estimate the correct state xt given all the measurements Zt = {z1, … , zt} up to that moment, or equivalently to construct the posterior probability density function (pdf) p(xt|Zt). The theoretically optimal solution is provided by recursive Bayesian filtering that in the prediction step uses the dynamic equation and the already computed pdf of the state at time t−1, p(xt−1|Zt−1), to derive the prior pdf of the current state, p(xt|Zt−1); then, in the update step, it employs the face likelihood function p(zt|xt) of the current measurement to compute the posterior pdf p(xt|zt). Thus, the key issue of a tight coupling approach is to provide a face detection algorithm suitable to calculate the face likelihood function p(zt|xt) and, if a Particle Filtering (PF,[4]) implementation is adopted, to generate proper particle weighting. For instance, Verma et al. [2] adopt the Schneiderman and Kanade detector [5], which provides a suitable probabilistic output, but unfortunately is inadequate for real-time implementation.

A more appealing solution, which we propose here, could be the real-time face detection scheme proposed by Viola and Jones (VJ,[6]) – basically a cascade of AdaBoost classifiers – which is arguably the most commonly employed detection method [7]. Note that the combination of AdaBoost and PF has been proposed in [3], but detection results were heuristically combined in the proposal function, rather than being exploited to model a likelihood function in a principled way. In [8], PF is integrated with an AdaBoost monolithic classifier via the probabilistic interpretation given by Friedman et al. in [9]. Unfortunately tight coupling is spoiled by the introduction of a mean shift iteration, based on color features, to support the prediction of new hypotheses in adjacent regions.

Differently, we exploit the cascaded classifiers described in [10]. In this case, the probabilistic interpretation valid for monolithic classifiers does not apply directly; thus we work out a probabilistic interpretation of the VJ algorithm suitable for our purposes. We discusses the proposed PF implementation by defining the state parametrization and the adopted dynamical model. A PF may perform poorly when the posterior is multi-modal as the result of ambiguities or multiple targets, and issues such as appearance and disappearance of faces should be handled in a robust way. Interestingly enough, the latter issue, which is considered critical in PF based tracking, is very easily solved in human vision where multiple object tracking “runs” in parallel with background motion alerting processes capable of triggering, through pop-out effects, the tracking of new objects entering the scene and the loss of attention for disappearing objects. These problems have been tackled by incorporating such concurrency of processes in the system architecture.

**References**

2. Verma, R., Schmid, C., Mikolajczyk, K.: Face Detection and Tracking in a Video by Propagating Detection Probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1215–1228 (2003)

3. Okuma, K., Taleghani, A., de Freitas, N., Little, J., Lowe, D.: A Boosted Particle Filter: Multitarget Detection and Tracking. LNCS, pp. 28–39. Springer, Heidelberg (2004)

4. Isard, M., Blake, A.: CONDENSATION – Conditional Density Propagation for Visual Tracking. International Journal of Computer Vision 29(1), 5–28 (1998)

5. Schneiderman, H., Kanade, T.: A statistical method for 3D object detection applied to faces and cars. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2000)

6. Viola, Jones: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004)

7. Pantic, M., Pentland, A., Nijholt, A., Huang, T.: Human Computing and Machine Understanding of Human Behavior: A Survey. In: Huang, T.S., Nijholt, A., Pantic, M., Pentland, A. (eds.) ICMI/IJCAI Workshops 2007. LNCS (LNAI), vol. 4451, pp. 47–71. Springer, Heidelberg (2007)

8. Li, P., Wang, H.: Probabilistic Object Tracking Based on Machine Learning and Importance Sampling. In: Marques, J.S., Perez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3522, pp. 161–167. Springer, Heidelberg (2005)

9. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Technical report, Stanford University (1998)

10. Lienhart, R., Kuranov, E., Pisarevsky, V.: Empirical analysis of detection cascades of boosted classifiers for rapid object detection. In DAGM25th Pattern Recognition Symposium (2003)

11. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90(2) (2002)

12. OpenCV library: http://sourceforge.net/projects/opencvlibrary/