Adult content Web filtering and face detection using data-mining based kin-color model

The paper presents a novel approach for robust skin-color detection using data-mining techniques. The goal of skin-color detection is to select the appropriate color model that allows pixels to be verified under different lighting conditions and other variations. When the appropriate color model is selected, it is implied that we have good skin-color classifier properties for skin detection. This model has been successfully applied to face detection and Web based adult content filtering issues.


Introduction
This paper describes the construction of statistical color models from a data set of unprecedented size: Our model includes nearly I billion labelled training pixels obtained from the CLR dataset [6].From this data we construct a color model as well as separate skin and non-skin color.Using our skin classifi er, which operates on the color of a single pixel, we construct a system for detecting images containing naked people and facial regions.
The remainder of this paper is organized as follows.
The skin detection using color model is presented in Section 2. The web site filtering system is discussed in Section 3. The face detection system is presented in Section 4. Section 5 draws conclusions on our work.

Skin Detection Using Colar Model
The color of skin in the visible spectrum depends primarily on the concentration of melanin and hemoglobin [ 18].The distribution of skin color across different ethnic groups under controlled conditions of illumination has been shown to be quite compact, with variations expressible in terms of the concentration of skin pigments [19].However, under arbitrary conditions of illumination the variation in skin color will be less constrained.This is particularly true for web images captured under a wide variety of imaging conditions.Ho\vever, given a sufficiently large collection of labelled training pixels we can still model the distribution of skin and non-skin colors accurately [6].

Learning database
Skin colors change from person to person.Several color spaces have been proposed in the literature for skin detection applications [5].YCbCr has been widely used since the skin pixels form a compact cluster in the Cb-Cr plane.As YCbCr is also used in video coding and then no transcoding is needed, this color space has .been used in skin detection applications where the video sequence is compressed [I, 11].In [12] two components of the normalized RGB color space (rg) have been proposed to m1n1m1ze luminance dependence.And finally CIE Lu*v* has been used in [14].However, it is still not clear which is the color space where the skin detection performance is the best.
To create skin-color model using color information we use CRL database of skin-color and non-skin color images [6]; all intensity information on color pixels was extracted using binary masks.
For each pixel we compute its representation in following normalized colorspaces: RGB, HSY, YIQ, YCbCr, CMY in order to find the most discriminative set of color axes.

Data mining-based Skin Classifier
A number of classification techniques from the statistics and machine learning communities have been proposed (3,7,8,13].A well-accepted method of classification is the induction of decision trees (2, 7].In our approach we use the SIPINA method [17].
SIPINA is a widely used technique for data-mining.The effectiveness of SlP!NA is superior to the classical methods such as ID3 and C4.5 (8], because the distribution equivalency can be considered in a population-wise manner.irrespective of any fixed solutions proposed previously (16].This accounting mechanism accurately charts usage distribution and leads to the highest performance among other methods, particularly for skin-color modelling.
As a result of applying this method to a training set, a hierarchical structure of classifying rules of the type "IF ... THEN ... " is created.This structure has a form of a tree.We discover that HSY is the most discriminative colorspace.Because of copyrights policy we are not authorized to publish detailed values on skin-color model decision rules, however we will present experimental results on evaluation of this model for face detection in video and adult web content filtering applications.

Web site filtering system
By taking advantage of the fact that there is a strong correlation between images with large patches of skin and adult or pornographic images, the skin detector can be used as the basis for an adult image detector.
There is a growing industry aimed at filtering and blocking adult content from Web indexes and browsers.Some representative companies are www.suifcontrol.comand www.netnanny.com.All of these services currently operate by maintaining lists of objectionable URL's and newsgroups and require constant manual updating (6].
In this section, we propose an adult content detection and filtering system that extends adult content detection accuracy by the usage of both image signature and textual clues for adult web site filtering.

WebGuard's Architecture
The formulation of the « WebGuard » is as follows: -Fully automated adult content detection and filtering -Categorization into "black list" (access denied) and "white list" (access allowed) to speed up navigation -If the site is not recorded on the "black list" or "white list" the engine will then analyse both the visual and textual information and make a further decision on the sites access allowed/denied status.The black list/white list file is then updated.�----------------------------------------

Adult Web sites classification
WebGuard is using data-mining both for text and image classification.The creation of a database for the data-mining process requires a large number of sites of each category: I OOO pornographic and 1 OOO non pomographic sites were added to the data-base.Once the feature vectors of all the URLs have been constructed, the task is to use a classifier to categorize these URLs into two types: adult content URLs and non-adult content URLs.

Dictionary and weighting system
In WebGuard 2.0 the efficiency and quality of classification methods was improved by enhancing the dictionary that defines which words are sexually explicit: the richer the word base, the more accurate the data that is recovered by the parser.Its richness also depends on the different languages it addresses.This database contains English, French, Spanish and Italian words.Thus, the classification rules and the analysis of a site are more accurate.
In improved version of system 11Webguard 2.0", several text classification methods are used (ID3, C4.5, SIPINA (with two different values for the admissibility constraint), and Improved C4.5) that can be combined in order to ensure a higher degree of accuracy.Three methods were used for evaluating the performances of each algorithm: the random error rates method, cross validation, anci bootstrap.In order to determine whether a website is pornographic or not, we use a weighting method: each algorithm has a coefficient and the sum of these five coefficients is equal to one.
For each algorithm, the system examines the web site and returns a Boolean value equal to one when the site is classified a:s pornographic and zero when it is classified as clean.Using the Boolean results from each algorithm and the weighting coefficient, we obtain a decimal number between zero and one.Higher final score results in the higher probability that the website is pornographic.

Textual and visual analysis
Most of the existing systems that rely only on URL and textual information can be readily outsmarted by adult content providers.Still, textual information remains the most significant index for fast filtering and will be used as the first stage in the detection and the filtering of contents of adult sites.

When our method of text based analysis is used
WebGuard 2.0 is able to filter 94% of adult web sites.
To improve the classification performance we use image analysis which is based on the skin color model.
According to the percentage of skin color pixels in the image one can decide whenever the image is suspect or not.This increases accuracy more than 98o/o.We evaluated our technique using textual analysis only (A) and then textual + visual analysis using data-mining based skin-color model (B) [4].For the purposes of our experiment we used I OOO web sites which were manually classified into adult and non-adult sets.
Figure 2 shows the the improvement of WebGuard system by the use of image analysis.

A
central task in visual learning is the construction of statistical models of image appearance from pixel data.When the amount of available training data is small, sophisticated learning algorithms may be required to interpolate between samples.However, as a result of the World Wide Web and the proliferation of on-line image collections, the vision community today has access to image libraries of unprecedented size.
rates of text-only analysis and text-and image analysis

4. 2 .:
Experimental EvaluationWe performed two experiments and present the results here.All experiments were performed with video captured to files, so that pure performance (faster than real time) could be evaluated.The sk:in color models used in experiments were: data-mining based (A), CRL[6] (B).We note that the face detection in video system was tuned for fast performance and some features were disabled.Meanwhile skin-color preprocessing takes less than l % of total computational complexity; therefore the only parameter evaluated is a total number of detected persons per hour of video.Table l.Face detection in video using skincorrect face detection rate A -64 (B -60) different nersons oer hour !False face detection rate 6 subiects oer hour