Xi'an Technological University
Subject: Computer Science, Software Engineering
eISSN: 2470-8038
SEARCH WITHIN CONTENT
Changyuan Wang ^{*} / Guang Li ^{*} / Pengxiang Xue ^{*} / Qiyou Wu ^{*}
Keywords : Classification Algorithm, Machine Learning, Face Recognition, Model Evaluation
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 5, Issue 3, Pages 23-29, DOI: https://doi.org/10.21307/ijanmc-2020-024
License : (CC-BY-NC-ND 4.0)
Published Online: 14-October-2020
Due to the different classification effects and accuracy of different classification algorithms in machine learning, it is inconvenient for scientific researchers to choose which classification algorithm to use. This paper uses the face data published by Cambridge University as an experiment. The experiment first reduces the dimensionality of the data through the principal component analysis (PCA) algorithm, extracts the main features of the data, and then respectively through linear logic classification, linear discrimination LDA, nearest neighbor algorithm KNN, support vector machine SVM and the integrated algorithm Adaboost are used for classification. By comparing the advantages and disadvantages of the classification performance and complexity of different algorithms, the final review reviews accuracy, recall, f1-score, and AUC as evaluation indicators.
With the rise of artificial intelligence and machine learning, face recognition technology is widely used in life, such as station security, time and attendance punching, and secure payment [1-3], but different face recognition devices use different algorithms. Therefore, this paper analyzes and compares the commonly used classification algorithms in face recognition. The data set in this paper uses the ORL face data set published by Cambridge University in the United Kingdom. The methods used involve linear logistic regression, linear discriminant analysis (LDA), K-Nearest Neighbor (KNN), support vector machine (SVM), Naïve Bayes (NB) and other methods. The definition and advantages and disadvantages of the act are briefly explained. Finally, the five methods are compared and analyzed according to the evaluation indicators such as the accuracy rate, recall rate, F1-score, and AUC area commonly used in machine learning.
The data set contains a total of 400 photos. We use the machine learning library Scikit-learn provided by python to process the data, and display part of the data set pictures as shown in Figure 1.
The experimental data has 4096 features per picture. Since the number of features is much greater than the number of samples, it is easy to regenerate overfitting during training. Therefore, a principal component analysis algorithm is required to reduce the dimensions of the features and select K main features as the input of the data set. The main idea of PCA [4] uses the covariance matrix to calculate the degree of dispersion of samples in different directions, and selects the direction with the largest variance as the main direction of the sample set. Processing process:
a) Data preprocessing normalizes and scales the data first. Normalization makes the mean value of data features 0, and scaling is to solve the case where feature values are different by an order of magnitude.
b) Calculate the covariance matrix and eigenvectors of the processed data. The eigenvectors can be obtained by singular value decomposition.
c) Retain the feature vector and feature value corresponding to the largest first K feature values to form an orthogonal basis.
d) Project the sample into a low-dimensional space, and the acquired dimensionality-reduced data can represent the original sample approximately, but with a certain degree of distortion.
This paper use formula (1) to calculate the distortion of PCA.
Through scikit-learn processing, the reduction rate after dimensionality reduction is obtained from the PCA model diagram is shown in Figure 2. It can be seen from the figure that the larger the value of k, the smaller the distortion rate. As k continues to increase, the data reduction rate will approach 1 Using this rule, choose between 10 and 300, and perform sampling calculation every 15th. Under the k features of all samples, the reduction rate is obtained after processing by the PCA algorithm. We select the reduction ratios at 98%, 90%, 80%, and 70%, and the corresponding k values are 195, 75, 32, and 20. The corresponding pictures after PCA processing are displayed. The first line of the image is the original image, and then each column corresponds to the It is a picture at different reduction rates. The lower the reduction rate, the more blurred the image is shown in Figure 3.
Supervised learning is the most widely used branch of machine learning in industry. Classification and regression are the main methods in supervised learning. Linear classifier is the most basic and commonly used machine learning model. This paper uses linear logistic regression to classify and recognize faces.
The prediction function of linear regression is:
Where is the prediction function, x is the feature vector, To handle the classification problem, this paper hope that the value of the function is [0,1], This paper introduce the Sigmoid function:
Combined with linear regression prediction function:
If there is a simple binary classification of class A or class B, then this paper can use Sigmoid as the probability density function of the sample, and the result of the input classification can be expressed by probability:
The cost function is a function that describes the difference between the predicted value and the true value of the model. If there are multiple data samples, the average value of the replacement price function is obtained. It is expressed by J(θ). Close to, based on the maximum likelihood estimate available cost function J (θ).
Linear Discriminant Analysis (LDA)^{[5, 6]}, also known as Fisher Linear Discriminant (FLD), was introduced into the field of machine learning by Belhumeur. LDA is a dimensionality reduction technique in supervised learning, which can not only reduce dimensionality but also classify, and mainly project data features from high latitude to low latitude space. The core idea is that after projection, the projection points of the same category of data should be as close as possible, and the distance between the category centers of different categories of data should be increased as much as possible [5]. If the data set has two data sets, for the center of the two classes
Then within-class scatter matrix (within-class scatter matrix):
Between-class scatter matrix:
Among N training samples, find the k nearest neighbors of the test sample x. Suppose there are m training samples in the data set, and there are c categories, namely $$\{{\varpi}_{1},\dots {\varpi}_{c}\}$$, and the test sample is x. Then the KNN algorithm can be described as: Find k neighbors of x in m training samples, among which the number of samples belonging to category w_{i} in k neighbors of x are k_{1}, k_{2}, … kc, then the discriminant function is
The core idea of K-nearest neighbor algorithm ^{[7]} is to calculate the distance between unlabeled data samples and each sample in the data set, take the K nearest samples, and then K neighbors vote to decide the type of unlabeled samples.
KNN classification steps:
a) Prepare the training sample set X, which contains n training samples, and select an appropriate distance measurement method according to specific requirements. This paper use dis(xa,xb) to represent the distance between ax and bx in the sample set.
b) For the test sample x, use the distance measurement formula to calculate the distance between the test sample x and n samples to obtain the distance set Dis, where
.c) Sort the distance set, select the smallest k elements from it, and get k samples corresponding to k elements.
d) Count the categories of these k samples, and obtain the final classification results by voting.
Assuming that x_test is an unlabeled sample and x_train is a labeled sample, the algorithm is as follows:
For distance measurement, Euclidean distance, Manhattan distance, Chebyshev distance, etc. are usually used. Generally, Euclidean distance is mostly used, such as the distance between two points and two points in N-dimensional Euclidean space.
Support Vector Machine (SVM)[7] for short is a very important and extensive machine learning algorithm. Its starting point is to find the optimal decision boundary as far as possible, which is the farthest from the two types of data points. Furthermore, is the farthest from the boundary of the two types of data points, so the data point closest to the boundary is defined as a support vector. Finally, our goal becomes to find such a straight line (multi-dimensional called hyperplane), which has the largest distance from the support vector. Make the generalization ability of the model as good as possible, so SVM prediction of future data is also more accurate, as shown in Figure 5 below. Find the best Dahua margin.
Let this plane be represented by g(x)=0, its normal vector is represented by w, the actual distance between a point and the plane is r point, and the distance from the plane can be measured by the absolute value of g(x) (called the function interval).
The penalty function is also called the penalty function, which is a kind of restriction function. For constrained nonlinear programming, its constraint function is called a penalty function, and the coefficient is called a penalty factor. The objective function of SVM (soft interval support vector machine) with penalty factor C is:
Advantages of SVM:
a) Non-linear mapping is the theoretical basis of SVM method, SVM uses inner product kernel function to replace the nonlinear mapping to high-dimensional space;
b) The optimal hyperplane to divide the feature space is the goal of SVM, and the idea of maximizing the classification margin is the core of the SVM method;
c) Support vector is the training result of SVM. It is the support vector that plays a decisive role in SVM classification decision.
d) SVM is a novel small sample learning method with a solid theoretical foundation. It basically does not involve probability measurement and the law of large numbers, so it is different from the existing statistical methods. In essence, it avoids the traditional process from induction to deduction, realizes efficient “transduction inference” from training samples to forecast samples, and greatly simplifies the usual classification and regression problems.
Naive Bayes [8, 9] is a conditional independence assumption, and there is no correlation between leave and leave. Suppose there is a labeled data set, where the data set has a total of categories, and for new samples, this paper predict the value. This paper use statistical methods to deal with this problem, so this paper can understand it as the probability of the category to which the sample belongs. The conditional probability formula is:
Therefor C_{k} ∈ [C_{1}, C_{2} …, C_{m}], this paper only require the probabilities of m categories, the category belongs to the largest value, and use Bayes’ theorem to solve:
And because x has n feature vectors, it is in the determined data set, C_{k}, P(x) are fixed values, according to the joint probability formula:
The data set in this paper uses the ORL face data set of Cambridge University, a total of 400 photos. After experimental analysis, this paper chose PCA to reduce the information rate after dimensionality reduction to 98%, and select 195 as the main features of each picture as data input. In order to make the experimental results more generalized, 80% of the data set is used as the training set and 20% is used as the test set. The 10-fold cross-validation method is used during model training. In order to ensure that the experimental results run the same data every time, a fixed random seed is set Is 7. For the test standard, this paper selected the accuracy rate (P), recall rate (R), F value, and AUC area. Compared with our prediction results, the accuracy rate indicates how many true positive samples are in the positive prediction samples. There are two possibilities for the prediction results, that is, the positive prediction is positive (TP), and the negative prediction is positive (FP), the formula is:
The recall rate refers to our original sample, indicating how many positive examples in the sample were predicted correctly. There are also two possibilities, that is, the original positive class is predicted as a positive class (TP), and the other is to predict the original positive class as a negative class (FN).
F1 combines the results of precision rate and recall rate. When F1 is higher, it means that the verification method is more effective.
Face recognition technology [10] has become one of the most popular research directions in computer vision and has made great achievements. With the development of computer technology, more and more classification methods will appear. This thesis is only through several common mainstream classification methods for experimental analysis and comparison. Through experiments, it is found that the PCA + linear logic classification method has obvious advantages in accuracy rate and recall rate. However, specific analysis should be combined with specific issues. Then I am ready to do experiments on different data sets, understand more classification algorithms, and constantly improve my results.