Preserving authenticity: transfer learning methods for detecting and verifying facial image manipulation

. Facial retouching in supporting documents can have adverse effects, undermining the credibility and authenticity of the information presented. This paper presents a comprehensive investigation into the classification of retouched face images using a fine-tuned pre-trained VGG16 model. We explore the impact of different train-test split strategies on the performance of the model and also evaluate the effectiveness of two distinct optimizers. The proposed fine-tuned VGG16 model with “ImageNet” weight achieves a training accuracy of 99.34 % and a validation accuracy of 97.91 % over 30 epochs on the ND-IIITD retouched faces dataset. The VGG16_Adam model gives a maximum classification accuracy of 96.34 % for retouched faces and an overall accuracy of 98.08 %. The experimental results show that the 50 % - 25 % train-test split ratio outperforms other split ratios mentioned in the paper. The demonstrated work shows that using a Transfer Learning approach reduces computational complexity and training time, with a max. training duration of 39.34 min for the proposed model.


INTRODUCTION
In our modern life, digital images have become indispensable.Unfortunately, the widespread availability of advanced image processing tools on the Internet has led to a proliferation of fake images.While some of these images may seem harmless, they have been exploited for nefarious purposes, such as creating counterfeit legal documents, manipulating evidence in legal proceedings, and distorting historical events.Furthermore, the prevalence of retouched images on social media platforms, often using filters to create flawless appearances, has fostered unrealistic beauty standards.Beauty and celebrity magazines also contribute to this phenomenon, perpetuating unrealistic expectations by showcasing heavily altered appearances, as depicted in Figure 1.
Image forgery poses a significant challenge, as it can be visually imperceptible when executed with precision.Reference [1] demonstrated how such alterations can negatively impact individuals' self-esteem by promoting unrealistic beauty standards.The introduction of the Photoshop Law in Israel further emphasizes the need for algorithms to detect tampering, reflecting the prevalence of this issue [2].Moreover, beyond health and moral concerns, synthetic alterations affect biometric systems, potentially hindering accurate identification and auto-matching of bonafide faces [3].In order to maintain the integrity and authenticity of visual content, this research attempts to create an effective framework for the detection and classification of facial retouching using a transfer learning approach.This requires building a strong computational framework to recognize minute modifications made to facial images.

Literature review
Facial image retouching, photo spoofing or morphing, and makeup detection are widely studied areas and considered equivalent to detecting retouching.An SVR (support vector regression) between the altered and real photos was discovered in earlier studies by Reference [4].In 2015, to assess whether makeup is present, Reference [5] suggested SVM and alligator classifiers to detect makeup using shape and texture features that are retrieved from the complete face.In 2016, Reference [6] used a supervised Boltzmann algorithm to detect retouching on an ND-IIITD retouched faces dataset.The dataset introduced contains 2600 real face images and 2275 face images which are retouched by the Portrait Pro max photo editing tool.The geometric and photometric features are used to train SVR for classifying retouched images.Hence, a total of 4 facial patches are used to detect retouching.
In 2017, Reference [7] introduced a new dataset, namely MDRF (multi demographic retouched dataset) containing the real and retouched face images of three ethnicities -Caucasian, Chinese, and Indian.The classification of retouched images is done using semisupervised autoencoders.The model is trained on 4 patches of face images.An algorithm that recognizes altered face photos is presented by [8] and uses a gradient-based classification approach.In 2018, CNN (convolutional neural network) architecture was introduced to detect retouching on the standard ND-IIITD dataset [8].Different non-overlapping face patches of size (128×128) and (64×64) are used to detect retouching.The classifiers used are SVM and Thresholding.In 2019, Reference [9] used 5 different photo editing tools to retouch 800 bonafide face images and detected retouching using a PRNU (photo-response non-uniformity) scheme.The PRNU-based detection scheme demonstrates robust discrimination between unaltered bonafide images and retouched images, achieving an average detection equal error rate of 13.7 %.In order to detect morphing faces, a physical reflection model is introduced that calculates the direction of light sources for the nose and eye regions [10].Biological signals (photo plethysmography) and 129 picture quality features were employed to detect fake images as binary classification [11,12].
In 2023, an improvised patch-based deep convolutional neural network (IPDCN2) was presented in Reference [13], which effectively classifies facial images as either original or retouched through three stages: pre-processing using facial landmarks, high-level feature extraction with a CNN based on residual learning, and classification using fully-connected layers.The experimental results achieved an accuracy of 99.84 % (patch-based) on the ND-IIITD dataset and classification accuracies of 95.80 %, 83.70 %, and 97.30 % on the YMU, VMU, and MIW makeup datasets, respectively.Deep learning is a significant AI accomplishment [14].Convolutional neural networks (CNN) are a common type of deep learning architecture [15].Transfer learning helps avoid reinventing the wheel and lowers the cost of learning.Several extensively used pre-trained models include VGG16, VGG19, ResNet50, Inceptionv3, and EfficientNet [16].
Reference [17] introduces a retouching-FFHQ dataset, specifically used for detecting retouching.The TP, TN and accuracy of binary classification are analyzed using multigranularity attention modules and compared over different transfer learning models such as VGG16, InceptionV3, ResNet50, DenseNet121, and EfficientNet.In Reference [18], transfer learning was employed for a classification task using pre-trained models (MobileNet V2, ResNet50, and VGG19), with VGG19 achieving the highest classification accuracy (95 %) and f1-score on a previously unseen dataset, despite a longer execution time of 7 hours, 5 minutes, and 52 seconds.The pre-trained VGG16 architecture was utilized for skin cancer image classification [19], exploring different color scales (HSV, YCbCr, and Grayscale).The evaluation shows that a classification accuracy of 84.242 % was achieved by a dataset created from RGB and YCbCr images.The research extracted feature parameters from different layers and analyzed VGG16's performance across color scales to determine how effective it is in classifying diseases.
Moreover, very little research has been carried out till now over the face images which are retouched using photo editing tools.When employing a DL (deep learning) model to identify retouching on facial photos [20], there are many difficulties as presented in Reference [21]: To train the model to recognize retouching accurately, a large number of images, well-labeled metadata, and a facial dataset comprising both legitimate and manipulated photo images are required.In this context, Transfer learning (TL) addresses various challenges and enables optimal detection accuracy [22].TL offers several advantages in machine learning and deep learning tasks, including reduced training time, lower data requirements, and improved generalization.
Our contribution is as follows:  For the proposed work, VGG16 with ImageNet weight is used as it gives a top-5 error of 9.3 % [23]. For detecting retouching, an ND-IIITD retouched faces dataset is used.The dataset is divided into 80 % -20 %, 70 % -30 %, 60 % -40 %, and 50 % -50 % train-test split ratios. Two distinct 1 st order optimizers, Adam and RMSprop, are used during fine-tuning. A total of 8 distinct experiments are performed over the proposed TL model and the classification accuracy of the model is evaluated for the different train-test split ratios. To the best of the authors' knowledge, no prior research has conducted timing analysis for training a model, an aspect that is evaluated in this study.
The flow of this research is stacked as follows: The proposed methodology, brief of VGG16 architecture, optimizers and facial dataset are outlined in Section 2. Result analysis of the proposed models is summarized in Section 3. Conclusions and future work are discussed in Section 4.

PROPOSED METHODOLOGY
In this work, we proposed a TL method to classify bonafide (real) vs fake (retouched) face images from an ND-IIITD retouched face dataset by utilizing pre-trained VGG16 TL models with ImageNet weights.Steps of the proposed method are pictured in Figure 2.An ND-IIITD retouched face images [24] dataset is used and we split the dataset into train and test (validation) sets of 80 % -20 %, 70 % -30 %, 60 % -40 %, and 50 % -50 %.Data transformation is applied with data augmentation.Two different optimizers with TL VGG16 are used during fine-tuning.Hence, training and evaluation are performed to these fine-tuned TL VGG16 models with different train-test split and optimizer sets.The test images are evaluated on all eight proposed models and the classification results are compared and analyzed.We evaluate these TL models, compare them, and suggest the best fine-tuned TL VGG16 model.The VGG16 model is a deep CNN that was presented by researchers in the Visual Geometry Group (VGG) at the University of Oxford [23].It is a widely used model for image classification tasks and has achieved state-of-the-art performance on many benchmark datasets.The VGG16 model is a sequential architecture having 13 convolutional layers and 3 FC (fully connected) layers.The convolutional layers are used for extracting features of the input image, while the FC layers perform the classification job.The architecture learns more complex features from the given input and keeps the parameters low.The FC layers of the original VGG16 are removed and a new FC layer is added for retouching classification.The information of trainable and non-trainable parameters before and after fine-tuning is given in Table 1.The architecture of the modified VGG16 is briefly described in Table 2.During initial training of the model all convolution layers are freeze and newly added FC layer is updated in terms of weight parameter.During fine-tuning, few convolution layers of the modified model are made unfreeze.Hence, the weights of unfreeze convolution layers and FC layer are updated.Deep learning optimizers are a crucial part of computer vision because they ensure that the training process produces the optimum outcomes.The optimizer's task is to minimize the loss, which gauges the discrepancy between expected and actual results, by repeatedly altering the model's parameters.The right optimizer can have a significant impact on the training's efficiency, speed, and accuracy, as well as the outcomes themselves [25].As a result, optimizers are crucial in deep learning applications for computer vision.

Adam (adaptive moment estimation)
Adam is an optimization algorithm that combines the benefits of both RMSprop and momentum.Both the exponentially decaying average of the previous squared gradients (like RMSprop) and the average of the past gradients (like momentum) are maintained.The name "Adam" is derived from "adaptive moment".
Update Rule, ̂ √ ̂ (5) where, and are the first and second moment estimates of the gradients, respectively; and are hyperparameters that control the exponential decay rates of the moment estimates.is the learning rate, and is a small value added for numerical stability.

RMSProp (root mean square propagation)
RMSprop is an optimizer that addresses the short comes of traditional stochastic gradient descent (SGD) by taking the learning rates of each parameter individually based on the historical average of the squared gradients.
( ) where, is exponential decaying average of squared gradients; is a hyper parameter that controls the exponential decay rates of the moment estimates; is the learning rate, and is a small value added for numerical stability.

ND-IIITD retouched faces dataset
For the retouching detection, the real and retouched images of ND-IIITD retouched faces are taken.The ND-IIITD dataset [26] contains real (bonafide) and retouched images of a total of 325 subjects, where 211 are male faces and 114 are female faces.Each subject is further divided into 7 real face images taken during different time spans or under different light conditions, different backgrounds and with different poses.Those 7 real probes are retouched or altered with Portrait ProMax, a photo editing tool, with different levels of retouching.Retouching or brushing is applied on the nose, eye, lips, chick, and hair areas of the bonafide images [6].Table 3 shows the description of the original ND-IIITD dataset and Figure 3 shows some samples of retouching.The accuracy of the model depends on several parameters such as learning rate, number of epochs, optimizers, data size, etc.The ND-IIITD dataset is divided into different train test split ratios and the performance of all is compared for maximum classification accuracy.The details of bifurcation of dataset into training and testing sets are described in Table 4.The ratio of 80 % -20 % means that 80 % of the images are considered for the train dataset and 20 % are equally divided into the validation and test datasets.The model is trained on the training dataset and evaluated and tested on the testing dataset.Bonafide (real) and retouched samples of ~105 male and ~57 female subjects are used for training and the rest half (i.e.~105 males and ~57 females) are used for evaluation.Each dataset is formed by having the same number of real samples (images) as that of retouched samples.

Experimental arrangement
The model training and evaluation tasks are conducted on Google Colab with GPU runtime, utilizing TensorFlow, a machine learning package developed by Facebook's AI Research Department.A Python data loader is employed to load the data with a batch size of 32, and Google Drive serves as the storage location for both the dataset's file names and checkpoints.All the graphs are plotted using Matplotlib library.For the initial training of the modified fine-tuned VGG16 model, we use the Adam optimizer with a learning rate (LR) of 0.001, β 1 and β 2 of 0.9 and 0.999, respectively, and a number of epochs set to 10.During the fine-tuning of the model, we use the Adam and the RMSprop optimizers (momentum 0) with an LR of 0.0001 and a number of epochs set to 20.With the above hyper parameter settings, a total of eight experiments are conducted.The modified VGG16 model is trained over a training (train) dataset and evaluated over a testing (validation) dataset.On the train and validation sets, we determine the cross-entropy loss and accuracy for each epoch.

Performance metrics for evaluation
For the classification task, the performance of the model is evaluated based on 4 different parameters, namely precision, sensitivity, F1-score, and accuracy for bonafide and retouched face images.The ability of the models to give maximum TPR (true positive rate) is compared by the ROC (region of curve).Precision (P) is a metric that defines the percentage of correctly calculated results and is stated as: (8) where, x: true positive samples and y: false positive samples.
Recall (R) can be represented as the ratio of TP (true positives) to the sum of TP (true positives) and FN (false negatives), and expressed as: (9) where, x: true positive samples and z: false negative samples.F1-score (F1) analyses a model's performance on a class-by-class basis to determine how predictive it is.
Accuracy (A) is calculated as the proportion of accurately identified forecasts to all predictions.(11) ROC (region of curve) is a graphical measure of diagnostic ability of the model.It is a graphical plotting of TPR (true positive rate) vs. FPR (false positive rate) for the proposed model.As per Table 5, when Adam optimizer is used during fine-tuning, the accuracy achieved for all train-test split ratios is ~ 99 % as compared to the cases with RMSprop optimizers.Again, for the 50 -50 % train-test split, the model accuracy is incremented by 30.21 % over 30 epochs.The cross-entropy is reduced by ~1 to 2 % for all train-test splits with the Adam optimizer used during fine-tuning.4, revealing that the model's accuracy starts to improve after epoch 10, coinciding with the commencement of fine-tuning.

Result analysis based on training time
The information mentioned in Table 7 is presented by running %time of python before every training of the model.This returns the wall time or clock time required to train the model over the ND-IIITD dataset.Toughly, these timings depend on various factors like CPU usage, and the random samples taken by the model during batch normalization.Hence, a proper comparison over optimizers and the split ratios is not possible.In our analysis, we focus on training timing parameters that have not been previously explored or mentioned in the papers discussed in Subsection 1.2.We aim to uncover new insights and potential optimizations that could improve the efficiency and speed of the training process, providing a novel contribution to the existing methods.

Result analysis based on performance metrics
As per Figures 5.1

Comparison with existing works
The proposed model achieved the highest overall classification accuracy of 98.08 %, and the classification accuracy of retouched and real images of 99.83 % and 96.344 %, respectively.Moreover, the proposed model shows improvement of 16.18 % and 10.98 % in classifications of retouching for the same dataset, as shown in Table 8.Even ROC comparison of the proposed work reflected in Figure 5.4 with [6] demonstrates that the suggested model indicates better overall performance in terms of true positive rate versus false positive rate.
As compared to Reference [13], recent paper, the proposed work improved the classification accuracy by 10.08 % when the model is trained and evaluated on the whole image rather than the face patches.These findings lead to the conclusion that the proposed model exhibits superior performance in discerning genuine from retouched images when compared to state-of-the-art models.Most prior studies on retouching have trained and evaluated models using facial patches defined by specific landmarks, which often do not achieve optimal accuracy when analyzing entire images.In contrast, our research demonstrates enhanced accuracy and classification performance when training the model on entire images.

CONCLUSIONS
This paper presents a transfer learning approach to detect the digital manipulation of facial images.The pre-trained VGG16 model gives better performance over a small dataset compared to existing methods.Furthermore, leveraging the transfer learning and fine-tuning reduces the computational complexity and time.The experimental results demonstrate the significance of choosing appropriate data partitioning and optimization techniques in enhancing the overall performance of the VGG16-based classifier for retouched face image recognition tasks.The work shows that the fine-tuned VGG16 model with Adam optimizer outperforms to classify real and retouched faces for 50 % -25 % train-test split ratio over the ND-IIIITD retouched face dataset.
In the future, we can explore the use of other facial datasets and incorporate more pretrained models to investigate proper data partitioning and optimization techniques.By experimenting with different datasets and optimizers, we can potentially enhance the fine-tuned VGG16 model's performance in facial retouching detection.Additionally, further research on data augmentation and fine-tuning strategies may help improve the generalization and robustness of the model for real-world applications.

Figure 1 .
Figure 1.Showcasing the examples of facial tempering using Photoshop.The first image is real and the second image of the same person is retouched.

Figure 2 .
Figure 2. Steps for classification of retouching over ND-IIITD using TL fine-tuned VGG16 model.
(a) & (b), Adam and RMSprop optimizers used during fine-tuning perform equally with ~100 % precision for retouched (fake) images.On the other side, for 70 %, 60 % and 50 % splitting, Adam optimizer gives better results in terms of precision parameters for bonafide (real) images.Out of all split ratios, for 50 % train-test split, the maximum precision for real samples achieved is 96.45 % and 95.81 % when Adam and RMSprop are used, respectively.As shown from Figures 5.2 (a) & (b), out of all split ratios, 50 % ratio gives better accuracy in terms of recall metric for classifying retouched face images.The recall values measured by proposed fine-tuned VGG16 model with Adam optimizer are 96.34 % (max), 91.34 %, 91.25 % and 77.49% for 50 %, 60 %, 70 % and 80 % split ratios, respectively.And, they are 95.64 % (max), 89.61 %, 85.42 % and 88.74 % for that of RMSprop optimizer.For classifying real images, the proposed TL VGG16 gives an accuracy of ~100 % for all split ratios with both the

Figure 5 . 3 .
Figure 5.3.Accuracy analysis.optimizers as shown in Figure 5.2(b).As depicted in Figure 5.3, over all split ratios, for the Adam and RMSprop optimizers used during fine-tuning of the proposed TL VGG16 model, the maximum accuracy achieved is 98.08 % and 97.82 %, respectively, for 50 % -25 % of train-test split ratio.

Table 1 .
Trainable parameters before and after fine-tuning.

Table 3 .
Dataset description [Pr: degree/percentage of the retouching.Pr 1: least alteration and

Table 4 .
Train-test splitting of bonafide and retouched images.

Table 5 .
Comparison of accuracy and cross entropy for training dataset.

Table 6 .
Comparison of accuracy and cross entropy for testing (validation) dataset.

Table 7 .
Comparison of training time for different train-test split ratios.According to Table 6, maximum accuracy and minimum cross entropy loss are achieved for Adam optimizers and 60 -40 % train-test split.Although, 50 -50 % split with Adam gives a nearly equal performance in terms of accuracy and loss.The epoch-wise comparison for all 8 experiments is depicted in Figure

Table 8 .
Comparison of proposed model with all existing models.