A novel CAPTCHA solver framework using deep skipping Convolutional Neural Networks

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Methodology

Recent studies based on deep learning have shown excellent results to solve a CAPTCHA. However, simple CNN approaches may detect lossy pooled incoming features when passing between convolution and other pooling layers. Therefore, the proposed study utilizes skip connection. To remove further bias, a 5-fold validation approach is adopted. The proposed study presents a CAPTCHA solver framework using various steps, as shown in Fig. 1 . The data are normalized using various image processing steps to make it more understandable for the deep learning model. This normalized data is segmented per character to make an OCR-type deep learning model that can detect each character from each aspect. At last, the five-fold validation method is reported and yields promising results.

An external file that holds a picture, illustration, etc. Object name is peerj-cs-08-879-g001.jpg

Open in a separate window

Figure 1

The proposed framework for CAPTCHA recognition for both 4 and 5 character datasets.

The two datasets used for CAPTCHA recognition have four and five words in them. The five-word dataset has a horizontal line in it with overlapping text. Segmenting and recognizing such text is challenging due to its un-clearance. The other dataset of 4 characters was not as challenging to segment, as no line intersected them, and character rotation scaling needs to be considered. Their preprocessing and segmentation are explained in the next section. The dataset is explored in detail before and after preprocessing and segmentation.

Datasets

There are two public datasets available on Kaggle that are used in the proposed study. There are five and four characters in both datasets. There are different numbers of numeric and alphabetic characters in them. There are 1,040 images in the five-character dataset ( d ₁ ) and 9,955 images in the four-character dataset ( d ₂ ). There are 19 types of characters in the d ₁ dataset, and there are 32 types of characters in the d ₂ dataset. Their respective dimensions and extension details before and after segmentation are shown in Table 2 . The frequencies of each character in both datasets are shown in Fig. 2 .

Table 2

Description of both employed datasets’ in proposed study.

Properties	d1	d2
Image dimension	50 × 200 × 3	24 × 72 × 3
Extension	PNG	PNG
Number of images	9955	1040
Character types	32	19
Resized image dimension (Per Character)	20 × 24 × 1	20 × 24 × 1

Open in a separate window

An external file that holds a picture, illustration, etc. Object name is peerj-cs-08-879-g002.jpg

Open in a separate window

Figure 2

Five and four character’s datasets used in proposed study, their character-wise frequencies (row 1: 4-character dataset 1 ( d ₂ ); row 2: five-character Dataset 2 ( d ₁ )).

The frequency of each character varies in both datasets, and the number of characters also varies. In the d ₂ dataset, although there is no complex inner line intersection and a merging of texts is found, more characters and their frequencies are. However, the d ₁ dataset has complex data and a low number of characters and frequencies, as compared to d ₂ . Initially, d ₁ has the dimensions 50 × 200 × 3, where 50 represents the rows, 200 represents the columns, and 3 represents the color depth of the given images. d ₂ has image dimensions of 24 × 72 × 3, where 24 is the rows, 72 is the columns, and 3 is the color depth of given images. These datasets have almost the same character location. Therefore, they can be manually cropped to train the model on each character in an isolated form. However, their dimensions may vary for each character, which may need to be equally resized. The input images of both datasets were in Portable Graphic Format (PNG) and did not need to change. After segmenting both dataset images, each character is resized to 20 × 24 in both datasets. This size covers each aspect of the visual binary patterns of each character. The dataset details before and after resizing are shown in Table 2 .

The summarized details of the used datasets in the proposed study are shown in Table 2 . The dimensions of the resized image per character mean that, when we segment the characters from the given dataset images, their sizes vary from dataset to dataset and from character type to character type. Therefore, the optimal size at which the image data for each character is not lost is 20 rows by 24 columns, which is set for each character.

Preprocessing and segmentation

d ₁ dataset images do not need any complex image processing to segment them into a normalized form. d ₂ needs this operation to remove the central intersecting line of each character. This dataset can be normalized to isolate each character correctly. Therefore, three steps are performed on the d ₁ dataset. It is firstly converted to greyscale; it is then converted to a binary form, and their complement is lastly taken. In the d ₂ dataset, 2 additional steps of erosion and area-wise selection are performed to remove the intersection line and the edges of characters. The primary steps of both datasets and each character isolation are shown in Fig. 3 .

An external file that holds a picture, illustration, etc. Object name is peerj-cs-08-879-g003.jpg

Open in a separate window

Figure 3

Preprocessing and isolation of characters in both datasets (row 1: the d1 dataset, binarization, erosion, area-wise selection, and segmentation; row 2: binarization and isolation of each character).

Binarization is the most needed step in order to understand the structural morphology of a certain character in a given image. Therefore, grayscale conversion of images is performed to perform binarization, and images are converted from greyscale to a binary format. The RGB format image has 3 channels in them: Red, Green, and Blue. Let Image I _{(

x

,

y

)} be the input RGB image, as shown in Eq. (1) . To convert these input images into grayscale, Eq. (2) is performed.

I n p u t I m a g e = I_{(x, y)} .

In Eq. (1) , I is the given image, and x and y x and y represent the rows and columns. The grayscale conversion is performed using Eq. (2) :

G r e y (x, y) \leftarrow \sum_{i = n}^{j} (0.2989 * R, 0.5870 * G, 0.1140 * B) .

In Eq. (2) , i is the iterating row position, j is the interacting column position of the operating pixel at a certain time, and R , G , and B are the red, green, and blue pixel values of that pixel. The multiplying constant values convert to all three values of the respective channels to a new grey-level value in the range of 0–255. $G r e y (x, y)$ is the output grey-level of a given pixel at a certain iteration. After converting to grey-level, the binarization operation is performed using Bradly’s method, which calculates a neighborhood base threshold to convert into 1 and 0 values to a given grey-level matrix of dimension 2. The neighborhood threshold operation is performed using Eq. (3) .

B (x, y) \leftarrow 2 * ⌊ size (\frac{Grey (x,y)}{16} + 1) ⌋ .

In Eq. (3) , the output $B (x, y)$ is the neighborhood-based threshold that is calculated as the 1/8 ^th neighborhood of a given $G r e y (x, y)$ image. However, the floor is used to obtain a lower value to avoid any miscalculated threshold value. This calculated threshold is also called the adaptive threshold method. The neighborhood value can be changed to increase or decrease the binarization of a given image. After obtaining a binary image, the complement is necessary to highlight the object in a given image, taken as a simple inverse operation, calculated as shown in Eq. (4) .

C (x, y) \leftarrow \frac{1}{B (x, y)} .

In Eq. (4) , the available 0 and values are inverted to their respective values of each pixel position x and y . The inverted image is used as an isolation process in the case of the d ₂ dataset. In the case of the d ₁ , further erosion is needed. Erosion is an operation that uses a structuring element concerning its shape. The respective shape is used to remove pixels from a given binary image. In the case of a CAPTCHA image, the intersected line is removed using a line-type structuring element. The line-type structuring element uses a neighborhood operation. In the proposed study case, a line of size 5 with an angle dimension of 90 is used, and the intersecting line for each character in the binary image is removed, as we can see in Fig. 3 , row 1. The erosion operation with respect to a 5 length and a 90 angle is calculated as shown in Eq. (5) .

C ⊝ L \leftarrow x \in E | B_{x} \subseteq C .

In Eq. (5) , C is the binary image, L is the line type structuring element of line type, and x is the resultant eroded matrix of the input binary image C . B _x is the subset of a given image, as it is extracted from a given image C . After erosion, there is noise in some images that may lead to the wrong interpretation of that character. Therefore, to remove noise, the neighborhood operation is again utilized, and 8 neighborhood operations are used to a given threshold of 20 pixels for 1 value, as the noise value remains lower than the character in that binary image. To calculate it, an area calculation using each pixel is necessary. Therefore, by iterating an 8 by 8 neighborhood operation, 20 pixels consisting of the area are checked to remove those areas, and other more significant areas remain in the output image. The sum of a certain area with a maximum of 1 is calculated as shown in Eq. (6) .

S (x, y) \leftarrow \sum_{i = 1}^{j} \max (B_{x} | x i - x j |, B_{x} | y i - y j |) .

In Eq. (6) , the given rows ( i ) and columns ( j ) of a specific eroded image B _x are used to calculate the resultant matrix by extracting each pixel value to obtain one’s value from the binary image. The max will return only values that will be summed to obtain an area that will be compared with threshold value T. The noise will then be removed, and final isolation is performed to separate each normalized character.

CNN training for text recognition

{c o n v o (I, W)}_{x, y} = \sum_{a = 1}^{N_{C}} \sum_{b = 1}^{N_{R}} W_{a, b} * I_{x + a - 1, y + b - 1} .

(7)

In the above equation, we formulate a convolutional operation for a 2D image that represents I _{x

,

y} , where x and y are the rows and columns of the image, respectively. W _{x

,

y} represents the convolving window concerning rows and columns x and y . The window will iteratively be multiplied with the respective element of the given image and then return the resultant image in ${c o n v o (I, W)}_{x, y} . N_{C}$ and N _R are the numbers of rows and columns starting from 1, a represents columns, and b represents rows.

Batch Normalization Layer

Its basic formula is to calculate a single component value, which can be represented as

{B a t}^{'} = \frac{a - M [a]}{\sqrt{v a r (a)}} .

The calculated new value is represented as Bat ′, a is any given input value, and M [ a ] is the mean of that given value, where in the denominator the variance of input a is represented as var ( a ). The further value is improved layer by layer to give a finalized normal value with the help of alpha gammas, as shown below:

{B a t}^{^{''}} = γ * {B a t}^{'} + β .

The extended batch normalization formulation improved in each layer with the previous Bat ′ value.

ReLU

ReLU excludes the input values that are negative and retains positive values. Its equation can be written as

r e L U = \{\begin{matrix} x = x i f x > 0 \\ x = 0 i f x \leq 0 \end{matrix}\}

where x is the input value and directly outputs the value if it is greater than zero; if values are less than 0, negative values are replaced with 0.

Skip-Connection

The Skip connection is basically concetnating the previous sort of pictoral information to the next convolved feature maps of network. In proposed network, the ReLU-1 information is saved and then after 2nd and 3rd ReLU layer, these saved information is concatenated with the help of an addition layer. In this way, the skip-connection is added that makes it different as compared to conventional deep learning approaches to classify the guava disease. Moreover, the visualization of these added feature information is shown in Fig. 1 .

Average pooling

The average pooling layer is superficial as we convolve to the input from the previous layer or node. The coming input is fitted using a window of size mxn , where m represents the rows, and n represents the column. The movement in the horizontal and vertical directions continues using stride parameters.

Many deep learning-based algorithms introduced previously, as we can see in Table 1 , ultimately use CNN-based methods. However, all traditional CNN approaches using convolve blocks and transfer learning approaches may take important information when they pool down to incoming feature maps from previous layers. Similarly, the testing and validation using conventional training, validation, and testing may be biased due to less data testing than the training data. Therefore, the proposed study uses a 1-skip connection while maintaining other convolve blocks; inspired by the K-Fold validation method, it splits up both datasets’ data into five respective folds. The dataset, after splitting into five folds, is trained and tested in a sequence. However, these five-fold results are taken as a means to report final accuracy results. The proposed CNN contains 16 layers in total, and it includes three major blocks containing convolutional, batch normalization, and ReLU layers. After these nine layers, an additional layer adds incoming connections, a skip connection, and 3rd-ReLU-layer inputs from the three respective blocks. Average pooling, fully connected, and softmax layers are added after skipping connections. All layer parameters and details are shown in Table 3 .

Table 3

Parameters setting and learnable weights for proposed skipping-CNN architecture.

Number	Layers name	Category	Parameters	Weights/Offset	Padding	Stride
1	Input	Image Input	24 × 20 × 1	–	–	–
2	Conv (1)	Convolution	24 × 20 × 8	3 × 3 × 1 × 8	Same	1
3	BN (1)	Batch Normalization	24 × 20 × 8	1 × 1 × 8	–	–
4	ReLU (1)	ReLU	24 × 20 × 8	–	–	–
5	Conv (2)	Convolution	12 × 10 × 16	3 × 3 × 8 × 16	Same	2
6	BN (2)	Batch Normalization	12 × 10 × 16	1 × 1 × 16	–	–
7	ReLU (2)	ReLU	12 × 10 × 16	–	–	–
8	Conv (3)	Convolution	12 × 10 × 32	3 × 3 × 16 × 32	Same	1
9	BN (3)	Batch Normalization	12 × 10 × 32	1 × 1 × 32	–	–
10	ReLU (3)	ReLU	12 × 10 × 32	–	–	–
11	Skip-connection	Convolution	12 × 10 × 32	1 × 1 × 8 × 32	2	0
12	Add	Addition	12 × 10 × 32	–	–	–
13	Pool	Average Pooling	6 × 5 × 32	–	2	0
14	FC	Fully connected	1 × 1 × 19 (d2) 1 × 1 × 32 (d1)	19 × 960 (d2) 32 × 960 (d1)	–	–
15	Softmax	Softmax	1 × 1 × 19	–	–	–
16	Class Output	Classification	–	–	–	–

Open in a separate window

In Table 3 , all learnable weights of each layer are shown. For both datasets, output categories of characters are different. Therefore, in the dense layer of the five-fold CNN models, the output class was 19 for five models, and the output class was 32 categories in the other five models. The skip connection has more weights than other convolution layers. Each model is compared regarding its weight learning and is shown in Fig. 4 .

An external file that holds a picture, illustration, etc. Object name is peerj-cs-08-879-g004.jpg

Open in a separate window

Figure 4

Five-folds based trained CNN weights with their respective layers are shown that shows the proposed CNN skipping connection based variation in all CNNs’ architectures.

The figure shows convolve 1, batch normalization, and skip connection weights. The internal layers have a more significant number of weights or learnable parameters, and the different or contributing connection weights are shown in Fig. 4 . Multiple types of feature maps are included in the figure. However, the weights of one dataset are shown. In the other dataset, these weights may vary slightly. The skip-connection weights have multiple features that are not in a simple convolve layer. Therefore, we can say that the proposed CNN architecture is a new way to learn multiple types of features compared to previous studies that use a traditional CNN. This connection may be used in other aspects of text and object recognition and classification.

Later on, by obtaining these significant, multiple features, the proposed study utilizes the K-fold validation technique by splitting the data into five splits. These multiple splits remove bias in the training and testing data and take the testing results as the mean of all models. In this way, no data will remain for training, and no data will be untested. The results ultimately become more confident than previous conventional approaches of CNN. The d ₂ dataset has a clear, structured element in its segmented images; in d ₁ , the isolated text images were not much clearer. Therefore, the classification results remain lower in this case, whereas in the d2 dataset, the classification results remain high and usable as a CAPTCHA solver. The results of each character and dataset for each fold are discussed in the next section.

Results and Discussion

As discussed earlier, there are two datasets in the proposed framework. Both have a different number of categories and a different number of images. Therefore, separate evaluations of both are discussed and described in this section. Firstly, the five-character dataset is used by the 5-CNN models of same architecture, with a different split in the data. Secondly, the four-character dataset is used by the same architecture of the model, with a different output of classes.

Five-character Dataset ( d ₁ )

The five-character dataset has 1,040 images in it. After segmenting each type of character, it has 5,200 total images. The data are then split into five folds: 931, 941, 925, 937, and 924. The remaining data difference is adjusted into the training set, and splitting was adjusted during the random selection of 20–20% of the total data. The training on four-fold data and the testing on the one-fold data are shown in Table 4 .

Table 4

Five-character dataset accuracy (%) and F1-score with five-fold text recognition based testing on the trained CNNs.

	Accuracy (%)			F1-measure (%)		Accuracy (%)	F1-measure (%)
Character	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	5-Fold Mean	5-Fold Mean
	87.23	83.33	89.63	84.21	83.14	84.48	86.772
	87.76	75.51	87.75	90.323	89.32	86.12	86.0792
	84.31	88.46	90.196	91.089	90.385	89.06	89.4066
	84.31	80.39	90.00	90.566	84.84	86.56	85.2644
	86.95	76.59	82.61	87.50	82.22	87.58	85.2164
	89.36	87.23	86.95	86.957	88.636	86.68	87.3026
	89.58	79.16	91.66	93.47	89.362	87.49	89.5418
	81.81	73.33	97.72	86.04	90.90	85.03	87.7406
	87.23	79.16	85.10	82.60	80.0	82.64	81.0632
	91.30	78.26	91.30	87.91	88.66	88.67	86.7954
	62.79	79.54	79.07	85.41	81.928	78.73	79.4416
	92.00	84.00	93.87	93.069	82.47	89.1	87.5008
	95.83	91.83	100	95.833	94.73	95.06	94.522
	64.00	56.00	53.061	70.47	67.34	62.08	63.8372
	81.40	79.07	87.59	79.04	78.65	81.43	77.8656
	97.78	78.26	82.22	91.67	98.87	90.34	92.0304
	95.24	83.72	90.47	96.66	87.50	90.55	91.3156
	89.58	87.50	82.97	87.23	82.105	85.68	86.067
	93.02	95.45	97.67	95.43	95.349	95.40	95.8234
Overall 86.14 80.77 87.24 88.183 86.1265 85.52 85.9782

Open in a separate window

In Table 4 , there are 19 types of characters that have their fold-by-fold varying accuracy. The mean of all folds is given. The overall or mean of each fold and the mean of all folds are given in the last row. We can see that the Y character has a significant or the highest accuracy rate (95.40%) of validation compared to other characters. This may be due to its almost entirely different structure from other characters. The other highest accuracy is of the G character with 95.06%, which is almost equal to the highest with a slight difference. However, these two characters have a more than 95% recognition accuracy, and no other character is nearer to 95. The other characters have a range of accuracies from 81 to 90%. The least accurate M character is 62.08, and it varies in five folds from 53 to 74%. Therefore, we can say that M matches with other characters, and for this character recognition, we may need to concentrate on structural polishing for M input characters. To prevent CAPTCHA from breaking further complex designs among machines and making it easy for humans to do so, the other characters that achieve higher results need a high angle and structural change to not break with any machine learning model. This complex structure may be improved from other fine-tuning of a CNN, increasing or decreasing the skipping connection. The accuracy value can also improve. The other four-character dataset is essential because it has 32 characters and more images. This five-character dataset’s lower accuracy may also be due to little data and less training. The other character recognition studies have higher accuracy rates on similar datasets, but they might be less confident than the proposed study due to an unbiased validation method. For further validation, precision and recall-based F1-Score for all five folds mean are shown in Table 4 ; the Y character again received the highest value of F1-measure with 95.82%. Using the proposed method again validates the ’Y’ character as the most promisingly broken character. The second highest accuracy gaining character ’G’ got the second-highest F1-score (94.522%) among all 19 characters. The overall mean F1-Score of all 5-folds is 85.97% that is more than overall accuracy. However, F1-Score is the harmonic mean of precision and recall, wherein this regard, it could be more suitable than the accuracy as it covers the class balancing issue between all categories. Therefore, in terms of F1-Score, the proposed study could be considered a more robust approach. The four-character dataset recognition results are discussed in the next section.

Four-character dataset ( d ₂ )

The four-character dataset has a higher frequency of each character than the five-character dataset, and the number of characters is also higher. The same five-fold splits were performed on this dataset characters as well. After applying the five-folds, the number of characters in each fold was 7607, 7624, 7602, 7617, and 7595, respectively, and the remaining images from the 38,045 images of individual characters were adjusted into the training sets of each fold. The results of each character w.r.t each fold and the overall mean are given in Table 5 .

Table 5

Four-character dataset accuracy (%) and F1-score with five-fold text recognition based testing on the trained CNNs.

	Accuracy (%)			F1-measure (%)		Accuracy (%)	F1-measure (%)
Character	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	5-Fold Mean	5-Fold Mean
	97.84	99.14	99.57	98.92	99.13	98.79	98.923
	97.02	94.92	98.72	97.403	97.204	96.52	97.5056
	97.87	97.46	99.15	98.934	98.526	98.55	98.4708
	98.76	98.76	99.17	97.97	98.144	99.01	98.0812
	100	95.65	99.56	99.346	99.127	98.69	98.947
	98.80	99.60	99.19	99.203	98.603	99.36	98.9624
	99.15	98.72	97.42	98.283	98.073	98.29	98.1656
	98.85	96.55	98.08	98.092	99.617	98.39	98.4258
	97.85	98.71	99.13	98.712	97.645	98.54	98.2034
	99.57	96.59	98.72	97.89	96.567	97.95	97.912
	99.58	98.75	99.16	99.379	99.374	99.25	99.334
	100	100	100	99.787	99.153	99.92	99.6612
	99.18	97.57	100	98.994	98.374	98.94	98.6188
	98.69	98.26	100	98.253	98.253	98.52	98.3076
	98.76	97.93	100	98.319	98.551	98.43	98.7944
	99.58	97.90	100	98.347	99.371	99.33	99.1232
	100	98.72	99.57	99.788	99.574	99.66	99.4458
	99.15	99.58	100	99.156	99.371	99.58	99.1606
	97.41	98.28	100	99.355	99.352	98.79	99.1344
	99.16	96.23	99.16	99.17	99.532	98.58	98.9816
	99.58	97.10	99.17	99.793	98.755	98.83	98.652
	98.35	97.94	98.77	98.347	96.881	97.86	97.8568
	100	100	99.58	99.576	99.787	99.75	99.7456
	99.58	99.17	99.17	99.174	98.319	99.00	99.0834
	98.75	99.58	100	99.583	99.156	99.42	99.4118
	97.47	97.90	98.73	98.305	98.312	97.98	99.558
	100	97.43	99.57	99.134	98.925	98.80	99.1794
	100	98.67	98.67	99.332	98.441	98.47	98.8488
	100	100	100	99.376	99.167	99.67	99.418
	99.15	97.46	100	99.573	99.788	99.15	99.3174
	97.90	98.33	98.74	98.156	99.371	98.66	98.7866
	99.17	98.75	99.16	98.965	99.163	99.16	99.0832
Overall 98.97 98.18 99.32 98.894 98.737 98.82 98.846

Open in a separate window

From Table 5 , it can be observed that almost every character was recognized with 99% accuracy. The highest accuracy of character D was 99.92 and remained 100% in the four-folds. Only one fold showed a 99.57% accuracy. From this point, we can state that the proposed study removed bias, if there was any, from the dataset by doing splits. Therefore, it is necessary to make folds in a deep learning network. Most studies use a 1-fold approach only. The 1-fold approach is at high risk. It is also essential that character M achieved the lowest accuracy in the case of the five-character CAPTCHA. In this four-character CAPTCHA, 98.58% was accurately recognized. Therefore, we can say that the structural morphology of M in the five-character CAPTCHA better avoids any CAPTCHA solver method. If we look at the F1-Score in Table 5 , all character’s recognition values range from 97 to 99%. However, the variation in all folds results remains almost the same as Folds accuracies. The mean F1-Scores against each character validate the confidence of the proposed method and the breaking of each type of character. The class balance issue in 32 types of classes is the big issue that could make less confident to the proposed method accuracy. However, the F1-Score is discussed and added in Table 5 that cross-validates the performance of the proposed study. The highest results show that this four-character CAPTCHA is at high risk, and line intersection, word joining, and correlation may break, preventing the CAPTCHA from breaking. Many approaches have been proposed to recognize the CAPTCHA, and most of them have used a conventional structure. The proposed study has used a more confident validation approach with multi-aspect feature extraction. Therefore, it can be used as a more promising approach to break CAPTCHA images and test the CAPTCHA design made by CAPTCHA designers. In this way, CAPTCHA designs can be protected against new approaches to deep learning. The graphical illustration of validation accuracy and the losses for both datasets on all folds is shown in Fig. 5 .

An external file that holds a picture, illustration, etc. Object name is peerj-cs-08-879-g005.jpg

Open in a separate window

Figure 5

The validation loss and validation accuracy graphs are shown for each fold of the CNN (row 1: five-character CAPTCHA; row 2: four-character CAPTCHA).

The five- and four-character CAPTCHA fold validation losses and accuracies are shown. It can be observed that the all folds of the five-character CAPTCHA reached close to 90%, and only the 2nd fold value remained at 80.77%. It is also important to state that, in this fold, there were cases that may not be covered in other deep learning approaches, and their results remain at risk. Similarly, a four-character CAPTCHA with a greater number of samples and less complex characters should not be used, as it can break easily compared to the five-character CAPTCHA. CAPTCHA-recognition-based studies have used self-generated or augmented datasets to propose CAPTCHA solvers. Therefore, the number of images, their spatial resolution sizes and styles, and other results have become incomparable. The proposed study mainly focuses on a better validation technique using deep learning with multi-aspect feature via skipping connections in a CNN. With some character-matching studies, we performed a comparison to make the proposed study more reliable.

In Table 6 , we can see that various studies have used different numbers of characters with self-collected and generated datasets, and comparisons have been made. Some studies have considered the number of dataset characters. Accuracy is not comparable, as it uses the five-fold validation method, and the others only used 1-fold. Therefore, the proposed study outperforms in each aspect, in terms of the proposed CNN framework and its validation scheme.

Table 6

Comparison of proposed study based five and four-character datasets’ with state-of-the-art methods.

References	No. of characters	Method	Results
Du et al. (2017)	6	Faster R-CNN	Accuracy = 98.5%
	4		Accuracy = 97.8%
	5		Accuracy = 97.5%
Chen et al. (2019)	4	Selective D-CNN	Success rate = 95.4%
Bostik et al. (2021)	Different	CNN	Accuracy = 80%
Bostik & Klecka (2018)	Different	KNN	Precision = 98.99%
		SVN	99.80%
		Feed forward-Net	98.79%
Proposed Study	4	Skip-CNN with 5-Fold Validation	Accuracy = 98.82%
	5	–	Accuracy = 85.52%

Open in a separate window

Conclusion

The proposed study uses a different approach to deep learning to solve CAPTCHA problems. It proposed a skip-CNN connection network to break text-based CAPTCHAs. Two CAPTCHA datasets are discussed and evaluated character by character. The proposed study is confident to report results, as it removed biases (if any) in datasets using a five-fold validation method. The results are also improved as compared to previous studies. The reported higher results claim that these CAPTCHA designs are at high risk, as any malicious attack can break them on the web. Therefore, the proposed CNN could test CAPTCHA designs to solve them more confidently in real-time. Furthermore, the proposed study has used the publicly available datasets to perform training and testing on them, making it a more robust approach to solve text-based CAPTCHA’s.

Many studies have used deep learning to break CAPTCHAs, as they have focused on the need to design CAPTCHAs that do not consume user time and resist CAPTCHA solvers. It would make our web systems more secure against malicious attacks. However, in the future, the data augmentation methods and more robust data creation methods can be applied on CAPTCHA datasets where intersecting line-based CAPTCHA’s are more challenging to break that can be used. Similarly, the other local languages based CAPTCHAs also can be solved using similar DL models.

Supplemental Information

Supplemental Information 1

Code for Proposed Approach:

Click here for additional data file. ^{(7.8K, zip)}

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

Competing Interests

Shida Lu is employed by State Grid Information & Communication Company, SMEPC, China.

Kai Huang is employed by Shanghai Shineenergy Information Technology Development Co., Ltd., China

Author Contributions

Contributed by

Shida Lu and Kai Huang conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Contributed by

Talha Meraj conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Contributed by

Hafiz Tayyab Rauf conceived and designed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The MATLAB code is available in the Supplemental File . The data is available at Kaggle: https://www.kaggle.com/genesis16/captcha-4-letter and https://www.kaggle.com/fournierp/captcha-version-2-images .

References

Ahmed & Anand (2021) Ahmed SS, Anand KM. Data engineering and intelligent computing. Singapore: Springer; 2021. Convolution neural network-based CAPTCHA recognition for indic languages; pp. 493–502. [ CrossRef ] [ Google Scholar ]

Ahn & Yim (2020) Ahn H, Yim C. Convolutional neural networks using skip connections with layer groups for super-resolution image reconstruction based on deep learning. Applied Sciences. 2020; 10 (6):1959. doi: 10.3390/app10061959. [ CrossRef ] [ Google Scholar ]

Alqahtani & Alsulaiman (2020) Alqahtani FH, Alsulaiman FA. Is image-based CAPTCHA secure against attacks based on machine learning? An experimental study. Computers & Security. 2020; 88 :101635. doi: 10.1016/j.cose.2019.101635. [ CrossRef ] [ Google Scholar ]

Azad & Jain (2013) Azad S, Jain K. Captcha: attacks and weaknesses against OCR technology. Global Journal of Computer Science and Technology. 2013; 13 :3. [ Google Scholar ]

Baird & Popat (2002) Baird HS, Popat K. Human interactive proofs and document image analysis. International workshop on document analysis systems; New York. 2002. pp. 507–518. [ Google Scholar ]

Bostik et al. (2021) Bostik O, Horak K, Kratochvila L, Zemcik T, Bilik S. Semi-supervised deep learning approach to break common CAPTCHAs. Neural Computing and Applications. 2021; 33 (20):13333–13343. [ Google Scholar ]

Bostik & Klecka (2018) Bostik O, Klecka J. Recognition of CAPTCHA characters by supervised machine learning algorithms. IFAC-PapersOnLine. 2018; 51 (6):208–213. [ Google Scholar ]

Bursztein, Martin & Mitchell (2011) Bursztein E, Martin M, Mitchell J. Text-based CAPTCHA strengths and weaknesses. Proceedings of the 18th ACM conference on Computer and communications security; New York. 2011. pp. 125–138. [ Google Scholar ]

Bursztein et al. (2014) Bursztein E, Moscicki A, Fabry C, Bethard S, Mitchell JC, Jurafsky D. Easy does it: more usable CAPTCHAs. Proceedings of the SIGCHI conference on human factors in computing systems; 2014. pp. 2637–2646. [ Google Scholar ]

Cao (2021) Cao Y. Digital character CAPTCHA recognition using convolution network. 2021 2nd international conference on computing and data science (CDS); Piscataway. 2021. pp. 130–135. [ Google Scholar ]

Che et al. (2021) Che A, Liu Y, Xiao H, Wang H, Zhang K, Dai H-N. Augmented data selector to initiate text-based CAPTCHA attack. Security and Communication Networks. 2021; 2021 :1–9. [ Google Scholar ]

Chellapilla et al. (2005) Chellapilla K, Larson K, Simard P, Czerwinski M. Designing human friendly human interaction proofs (HIPs). Proceedings of the SIGCHI conference on human factors in computing systems; 2005. pp. 711–720. [ Google Scholar ]

Chen et al. (2019) Chen J, Luo X, Liu Y, Wang J, Ma Y. Selective learning confusion class for text-based CAPTCHA recognition. IEEE Access. 2019; 7 :22246–22259. doi: 10.1109/ACCESS.2019.2899044. [ CrossRef ] [ Google Scholar ]

Cruz-Perez et al. (2012) Cruz-Perez C, Starostenko O, Uceda-Ponga F, Alarcon-Aquino V, Reyes-Cabrera L. Breaking reCAPTCHAs with unpredictable collapse: heuristic character segmentation and recognition. In: Carrasco-Ochoa JA, Martínez-Trinidad JF, Olvera López JA, Boyer KL, editors. Pattern recognition. MCPR 2012. Lecture notes in computer science, vol 7329. Berlin: Springer; 2012. pp. 155–165. [ CrossRef ] [ Google Scholar ]

Danchev (2014) Danchev D. Google’s reCAPTCHA under automatic fire from a newly launched reCAPTCHA-solving/breaking service, internet security threat updates & insights. 2014. http://www.webroot.com/blog/2014/01/21/googles-recaptcha-automatic-fire-newly-launched-recaptcha-solving-breaking-service/

Dankwa & Yang (2021) Dankwa S, Yang L. An efficient and accurate depth-wise separable convolutional neural network for cybersecurity vulnerability assessment based on CAPTCHA breaking. Electronics. 2021; 10 (4):480. doi: 10.3390/electronics10040480. [ CrossRef ] [ Google Scholar ]

Du et al. (2017) Du F-L, Li J-X, Yang Z, Chen P, Wang B, Zhang J. CAPTCHA recognition based on faster R-CNN. In: Huang DS, Jo KH, Figueroa-García J, editors. Intelligent computing theories and application. ICIC 2017. Lecture notes in computer science, vol 10362. Springer; Cham: 2017. pp. 597–605. [ CrossRef ] [ Google Scholar ]

Ferreira et al. (2019) Ferreira DD, Leira L, Mihaylova P, Georgieva P. Breaking text-based CAPTCHA with sparse convolutional neural networks. Iberian conference on pattern recognition and image analysis; 2019. pp. 404–415. [ Google Scholar ]

Gao et al. (2021) Gao Y, Gao H, Luo S, Zi Y, Zhang S, Mao W, Wang P, Shen Y, Yan J. Research on the security of visual reasoning CAPTCHA. 30th USENIX security symposium (USENIX security 21).2021. [ Google Scholar ]

Gao, Wang & Shen (2020a) Gao J, Wang H, Shen H. Machine learning based workload prediction in cloud computing. 2020 29th international conference on computer communications and networks (ICCCN); Piscataway. 2020a. pp. 1–9. [ Google Scholar ]

Gao, Wang & Shen (2020b) Gao J, Wang H, Shen H. Smartly handling renewable energy instability in supporting a cloud datacenter. 2020 IEEE international parallel and distributed processing symposium (IPDPS); Piscataway. 2020b. pp. 769–778. [ Google Scholar ]

Gao, Wang & Shen (2020c) Gao J, Wang H, Shen H. Task failure prediction in cloud data centers using deep learning. IEEE transactions on services computing; 2020c. pp. 1111–1116. [ Google Scholar ]

George et al. (2017) George D, Lehrach W, Kansky K, Lázaro-Gredilla M, Laan C, Marthi B, Lou X, Meng Z, Liu Y, Wang H, Lavin A, Phoenix DS. A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs. Science. 2017; 358 (6368):eaag2612. doi: 10.1126/science.aag2612. [ PubMed ] [ CrossRef ] [ Google Scholar ]

Gheisari et al. (2021) Gheisari M, Najafabadi HE, Alzubi JA, Gao J, Wang G, Abbasi AA, Castiglione A. OBPP: an ontology-based framework for privacy-preserving in IoT-based smart city. Future Generation Computer Systems. 2021; 123 :1–13. doi: 10.1016/j.future.2021.01.028. [ CrossRef ] [ Google Scholar ]

Goswami et al. (2014) Goswami G, Powell BM, Vatsa M, Singh R, Noore A. FaceDCAPTCHA: face detection based color image CAPTCHA. Future Generation Computer Systems. 2014; 31 :59–68. doi: 10.1016/j.future.2012.08.013. [ CrossRef ] [ Google Scholar ]

Hua & Guoqin (2017) Hua H, Guoqin C. A recognition method of CAPTCHA with adhesion character. International Journal of Future Generation Communication and Networking. 2017; 10 (8):59–70. [ Google Scholar ]

Kaur & Behal (2015) Kaur K, Behal S. Designing a secure text-based captcha. Procedia Computer Science. 2015; 57 :122–125. doi: 10.1016/j.procs.2015.07.381. [ CrossRef ] [ Google Scholar ]

Kumar (2021) Kumar P. Thesis. 2021. Captcha recognition using generative adversarial network implementation. [ Google Scholar ]

Kumar & Singh (2021) Kumar A, Singh AP. Contour based deep learning engine to solve CAPTCHA. 2021 7th international conference on advanced computing and communication systems (ICACCS); Piscataway. 2021. pp. 723–727. [ Google Scholar ]

Lal et al. (2021) Lal S, Rehman SU, Shah JH, Meraj T, Rauf HT, Damaševičius R, Mohammed MA, Abdulkareem KH. Adversarial attack and defence through adversarial training and feature fusion for diabetic retinopathy recognition. Sensors. 2021; 21 (11):3922. doi: 10.3390/s21113922. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Madar, Kumar & Ramakrishna (2017) Madar B, Kumar GK, Ramakrishna C. Captcha breaking using segmentation and morphological operations. International Journal of Computer Applications. 2017; 166 (4):34–38. [ Google Scholar ]

Mahum et al. (2021) Mahum R, Rehman SU, Meraj T, Rauf HT, Irtaza A, El-Sherbeeny AM, El-Meligy MA. A novel hybrid approach based on deep CNN features to detect knee osteoarthritis. Sensors. 2021; 21 (18):6189. doi: 10.3390/s21186189. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Manzoor et al. (2022) Manzoor K, Majeed F, Siddique A, Meraj T, Rauf HT, El-Meligy MA, Sharaf M, Elgawad AEEA. A lightweight approach for skin lesion detection through optimal features fusion. Computers, Materials & Continua. 2022; 70 (1):1617–1630. doi: 10.32604/cmc.2022.018621. [ CrossRef ] [ Google Scholar ]

Meraj et al. (2019) Meraj T, Hassan A, Zahoor S, Rauf HT, Lali MI, Ali L, Bukhari SAC. Lungs nodule detection using semantic segmentation and classification with optimal features. Neural Computing and Applications. 2019; 33 :10737–10750. doi: 10.1007/s00521-020-04870-2. [ CrossRef ] [ Google Scholar ]

Namasudra (2020) Namasudra S. Fast and secure data accessing by using DNA computing for the cloud environment. IEEE Transactions on Services Computing. 2020 doi: 10.1109/TSC.2020.3046471.. Epub ahead of print 2020 20 December. [ CrossRef ] [ Google Scholar ]

Namasudra et al. (2021) Namasudra S, Deka GC, Johri P, Hosseinpour M, Gandomi AH. The revolution of blockchain: state-of-the-art and research challenges. Archives of Computational Methods in Engineering. 2021; 28 (3):1497–1515. doi: 10.1007/s11831-020-09426-0. [ CrossRef ] [ Google Scholar ]

Obimbo, Halligan & De Freitas (2013) Obimbo C, Halligan A, De Freitas P. CaptchAll: an improvement on the modern text-based CAPTCHA. Procedia Computer Science. 2013; 20 :496–501. doi: 10.1016/j.procs.2013.09.309. [ CrossRef ] [ Google Scholar ]

Osadchy et al. (2017) Osadchy M, Hernandez-Castro J, Gibson S, Dunkelman O, Pérez-Cabo D. No bot expects the DeepCAPTCHA! Introducing immutable adversarial examples, with applications to CAPTCHA generation. IEEE Transactions on Information Forensics and Security. 2017; 12 (11):2640–2653. doi: 10.1109/TIFS.2017.2718479. [ CrossRef ] [ Google Scholar ]

Ouyang et al. (2021) Ouyang Z, Zhai X, Wu J, Yang J, Yue D, Dou C, Zhang T. A cloud endpoint coordinating CAPTCHA based on multi-view stacking ensemble. Computers & Security. 2021; 103 :102178. doi: 10.1016/j.cose.2021.102178. [ CrossRef ] [ Google Scholar ]

Priya & Karthik (2013) Priya LD, Karthik S. Secure captcha input based spam prevention. IJESE. 2013; 1 (7):2319–6378. [ Google Scholar ]

Rathoura & Bhatiab (2018) Rathoura N, Bhatiab V. Recognition method of text CAPTCHA using correlation and principle component analysis. International Journal of Control Theory and Applications. 2018; 9 :46. [ Google Scholar ]

Rauf, Bangyal & Lali (2021) Rauf HT, Bangyal WHK, Lali MI. An adaptive hybrid differential evolution algorithm for continuous optimization and classification problems. Neural Computing and Applications. 2021; 33 (17):10841–10867. [ Google Scholar ]

Rauf et al. (2020) Rauf HT, Malik S, Shoaib U, Irfan MN, Lali MI. Adaptive inertia weight Bat algorithm with Sugeno-Function fuzzy search. Applied Soft Computing. 2020; 90 :106159. doi: 10.1016/j.asoc.2020.106159. [ CrossRef ] [ Google Scholar ]

Roshanbin & Miller (2013) Roshanbin N, Miller J. A survey and analysis of current captcha approaches. Journal of Web Engineering. 2013; 12 (1–2):001–040. [ Google Scholar ]

Saroha & Gill (2021) Saroha R, Gill S. Rising threats in expert applications and solutions. Springer; 2021. Strengthening pix CAPTCHA using trainlm function in backpropagation; pp. 679–686. [ Google Scholar ]

Shi et al. (2021) Shi C, Xu X, Ji S, Bu K, Chen J, Beyah R, Wang T. IEEE transactions on cybernetics. 2021. Adversarial captchas. [ PubMed ] [ CrossRef ] [ Google Scholar ]

Sudarshan Soni & Bonde (2017) Sudarshan Soni D, Bonde P. E-CAPTCHA: a two way graphical password based hard AI problem. International Journal on Recent and Innovation Trends in Computing and Communication. 2017; 5 (6):418–421. [ Google Scholar ]

Thobhani et al. (2020) Thobhani A, Gao M, Hawbani A, Ali STM, Abdussalam A. CAPTCHA recognition using deep learning with attached binary images. Electronics. 2020; 9 (9):1522. doi: 10.3390/electronics9091522. [ CrossRef ] [ Google Scholar ]

Von Ahn et al. (2003) Von Ahn L, Blum M, Hopper NJ, Langford J. CAPTCHA: using hard AI problems for security. In: Biham E, editor. Advances in cryptology – EUROCRYPT 2003. EUROCRYPT 2003. Lecture notes in computer science, vol 2656. Berlin: Springer; 2003. pp. 294–311. [ CrossRef ] [ Google Scholar ]

Von Ahn et al. (2008) Von Ahn L, Maurer B, McMillen C, Abraham D, Blum M. recaptcha: human-based character recognition via web security measures. Science. 2008; 321 (5895):1465–1468. doi: 10.1126/science.1160379. [ PubMed ] [ CrossRef ] [ Google Scholar ]

Wang (2017) Wang Z-h. Recognition of text-based CAPTCHA with merged characters. DEStech Transactions on Computer Science and Engineering. 2017;(cece) [ Google Scholar ]

Wang & Bentley (2006) Wang S-Y, Bentley JL. CAPTCHA challenge tradeoffs: familiarity of strings versus degradation of images. 18th international conference on pattern recognition (ICPR’06), vol. 3; Piscataway. 2006. pp. 164–167. [ Google Scholar ]

Wang & Shi (2021) Wang Z, Shi P. CAPTCHA recognition method based on CNN with focal loss. Complexity. 2021; 2021 :6641329. doi: 10.1155/2021/6641329. [ CrossRef ] [ Google Scholar ]

Wang et al. (2021) Wang Y, Wei Y, Zhang M, Liu Y, Wang B. Make complex captchas simple: a fast text captcha solver based on a small number of samples. Information Sciences. 2021; 578 :181–194. doi: 10.1016/j.ins.2021.07.040. [ CrossRef ] [ Google Scholar ]

Wang, Zhao & Liu (2021) Wang S, Zhao G, Liu J. Text captcha defense algorithm based on overall adversarial perturbations. Journal of Physics: Conference Series. 2021; 1744 (4):042243. [ Google Scholar ]

Weng et al. (2019) Weng H, Zhao B, Ji S, Chen J, Wang T, He Q, Beyah R. Towards understanding the security of modern image captchas and underground captcha-solving services. Big Data Mining and Analytics. 2019; 2 (2):118–144. doi: 10.26599/BDMA.2019.9020001. [ CrossRef ] [ Google Scholar ]

Xu, Liu & Li (2020) Xu X, Liu L, Li B. A survey of CAPTCHA technologies to distinguish between human and computer. Neurocomputing. 2020; 408 :292–307. doi: 10.1016/j.neucom.2019.08.109. [ CrossRef ] [ Google Scholar ]

Yan & El Ahmad (2008) Yan J, El Ahmad AS. Usability of CAPTCHAs or usability issues in CAPTCHA design. Proceedings of the 4th symposium on Usable privacy and security; 2008. pp. 44–52. [ Google Scholar ]

Ye et al. (2018) Ye G, Tang Z, Fang D, Zhu Z, Feng Y, Xu P, Chen X, Wang Z. Yet another text captcha solver: a generative adversarial network based approach. Proceedings of the 2018 ACM SIGSAC conference on computer and communications security; New York. 2018. pp. 332–348. [ Google Scholar ]

Zhang et al. (2021) Zhang X, Liu X, Sarkodie-Gyan T, Li Z. Development of a character CAPTCHA recognition system for the visually impaired community using deep learning. Machine Vision and Applications. 2021; 2 :1. doi: 10.1007/s00138-020-01119-9. [ CrossRef ] [ Google Scholar ]

Articles from PeerJ Computer Science are provided here courtesy of PeerJ, Inc

Methodology

Datasets

Table 2

Preprocessing and segmentation

CNN training for text recognition

Batch Normalization Layer

ReLU

Skip-Connection

Average pooling

Table 3

Results and Discussion

Five-character Dataset ( d 1 )

Table 4

Four-character dataset ( d 2 )

Table 5

Table 6

Conclusion

Supplemental Information

Supplemental Information 1

Funding Statement

Additional Information and Declarations

References

Five-character Dataset ( d ₁ )

Four-character dataset ( d ₂ )