1. Introduction
The research community has been working for many years to recognize printed and written texts [
1]. Many applications to the text recovery system help blind people to navigate specific environments [
2]. The text is filtered with textual contents on the images. The text has high-level scene information in natural pictures. The styles, font sizes, and text may be different. Anyone can view natural text scenes in advertising notes and banners. Text with a background of water, constructions, woods, rocks, etc., mostly has noise. The process of text recognition from natural images can be slowed down with these backgrounds [
3]. This problem may be more prominent when lights are loud, blurred, and texts are directed in images of nature [
3,
4]. Here, this paper’s primary focus is on the recognition of Arabic texts from natural scenes.
For most people, video is a crucial source of information—there are those who understand videos better than other sources of information. Many videos on social websites and TV channels are available. Storage technology can now save a large number of video data. Since the length and annotation of these videos are increasing, quick and productive recovery systems should be available to enable access to the information relevant to the videos. The most significant information in videos is the title. Anybody can understand the text written on a video. The recognition of video text offers help in analyzing the content of the video [
5]. The text data on the videos are highly linked to the people in the video, and they are also linked to the events of the video.
Reading text from a printed document such as an essay differs to natural scene recognition [
6]. This problem has brought attention to the area of natural scene recognition. Scene text recognition has become an essential module for many practical applications due to the advancement of mobile devices such as self-driving cars, smartphones, etc. [
7]. A lot of work has been carried out in recognizing printed text in the form of Optical Character Recognition (OCR) systems. Still, they cannot deal effectively with natural scenes with factors such as image distortion, variable fonts, background noise, etc. [
8]. Gur et al. [
9] described the problems in the text recognition system. According to them, OCR tools are not sufficient for a complete solution; human expertise is also required. Arabic is the fourth most widely spoken language globally, but the work on word-level text recognition for the Arabic language is negligible [
10]. The original image containing one word imagescan be seen in
Figure 1.
Arabic text acknowledgment is an active field of research in pattern recognition today. Arabic printed documents have been a topic of study for many years in the research community. Character recognition [
11] makes up most of the previous work on the Arabic script. Due to its cursive fashion and lack of annotated information, the Arabic script is challenging [
12]. Additionally, the OCR script system is also challenging in regard to Arabic because most English OCR techniques are not very practical for Arabic. For Arabic scripts, many researchers have worked on word segmentation, but now, people work on free text segmentation recognition. Models such as the recurrent neural network are used for this sequence by sequence (RNN). The input to target labels in a series is transcribed in these models [
13]. For handwritten English and Arabic text recognition, the RNN model is used in [
14,
15], respectively. For Arabic video text recognition, this paper uses the same sequence to sequence model. A great deal of work has been carried out in [
16,
17] to recognize video text. The availability of Arabic script datasets is minor compared to English script datasets, but for video text recognition, two datasets are available: the ACTIV dataset [
18] and the ALIF dataset [
19]. Due to the lack of annotated data of Arabic script, the work on word-level text recognition is very scarce [
20].
DCNNs are used for various tasks such as feature extraction, classification, and objection detection. However, the input and output size of this network is not fixed [
21]. If the model receives input data in a series, a sequence of labels should be predicted. RNN models resolve the problem of inputs and outputs of variable sizes. To recall previous input, RNN introduces the concept of memory [
22]; the information used by RNN-based models is usually converted into sequence–function vectors [
23]. A DCNN is used to obtain the functional representations of the input image in the proposed study. These features are transferred to a bidirectional RNN. These characteristics are transmitted to the two-way RNN. The mechanism of attention is applied to the RNN. The mechanism for attention gives the results of each RNN output. Relevant information will arrive at the front when necessary, via the attention mechanism [
24]. The results are predicted based on this attention model.
For natural scene images, several text recognition systems are proposed. Bissacco et al. [
25] recognize characters for English scripts on a single basis, and then the DCNN model recognizes the characters detected. These models’ limitation is that the exactness of the word’s cutting and detection is not very accurate. Due to the complications, this is harder for Arabic script. As far as our knowledge is concerned, there is less work in Arabic text script recognition at the word level. Here, we use images of natural scenes; Using images of natural scenes involves three challenges, including environmental challenges scene complexity, noise, etc. The mages may be blurred, and the content’s text may have variable fonts, guidelines, and text colors. The complex properties of an Arabic script, including its semi-cursive mode, letter shape, paws (Arabic part), dots, and similar letters in different situations present other challenges.
Figure 2 is a representation of Arabic natural scene text.
Jain et al. [
13] proposed a hybrid CNN-RNN model to recognize Arabic text in natural scenes and videos. In this model, an input image is converted into a feature map by the CNN layer. This layer splits the feature map column-wise (
f1,
f2, ...,
ft) and transfers to the RNN layer. The RNN layer reads the feature sequences one by one (mean one time step at a time) and obtains a fixed dimension vector that contains all the information of input feature sequences. Then, a transcription layer uses this fixed vector to generate the output sequence. The main issues of this approach include the fact that all the knowledge of these feature sequences is stored in a fixed-length vector. During the backpropagation, the model has to visit all the information in a fixed vector. This will be more difficult if we have long sequences. Therefore, there should be a mechanism that deals dynamically with these feature sequences (input sequences). The attention mechanism dynamically picks the most relevant information from the feature sequence and feeds it to the context vector; for this, it soft searches where the vital information is present in the input sequences [
26]. During backpropagation, the model will only visit the critical information in the context vector. This paper presents an attention-based CNN-RNN model for Arabic image text recognition.
2. Literature Review
The Arabic language is extremely complex, due to its many properties. It is written cursively from right to left; it has only 28 letters. Most of its letters, except six letters, are connected to the previous letter at the bottom of the line. One or more paws can be an Arabic word (part of the word). A letter can be linked to the letter or linked in the form of a paw to the letter. The letter form makes the script more difficult, and the shape of it depends on the location of the letter in the beginning, in the middle, or at the end [
27]. Ligature fashion [
28] is also used in Arabic scripts, which means that two letters connect and form a new letter that cannot be separated. These characteristics make the automatic recognition of Arabic scripts more difficult. A. Shahin presented a regression-based model to recognize Arabic printed text on paper [
29], which could be linear or nonlinear. First, they thinned the text from the images and segmented the images into sub-words. Following that, a coding scheme was used to represent the relationship between word segments. Each form of the character and font type had a code in the coding scheme. Finally, the linear regression model validated the representations against a truth table. They tested the model on 14,000 different word samples, and the accuracy was 86 percent.
There are two recognition methods for document- and video-based OCR techniques: one is the free segmentation method, and the other is based on segmentation. Mostly, two approaches are used for recognition: the first one is the analytical approach, which segments a line or word into smaller units, and the second one is the global approach, which does not perform any segmentation; it takes the whole line or word as an input. Most of the Arabic text recognition methods are segmentation-based. For example, Ben Halima [
30] uses an analytical approach for Arabic video text recognition. They perform binarization on Arabic text lines and then segmentation by projection profiles. They use the fuzzy-k-nearest-neighbors technique on handcrafted features to classify characters. Lwata [
31] also employs the analytical approach for Arabic text recognition on various video frames. They use a space detection technique to segment the line into words, which are then segmented into characters. Finally, they use an adapted quadratic function classifier to perform character-level recognition. With these approaches, segmentation errors can grow indefinitely and contribute to recognition performance.
A. Bodour presented a deep learning model for Arabic text recognition in their work. Their work was mainly focused on the recognition of Islamic manuscripts of history. To enhance the quality of scanned images and for segmentation, several preprocessing techniques were used. After preprocessing, they used CNNs for recognition and achieved 88.20% validation accuracy. A deep learning model is also used in [
9] for handwritten Arabic text recognition. Various preprocessing techniques were used for image noise reduction. They used a threshold method to segment the characters of words into distinct regions. Feature vectors were constructed by these distinct regions. They evaluated the model on a custom dataset gathered from different writers. The model achieved 83% accuracy.
In research works [
32,
33], a CNN model with three layers is presented for the recognition of handwritten characters of Arabic script. The datasets used for evaluation were AHCD [
34] and AIA9k [
35]. The model achieved 97.5% validation accuracy. Another deep learning model with MD-LSTM is presented in [
36] for handwritten Arabic text recognition. The model used the CTC loss function. The purpose of that work was to evaluate the results of extending the dataset using data enhancing techniques and compare the performance of an extended model with other related models. The dataset used for training and evaluation is handwritten KHAT [
37]. The model achieved 80.02% validation accuracy. In [
13], the authors presented a hybrid deep learning model for printed Urdu text detection from scanned documents. The model used a CNN for feature extraction and bi-LSTM for sequence labeling. The model used the CTC loss function. The APTI and URDU datasets [
38] were used for evaluation. The model achieved 98.80% accuracy. To illustrate,
Figure 3 is a flowchart to separate words from Arabic sentence images.
Zhai [
39] proposed a bidirectional RNN-based segmentation free model with the layer of CTC for the recognition of Chinese text, which was obtained from news channels. For model training, they composed almost two million news headings from 13 TV channels. There is a lot of work on automatic text recognition from natural scenes and videos for English scripts, but for Arabic scripts, the research community is newly active. For English scripts, Bissacco et al. [
25] firstly detect the characters individually, then the DCNN model recognizes these detected characters. The limitation of these models is that the accuracy of cropping and detecting the character from the word is not high. This will be more difficult for the Arabic script due to its complications.
For English script text recognition, Jaderberg et al. [
40] used a DCNN for detection but the recognition region mechanism was used. However, this model is not suitable for Arabic script due to its complications. The number of unique words in English script is less than in Arabic script. Almazán et al. [
41] had two goals to fulfill: the first is word spotting and the second is word recognition on the images. The authors proposed an approach where text strings and word images both emerged to a common vector space. In semantic segmentation, RNN-based models are used to handle the long-distance pixel dependencies [
42]. For example, in [
43], Gatta et al. unrolled the CNN to obtain the semantic connections at different time steps. Authors [
44,
45,
46,
47,
48] used LSTM with four blocks to scan all directions of an image; this helped to learn long range spatial dependencies.
There has been a lot of work performed in the field of text extraction from videos. Karray et al. [
49] extracted text using classification methods. Hua et al. [
50] used edge detection, which is a type of text extraction. Previously, the majority of work on Arabic videos was divided into two categories: dealing with relevant feature extraction and classifying features to target labels. To recognize Chinese handwritten text, authors [
51,
52,
53,
54,
55] used a bidirectional LSTM network. The attention-based model inspired our proposed model. Papers [
56,
57,
58,
59,
60,
61] proposed different attention-based speech recognition model. Authors [
62,
63,
64,
65,
66,
67,
68,
69,
70,
71] proposed various attention-based models for object detection in images as well. Fasha et al. [
72] proposed a hybrid deep learning model, and used five CNN layers and two LSTM layers for Arabic text recognition. They worked on 18 fonts of the Arabic language [
73]. Graves et al. [
74] created Neural Turing machines using an attention-based model with convolutions (NTMs). The NTM computes the attention weights based on content. Authors [
75,
76,
77] present a supervised convolutional neural network (CNN) model that uses batch normalization and dropout regularization settings to extract optimal features from context. When compared to traditional deep learning models, this tries to prevent overfitting and improve generalization performance. In [
78,
79,
80], for feature extraction, the first model employs deep convolutional neural networks (CNNs), whereas word categorization is handled by a fully connected multilayer perceptron neural network (MLP). SimpleHTR, the second model, extracts information from images using CNN and recurrent neural network (RNN) layers to develop the suggested deep CNN (DCNN) architecture. To compare the outcomes, Daniyal et al. presented the Bluechet and Puchserver models. Arafat et al. [
81] introduce Hijja, a novel dataset of Arabic letters produced only by youngsters aged 7–12. A total of 591 individuals contributed 47,434 characters to our dataset. They propose a convolutional neural network-based automatic handwriting recognition model (CNN). Hijja and the Arabic Handwritten Character Dataset (AHCD) are used to train our model. Ali et al. present a method for recognizing cursive caption text that uses a combination of convolutional and recurrent neural networks that are trained in an end-to-end framework. The background is segmented using text lines retrieved from video frames, which are then fed into a convolutional neural network for feature extraction. To learn sequence-to-sequence mapping, the extracted feature sequences are fed to different forms of bidirectional recurrent neural networks along with the ground truth transcription. Finally, the final transcription is created using a connectionist temporal classification layer. Alrehali et al. present a method for identifying text in photographs of the Hadeethe, Tafseer, and Akhidah manuscripts and converting it into a legible text that can be copied and saved for use in future studies. Their algorithm’s major steps are as follows: (1) picture enhancement (preprocessing); (2) segmenting the manuscript image into lines and characters; (3) creating an Arabic character dataset; (4) text recognition (classification). On three produced datasets, they employ a CNN in the classification stage. To solve the Arabic Named Entity Recognition (NER) task, El Bazi et al. developed a deep learning technique. They built a bidirectional LSTM and Conditional Random Fields (CRFs) neural network architecture and tested various commonly used hyperparameters to see how they affected the overall performance of our system. The model takes two types of word information as input: pre-trained word embeddings and character-based representations, and it does not require any task-specific expertise or feature engineering. Arafat et al. proposed a method for detecting, predicting, and recognizing Urdu ligatures in outdoor photographs. For identification and localization purposes, the unique FasterRCNN algorithm was utilized in conjunction with well-known CNNs such as Squeezenet, Googlenet, Resnet18, and Resnet50 in the first phase for images with a resolution of 320 240 pixels. A unique Regression Residual Neural Network (RRNN) is trained and evaluated on datasets including randomly oriented ligatures to predict ligature orientation. A Two Stream Deep Neural Network was used to recognize ligatures (TSDNN).
According to a review of related work, less work exists for natural scene Arabic text recognition, particularly at the word level. Our performed work represents specific steps in this research area and provides tools that can be used to expand and improve the results obtained.
3. Proposed CNN-RNN Attention Model
The architecture is divided into three sections: the first deals with convolutional layers, the second with recurrent layers of a bidirectional LSTM, and the third with recurrent layers of a bidirectional LSTM and an attention mechanism. The first part extracts feature representations from the input image. These representations are fed into the second part. This recurrent layer will label the sequences one by one. The connectionist temporal classification layer (CTC) is used when dealing with aligned data; it is essentially a loss function deal where time is a variable [
48].
This paper uses two datasets, ACTIV and ALIF; they are firstly preprocessed, then fed to the convolutional layer. The size of all images is fixed before giving input to the first part of the architecture. Here, the height of each image is fixed. Convolutional layers will split the feature maps column-wise and create features sequence by sequence (
f1,
f2, ...,
ft). Due to CNN feature extraction, each feature contains important information. As CNNs cannot handle the long dependency, these features are directed as input to the second part of the architecture. This section makes predictions based on the features generated by the first section. In the second part, bidirectional Long Short-Term Memory (LSTM) networks are used. These networks deal with inputs and outputs in a recurrent fashion. The second part of the network will be unrolled based on a time variable, and the time variable will be determined solely by the input data. The disappearing gradient problem occurs in simple recurrent neural networks, but LSTM networks can learn meaningful information. They use different gates and memory cells to resolve the problem of vanishing gradient [
49]. A bidirectional LSMT network is used in text recognition, as the learning pattern of text from right to left and left to right is very important [
23]. The network can be made deeper by stacking these LSTM layers [
50].
Now, ‘
v’ has the information of both sides. The third part of the architecture has an LSTM network with attention, produces the results based on the information in ‘
vi’. The attention mechanism gives scores to each feature sequence; for this, it gives probability to the recognized text.
cont is the context vector at time
t generated from feature sequences. Probability with one LSTM layer is:
Here,
n is a non-linear function that gives the probability output of
yt. Here,
hst is the hidden state of LSTM, which can be calculated as:
The context vector can be calculated as:
Each feature sequence has a weight
Wti and can be calculated as:
esi scores represent how well input around
i related to the output at
t. It can be calculated in the equation below. However, the proposed method is explained in
Figure 4.
The proposed methodology uses an attention mechanism over a bidirectional LSTM; for this, it calculates the attention of each input sequence and then multiplies this attention to the respective input vector. The weighted sum of these attentions with their respective vectors is stored in the context vector. Attentions are the weights of input sequences.
3.1. Dataset and Preprocessing
The model was trained on two datasets, ACTIV and ALIF. There are 21,520 line images in the ACTIV dataset, and these images are from different news channels. The ALIF dataset is smaller than ACTIV, and has 6532 text lines images, and these are also from Arabic news channels. These images are annotated with XML files. Another dataset was curated by downloading freely available images containing Arabic script from Google Images and Facebook; this dataset contains cropped images of market banners, shop banners, etc. This Arabic dataset of 2477 images was manually annotated by experts.
Table 1 and
Table 2 represent the statistical details of the ACTIV dataset and ALIF dataset. Details of the Google curated dataset is given in
Table 3. Segmented word images can be seen in
Figure 5.
3.2. Segmented Dataset
Here, we had two datasets, ACTIV and ALIF, and an image may have had more than one word. We worked on word-level text recognition; we needed one word in an image. For this, we performed segmentation to obtain one word in an image. The working of text recognition process is explained in
Figure 6. A binarization algorithm was applied, which extracted the text from the noisy and shadow affected images.
4. Commonly Used Deep Learning Techniques
Deep learning is a technique of machine learning, which uses multiple layers of neural networks for classifying patterns [
51]. Deep learning models use neural networks referred to as deep neural networks (DNNs). Simple neural networks only use two to three hidden layers, but in DNNs, hidden layers can be more than a hundred (deep mean numbers of hidden layers in network). A simple neural network takes the input at the input layer, and processes this input through hidden layers. Each neuron in the hidden layer is fully connected to all previous layer neurons. In a layer, neurons do not share their information; they operate independently from other neurons [
52]. The last layer of the network will be fully connected. These fully connected neural networks cannot scale to larger images. Convolutional neural networks (CNNs) solve this problem; each neuron in the hidden layer is not fully connected to all previous layer neurons [
53]. Only a few neurons of the previous layer are connected to each neuron. The shortcoming of CNNs is that they are unable to deal with variable size input and output; they only take fixed size input and generate fixed size output [
54]. CNNs are not suitable for some applications such as speech and video processing. The algorithm for data preprocessing in our proposed method is described in Algorithm 1.
Algorithm 1 Proposed algorithm used for Data Preprocessing |
1: | procedure PRE-PROCESSING(Dataset) |
2: | for Each image in dataset do |
3: | ImageBackground = PreBinarization(Image) |
4: | ImageGradient = CalculateGradient(ImageBackground) |
5: | ImageSmoothen = SmoothenGradient(ImageGradient) |
6: | ThresholdedImage = ThresholdedImage(ImageSmoothen) |
7: | DilatedImage = ApplyDilation(ThresholdedImage) |
8: | OutliersRemovedImages = ApplyConnectedComponentAnalysis |
9: | LocalizedTextImages = ApplyTextLocalization |
10: | return LocalizedTextImages |
Recurrent neural networks (RNNs) are designed to connect the previous information with the present task. They tackle the issue of fixed size input and output [
55]. Simple RNNs cannot handle long dependencies, and are only useful when the gap between past information and the needed place is small. Simple RNNs face the problem of a vanishing gradient and exploding. Long short-term memory (LSTM) networks are a special kind of RNN that deal with this problem by having cell memory using linear and logistic units [
56]. An LSTM network consists of three gates: write gate, keep the gate and read gate. Write gate writes the information on the cell, keep gate will keep the information on the cell and read gate allows one to read the information from the gate.
In deep learning, attention mechanisms attracted a lot of interest. Initially, an attention mechanism was applied on sequence for sequence models in neural machine translation, but now they are used in many problems, such as image captioning and others [
57]. In attention-based LSTM, the system gives a score or attention weight to each output of LSTM. Based on these attention weights, the system will predict the output with high accuracy.
Overview of RNN
In 1960, the first RNNs were introduced. Due to their contextual modeling ability, they became more well known. RNNs are especially popular to tackle the time series problem with different patterns. Basically, with recurrent layers, an RNN is a simple multi-layer perceptron (MLP). Suppose an RNN receives an input sequence with B input units, H hidden units, and O output units. The values of activation’s
fh and hidden units
eh of each layer can be calculated with these equations:
Here,
am(
t) is the value of input
m in time
t,
en(
t) and
fn(
t) are the input and the activation of the network to unit
n, respectively, at time
t. From unit
m to unit
n, the connection is represented as
vmn. Θ
h represents the activation function of hidden unit
h. For the first time, RNNs were used for speech recognition by Robison [
20]. For handwritten recognition, RNNs were used by Lee and Kim [
40]. An input image as in
Figure 7 and
Figure 8 was preprocessed to obtain the background and locate the text regions. An image background can be estimated by the pre binarization process. Firstly, a color image was converted into a gray scale image and then further converted into black and white images. Vertical profile projections were used to obtain the segmentation of words.
Figure 9,
Figure 10 and
Figure 11 are the visual representations of text.
A bidirectional RNN was presented by Schuter and Paliwal [
18] in 1997. There are two recurrent layers in sequence processing, one for the forward direction and the second for the backward direction. These recurrent layers are connected to the same input and output layers. To deal with multi-dimensions such as 2D images or 3D videos, Graves [
41] presents a multi-dimension recurrent neural network (MD-RNN). For multi-dimension RNN, consider a point of input series designed to have higher-layer units such as memory cells. These memory cells are specially used to maintain information for a longer time span.
In the 1D case, the input sequence is written as
e(
t), but here,
eq is written as input for multi-dimension cases. The position of the upper index is written as
qm,
m ∈ 1,2,3, …
k. In dimension
g, the step back positions can be presented as
Q′
g =
q1 …
qd−1, …
qk. From
m to
n with dimension
g,
Vmnd is the recurrent connection. The equation with the forward direction of
g dimension MD-RNNs is represented as:
The below equation calculates the backward pass as follows:
where
εqn =
∂E/∂fnq is the output error of
n unit at time
q and
δnq =
q,
∂E/∂en represents the error after accumulation.
One-dimensional recurrence is used by standard RNNs such as one axis of an image, but the scanning of an image by MDRNNs is carried out along both axes. This will enable more contextual variations in four directions (top, bottom, left, and right).
Figure 12 illustrates the scanning direction of 2D-RNN. In the figure,
aq is the input sequence; the two points (
m − 1,
n) and (
m,
n − 1) always reached the end before that point (
m,
n). There were very few practical applications of RNNs due to the problem of vanishing gradient and long-term dependencies. In 1997, Hochreiter designed a new network, LSTM, to tackle long-term dependencies. LSTMs are a special type of RNN that are designed to have hidden-layer units such as memory cells. These memory cells are specially used to maintain information for a longer time span.
5. Description of Implementation
In the proposed method, the convolutional layer follows the VGG architecture [
6]. In VGG architecture, square pooling windows are used, but this architecture uses rectangular max-pooling windows. This convolutional layer generates wider and longer feature maps. These feature maps will be the input of RNN layers. An illustration of LSTM architecture is given in
Figure 13.
Figure 14 compares proposed architecture with different architectures on character recognition rate using Alif dataset. However, for line recognition rate and word recognition rate,
Figure 15 and
Figure 16 describe analytically. With respect to Activ dataset, the comparison of proposed architecture with different architecture on WRR, LRR and CRR are illustrated in
Figure 17,
Figure 18 and
Figure 19 respectively.
All the scene text and video text images are resized to a fixed height and width. The input resized shape for video text is (90,300) and for a scene, the text is (110,320). We have tested this many times, and observe that the performance is not much affected by resizing the input image into the fixed width and height. As is well known, the Arabic language starts from right to left, for this, all the input images are horizontally flipped; then, these will be the input to the convolutional layer. Here, the architecture composes three parts: the convolutional part, RNN part, and attention over RNN; because of them, it will face a training problem. The batch normalization technique is used to handle this problem [
27]. Stochastic gradient descent is used to train this architecture. Backpropagation algorithms are used to calculate gradients. In [
28], the backpropagation through time algorithm is used for differentials of error for RNN layers. The adadelta optimization technique is used to set the learning rate parameter [
29].
6. Experimental Results
The text on natural scene images and video frames can be recognized by the above-discussed architecture; this section will elaborate on the efficient recognition of Arabic text by this architecture. Here, as we know, all the input images are cropped from the original scene image or a video frame. Each image contains only one word of Arabic text. As per our knowledge, most of the work on Arabic text is character-level classification or recognition. This paper introduces new Arabic scene text baseline results. There are two datasets available for Arabic video text, Alif and Activ [
23,
25]; the results are reported on these two datasets.
Table 4 presents the results of recognition on the Alif dataset for video text.
Table 5 shows the proposed architectural results of the AcTiV dataset. A famous Optical character recognition Tesseract (OCR-T) [
47] is applied over our one-word dataset for comparison, because there is no or a lower amount of work on a word-level dataset. Three metrics are used to evaluate the performance: Character Recognition Rate (
ChRR), Word Recognition Rate (
WoRR), and Line Recognition Rate (
LiRR).
The methods used for comparison on video text recognition have different convolutional architecture for the extraction of features from end-to-end trainable architecture. For the video text recognition task, we obtained better character-level and line-level accuracy and set the new state of the art. A Deep Belief Network (DBN) is a feed-forward neural network with one or more layers of hidden units, which are commonly referred to as feature detectors [
59]. The fact that generative weights can be learned layer-by-layer is a unique feature of DBN. A pair of layers’ latent variables can be learned at the same time. MLP_AE_LSTM is an Auto Encoder Long Short-Term Memory based on Multi-Layer Perceptron [
59]. This is a feed-forward network that learns to provide the same output as the input. In this network, they used a three-layered neural network with fully connected hidden units. The ability to collect multi-model features is improved by using more than one hidden layer with nonlinear units in the auto-encoder. The backpropagation algorithm is used to train the network. ABBYY [
59] is a famous Arabic OCR engine Fine reader 12, used for Arabic character recognition; this OCR obtains a 82.4% to 83.26% character recognition rate, but our proposed model obtains a 98.73% character recognition rate. MDLSTM [
60] is a Multi-Dimension Long Short-Term Memory Network; here, MDLSTM was used for Arabic text recognition, and the text variations on both dimensions of the input image are modeled using this architecture.