Next Article in Journal
Similarity-Based Remaining Useful Lifetime Prediction Method Considering Epistemic Uncertainty
Previous Article in Journal
Mini-Batch Alignment: A Deep-Learning Model for Domain Factor-Independent Feature Extraction for Wi-Fi–CSI Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dual-Stream Spatiotemporal Networks with Feature Sharing for Monitoring Animals in the Home Cage

1
School of Computer Science, College of Science, University of Lincoln, Brayford Pool, Lincoln LN6 7TS, UK
2
Mary Lyon Centre at MRC Harwell, Oxfordshire OX11 0RD, UK
3
Independent Researcher, Lincoln LN6 7TS, UK
*
Author to whom correspondence should be addressed.
Submission received: 9 October 2023 / Revised: 15 November 2023 / Accepted: 24 November 2023 / Published: 30 November 2023
(This article belongs to the Topic Machine Learning and Biomedical Sensors)

Abstract

:
This paper presents a spatiotemporal deep learning approach for mouse behavioral classification in the home-cage. Using a series of dual-stream architectures with assorted modifications for optimal performance, we introduce a novel feature sharing approach that jointly processes the streams at regular intervals throughout the network. The dataset in focus is an annotated, publicly available dataset of a singly-housed mouse. We achieved even better classification accuracy by ensembling the best performing models; an Inception-based network and an attention-based network, both of which utilize this feature sharing attribute. Furthermore, we demonstrate through ablation studies that for all models, the feature sharing architectures consistently outperform the conventional dual-stream having standalone streams. In particular, the inception-based architectures showed higher feature sharing gains with their increase in accuracy anywhere between 6.59% and 15.19%. The best-performing models were also further evaluated on other mouse behavioral datasets.

1. Introduction

Over many decades, the ethical implications of using animals in research have undergone considerable discussion and scrutiny [1]. A major landmark in the regulation of research involving animals was the designation of the three ‘R’s (Replacement, Refinement, and Reduction), which was spearheaded by The National Centre for the 3Rs (NC3Rs) in the United Kingdom. As of 2022, around 2.76 million living animals were used for various research procedures in the UK, with 96% of these comprised of rodents (rats and mice), birds, and fish [2]. Due to their genetic, physiological, and anatomical similarities with humans [3] as well as their short lifecycles [4,5], mice are one of the most utilized species in biomedical research.
In support of the 3Rs mission, technology has been increasingly used to better understand the different aspects of research involving animals. Behavioral phenotyping is particularly important as it may highlight welfare concerns that arise over the course of an experimental design. However, the manual observation of these behaviors is expensive, laborious, and time-consuming. Furthermore, behavioral studies relying solely on expert observation are not easily reproducible [6,7]. The development of home-cage monitoring (HCM) systems was a major technological breakthrough that has helped to solve many of these issues [8]. HCM systems facilitate non-intrusive, longitudinal observation of mice and may provide a range of outputs such as behavioral annotation, ethogramming, depth sensing and tracking, activity summarising of circadian rhythm, and pose estimation. Such HCM systems include the Techniplast Digital Ventilated Cage (DVC) [9], the System for Continuous Observation of Rodents in Home-cage Environment (SCORHE) [10] and IntelliCage [11], to name a few. Cameras are extensively utilized across diverse industries for a number of tasks [12], including autonomous driving, pose estimation [13], security and surveillance, etc. As such, these home-cage setups may be equipped with either single-view [14] or multi-view cameras [10], depending on design considerations. Nevertheless, there are few commercially available solutions to the problem of detecting behaviors from video footage alone. Moreover, many of the solutions that do exist are strongly coupled to commercial hardware, rather than video footage in general. Owing to their recent successes in human action recognition and many other domains, deep learning approaches offer a potential solution to the problem of behavioural phenotyping in the home-cage.
In this paper, dual-stream deep learning architectures are proposed for the behavioral classification of mice in the home-cage. The models in question were developed for entirely supervised learning, whereby spatiotemporal (ST) blocks of video data are mapped to one of several behavior categories. The dataset utilized is publicly available and contains videos of a singly-housed mouse [7]. Our models are initially trained on the entire main data and then tested using the more unambiguous clipped database. This approach is different compared to that in the original paper. Nevertheless, comparisons were also made between our proposed methodology and their results [7] using the same cross-validation technique adopted in the original publication. Furthermore, a select few of our models were also evaluated on a more complex, multi-view home-cage data [10]. One of the novel aspects of these models is shared layers between the streams of the networks. Here, instead of fusing individual streams at the end (termed “late fusion” in [15], we propose to combine features at regular intervals throughout the architecture. We hypothesize that accurate representations are better enforced when both streams are privy to information from each other (Figure 1). Some instances of shared features have been seen in U-Nets [16] and its many derivative networks, and some other specialized multi-stream architectures [17,18], albeit in a different manner to that proposed herein for multi-stream networks.
To the best of our knowledge, our work is the first to propose this “horizontal” form of connection in multi-stream deep learning (DL) architectures. The kind of connection present in U-Nets has been referred to as long skip connections [19] and is an integral part of the models’ ability to prevent dilated features while transferring useful representations to its decoding stage [20]. The other known kind of connection has been referred to as the short-skip connection [19] and was first introduced in ResNet [21] to solve the problem of vanishing gradients as these architectures scaled higher with increasing depths [22,23]. Some research has even combined these connections in their architectural DL designs [24]. The similarity between the long and short connections is the use of simple operations such as addition or concatenation. However, the forms of feature sharing range from concatenation to the use of new, joint-processing blocks that can be optimized with the entire architecture. Thus, our research study is novel in its presentation of feature sharing between dual-stream architectures. This investigation forms the highlight of our paper, and will be evaluated against the conventional or standalone forms for all the architectures developed.

2. Related Work

2.1. Behavioural Classification

The seminal work in mouse behavior classification [7] was developed for individually-housed animals, and provides the benchmark dataset with which several other methods (including ours) are evaluated. In this work, a series of hand-crafted shape and motion features were extracted, a support vector machine (SVM) was used alongside a hidden Markov model (HMM) to classify video clips into eight distinct behaviors. Model training was repeated n = 12 times using a leave-one-out methodology, achieving a classification accuracy of 77.3% across all eight classes as opposed to the 71.6% accuracy by human annotators. However, this work was operated on frame-wise/2D inputs, and does not take-in the entire spatio-temporal context like ours.
Since then, deep learning has emerged as the state-of-the-art for the classification of video data in general. Though 2D (i.e., spatial-only) models thrive in most applications, the need for better contextual understanding has become increasingly apparent, and the application of 3D convolutions has enabled this. Better yet, the use of multiple input streams allows for better encoding of video or clip sequences into different representations. It is often the case that one of the model streams operates on an image or image sequence (within time frame t 0 to t n ) while the next stream operates on optical flow data (computed for t 1 to t n + 1 ) [25,26]. Some other multi-stream variations operate on two image streams of different points of view, resolutions [27], or zoom [15] depending on the goal of classification.
The work by [25] presented a new network called the inflated 3D (I3D) Inception model. The I3D modules differed from the classic Inception module [28] due to the addition of 3D convolutions and ‘inflated’ filters that allowed a wider receptive field necessary to better learn spatiotemporal data. This dual-stream I3D architecture (pre-trained on ImageNet) was utilized by [29] to classify home cage mouse behaviors. Its evaluation was carried out on the same dataset [7] and used a leave-one-out method, therefore averaging test results across the twelve videos in the main dataset. They achieved an average accuracy of over 90% on testing with various stream weights.
In another paper by [18], the effect of shared features at higher levels of multi-stream networks was demonstrated. The authors termed this operation feature fusion. The architecture comprised a frame-wise spatial transformer-based stream and a clip-wise temporal stream. The stream features were combined at two successive final pooling layers, ultimately achieving accuracies of 95.3% on the UCF101 [30] and 72.9% on the HMDB51 [31] datasets. A key difference between this approach and the feature sharing approach proposed in this work is its implementation at multiple points throughout the dual-stream architecture (as explained in Section 3.3).

2.2. Spatiotemporal Learning

Though utilized in different research, ref. [32] demonstrated that spatiotemporal cuboids of data formed better descriptors in both human activity recognition and mice behavioral classification tasks than spatial-only data. These spatiotemporal features were implied to be better for video classification due to the presence of more information that better captures event contexts, especially those that can be easily confused at instance classification. In rodent phenotyping, a lot of emphasis is placed on themes such as behavioral sequences, periodicity, repetitiveness, or patterns of certain behaviors [33]. Depending on the nature of biological research, these factors become increasingly important to identify, else subtle details are missed. A good example of this is self-grooming behavior, which can be observed as mice transition from their idle periods to high activity [33,34]. However, when in excess, this behavior is also commonly associated with mice models of both autism spectrum disorders and compulsive disorders [35]. This further attests to the importance of capturing temporal content in machine learning models.
One of the key components in deep spatiotemporal learning is I3D [25], as mentioned previously. I3D was built by changing the 2D convolutional layers in the Inception v1 [28] model to 3D convolutions while still leveraging on the efficient structure of the Inception blocks. Unlike other 3D convolutional methods, I3D is deep yet lightweight, and brings the advantages of Inception for static image classification to the spatiotemporal domain. Owing to these advantages, the I3D concept has been applied in some of the architectures proposed in this paper.

2.2.1. Attention Mechanisms

An attention module is characterized by the following elements: query Q, key K, and value V. It attempts to map these to the output and scales the output using the dimension of the keys d k . Multi-head attention (MHA) combines multiple attention instances with trainable parameters W and is often utilized to ensure efficient learning of vector sequences [36]. The general expressions for the attention function and multi-head attention are given below:
Attention ( Q , K , V ) = softmax ( ( QK T ) / ( d k ) ) V
MHA ( Q , K , V ) = Concatenate ( head 1 , head 2 , , head i ) W o
where head i = Attention ( QW i Q , KW i K , VW i V ) .
Transformers are a derivative architecture of MHA initially applied to natural language understanding [36] but have also been found to be effective in computer vision. The Vision Transformers (ViT) is one of those that repurposed transformers to image tasks [37]. Further variations of ViTs designed for spatiotemporal learning of videos have achieved state-of-the-art (SOTA) results in activity recognition [38,39]. The work by [38] also proved that multi-head attention captures vital temporal dependencies by focusing on displaced or moving objects within a sequence. Furthermore, its application was proven to be effective in capturing global features in a multi-stream architecture for video classification [40].

2.2.2. Long-Short Term Memory (LSTM)

LSTMs are architectures that learn to store information using memory cells and gates. The memory cell was designed to achieve constant error flow and used multiplicative input and output gates that protect data from perturbation [41]. Further improvements after this saw better-defined gate operations which improved the memory retention of the architecture.
Building upon this, Bidirectional LSTMs (BiLSTM) allow for the computation of memory both ways and have been proven to achieve good results in vision tasks. A BiLSTM is composed of two LSTMs that store relevant dependencies from both forward (i.e., past to present) and backward (i.e., future to present) state directions [42]. In conjunction with other ML architectures, bidirectional LSTMs have been found to outperform the unidirectional LSTM in several natural language understanding [43,44] and image classification [45] tasks. In the paper by [42], a BiLSTM was used with 1-dimensional convolutions to classify the circadian rhythm of wild-type mice into day or night states. This was trained after the dimensionality reduction of a five-minute clip which was further subdivided into three-second frame windows. It was found to outperform the other ML algorithms explored, achieving an area-under-the-curve (AUC) of 0.97. In short, BiLSTMs are capable of efficiently detecting and learning patterns that define the behaviors mapped.

3. Materials and Methods

3.1. Data

The MIT mice dataset [7] is subdivided into a main dataset and clipped database. In this work, we utilize all twelve videos from the main dataset for training and validation while the clipped database, composed of unambiguous behaviors, is used to test the models. Specifically, the video recordings from the 20080423191834F folder were used for validation, while the rest of the main data were used for training. Unlike the leave-one-out methodology by the original authors, we surmise that this approach helps us to better examine the generalization performance of our models. Nonetheless, we also made comparisons to the original cross-validation results. The optical flow data were generated from the raw videos using the dense optical flow method [46]. Both training and test frames were resized to 128 × 128 , and further reduced to 128 × 96 by uniformly cropping redundant parts of each frame that lie along the vertical axis. The data were also temporally downsampled using five-frame intervals. The temporal length used for each T = 8 frames. Thus, each spatiotemporal cuboid represents approximately a 1.33-second window in the original videos. Toward the end of the videos/clips, any frames that could not fit these specifications were discarded. The final input data are in the form N × T × W × H × C which represents the number of clips, temporal length, spatial width, spatial height and number of channels respectively. The N values for the final training, validation, and testing sets are 23,444, 4195, and 5171 respectively.

3.2. Pre-Processing

Class imbalance was alleviated using class weights [47], which forced the model to perceive the number of samples in each class as having the same value. Hence, the classes that suffered from low sample sizes, such as drinking, were assigned higher weights, and the reverse for labels with large sample sizes like micromovement. Additionally, pre-processing was carried out to normalise visual differences present between videos acquired at different times of day. In this particular dataset, there are only two videos recorded at night-time (using infrared cameras) while all the rest were recorded during the day. In some deep learning applications, conversion to grayscale has proven effective but this method was found to degrade the performance of the models. As such, all day videos contained within the dataset were ‘nightified’ (i.e., changed into night-time). This was achieved by first calculating the averaged R, G, and B channel values from the night videos. These were then used to weigh the [0–1] normalized data from the day videos and finally expanded back to the [0–255] range. The results gave a close approximation of what the videos would look like if recorded at night, and thus lessened bias in the models caused by the day-night imbalance (Figure 2). No further augmentations were performed on the dataset. More data samples, used in both RGB and flow streams, are available in Appendix D.

3.3. Architectures

All of the models presented are dual-stream and, in this application, use raw video and optical flow streams. The building blocks utilized in the networks are depicted in Figure 3. One of the more vital aspects of the models presented here is the feature sharing between the dual-streams of the network. Feature sharing involves the combination and/or joint processing of the stream outputs after operation by the primary modules. This combination is achieved either via addition or concatenation, followed by further processing on the joint streams which are then projected back to the individual streams. These operations take place at regular intervals throughout the architecture. We hypothesize that this procedure reinforces learned features better than operating on the streams individually. The various implementations of these modules are further discussed under each architecture and in Table 1. The overview of each architecture is also depicted in Appendix E.
The blocks in Figure 3a,c represent the primary processing modules used in both the RGB image and optical flow streams, while the blocks in Figure 3b,d are the joint processing modules. The blocks in Figure 3b,c depict the 3D formats of modules originally found in Inception v3 and Inception v1 architectures respectively [28,48]. In particular, Figure 3b was adapted here to boost the performance of the architectures utilizing module Figure 3a via further processing at the junctions where the streams meet. Block Figure 3d is a custom joint processing module utilized only in the baseline network.

3.3.1. Baseline Network (CNN)

This simple architecture consists of blocks with 3D convolutional layers, dropout (with uniform rates of 20%), and batch normalization (see Figure 3a). The kernel sizes here were made uniform for each block (i.e., kernels of size m rather than m 2 as depicted in Figure 3a). After operation by similar blocks, the results from both streams are summed up and further operated on by dense and dropout layers (Figure 3d) before splitting again into the individual streams.

3.3.2. CNN + Inception v3_D + Attention (CIv3D_MHA)

This builds on the baseline architecture, adding the self-attention mechanisms to both streams after the last primary blocks. The kernel size for 3D convolutions was made to increase and decrease repeatedly (as shown in Figure 3a) between stream blocks. In addition, the simple processing block is replaced by the Inception v3 block D [48] (Figure 3b) throughout the architecture. The self-attention block used here is similar to vision transformers [37] however it uses batch normalization, and the patch tokens are replaced by the end features of the streams before summation and processing by the last InceptionD block.

3.3.3. CNN + Inception v3_D + BiLSTM (CIv3D_BiLSTM)

This uses the same improvisations made in CIv3D_MHA but removes the primary modules’ dropout layers. The bidirectional LSTMs are used in place of the traditional flattening that precedes fully-connected (FC) layers. The input to this is the summed output of both streams’ final subsection, reshaped from four to two dimensions to allow loading into the LSTMs.

3.3.4. Purely Inception-Based Networks

There are two architectures completely built up using the 3D Inception v1 block (see Figure 3c). This block was revised for spatiotemporal operation from the dimensionality reduction module in the classic Inception v1 architecture [28] but is without the singular 1 × 1 convolution branch in the original (Figure 3c). The first architecture works by reinforcing representation learning in a single stream rather than splitting the features apart. At the bottleneck between successive sub-regions of the network, feature learning is reinforced by repeatedly concatenating strided computations of the original optical flow sequence with the previous features extracted from the RGB stream. Hence, the network was termed the Singly Reinforced Stream (SRS) network. It also adds the design consideration of removing the first and last two frames of the optical flow stream (along with some surrounding dimensions) after the first block operation on both streams. This cropping operation is carried out only once and under the assumption that the temporal sequence is better represented by the center portions of the mid-four frames. This train of thought is quite similar to the fovea stream in [15] but takes it further by removing frames at the extremities.
The second architecture was developed to encourage cross-pollination between streams; this implies that just as the optical stream enforces representation learning in the image stream, the image stream is also used to enforce learning in the optical stream, and they alternate in this manner. This is carried out by independently concatenating the past features from each streams’ block with the jointly-processed input fed into consequent blocks. This operation however led to a considerable increase in computation (see parameter count in Table 2). This network was named Cross Reinforced Streams (CRS) network.

3.3.5. Other Networks

To investigate the effectiveness of the shared layers between streams, experiments were conducted on versions of the above models without the unique feature sharing modules. The design considerations used in each architecture were left in place while the blocks of joint processing (i.e., feature sharing) were replicated in both streams, all before the common fully-connected layers.

3.4. Model Training

All models were trained using the categorical cross-entropy loss and optimized using stochastic gradient descent (SGD). The number of epochs and batch size were set to 85 and 8, respectively. Training was set to reduce its learning rate by a factor of 0.5 if validation loss plateaus or peaks, and to finally stop if no notable learning is achieved. This prevents overfitting and allows for the early restoration of the best checkpoints. Each model is trained and evaluated n = 4 times corresponding to different random seeds, and averaged. By using the averages, we present an accurate representation of each models’ predictive capability. The system used for all the experiments was equipped with 64GB RAM and 2 Nvidia GeForce RTX-3080 GPUs.

3.5. Metrics

The most popular metric used for classification problems is accuracy. However, we evaluate all the models presented here on several metrics, including accuracy, average precision (AP), precision, recall, F1 score, and area-under-the-ROC-curve (AUC), where ROC is the receiver operating characteristic. More specifically, the AP metric (which was utilized the most) is the micro-averaged precision, while the precision is the macro-averaged per-class computation. All together, these metrics give a holistic view of each models’ performance.

4. Results

4.1. Model Comparison

The results of all seed models for each of the architectures were averaged to achieve the final results. The full performances for each seed can be found in Appendix A. The best result was obtained for the singly-reinforced stream (SRS) architecture, achieving an average accuracy of 81.96 ± 2.71%. The averaged performances of all the feature sharing dual-stream models are tabulated in Table 3.

4.2. Ensembles

The ensembles were created by averaging the results of the models at inference. Due to the gap in performance, most ensembles between models did not show any improvements over the SRS model. The final choice of models to ensemble was made by evaluating the validation results for all seed training in each model. For intra-model ensembles (that is, between the top 2 seeds of the same model), the best results were found for the SRS model and achieved 82.37%. The best inter-model ensemble was observed between the SRS and CIv3D_MHA models and achieved 86.28%. Further ensembles between models are shown in Table 4. The confusion matrices and ROC plots for the ensembles can be found in Appendix B.

4.3. Ablation Study

4.3.1. The Case for Feature Sharing

Here, the results of the models and their non-feature sharing variants are presented. The variants were trained and tested on the same dataset, and under the same conditions as those with joint processing. The averaged results across all metrics are tabulated (Table 5). In general, only the accuracy shows significant variation between seed models while the variation in other metrics is negligible. It can be clearly observed that for each architectural pair (i.e., feature sharing vs standalone), the feature sharing models perform better than their standalone forms.
The performance gains are especially pronounced in the Inception-based architectures with the SRS gains ranging from 6.59% to 15.19%, and the CRS gains ranging from 7.79% to 13.29%. The lowest observable gain after applying feature sharing was a 0.33% increase in accuracy, calculated for the baseline model. Conversely, the only occasion of a loss in accuracy (after feature sharing) was observed for the CIv3D_BiLSTM model having a value of −3.66%. Nonetheless, this same architecture was also found to be able to achieve gains of about 3.92% over its standalone streams.

4.3.2. The Case for nightification

To justify the choice of nightified spatiotemporal (ST) clips in the image stream, further experiments were conducted for both raw RGB input and grayscale input. These experiments were carried out on the baseline model and the previously ascertained best-performing models from Section 4.2. These models were trained and tested in the same rigorous manner as the core paper models. The results show that nightified ST input has higher accuracy than both grayscale and raw video ST inputs for most models, the only exception being the baseline model. Those using grayscale cuboids seemed to initially perform well just observing the AUCs and average precision however all their accuracies were subpar compared to the nightified cuboids. Observations show that this was due to greater misclassification between visually similar behaviors (such as micromovement and rest), indicative of the fact that the grayscale modality did not possess sufficient information for these deep models to distinguish between the behaviors. A similar narrative was observed in the raw video inputs though we argue that, in this case, the drop in performance (albeit small) was due to the lack of standardization. The results are presented in Table 6.

4.3.3. Varying Temporal Length

The temporal length refers to the number of frames that make up each clip. As previously stated, all architectures were designed for a temporal length T = 8, corresponding to 1.33 seconds. Further experiments are performed here by varying the preset T value. The new temporal lengths chosen were (i.e., T = 4) and (i.e., T = 16). These experiments were only carried out on the baseline and SRS models, and were conducted in the same rigorous manner as the initial runs. Besides changing the input shape, the temporal cropping (refer to Section 3.3.4) in the SRS architecture was also slightly modified. Same as the new T values, this feature was halved and doubled respectively for T = 4 and T = 16. Hence, there was no change to the architectural complexity. For the baseline model, its complexity only increased, slightly, for T = 16. The results after averaging the results for various seeds are shown in Table 7.
The results show that the preset T = 8 was optimum as the accuracies obtained in the new experiments were not up to par. In order of performance, the models having input temporal dimensions of T = 8 were the best, followed by T = 4 and lastly T = 16.

4.3.4. Cross Validation

As stated in the review section, the authors of the original dataset performed cross-validation using the main mice dataset comprised of the twelve videos in the main dataset. Here, the same n = 12 cross-validation is carried out and the results are reported for the best ensemble in Section 4.2, comprised of the SRS and CIv3D_MHA models. The final results are shown in Table 8.
The result achieved is seen to perform better than the human annotators; however, it is lower than the proposed method in the original paper. Despite the difference in model contexts (i.e., spatiotemporal against their framewise model), the results achieved ascertain the validity of our methodology. However, unlike the original publication [7], all our models are trained from scratch and have no prior contact with the MIT main dataset.

4.4. Other Datasets

SCORHE

Further experiments were conducted by applying the pre-trained versions of the top three seeds (from all models) to a different home-cage mouse dataset [10]. As previously shown in Section 4.2, the top performing seeds (as identified by the validation data) occur in these models: CIv3D_BiLSTM, CIv3D_MHA, and SRS. Although 13 unique annotations were originally present (see graph in Appendix C), these were refined to 8 classes by removing samples with ambiguous classes (such as behav_ignore, behav_other), removing samples having extremely low class occurrence (such as discrepancy, rotating), and merging the supported and unsupported rearing classes.
The recordings in the SCORHE home cage were captured from multiple points as no singular viewpoint provides a clear view due to occlusions. To address this, the viewpoints from opposite ends of SCORHE were shaped as 128 × 64 frames and stacked into a singular 128 × 128 frame. The same was also performed for the optical flow data stream. No frame skips were used here to ensure ample training and testing data were available. Data samples for the SCORHE dataset are available in Appendix D.
For the training, the previous FC layers were replaced and all other training parameters were kept the same, except for the learning rate which was halved to 0.0005. The resulting receiver operating characteristics (ROC) and precision-recall (PR) curves are shown in Figure 4. The accuracies achieved on the SCORHE dataset by the feature sharing CIv3D_BiLSTM CIv3D_MHA and SRS were 80.51%, 79.88% and 79.13% respectively. Their non-feature sharing variants achieved 72.18%, 77.95% and 70.83% respectively.
A few observations were made on the feature sharing models. The CIv3D_BiLSTM and CIv3D_MHA were good at reinforcing previously learned spatiotemporal representations to this complex home cage for similar behaviors. However, despite having lower accuracy, SRS performed better in both learning old classes and balancing predictions to learn totally new class, climbing. This is proven by its class accuracy across the different confusion matrices; while CIv3D_BiLSTM and CIv3D_MHA achieved 22.34% and 33.68% respectively, the SRS model achieved 53.61%.

5. Discussion

Generally, it was observed that the more dynamic behaviors were better captured, by all the models, than the less dynamic behaviors. Areas of weak performance across all the models were mainly due to misclassification of resting, grooming, and micromovement behaviors. These behaviors are quite closely related; during grooming, the mouse is mostly stationary albeit the motion of its forelimbs and when resting, the mouse is completely immobile. Micromovement describes very small-scale motions and hence it is most likely that the 1.33-second windows of T = 8 cuboids cannot capture the full range of motion to distinguish between these classes. Nonetheless, these ’misclassifications’ are also indicative of similitude in the temporal pattern needed to perform certain tasks and may be subject to further interpretation by the subject experts.
Further experiments in the ablation study also showed that, for time windows lower or higher than the 1.33-second window, the performance of the models degrades. Thus, other clip sizes will require more intense hyperparameter tuning and data preprocessing to work with the feature sharing paradigm. In particular, the T = 16 temporal input may also require a deeper architecture (i.e., having more rungs or blocks) at the cost of increasing the computational complexity of the learning objective. The step up in performance between the feature sharing and standalone baseline models lends credence to the effectiveness of combined streams; by simply summing parallel outputs from both streams and processing with a dense-dropout pair (depicted in Figure 3b), we observe between 0.33% and 8.97% improvement in accuracy. This observation was further proven in subsequent networks utilizing algorithms such as bidirectional LSTMs and self-attention mechanisms. Though the CIv3D_BiLSTM model was only marginally better in terms of accuracy, it outperformed its non-feature sharing variant in all other metrics. Similarly, we observe a notable improvement across all the metrics for the other models, especially in the purely 3D Inception-based networks (SRS and CRS), both having over 10% improvement in averaged accuracy alone. The ensemble of the SRS and CIv3D_MHA was also seen to achieve better accuracy than human annotators on cross-validation using the training data. Although this accuracy was not up to par with the proposed methodology in the original paper, it sufficiently demonstrates the workability of the feature sharing paradigm.
Based on both the parameter count and floating point operations per second (FLOPS), the implementation of feature sharing was also found to mostly reduce the complexity of the architectures, with the exception of the CRS model (see Table 2). Conversely, utilizing feature sharing would require establishing which feature sharing method is best suited for the architecture, i.e., either simple concatenation or a new processing block (such as Figure 3d or Figure 3b). These investigations would generally increase the number of experiments needed, therefore increasing the time needed to establish its utility.

6. Conclusions

In summary, this paper proposed an approach to mouse behavior classification based on multi-stream convolutional neural networks with feature sharing. By including this architectural consideration, we observed gains ranging from 0.33% to 15.19% for all the custom architectures that were presented. Only in one model type (i.e., the CIV3D_BiLSTM) was the feature sharing architecture reported to achieve a lower accuracy than its standalone variant. Nevertheless, upper-limit gains of 3.92% were also possible for this same architecture. We validate this approach using two publicly available datasets, and it performs favourably compared to the start-of-the-art.
Further work will investigate improving the overall cross-validation by employing data augmentations not employed in this paper. In addition, feature sharing can be adapted using well-established, state-of-the-art supervised models (both convolutional and transformer-based) to further investigate its pros and cons. Finally, future research will also consider the unsupervised detection of behaviors and welfare concerns in the home cage, and whether the unique feature sharing approach will impact multi-stream models in this learning domain.

Author Contributions

Conceptualisation, training, and evaluation of the ML architectures, E.I.N. and J.M.B.; expert advisors on mice phenotypy (providing biological insight) and funding collaborators, S.W. and R.S.B.; research coordination and supervision, J.M.B. and X.Y.; assistance with resource provision, L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC/T002050/1).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All datasets used here are publicly available. The MIT home-cage mouse dataset by [7] is available at https://cbmm.mit.edu/mouse-dataset, (accessed on 20 September 2021). The SCORHE dataset by [10] is currently archived but can be accessed by going to https://web.archive.org, (accessed on 26 July 2022) and then searching for scorhe.nih.gov, (accessed on 26 July 2022). The required videos are available in the “downloads” section.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Full Performance Table

Table A1. Full Performance table for feature sharing models.
Table A1. Full Performance table for feature sharing models.
MODELSSEEDSMETRICS
AUCAPF1 ScoreAccuracy (%)
micro (m)macro (M)
BaselineA0.891390.924720.6020.5343475.0425
B0.871510.919150.6090.5725677.4076
C0.860570.914780.5640.5181670.5459
D0.890470.920200.5930.5809376.7404
Average(s)0.878490.919710.5920.5515074.9341
CIv3D_ BiLSTMA0.937460. 953310.7290.5575076.3368
B0.924150.950300.7200.5563475.4044
C0.928480.946230.7230.5367775.7875
D0.910470.951330.7290.5422878.6451
Average(s)0.925140.950290.7250.5482276.5435
CIv3D_ MHAA0.926000.948090.6920.5660674.2224
B0.947150.946320.7570.6805076.2464
C0.973210.970590.8600.7744183.6366
D0.955600.947580.7720.6905177.8722
Average(s)0.950490.953150.7700.6778777.9944
SRSA0.968210.977030.8390.7443484.2392
B0.958250.973830.7830.6585779.4152
C0.942760.961190.7270.6200479.1211
D0.965020.974160.8160.7221085.0749
Average(s)0.958560.971550.7910.6862681.9626
CRSA0.925820.957320.7130.5668277.2607
B0.935840.966220.7280.5718179.9323
C0.9323440.957500.7170.5627478.0969
D0.942160.959380.7520.6086678.6730
Average(s)0.934040.960110.7280.5775178.4907

Appendix B. Additional Performance Plots

Figure A1. ROC and PR plot for SRS + CIv3D_MHA.
Figure A1. ROC and PR plot for SRS + CIv3D_MHA.
Sensors 23 09532 g0a1
Figure A2. Confusion Matrix for SRS + CIv3D_MHA.
Figure A2. Confusion Matrix for SRS + CIv3D_MHA.
Sensors 23 09532 g0a2
Figure A3. ROC and PR plot for SRS + CIv3D_BiLSTM.
Figure A3. ROC and PR plot for SRS + CIv3D_BiLSTM.
Sensors 23 09532 g0a3
Figure A4. Confusion Matrix for SRS + CIv3D_BiLSTM.
Figure A4. Confusion Matrix for SRS + CIv3D_BiLSTM.
Sensors 23 09532 g0a4
Figure A5. ROC and PR plot for SRS + CRS.
Figure A5. ROC and PR plot for SRS + CRS.
Sensors 23 09532 g0a5
Figure A6. Confusion Matrix for SRS + CRS.
Figure A6. Confusion Matrix for SRS + CRS.
Sensors 23 09532 g0a6
Figure A7. ROC and PR plot for CIv3D_BiLSTM + CIv3D_MHA.
Figure A7. ROC and PR plot for CIv3D_BiLSTM + CIv3D_MHA.
Sensors 23 09532 g0a7
Figure A8. Confusion Matrix for CIv3D_BiLSTM + CIv3D_MHA.
Figure A8. Confusion Matrix for CIv3D_BiLSTM + CIv3D_MHA.
Sensors 23 09532 g0a8

Appendix C. Original Data-Label Distributions

Figure A9. Full data summaries (before preprocessing).
Figure A9. Full data summaries (before preprocessing).
Sensors 23 09532 g0a9

Appendix D. SCORHE and MIT Samples (after Preprocessing)

Figure A10. Samples from SCORHE and MIT dataset.
Figure A10. Samples from SCORHE and MIT dataset.
Sensors 23 09532 g0a10

Appendix E. Schematic Diagrams of SRS and CRS Models

The resultant feature shapes of each architecture (at major rungs) and the fully-connected sizes are available in Table 1. As mentioned in the text, the preprocessing steps for all architectures are heightwise cropping and rescaling. The details of the blocks/modules are thus: the inception-based primary block in Figure A11 and Figure A12 correspond to Figure 3c, the custom primary block in Figure A13, Figure A14 and Figure A15 correspond to Figure 3a, and the inception-based joint processing in Figure A14 and Figure A15 correspond to Figure 3b. Furthermore, note that the ST attention block in Figure A15 is the same as a single-head, single encoder stack in [36] but without any positional embedding.
Figure A11. Overview of SRS model.
Figure A11. Overview of SRS model.
Sensors 23 09532 g0a11
Figure A12. Overview of CRS model.
Figure A12. Overview of CRS model.
Sensors 23 09532 g0a12
Figure A13. Overview of Baseline model.
Figure A13. Overview of Baseline model.
Sensors 23 09532 g0a13
Figure A14. Overview of CIv3D_BiLSTM model.
Figure A14. Overview of CIv3D_BiLSTM model.
Sensors 23 09532 g0a14
Figure A15. Overview of CIv3D_MHA model.
Figure A15. Overview of CIv3D_MHA model.
Sensors 23 09532 g0a15

References

  1. Akhtar, A. The flaws and human harms of animal experimentation. Camb. Q. Healthc. Ethics 2015, 24, 407–419. [Google Scholar] [CrossRef] [PubMed]
  2. NC3Rs. How Many Animals Are Used in Research? Available online: https://nc3rs.org.uk/how-many-animals-are-used-research#:~:text=In%20Great%20Britain%20in%202020,and%20monkeys%2C%20are%20also%20used (accessed on 23 November 2023).
  3. Breschi, A.; Gingeras, T.R.; Guigó, R. Comparative transcriptomics in human and mouse. Nat. Rev. Genet. 2017, 18, 425–440. [Google Scholar] [CrossRef] [PubMed]
  4. Ackert-Bicknell, C.L.; Anderson, L.C.; Sheehan, S.; Hill, W.G.; Chang, B.; Churchill, G.A.; Chesler, E.J.; Korstanje, R.; Peters, L.L. Aging research using mouse models. Curr. Protoc. Mouse Biol. 2015, 5, 95–133. [Google Scholar] [CrossRef] [PubMed]
  5. Yanai, S.; Endo, S. Functional aging in male C57BL/6J mice across the life-span: A systematic behavioral analysis of motor, emotional, and memory function to define an aging phenotype. Front. Aging Neurosci. 2021, 13, 697621. [Google Scholar] [CrossRef] [PubMed]
  6. Karl, T.; Pabst, R.; von Hörsten, S. Behavioral phenotyping of mice in pharmacological and toxicological research. Exp. Toxicol. Pathol. 2003, 55, 69–83. [Google Scholar] [CrossRef]
  7. Jhuang, H.; Garrote, E.; Yu, X.; Khilnani, V.; Poggio, T.; Steele, A.D.; Serre, T. Automated home-cage behavioural phenotyping of mice. Nat. Commun. 2010, 1, 1–10. [Google Scholar] [CrossRef] [PubMed]
  8. Voikar, V.; Gaburro, S. Three pillars of automated home-cage phenotyping of mice: Novel findings, refinement, and reproducibility based on literature and experience. Front. Behav. Neurosci. 2020, 14, 575434. [Google Scholar] [CrossRef] [PubMed]
  9. Iannello, F. Non-intrusive high throughput automated data collection from the home cage. Heliyon 2019, 5, e01454. [Google Scholar] [CrossRef]
  10. Salem, G.H.; Dennis, J.U.; Krynitsky, J.; Garmendia-Cedillos, M.; Swaroop, K.; Malley, J.D.; Pajevic, S.; Abuhatzira, L.; Bustin, M.; Gillet, J.P.; et al. SCORHE: A novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav. Res. Methods 2015, 47, 235–250. [Google Scholar] [CrossRef]
  11. Kiryk, A.; Janusz, A.; Zglinicki, B.; Turkes, E.; Knapska, E.; Konopka, W.; Lipp, H.P.; Kaczmarek, L. IntelliCage as a tool for measuring mouse behavior–20 years perspective. Behav. Brain Res. 2020, 388, 112620. [Google Scholar] [CrossRef]
  12. Liu, H.; Liu, T.; Chen, Y.; Zhang, Z.; Li, Y.F. EHPE: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
  13. Liu, T.; Wang, J.; Yang, B.; Wang, X. NGDNet: Nonuniform Gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 2021, 436, 210–220. [Google Scholar] [CrossRef]
  14. Armstrong, J.D.; Acevedo-Arozena, A.; Bains, R.S.; Cater, H.; Chartsias, A.; Nolan, P.; Sneddon, D.; Sillito, R.; Wells, S. Tracking of Individual Mice in a Social Setting Using Video Tracking Combined with RFID tags. Proc. Meas. Behav. 2016, 10, 413–416. [Google Scholar]
  15. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
  16. Han, X. Automatic liver lesion segmentation using a deep convolutional neural network method. arXiv 2017, arXiv:1704.07239. [Google Scholar]
  17. Zhang, H.; Liao, Y.; Yang, H.; Yang, G.; Zhang, L. A local-global dual-stream network for building extraction from very-high-resolution remote sensing images. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 1269–1283. [Google Scholar] [CrossRef]
  18. Hou, Y.; Yu, H.; Zhou, D.; Wang, P.; Ge, H.; Zhang, J.; Zhang, Q. Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition. Neural Comput. Appl. 2021, 33, 16439–16450. [Google Scholar] [CrossRef]
  19. Drozdzal, M.; Vorontsov, E.; Chartrand, G.; Kadoury, S.; Pal, C. The importance of skip connections in biomedical image segmentation. In International Workshop on Deep Learning in Medical Image Analysis, International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, DLMIA 2016, LABELS 2016: Deep Learning and Data Labeling for Medical Applications; Springer: Berlin, Germany, 2016; pp. 179–187. [Google Scholar]
  20. Zhou, W.; Zhao, Y.; Zhang, F.; Luo, B.; Yu, L.; Chen, B.; Yang, C.; Gui, W. TSDTVOS: Target-guided spatiotemporal dual-stream transformers for video object segmentation. Neurocomputing 2023, 555, 126582. [Google Scholar] [CrossRef]
  21. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  22. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
  23. Weng, O.; Marcano, G.; Loncar, V.; Khodamoradi, A.; Sheybani, N.; Meza, A.; Koushanfar, F.; Denolf, K.; Duarte, J.M.; Kastner, R. Tailor: Altering Skip Connections for Resource-Efficient Inference. arXiv 2023, arXiv:2301.07247. [Google Scholar] [CrossRef]
  24. Bittner, K.; Liebel, L.; Körner, M.; Reinartz, P. Long-short skip connections in deep neural networks for dsm refinement. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 383–390. [Google Scholar] [CrossRef]
  25. Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
  26. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
  27. Wei, J.; Xu, Y.; Cai, W.; Wu, Z.; Chanussot, J.; Wei, Z. A two-stream multiscale deep learning architecture for pan-sharpening. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 5455–5465. [Google Scholar] [CrossRef]
  28. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  29. Nguyen, N.G.; Phan, D.; Lumbanraja, F.R.; Faisal, M.R.; Abapihi, B.; Purnama, B.; Delimayanti, M.K.; Mahmudah, K.R.; Kubo, M.; Satou, K. Applying deep learning models to mouse behavior recognition. J. Biomed. Sci. Eng. 2019, 12, 183–196. [Google Scholar] [CrossRef]
  30. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
  31. Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
  32. Dollár, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; pp. 65–72. [Google Scholar]
  33. Kyzar, E.; Gaikwad, S.; Roth, A.; Green, J.; Pham, M.; Stewart, A.; Liang, Y.; Kobla, V.; Kalueff, A.V. Towards high-throughput phenotyping of complex patterned behaviors in rodents: Focus on mouse self-grooming and its sequencing. Behav. Brain Res. 2011, 225, 426–431. [Google Scholar] [CrossRef] [PubMed]
  34. Kalueff, A.V.; Tuohimaa, P. Mouse grooming microstructure is a reliable anxiety marker bidirectionally sensitive to GABAergic drugs. Eur. J. Pharmacol. 2005, 508, 147–153. [Google Scholar] [CrossRef] [PubMed]
  35. Liu, H.; Huang, X.; Xu, J.; Mao, H.; Li, Y.; Ren, K.; Ma, G.; Xue, Q.; Tao, H.; Wu, S.; et al. Dissection of the relationship between anxiety and stereotyped self-grooming using the Shank3B mutant autistic model, acute stress model and chronic pain model. Neurobiol. Stress 2021, 15, 100417. [Google Scholar] [CrossRef]
  36. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  37. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  38. Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding. arXiv 2021, arXiv:2102.05095. [Google Scholar]
  39. Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
  40. Li, H.; Wang, J.; Han, J.; Zhang, J.; Yang, Y.; Zhao, Y. A novel multi-stream method for violent interaction detection using deep learning. Meas. Control 2020, 53, 796–806. [Google Scholar] [CrossRef]
  41. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  42. Gharagozloo, M.; Amrani, A.; Wittingstall, K.; Hamilton-Wright, A.; Gris, D. Machine Learning in Modeling of Mouse Behavior. Front. Neurosci. 2021, 15. [Google Scholar] [CrossRef] [PubMed]
  43. Graves, A.; Fernández, S.; Schmidhuber, J. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the International Conference on Artificial Neural Networks, Warsaw, Poland, 10–15 September 2005; pp. 799–804. [Google Scholar]
  44. Suzuki, S.; Iseki, Y.; Shiino, H.; Zhang, H.; Iwamoto, A.; Takahashi, F. Convolutional Neural Network and Bidirectional LSTM Based Taxonomy Classification Using External Dataset at SIGIR eCom Data Challenge. In Proceedings of the eCOM@ SIGIR, Ann Arbor, MI, USA, 12 July 2018. [Google Scholar]
  45. Hua, Y.; Mou, L.; Zhu, X.X. Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification. ISPRS J. Photogramm. Remote Sens. 2019, 149, 188–199. [Google Scholar] [CrossRef] [PubMed]
  46. Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian conference on Image Analysis, SCIA 2023, Sirkka, Finland, 18–21 April 2003; pp. 363–370. [Google Scholar]
  47. King, G.; Zeng, L. Logistic regression in rare events data. Political Anal. 2001, 9, 137–163. [Google Scholar] [CrossRef]
  48. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Figure 1. Conventional standalone vs. feature sharing dual networks. While the conventional dual-stream only extracts features for its stream, we propose the use of joint-processing layers, which we have termed feature sharing.
Figure 1. Conventional standalone vs. feature sharing dual networks. While the conventional dual-stream only extracts features for its stream, we propose the use of joint-processing layers, which we have termed feature sharing.
Sensors 23 09532 g001
Figure 2. Sample frame before and after ‘nightification’.
Figure 2. Sample frame before and after ‘nightification’.
Sensors 23 09532 g002
Figure 3. Primary and feature sharing modules used in the architectures. Note that the variables k, n, and m used here represent the multipliers, filter sizes, and kernel sizes respectively, at different levels in the architectures.
Figure 3. Primary and feature sharing modules used in the architectures. Note that the variables k, n, and m used here represent the multipliers, filter sizes, and kernel sizes respectively, at different levels in the architectures.
Sensors 23 09532 g003
Figure 4. ROC and PR curves between the feature sharing and conventional architectures using the SCORHE dataset. On observation, all the feature sharing architectural forms were found to also outperform their standalone stream variants in both AUC and average precision.
Figure 4. ROC and PR curves between the feature sharing and conventional architectures using the SCORHE dataset. On observation, all the feature sharing architectural forms were found to also outperform their standalone stream variants in both AUC and average precision.
Sensors 23 09532 g004
Table 1. Full summary of feature sharing models showing the internal parameters and output sizes for each stacked module.
Table 1. Full summary of feature sharing models showing the internal parameters and output sizes for each stacked module.
ModelsBaselineCIv3D_BiLSTMCIv3D_MHASRSCRS
Primary filters (n) 1 16, 32, 64, 12816, 32, 64, 12816, 32, 64, 1288, 16, 32, 64, 64,12812, 24, 48, 96, 192
Intra-block filter multipliers (k) 1 1.51.51.51.5, 1.5, 1.5, 1.5, 1.5, 1.01.0
Kernels (m)5, 3, 5, 57, 5, 5, 37, 5, 5, 3refer to Figure 3crefer to Figure 3c
Stream combination via?AdditionAdditionAdditionConcatenationConcatenation
Processing at joints?Yes (refer to Figure 3d)Yes (refer to Figure 3b)Yes (refer to Figure 3b)NoNo
Further processing before FC?NoNoNoSingle Inception v1 block (Figure 3c)No
Activation function(s) before FC?Leaky ReLULeaky ReLULeaky ReLUReLUReLU
Activation at last dense layer?SoftmaxSoftmaxSoftmaxSoftmaxSoftmax
FC units (Descending)512, 64, 8512, 64, 8512, 64, 8512, 64, 8512, 64, 8
Output sizes after joint processing blocks 2
Module 14 × 24 × 32 × 244 × 24 × 32 × 244 × 24 × 32 × 248 × 96 × 128 × 728 × 96 × 128 × 72
Module 22 × 12 × 16 × 482 × 12 × 16 × 482 × 12 × 16 × 484 × 48 × 64 × 1444 × 48 × 64 × 144
Module 31 × 6 × 8 × 961 × 6 × 8 × 961 × 6 × 8 × 962 × 24 × 32 × 2882 × 24 × 32 × 288
Module 41 × 3 × 4 × 1921 × 3 × 4 × 1921 × 3 × 4 × 1921 × 12 × 16 × 5761 × 12 × 16 × 576
Module 5---1 × 6 × 8 × 5761 × 6 × 8 × 1152
Module 6---1 × 3 × 4 × 384-
1 filter/unit size n in joint processors (Figure 3b,d) is based on filter size after stream combination (i.e., adding or concatenation) and uses fixed multiplier k = 1.5. 2 Striding through temporal dimension produced more compact feature representations and lessened overall parameter count.
Table 2. Learning rates, parameter count, and floating point operations per second (FLOPS) for each model.
Table 2. Learning rates, parameter count, and floating point operations per second (FLOPS) for each model.
ModelsBaselineCIv3D_BiLSTMCIv3D_MHASRSCRS
Learning Rate(s)0.00050.0010.0010.0010.001
Parameters (feature sharing)11,315,84813,243,71215,554,1129,671,87222,927,824
Parameters (standalone)11,348,72816,623,75219,044,7449,670,02422,483,944
FLOPS (feature sharing) 22.63 × 10 6 30.67 × 10 6 31.10 × 10 6 19.34 × 10 6 45.85 × 10 6
FLOPS (standalone) 22.69 × 10 6 37.42 × 10 6 38.07 × 10 6 19.34 × 10 6 44.96 × 10 6
Table 3. Performance of proposed models across chosen metrics showing that the SRS outperforms all other architectures across all metrics.
Table 3. Performance of proposed models across chosen metrics showing that the SRS outperforms all other architectures across all metrics.
ModelsMetrics
AUCPrecisionRecallF1-ScoreAccuracy
micro (m)macro (M)
Baseline0.8790.9200.6080.7490.60774.93
CIv3D_BiLSTM0.9550.9650.7310.7710.68777.08
CIv3D_MHA0.9510.9530.6520.7800.67177.99
SRS0.9590.9720.7960.8200.75081.96
CRS0.9340.9600.7550.7850.68378.49
Table 4. Result of binary ensembles between SRS, CRS, CIv3D_MHA and CIv3D_BiLSTM models.
Table 4. Result of binary ensembles between SRS, CRS, CIv3D_MHA and CIv3D_BiLSTM models.
EnsemblesMetrics
mAUCAPAcc (%)
SRS + CRS0.9580.79583.31
SRS + CIv3D_MHA0.9770.88086.28
SRS + CIv3D_BiLSTM0.9660.83083.69
CRS + CIv3D_MHA0.9630.81482.62
CRS + CIv3D_BiLSTM0.9420.74579.55
CIv3D_MHA + CIv3D_BiLSTM0.9680.83182.46
Table 5. Detailed comparison between the feature sharing and standalone stream forms of each network. The standard deviation (from the mean metric value) across the four different seed models in each case was also included.
Table 5. Detailed comparison between the feature sharing and standalone stream forms of each network. The standard deviation (from the mean metric value) across the four different seed models in each case was also included.
ModelsStream KindmAUCMAUCAPF1 ScoreAccuracy (%)
Baselinesharing0.878 ± 0.0130.920 ± 0.0040.592 ± 0.0170.552 ± 0.02674.93 ± 2.68
standalone0.815 ± 0.0200.916 ± 0.0100.562 ± 0.0240.483 ± 0.01670.28 ± 1.64
CIv3D_BiLSTMsharing0.955 ± 0.0120.965 ± 0.0030.789 ± 0.0450.654 ± 0.07477.08 ± 1.77
standalone0.910 ± 0.0150.955 ± 0.0040.668 ± 0.0510.537 ± 0.02676.95 ± 2.02
CIv3D_MHAsharing0.951 ± 0.0170.953 ± 0.0170.770 ± 0.0600.678 ± 0.07477.99 ± 3.50
standalone0.896 ± 0.0180.938 ± 0.0070.635 ± 0.0360.564 ± 0.03473.10 ± 2.44
SRSsharing0.959 ± 0.0100.972 ± 0.0060.791 ± 0.0420.686 ± 0.05081.96 ± 2.71
standalone0.900 ± 0.0290.931 ± 0.0090.632 ± 0.0640.666 ± 0.04871.07 ± 1.59
CRSsharing0.934 ± 0.0060.960 ± 0.0040.728 ± 0.0150.578 ± 0.01878.49 ± 0.95
standalone0.868 ± 0.0160.921 ± 0.0080.562 ± 0.0250.523 ± 0.01967.95 ± 1.80
Table 6. Results on grayscale (GS), raw RGB (R), and nightified (N) data show that 3-channel inputs in the raw image stream produced a better performance than the single channel grayscale in all the architectures considered. Of these models, two of the best performers were associated with the nightified data format.
Table 6. Results on grayscale (GS), raw RGB (R), and nightified (N) data show that 3-channel inputs in the raw image stream produced a better performance than the single channel grayscale in all the architectures considered. Of these models, two of the best performers were associated with the nightified data format.
ModelBaselineCIv3D_MHASRS
Input kindGSRNGSRNGSRN
mAUC0.8730.8690.8790.9100.9210.9500.9620.9380.959
MAUC0.9100.9140.9200.9290.9430.9530.9720.9530.972
AP0.5450.5770.5920.6470.6730.7700.8100.6870.791
F1 Score0.5420.5440.5520.5640.5970.6780.7330.5690.686
Accuracy (%)73.3776.5774.9373.7077.1677.999.2477.8781.96
Table 7. Results for varied temporal lengths show that the 1.33-second window was optimal for the depth and complexity of the architectures presented.
Table 7. Results for varied temporal lengths show that the 1.33-second window was optimal for the depth and complexity of the architectures presented.
Temporal Length (s)mAUCMAUCAPF1 ScoreAccuracy (%)
Baseline40.8530.7430.4110.38157.03
80.8790.9200.5920.55274.93
160.8310.7360.3650.34947.63
SRS40.9140.9580.6290.57669.72
80.9590.9720.7910.68681.96
160.9160.9300.6580.60767.43
Table 8. Result on cross-validation only shows results comparable to those of human annotators.
Table 8. Result on cross-validation only shows results comparable to those of human annotators.
Cross-Validation Acc (%)
Human Annotators ([7])71.6
Their method ([7])77.3
Ours71.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nwokedi, E.I.; Bains, R.S.; Bidaut, L.; Ye, X.; Wells, S.; Brown, J.M. Dual-Stream Spatiotemporal Networks with Feature Sharing for Monitoring Animals in the Home Cage. Sensors 2023, 23, 9532. https://0-doi-org.brum.beds.ac.uk/10.3390/s23239532

AMA Style

Nwokedi EI, Bains RS, Bidaut L, Ye X, Wells S, Brown JM. Dual-Stream Spatiotemporal Networks with Feature Sharing for Monitoring Animals in the Home Cage. Sensors. 2023; 23(23):9532. https://0-doi-org.brum.beds.ac.uk/10.3390/s23239532

Chicago/Turabian Style

Nwokedi, Ezechukwu Israel, Rasneer Sonia Bains, Luc Bidaut, Xujiong Ye, Sara Wells, and James M. Brown. 2023. "Dual-Stream Spatiotemporal Networks with Feature Sharing for Monitoring Animals in the Home Cage" Sensors 23, no. 23: 9532. https://0-doi-org.brum.beds.ac.uk/10.3390/s23239532

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop