Next Article in Journal
Using Twitter to Detect Hate Crimes and Their Motivations: The HateMotiv Corpus
Previous Article in Journal
Development of a Model Using Data Mining Technique to Test, Predict and Obtain Knowledge from the Academics Results of Information Technology Students
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Few Shot Key Frame Extraction for Cow Teat Videos

1
Department of Clinical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY 14853, USA
2
Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY 14853, USA
*
Author to whom correspondence should be addressed.
Submission received: 22 April 2022 / Revised: 18 May 2022 / Accepted: 18 May 2022 / Published: 23 May 2022

Abstract

:
A novel method of monitoring the health of dairy cows in large-scale dairy farms is proposed via image-based analysis of cows on rotary-based milking platforms, where deep learning is used to classify the extent of teat-end hyperkeratosis. The videos can be analyzed to segment the teats for feature analysis, which can then be used to assess the risk of infections and other diseases. This analysis can be performed more efficiently by using the key frames of each cow as they pass through the image frame. Extracting key frames from these videos would greatly simplify this analysis, but there are several challenges. First, data collection in the farm setting is harsh, resulting in unpredictable temporal key frame positions; empty, obfuscated, or shifted images of the cow’s teats; frequently empty stalls due to challenges with herding cows into the parlor; and regular interruptions and reversals in the direction of the parlor. Second, supervised learning requires expensive and time-consuming human annotation of key frames, which is impractical in large commercial dairy farms housing thousands of cows. Unsupervised learning methods rely on large frame differences and often suffer low performance. In this paper, we propose a novel unsupervised few-shot learning model which extracts key frames from large (∼21,000 frames) video streams. Using a simple L1 distance metric that combines both image and deep features between each unlabeled frame and a few (32) labeled key frames, a key frame selection mechanism, and a quality check process, key frames can be extracted with sufficient accuracy (F score 63.6%) and timeliness (<10 min per 21,000 frames) for commercial dairy farm setting demands.

1. Introduction

Monitoring the dairy cows’ health is critical in ensuring quality milk production. In the commercial dairy farm setting, monitoring the health of thousands of cows is a time-consuming and expensive task. During the milking process, cows are moved toward large parlors for machine milking as shown in Figure 1. These systems consist of a large and slowly rotating set of stalls, where a cow is guided into a stall, a milking unit is manually attached to the cow’s teats, machine milking commences via vacuum, and the milking unit automatically detaches and retracts from the teats. Thereafter, the cow exits the rotating parlor.
Within a milking session, the opportunity for a veterinarian to assess the health of the dairy cows’ teats is limited to immediately before or after the milking unit attaches or detaches from the teats. Mastitis, or bacterial infections of the udders and/or teats, poses one of the greatest health concerns for dairy cows. The risk of mastitis is increased with changes in the callosity (hyperkeratosis) of the teat end, and this can be assessed via manual inspection. While it is possible to assess the extent of hyperkeratosis in a large proportion of the herd during a milking session, the total time available for the veterinarian to conduct this assessment is limited due to the finite amount of time that the cow is in the stall (typically tens of seconds). It is thus impractical to conduct health assessments of the entire herd in this manner, and industry standards suggest evaluating 20% (or greater) of the herd [1].
Recently, we proposed a digital framework for evaluating the extent of hyperkeratosis by using a digital camera and software [2]. This approach allows remote assessment of the entire population of cows that enter the parlor. A digital approach also permits the opportunity for several experts to conduct assessments of hyperkeratosis independently and mitigate the influence of inter-rater variability. We have also shown that it is feasible to use deep learning to classify the extent of hyperkeratosis [3]. These innovations permitted the opportunity to explore whether such health assessments can be conducted remotely using video-based imaging systems. Later, we proposed a separable confident transductive learning [4] model to minimize the difference between training and test datasets, and we improved the hyperkeratosis recognition accuracy from 61.8 to 77.6%.
While a video-based analysis might seem like a simple extension of this work, analyzing the entire video frame by frame is inefficient since only a small number of frames contain useful diagnostic information. Many vision-based tasks (classification, segmentation) can be performed more efficiently using key frames (KFs) instead of the full video, thus one option is to select KFs from these cow teat videos for analysis. Most existing key frame extraction (KFE) methods use supervised or unsupervised learning. Supervised learning requires the manual labeling of KFs from large-scale training data to train a model. In the dairy farm setting, it is not practical nor economical to manually label all video images; thus, unsupervised or semi-supervised learning models are preferred. Unsupervised learning models for detecting KFs rely on significant changes between image frames. The cognitive goal of our problem is to extract key teat frames from video sequences in which changes in objects between frames are less obvious. The utilitarian goal is to efficiently and accurately extract key frames with only a few key frames. Therefore, existing supervised (require massive labels) and unsupervised (require significant frame changes) methods are ineffective in our problem.
We propose a modified few-shot learning approach and leverage knowledge from several (N = 32) support KFs and then identify KFs in unlabeled video image frames (Figure 2). Figure 3 shows 6 of the 32 KFs used in this study. This paper provides three specific contributions:
  • The CowTeatVideo Benchmark. We provide a new, publicly available dataset consisting of dairy cow teat videos for key frame extraction. This is a published dataset of dairy cow teat videos that can be used for the testing and evaluation of different KFE models.
  • Few-shot generalized learning. We run few-shot learning without a base training dataset and unlabeled query datasets (cow teat videos). The key frames are detected using the distance between unlabeled query datasets and support key frame images.
  • UFSKFE model. We describe a novel unsupervised few-shot learning key frame extraction (UFSKFE) model for our problem. We combine the L1 distance of raw RGB images and extracted deep features to form a robust fusion distance. After selecting key frame candidates, we further propose a quality check process to remove noisy key frames.

2. Related Work

Extracting correct KFs has been a long-standing problem with many applications, such as managing, storing, transmitting, and retrieving video data. Both traditional and deep learning-based methods have been explored.

2.1. Traditional Methods

Traditional KFE models can be divided into two categories: unsupervised learning and supervised learning. Unsupervised KFE often relies on computing the relevance, diversity and representations using extracted traditional features using optical flow [5,6], SIFT [7,8] and SURF features [9,10]. The clustering approach is one representative unsupervised KFE method [11]. Mendi and Bayrak [12] developed a dynamic KFE method through three steps: color histogram differences, self-similarity modeling and unsupervised k-means clustering. Priya and Dominic [13] utilized inter-cluster similarity analysis to extract KFs. Vázquez-Martín and Bandera [14] computed similarity by building an auxiliary graph of frame features and then applied spectral clustering to extract KFs. Later, Ioannidis et al. [15] extracted KFs via applying spectral clustering to a composite similarity matrix that was computed using weighting sum of all similarity matrices of video frames. Supervised KFE models rely on human annotated data to train a machine learning model and generate KFs from the test videos. Ghosh et al. [16] and Gygli et al. [17] treated the process of extracting KFs as a regression scoring problem, where a higher score is selected as a KF. Yao et al. [18] proposed a multifeature fusion method (which can capture complicated and changeable dancer motions) to extract KFs from dance videos.

2.2. Deep Learning Models

Recently, deep learning approaches have attracted interest in KFE. Both supervised and unsupervised deep learning models have been proposed to boost the performance of KFE from videos. Supervised deep KFE models usually estimate a frame’s importance via deep neural networks with the aid of ground truth KFs. Zhang et al. [19] first applied long short-term memory (LSTM) units to model variable-range temporal dependency among video frames, and they predicted the frame’s importance via multi-layer perceptron. Later, Zhao et el. [20] proposed a two-layer LSTM to estimate the key fragments of a video. They further developed a tensor-train embedding layer in a hierarchical architecture of recurrent neural networks to model the long temporal dependency among video frames [21]. Based on [19], Casas and Koblents introduced an attention mechanism to estimate the frame’s importance and select the video KFs. Fajtl et al. [22] utilized self-attention with a two-layer fully connected network to predict the frame’s importance score. Li et al. [23] developed a global diverse attention mechanism based on a pairwise similarity matrix that contains diverse attention weights. These weights can further transform into frame importance scores. Jian et al. [24] extracted the KFs of sports videos, considering the neighboring probability difference of frames, and these probabilities were estimated from a CNN on extracted region of interest areas. Yuan et al. [25] introduced a global motion model to extract candidate KFs; spatial–temporal consistency and hierarchical clustering were used to extract KFs.
There are also several unsupervised deep learning models for KFE. Yuan et al. [26] introduced a bidirectional LSTM model to automatically extract KFs. Mahasseni et al. [27] applied the generative adversarial networks (GAN) in KFE. They employed an LSTM as a frame selector and confused the discriminator (which aims to distinguish original video and reconstructed video). Yuan et al. [28] utilized bidirectional LSTM as a frame selector to model the temporal dependency among frames, and KFs were evaluated by two GANs. Yan et al. [29] proposed an automatic self-supervised learning model to detect KFs in videos. They proposed to generate pseudo labels for each frame with optical flow and RGB image features. Li [30] proposed an end-to-end network embedding for unsupervised KFE for person re-identification. They designed a KFE module by training a CNN with pseudo labels generated by hierarchical clustering. Recently, Elahi and Yang [31] proposed an online learnable module for KFE, and the extracted KFs were used for recognizing action with deep learning-based classification models.
Our goal is to devise an effective strategy to extract KFs that contain a clear, unambiguous, and high-resolution image of the dairy cow teats for clinical diagnosis. Unsupervised learning models rely on the sharp differences between consecutive frames to determine the KFs, but this is not the case in our problem. Unsupervised clustering models can lead to low performance in our situation since KFs are similar to each other and may be easily be assigned to the same class (see sample KF images in Figure 3).
Few-shot learning aims to accomplish a learning task by using very few training examples, which typically recognize the different categories of images in the query dataset given a base training dataset and a support dataset [32,33,34]. Oreshkin et al. [35] trained a normal global classifier on the base dataset to form an auxiliary task, which can co-train the few-shot classifier and create a regularization effect. Gidaris et al. [36] combined self-supervision with few-shot learning, which can learn rich and transferable visual representations with few annotated samples. Hong et al. [37] utilized reinforcement learning for training an attention agent to generate discriminative representation in few-shot learning. Wei and Mahmood [38] optimized few-shot learning tasks by generating new samples using variational autoencoders on face recognition. However, current few-shot models are mostly supervised and rely on labeled examples. Current attempts of unsupervised few-shot learning [39,40] are not suitable in our problem. Only a few KFs (support dataset) and unlabeled cow teat videos are provided for the learning.

3. Methodology

3.1. Motivation

Given the unique nature of our dataset and problem, we propose to apply few-shot learning in an unsupervised manner for KFE. We then design a framework that takes the knowledge from the few support KF images to find its nearby neighbors using both raw RGB images and pre-trained deep features distances as shown in Figure 4.

3.2. Preliminaries

3.2.1. Key Frame Extraction

Given a video V = { v i } i = 1 n v , where v i is the i-th frame image and n v is the number of frames in video V, the goal of video KFE is to fetch the KF numbers Y :
Y = S ( V ) ,
where Y = { y j } j = 1 n y ( n y is the number of predicted KFs, and  n y < < n v ) and S is an automatic KF selection function. In supervised KFE, the KF numbers F = { f i } i = 1 n f of video V, or importance of each frame image, are provided, where n f is the number of KFs and typically n f < < n v . We aim to minimize the error between Y and F during the training and generalize the trained model for new video data. In unsupervised KFE, no KFs are known (i.e.,  F = ). It aims to predict Y that can best describe the content of a video V.

3.2.2. Few Shot Learning

In supervised few-shot learning, we have a labeled base training dataset D b a s e = ( X , Z ) = { x i , z i } i = 1 n d that contains n d labeled training images from A base classes, i.e., z i { 1 , 2 , , A } . In addition, we are given a support dataset D S of labeled images from C novel classes, and each class has K examples. The goal of few-shot learning is to train a model that can accurately recognize the C novel classes in another query dataset D Q . This learning paradigm is called C-way K-shot learning. In unsupervised few-shot learning, there are no labels for the base training dataset, i.e., D b a s e = X = { x i } i = 1 n d . In our KFE problem, the base training dataset is also unavailable i.e., D b a s e = . We treat the full video as the query dataset, and it has no labels. In the next section, we discuss how we can construct tasks in unsupervised KFE with few-shot learning.

3.3. Unsupervised Few-Shot KFE

In traditional unsupervised KFE, poor performance is often the result of no labeled KFs. In our videos, there are no distinctive changes between frames, like in sports videos. To improve the learning of these KFs, we start with a few KFs (i.e., a support dataset D S exists). Since we only have one class (KFs) and K KFs (K images, K = 32 in our case), our problem can be treated as a one-way 32-shot problem, or a few-shot learning perspective. However, the aforementioned base training dataset is not provided. Furthermore, the query dataset is our unlabeled cow teat video ( D Q = V ). A key question then is how to obtain key frames from each cow in all unlabeled videos with only a few prior KFs. Inspired by few-shot learning, we consider measuring the distance between each video frame image and support KFs.

3.3.1. Raw Distance Representation

To select KFs from the unlabeled videos, we propose to calculate the distance between support KF images D S = { s k } k = 1 K = 32 and each frame image of a video. Frames with the lowest distances could be potential KFs. First, we calculate a distance based on each raw frame image and support KF image via the distance matrix M r a w R n v × K in Equation (2), which represents the L1 difference between each raw video frame image and K support KF raw images. An element in the distance matrix is defined as
M r a w i k = | s k v i | 1 ,
where | · | 1 is the L1 norm of the difference between one support KF image and one video frame ( k { 1 , , K } and i { 1 , , n v } ), | s k v i | 1 R 1 × 1 , and hence, M r a w R n v × K . We then define the raw distance as
d r a w = min r M r a w ,
where min r returns the minimum number of each row in the matrix M r a w . For each frame v i , its associated raw distance is d r a w i = min { M r a w i k } k K R 1 × 1 and denotes the distance to one of the closest support KF images. Since a video contains many images of a cow—and many cows—several KFs to compare against an analyzed image are necessary. We aim to have a diverse set of support key frame KF s k from which at least one image closely resembles the current frame. For all frames in any video V, we can calculate the raw distance d r a w R n v × 1 . Note, however, that the raw distance is computed using original images and might not capture all of the important features in a key frame. So, we describe how we extract deep features from both the video frame images and support KF images, and then calculate a deep feature distance, as described in more detail in the next section.

3.3.2. Deep Distance Representation

There is no deep model for cow teat video classification or segmentation; thus, our approach is to extract deep features from a pre-trained ImageNet model. Let Φ represent feature extraction from a pre-trained ImageNet model. Similar to the raw distance matrix in Equation (2), an element in a deep distance matrix is denoted as
M d e e p i k = | Φ ( s k ) Φ ( v i ) | 1 ,
where Φ ( · ) R D , which represents the feature vector for a given frame image with dimensionality D (We extract deep features from the layer prior to the last fully connected layer.), M d e e p i k R 1 × 1 , and  M d e e p R n v × K . The deep distance is then defined as:
d d e e p = min r M d e e p .
Again, d d e e p has the size of n v × 1 . This deep distance can represent feature differences of the current video frame to its closest support KF. Both d r a w and d d e e p can denote the distance between one video frame and support KFs. Next, we form a robust fusion distance by considering these two distances for KFE.

3.3.3. Fusion Distance

We combine the raw and deep distances in a new distance function to improve the performance of KF detection in our problem:
d = α d ^ r a w + ( 1 α ) d ^ d e e p .
Since raw distance d r a w and deep distance d d e e p have different magnitudes, we re-scale them by dividing them by the maximum distance within their respective matrices i.e., d ^ r a w = d r a w / max ( d r a w ) and d ^ d e e p = d d e e p / max ( d d e e p ) . The parameter α controls the weight between re-scale raw distance d ^ r a w and re-scale deep distance d ^ d e e p . With this new fusion distance d defined, the next step is to design a KF select function S that correctly retrieves KFs.

3.3.4. Key Frame Selection Mechanism

One straightforward way of extracting KFs is to return frames that have a distance score below a threshold. However, establishing an (arbitrary) threshold is prone to errors and redundancy. Different KFs could have very large distances to the support KFs, resulting in an incorrect KF. Alternatively, frames which have a distance below the threshold can belong to the same cow (redundancy). For example (Figure 5), the fusion distance when analyzing a video suggests four frames (circles) would be selected as KFs. However, each circle represents one cow, and only one frame is needed for the best view of the key cow teat frame. To address the redundancy problem, we propose to first sort d in ascending order, then iteratively take the first small distance frame as the KF, and then remove its nearby window of potentially ± R redundant frames. This process can be summarized in Algorithm 1. This key frame selection S allows us to uniquely obtain KFs from each cow in the video.
Algorithm 1 Key frame selection mechanism ( S )
1:
Input: fusion distance d, and redundant frame number R = 500
2:
Output: selected key frame numbers Y S
3:
[ d s o r t , d i n d e x ] = ascend-sort(d) // return the sorted distance and its index
4:
I = d i n d e x
5:
for t = 1 to l e n ( I ) do
6:
   if  I t ! = 1  then
7:
       tem = I t
8:
        I [ ( I < ( I t + R ) ) & ( I > ( I t R ) ) ] = 1 // Assign −1 to ( ± R ) of one key frame
9:
        I t = tem
10:
   end if
11:
end for
12:
Y S = u n i q u e ( I ) // Get unique key frame numbers
13:
Y S [ Y = = 1 ] = [] // Remove 1 from the predicted KFs
14:
return Y S

3.3.5. Predicted KFs Quality Check

After generating several KF candidates Y S with Algorithm 1, we conduct a quality check ( Q C ) of the predicted KFs. The most common issue for an incorrect KF candidate is when the milking unit is still attached to the dairy cow, or it obstructs visualization of the dairy cow teats (as shown in Figure 6a). To enforce selected KFs with a clear view of the teat area, we calculate the structural similarity index (SSIM) [41] score between support KF approximate teat area and selected KFs area (The position and size of teat areas is x coordinate = 130, y coordinate = 80, width = 170, height = 190, and remains constant since the camera is in a fixed position.). If the SSIM score between the most similar support KF and the selected KF is smaller than the threshold ( O = 0.45 ), the selected KF is excluded. This threshold is determined empirically. Let L be the number of selected KFs candidates Y S and Y S l be its l-th KF number. We then can calculate SSIM between each selected KF and each support KFs in the teat position to form a similarity matrix H R L × K . An element in H is defined as
H l k = S S I M ( s k p , v Y S l p ) ,
where p represents the sub-region of the image of greatest clinical relevance, and  v Y S l is the selected KF image. Finally, we determine the KFs numbers with the following equation,
Y = Y S ( max r H ) O ,
where max r returns the maximum number of each row of the similarity matrix H. The superscript ( max r H ) O selects the frame number when the highest SSIM score is greater than the threshold O.
Figure 6a displays a candidate KF image from S but the milking unit is still attached to the dairy cow teats. To mitigate this issue, we calculate the SSIM between the the current KF and support KFs within the sub-region using Equation (7), and the highest SSIM scores among all K frames to obtain its most similar support KF (Figure 6b). The SSIM score is 0.41, which is lower than the threshold O = 0.45 . Using this method, we are able to exclude the detected KF in Figure 6a.

3.4. Ufskfe Model

Figure 4 depicts the overall framework of our proposed UFSKFE model. Combining all steps in Section 3.3, our UFSKFE model is denoted by the function:
Y = Q C ( S ( d ) ) ,
where Q C is the quality check, S is the section mechanism in Algorithm 1, and d is the fusion distance. The overall learning algorithm is shown in Algorithm 2.
Algorithm 2 Unsupervised few-shot key frame extraction
1:
Input: Cow teat video V, K = 32 support KFs, weight balance factor α , redundant frame number R and similarity threshold O
2:
Output: predicted KFs Y
3:
for  i = 1 to n v do
4:
    for  k = 1 to K do
5:
        Compute M r a w i k and M d e e p i k according to Equations (2) and (4)
6:
    end for
7:
end for
8:
Calculate d r a w and d d e e p according to Equations (3) and (5) and form d using Equation (6)
9:
Select KFs candidates using Algorithm 1
10:
for  l = 1 to l e n ( Y S ) do
11:
    for  k = 1 to K do
12:
        Compute similarity matrix H according to Equation (7)
13:
    end for
14:
end for
15:
Return predicted key frame numbers Y S using Equations (8)

4. Experiments

4.1. Datasets

4.1.1. Data Collection

Approximately eight hours of video footage of dairy cow teats on a commercial dairy farm were obtained using a GoPro 10 camera mounted on a tripod with two adjustable LED lights directed towards the teats. The farm houses approximately 1600 Holstein cows which are milked daily on a 60-stall rotary parlor. The 1691 Holstein dairy cows were housed in free-stall pens and milked three times per day in a 60-stall rotary parlor. Cows were in the first (697, 41.2%), second (446, 26.4%), and third or greater lactation (548, 32.4%) and between 1 and 738 days in milk (mean and standard deviation, 185 (113)). All procedures were reviewed and approved by the Cornell University Institutional Animal Care and Use Committee (protocol no. 2013-0064). Videos were sampled at 1080 × 1920 × 3 pixels, 59.94 frames per second and saved in MP4 format. The camera was set to use default settings, and external lighting was used. The images were acquired immediately after removal of the milking cluster.
The rotational speed of the milking rotary parlor was 8.5 s/stall, leading to a rotation time of 510 s (i.e., 8.5 min). This resulted in a theoretical throughput of 423 cows per hour. The average milking duration to milk the 1600 cows was approximately five hours. The speed of rotation of the milking parlor platform does not affect the accuracy of the camera measurements, provided that the video feed is sampled at a sufficiently high enough rate. Our data were sampled with a minimum of 60 frames per second. Four milking technicians operated the milking parlor and were assigned to four different positions, including the following tasks: Position 1, manual forestripping of teats and application of pre-milking teat disinfection; position 2, cleaning and drying of teats with a clean cloth towel; position 3, attachment and alignment of the milking unit; and position 4, application of post-milking teat disinfectant with a dip-applicator cup. Post-milking teat disinfectant was applied by an automatic teat spray robot. Cows were led to the holding area by one farm technician.
Plastic covers protected the tripod and lights and were mounted around the camera to minimize the contamination of feces and other contaminants. The camera feed was displayed continuously and regularly checked to ensure that the lens was not obfuscated from such contaminants, and the camera lens itself was regularly inspected and cleaned throughout the data collection.

4.1.2. Data Analysis

Table 1 shows the statistics of the cow teat videos analyzed in this study. There are only few KFs in each cow teat video, which leads to the difficulty of KFE. Note that cows do not always occupy all the stalls in the rotating parlor, which explains why fewer key frames are detected in videos 1–10. Note also the videos are relatively large in file size (2.47 gigabytes on average), with 21,191 frames in each video. Here, the number of KFs are checked with an expert for evaluation purposes. There are usually 500 frame differences between two successive KFs unless the parlor rotation is interrupted, the parlor stall is empty, or the milking system obfuscates the teats: for these reasons, the redundant frame number R is set to 500. The computation time should be as short as reasonably possible, as it may result in delays in assessing a cow’s teat health. We expect the computation time of any KFE algorithm to be less than an hour per video, which is reasonable in a commercial dairy farm setting.

4.2. Evaluation Metric

We use the F score to evaluate the performance of KFE models [27,29]. The F score uses recall ( R e ) and precision ( P r ) to measure how much the KFs overlap in Equation (10). The higher these metrics, the better the model is.
R e = N c o r r n f , P r = N c o r r l e n ( Y ) , F = 2 R e × P r R e + P r ,
where n f is the number of ground truth KFs (third column in Table 1), l e n ( Y ) is the length of the predicted KFs, and  N c o r r is the number of correctly detected KFs. The closer the F score is to 1 (or 100% in Table 2), the better the model. Since several nearby frames of annotated KFs are similar to each other, they also contain a clear view of the teat area. Therefore, we treat the annotated KF number within ± 20 frames (or approximately 0.3 s) as the correct prediction (e.g., a predicted KF number of 120 is correct if the annotated KF number is 100). This value will vary based on the video frame rate and rotation rate of the parlor.

4.3. Implementation Details

In our UFSKFE model, we utilize ResNet-101 [42] as the pre-trained model to extract deep features from the layer prior to the last fully connected layer. We conducted experiments with 12 different ImageNet models in order to justify selecting ResNet-101 for feature extraction. The performance of different ImageNet models can be found in Appendix A. Frame image features are extracted with an NVIDIA RTX A6000 GPU with 48 Gigabyte. The three hyperparameters are set at α = 0.4 , R = 500 and O = 0.45 . We also conduct a parameter analysis in Section 4.6. Since there are no KFE models to adopt in our problem, we compare several existing models with different frame image extraction methods. In Section 3.3.2, Φ refers to the feature extractor from an ImageNet model. We can also extract other features, such as SURF features [9,10], binary image features [43] and Sobel edge detection image features [44]. We then can calculate d S U R F , d B i n a r y and d S o b e l . We replace the fusion distance d with these other distances in Algorithm 1 to predict KFs. The details of feature extraction can be found in Appendix A.1. Results in Table 2 are reported with the additional quality check.

4.4. Results

Table 2 shows the performance of all 18 cow teat videos. Compared with all other baselines, our UFSKFE model achieves the highest average F score over all videos. Note that while d r a w c r o p has the lowest F score, it only calculates the distance between each video frame to the support KF images in the small teat area, and ignores other important areas, e.g., the cow leg area. We find that the performance of extracted AlexNet [45] features and NASNetLarge [46] features are similar and lower than ResNet-101 [42] features. One reason for this is that AlexNet is not a high-performance ImageNet model. Its extracted features might focus on shallow features. In comparison, the performance of the NASNetLarge model is high, suggesting extracted features might lead to ImageNet image features. The F score of the SURF features is lower than those of the AlexNet and NASNetLarge features, likely because the SURF features can only detect a few important points while ignoring background features.  d r a w achieves the second-best results, which demonstrates that the raw images also contain important features that deep neural networks are not captured. The performance when using binarized images of the videos is similar to the performance of d r a w since it contains similar important features to the raw frame images. They both perform better than the Sobel edge detection images likely because edge features are predominantly analyzed. In terms of computation time, deep feature extraction with ResNet-101 is faster than with all other models. Although our UFSKFE model takes longer, it combines the feature extraction time of ResNet-101 and raw images. The total average time to extract KFs is less than nine minutes and is faster than extracting SURF, NASNetLarge features, and  H S S I M (details are shown in Appendix A.2). The SSIM similarity selection is not an efficient method, with computation times of more than 1.7 h per video. These extensive results demonstrate that our proposed UFSKFE model can quickly and accurately extract KFs.
Figure 7 shows KFs detection using our model from the GH060066 video using fusion distance d. There are five true KFs (green dots), while our model detected six points as the KFs (red dots). Although there are differences between green and red dots, those differences are within ± 20 frames, and are considered correct predictions. Figure 8 compares detected KF images with the ground truth. In our UFSKFE model, only one wrong prediction (frame 2611) is detected. This is likely due to the milking apparatus still being attached to the dairy cow’s teats with the low field of view. The quality check process does not remove this detected KF (similarity score 0.62 exceeds threshold O). The other two methods d S U R F and d B i n a r y also incorrectly identify this cow’s images as a key frame. Compared with predicted KF images of other models, UFSKFE has a higher F score, and these frames are closer to the ground truth KF images than other models.

4.5. Ablation Study

To demonstrate the effectiveness of different components on the final F score, we conduct an ablation study for each component of our proposed UFSKFE model (Table 3) with four randomly selected videos (GH060066, GH030072, GH010066, and GH050066). Realizing that the KF selection function S is required, we conduct the ablation study with d r a w , d d e e p , and  Q C . d r a w selects the KFs using raw distance with S , d d e e p selects the KFs using deep distance with S , and  d r a w + Q C conducts a quality check after selecting the KFs of the raw distance. We find that the F score for fusion distance d is higher than when using d r a w or d d e e p . The quality check process is also effective in improving the F score. Therefore, all our proposed components demonstrate effectiveness and importance in this KFE.

4.6. Parameter Analysis

There are three hyperparameters in our model: weight balance factor α , redundant frame number R and similarity threshold O. To determine the best parameters, we report F score of randomly select three videos when these hyperparameters are varied. α is selected from { 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 0.1 } , R { 300 , 350 , 400 , 450 , 500 , 550 , 600 , 650 , 700 , 750 , 800 } and O { 0.1 , 0.15 , 0.2 , 0.25 , 0.3 , 0.35 , 0.4 , 0.45 , 0.5 , 0.55 , 0.6 } . We vary each parameter independently, while keep others fixed. From Figure 9a,c, we find that when α = 0.4 , R = 500 and O = 0.45 , the F score is maximized.

5. Discussion of Results and Limitations of UFSKFE Model

Our UFSKFE achieves the highest average F score when compared with other methods. There are three possible reasons why this model performs well. First, the proposed unsupervised few-shot learning paradigm leverages knowledge from a few support KFs to all of the video frames. Second, our proposed fusion distance takes advantage of both raw and deep distances from support frames that represent a diverse range of possible key frames. Third, the quality check process acts as an effective method for removing noisy KF candidates, resulting in a substantially improved overall performance.
A limitation of our proposed UFSKFE model is that it cannot remove some of the KFs, primarily those images where the milking apparatus remains attached to the dairy cows. Although our quality check can remove some of these images (Figure 6a) the removal of such images becomes more challenging when the cameras’ field of view does not adequately image the cows’ teats. This can be circumvented with re-positioning the camera in the portrait as opposed to landscape mode when collecting the videos. In addition, we only extract one key frame per cow. A clear view of each teat in the same cow can come from different frames. Therefore, we can consider extracting key frames for each teat. Exploring other methods for extracting features from video frames from few-shot learning may also be of value in our efforts to improve performance. Furthermore, our process can be performed in real time if we can directly store the recorded video on the cloud console. Future work focuses on the use of other machine learning approaches to assess the extent of hyperkeratosis and the risk of mastitis.
The performance of our key-frame extraction methodology may also be influenced by farm- and cow-related factors. With regards to the farm itself, the lighting conditions and cleanliness of the farm, stall, and parlors could affect the performance. The rotary parlor is housed inside a large complex which mitigates the effects of weather, lighting, and other environmental factors that could affect the quality of the video data. Variations from best milking practices could similarly affect performance, such as inconsistent cleaning of the teat ends. Key-frames with the milking unit still attached to the cow will depend on the settings of the milking system (vacuum pressure and detachment), parlor rotational speed, and location of the camera. Finally, the performance of the key-frame extraction method will depend on the size of the dataset. Our model was developed using only 32 labeled key frames. While additional labeled data could improve overall performance, our findings suggest that in the commercial dairy farm setting where such rotating parlors are used, key frames from only a very small fraction of the herd are necessary if using our automated key-frame extraction technique.

6. Conclusions

In this paper, we propose a novel unsupervised few-shot learning key frame extraction model for cow teat videos. We combine the raw and deep distances between each video frame and support key frame images and form a fusion distance to better denote the differences between each video frame and support key frame images. An efficient key frame selection mechanism is proposed to first determine the key frame candidates, followed by a quality check procedure to refine the predicted key frames. Extensive experiment results demonstrate that the proposed UFSKFE model can accurately and efficiently extract the key cow teat frames. Our approach provides an opportunity to reduce the redundancy of processing large videos. The extracted key teat-end frames can be collected to monitor the health status of dairy cows.

Author Contributions

Conceptualization, Y.Z. and P.S.B.; methodology, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, P.S.B. and M.W.; supervision, P.S.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Cornell Initiative for Digital Agriculture (CIDA).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Source code and datasets will be at https://github.com/YoushanZhang/UFSKFE (accessed on 17 May 2022).

Acknowledgments

The authors thank the farm owners and their employees for their willingness to participate in the study and their support during the data collection.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1

To calculate deep distance, we need to extract deep features for video frames from pre-trained ImageNet models. To select the best ImageNet model for feature extraction of our cow teat videos, we conducted extensive experiments with 12 frequently used ImageNet models. We use the layer prior to the last fully connected layer to extract deep features. These 12 ImageNet models are AlexNet [45], VGG16 [47], VGG19 [47], GoogLeNet [48], DenseNet-201 [49], ResNet-18 [42], ResNet-50 [42] , ResNet-101 [42] , Inception-V3 [50], Xception [51], InceptionResNet-V2 [52], NASNetLarge [46].
We also utilize t-SNE [53] to visualize extracted deep features in 2D space as shown in Figure A1, but it is still difficult to select the best pre-trained ImageNet model. We thus plot the projected loss from a high dimension space to the 2D space of the t-SNE model in Figure A2. We observe that ResNet-101 has the smallest projection loss among the 12 models, which suggests that ResNet-101 is a suitable ImageNet model for extracting deep features. However, we still do not know whether the performance of ResNet-101 features of our key frame extraction problem is better than other deep features. We thus report the performance of all 12 models in Table 2. We first calculate these deep distances and use key frame selection S to select the key frame candidates, then the quality check is performed to remove noisy key frames. We can find that the deep ReseNet-101 distance indeed achieves a higher F score than other models.
Figure A1. T-SNE visualization of extracted features of 12 ImageNet models from GH060066 video. Blue represents video frames, while green dots are the key frame image position.
Figure A1. T-SNE visualization of extracted features of 12 ImageNet models from GH060066 video. Blue represents video frames, while green dots are the key frame image position.
Data 07 00068 g0a1
Figure A2. T-SNE projection loss of different ImageNet models. Y-axis denotes the projected loss from high dimension space to the 2D space.
Figure A2. T-SNE projection loss of different ImageNet models. Y-axis denotes the projected loss from high dimension space to the 2D space.
Data 07 00068 g0a2

Appendix A.2. Other Baseline Features

Here, we provide the details of extracting features from other baselines. As shown in Figure A3, we show SURF detected points, binary image, and edge detection using the Sobel algorithm. In Figure A3b, we only show 10 of the strongest SURF points, while a total of 500 points are extracted from each video frame and supported key frame images, and there are 64 features for each point. We could extract 500 × 64 = 3200 SURF features for each image. The SURF distance is defined as follows:
d S U R F = min r M S U R F , M S U R F i k = | ϕ ( s k ) ϕ ( v i ) | 1 ,
where ϕ refers to the SURF feature extractor and ϕ ( · ) R 1 × 3 , 200 . In Figure A3c, we calculate the distance between the binary image of each video frame and support key frame images. The binary distance is defined as follows:
d B i n a r y = min r M b i n a r y , M b i n a r y i k = | B ( s k ) B ( v i ) | 1 ,
where B refers to obtaining the binary image. In Figure A3d, we calculate the distance between edge detection images using the Sobel algorithm of each video frame and support key frame images. The Sobel distance is defined as follows:
d S o b e l = min r M S o b e l , M S o b e l i k = | E ( s k ) E ( v i ) | 1 ,
where E refers to obtaining an edge detection image using the Sobel algorithm. After obtaining SURF distance d S U R F , binary distance d B i n a r y and Sobel distance d S o b e l , we use the key frame selection S (the fusion distance d will be replaced with these three distances, respectively) and perform the quality check process to obtain the final extracted key frames.
Figure A3. Raw image, SURF 10 strongest points, binary image edge detection with Sobel algorithm image comparison.
Figure A3. Raw image, SURF 10 strongest points, binary image edge detection with Sobel algorithm image comparison.
Data 07 00068 g0a3
We utilize visualize extracted SURF features (Figure A4) using t-SNE. These frame features (blue dots) are indistinguishable from the non-key frames, as shown in blue in Figure A1. SURF features also have a higher project loss (1.819) than other ImageNet models. This implies that the performance of SURF features might be lower than different ImageNet models. The average F score of SURF is 32.3 (in Table 1 of the main paper), which is lower than most ImageNet models.
Figure A4. T-SNE visualization of extracted SURF features.
Figure A4. T-SNE visualization of extracted SURF features.
Data 07 00068 g0a4
In Table 1 of the main paper, we also present the result of H S S I M , using the SSIM similarity matrix to determine key frames. Specifically, we calculate the SSIM score of the crop teat area of each video frame image and support video key frame image, and then select the highest score to detect the key frames and then perform the quality check process. The SSIM similarity matrix is defined as follows:
H S S I M = max r h S S I M , h S S I M i k = S S I M ( s k p , v i p ) ,
where p represents the teat position area, and  max r returns the maximum number of each row of the similarity matrix h S S I M R n v × K . Hence, H S S I M R n v × 1 . We then have a partial new key frame selection function S to determine the key frame candidates as in Algorithm A1. There are two changes, the first is that the input is not the fusion distance d, but the SSIM similarity matrix H S S I M . Secondly, we sort H S S I M in a descending order since a more similar teat area is more likely to be a key frame. Figure A5 shows the process of detecting key frames using new key frame selection function ( S ) on GH060066 video with similarity matrix H S S I M .
Algorithm A1 Key frame selection mechanism ( S )
1:
Input: SSIM similarity matrix H S S I M , and redundant frame number R = 500
2:
Output: selected key frame numbers Y S
3:
[ H s o r t , H i n d e x ] = descend-sort( H S S I M ) // return the sorted distance and its index
4:
I = H i n d e x
5:
for t = 1 to l e n ( I ) do
6:
    if  I t ! = 1  then
7:
        tem = I t
8:
         I [ ( I < ( I t + R ) ) & ( I > ( I t R ) ) ] = 1 // Assign −1 to ( ± R ) of one key frame
9:
         I t = tem
10:
   end if
11:
end for
12:
Y S = u n i q u e ( I )
13:
Y S [ Y = = 1 ] = [] // Remove 1 from the predicted key frames
14:
return Y S
Figure A5. Key frame extraction with H S S I M model of GH060066 cow teat video. H S S I M is similarity matrix. Red dots are the detected key frames, while green dots are the ground truth key frames.
Figure A5. Key frame extraction with H S S I M model of GH060066 cow teat video. H S S I M is similarity matrix. Red dots are the detected key frames, while green dots are the ground truth key frames.
Data 07 00068 g0a5
Table A1. F score (%) and computation time (s) of 12 different ImageNet models (IR: InceptionResNet-V2, NAST: NASNetLarge).
Table A1. F score (%) and computation time (s) of 12 different ImageNet models (IR: InceptionResNet-V2, NAST: NASNetLarge).
Videos d d e e p A l e x N e t d d e e p V G G 16 d d e e p V G G 19 d d e e p G o o g L e N e t d d e e p D e n s e N e t 201 d d e e p R e s N e t 18
FTimeFTimeFTimeFTimeFTimeFTime
GH06006672.714.472.725.936.428.418.211.836.442.854.521.3
GH01006330.819.014.330.528.631.028.615.715.452.414.315.6
GH01006722.218.060.027.020.027.666.714.122.246.360.014.2
GH02007012.521.637.535.637.536.512.517.837.562.137.518.0
GH01006947.123.958.840.447.140.758.819.947.170.635.319.8
GH01006852.232.327.356.943.561.843.528.652.2101.943.525.0
GH01006517.584.238.6137.845.6134.325.067.317.5233.538.665.3
GH02007119.2145.333.3228.140.5229.431.0132.333.3424.338.4152.3
GH03007225.0181.38.2228.032.9242.022.9142.025.0428.435.1157.3
GH04006669.7280.374.2333.671.1359.470.5255.071.9526.475.0248.7
GH01007130.9299.434.7338.347.4341.929.2253.220.6509.742.4249.2
GH03006643.1282.033.7322.439.6343.033.7242.126.0505.851.5249.2
GH01007023.3311.836.2331.145.7348.819.6229.340.4516.845.7248.6
GH02007223.1408.727.2365.432.4334.913.6243.337.3518.642.3252.1
GH01006636.0327.341.6325.445.1338.033.3247.831.7495.645.1252.0
GH02006634.3300.444.9326.145.3329.841.9292.239.3510.041.5252.1
GH01007217.1293.531.8327.344.9335.615.1293.227.2511.344.9241.7
GH05006647.5277.654.9330.854.9347.121.8264.925.7511.947.1250.2
Ave34.7184.540.6211.742.1217.232.6153.933.7337.144.0151.8
Videos d d e e p A l e x N e t d d e e p V G G 16 d d e e p V G G 19 d d e e p G o o g L e N e t d d e e p D e n s e N e t 201 d d e e p R e s N e t 18
FTimeFTimeFTimeFTimeFTimeFTime
GH06006654.514.390.920.872.722.518.229.536.430.960.057.7
GH01006314.318.028.620.414.328.014.337.70.036.342.994.0
GH01006760.016.060.017.760.024.760.033.222.233.244.473.1
GH02007050.020.937.523.040.033.213.344.440.043.90.099.9
GH01006947.122.858.826.512.539.535.350.237.549.20.0122.4
GH01006852.240.369.634.260.953.717.466.417.465.334.8158.2
GH01006545.680.942.186.738.6121.228.1161.018.2159.535.1354.4
GH02007132.9140.443.8188.028.2218.028.6250.414.1268.533.3554.0
GH03007232.0155.944.8181.522.5239.216.7287.816.9278.216.7593.7
GH04006675.0266.774.2268.768.9343.468.2424.371.9418.971.9810.6
GH01007150.5269.153.1272.935.4369.216.5440.520.8402.830.6807.6
GH03006639.6300.349.0289.030.6342.731.7435.828.3423.537.6777.9
GH01007039.6275.034.3277.932.4338.340.0434.721.2427.045.7799.3
GH02007245.7287.146.2281.815.8326.831.1420.814.0425.629.7818.0
GH01006652.9259.948.5281.932.3361.138.4439.014.1399.143.6799.4
GH02006650.5248.358.5285.239.6360.439.3421.719.4422.835.5804.0
GH01007244.9272.737.4286.736.9344.321.2423.229.1448.836.2780.3
GH05006647.1269.259.6286.133.3332.236.4423.022.4437.225.7785.5
Ave46.4164.352.1173.837.5216.630.8268.024.7265.034.7516.1

Appendix A.3. Other Ablation Study

In this section, we show small variants of our UFSKFE model. As shown in Table A2, d d e e p R e s N e t 101 p refers to calculating the deep distance using the crop teat area position. L 2 norm means balancing the scale of raw distance and deep distance using L2 norm. In the main paper, we use d ^ r a w = d r a w / max ( d r a w ) and d ^ d e e p = d d e e p / max ( d d e e p ) to balance the scale between them. Here, we instead use the L2 norm, i.e., d ^ r a w = d r a w / | | d r a w | | 2 and d ^ d e e p = d d e e p / | | d d e e p | | 2 . The raw L 2 distance means that we calculate the L2 distance in Equation (2) of the main paper. It can be denoted as M r a w i k = | | s k v i | | 2 .
Table A2. Ablation study of different variants of UFSKFE.
Table A2. Ablation study of different variants of UFSKFE.
Video Name d d e e p R e s N e t 101 p Feature L 2 NormRaw L 2 Distance
GH06006654.590.954.5
GH01006314.357.157.1
GH01006760.044.444.4
GH02007037.550.075.0
GH01006911.858.858.8
GH01006826.169.669.6
GH01006539.352.637.0
GH02007130.654.160.3
GH03007219.746.636.6
GH04006673.373.368.9
GH01007124.557.752.1
GH03006638.049.056.9
GH01007017.652.948.5
GH02007225.048.553.8
GH01006621.662.056.0
GH02006624.556.656.6
GH01007211.557.950.9
GH05006635.664.757.1
Ave31.458.255.2
From Table A2, we find that the performance of d d e e p R e s N e t 101 p (31.4) is much lower than that of d d e e p R e s N e t 101 (52.1). The reason is that the small teat area tends to ignore other important background features. The performance of L 2 norm (58.15) is also lower than the simple L 1 norm (63.6 in Table 2). In addition, the F score of raw L 2 distance (55.2) is slightly lower than the performance of the L 1 distance 55.4 (Table 2). We can conclude that all proposed strategy in our UFSKFE model is effective in improving the accuracy of key frame extraction in cow teat videos.

Appendix A.4. Cow Teat Process Video

We attached a UFSKFE_GH060066.mp4 demo video, which demonstrates the process of our UFSKFE detecting key frames in the GH060066 video. We first plot the fusion distance and then show the extracted six key frames. The video frame is accelerated, skipping every 10 frames in the demo, which leads to the oscillation of the demo video. The fusion distance of all frames is also plotted. The actual computation time of our model for GH060066 video is 65.3 s.

References

  1. Reinemann, D.; Rasmussen, M.; LeMire, S.; Neijenhuis, F.; Mein, G.; Hillerton, J.; Morgan, W.; Timms, L.; Cook, N.; Farnsworth, R.; et al. Evaluation of bovine teat condition in commercial dairy herds: 3. Getting the numbers right. In Proceedings of the 2nd International Symposium on Mastitis and Milk Quality, NMC/AABP, Vancouver, BC, Canada, 12–14 September 2001; pp. 357–361. [Google Scholar]
  2. Basran, P.S.; Wieland, M.; Porter, I.R. A digital technique and platform for assessing dairy cow teat-end condition. J. Dairy Sci. 2020, 103, 10703–10708. [Google Scholar] [CrossRef] [PubMed]
  3. Porter, I.R.; Wieland, M.; Basran, P.S. Feasibility of the use of deep learning classification of teat-end condition in Holstein cattle. J. Dairy Sci. 2021, 104, 4529–4536. [Google Scholar] [CrossRef] [PubMed]
  4. Zhang, Y.; Porter, I.R.; Wieland, M.; Basran, P.S. Separable Confident Transductive Learning for Dairy Cows Teat-End Condition Classification. Animals 2022, 12, 886. [Google Scholar] [CrossRef]
  5. Wolf, W. Key frame selection by motion analysis. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings; IEEE: Atlanta, GA, USA, 1996; Volume 2, pp. 1228–1231. [Google Scholar]
  6. Kulhare, S.; Sah, S.; Pillai, S.; Ptucha, R. Key frame extraction for salient activity recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR); IEEE: Cancun, EM, USA, 2016; pp. 835–840. [Google Scholar]
  7. Guan, G.; Wang, Z.; Lu, S.; Da Deng, J.; Feng, D.D. Keypoint-based keyframe selection. IEEE Trans. Circuits Syst. Video Technol. 2012, 23, 729–734. [Google Scholar] [CrossRef]
  8. Hannane, R.; Elboushaki, A.; Afdel, K.; Naghabhushan, P.; Javed, M. An efficient method for video shot boundary detection and keyframe extraction using SIFT-point distribution histogram. Int. J. Multimed. Inf. Retr. 2016, 5, 89–104. [Google Scholar] [CrossRef]
  9. Luo, Y.; Zhou, H.; Tan, Q.; Chen, X.; Yun, M. Key frame extraction of surveillance video based on moving object detection and image similarity. Pattern Recognit. Image Anal. 2018, 28, 225–231. [Google Scholar] [CrossRef]
  10. Yu, L.; Cao, J.; Chen, M.; Cui, X. Key frame extraction scheme based on sliding window and features. Peer- Netw. Appl. 2018, 11, 1141–1152. [Google Scholar] [CrossRef]
  11. Zhuang, Y.; Rui, Y.; Huang, T.S.; Mehrotra, S. Adaptive key frame extraction using unsupervised clustering. In Proceedings of the 1998 International Conference on Image Processing, Chicago, IL, USA, 7 October 1998; Volume 1, pp. 866–870. [Google Scholar]
  12. Mendi, E.; Bayrak, C. Shot boundary detection and key-frame extraction from neurosurgical video sequences. Imaging Sci. J. 2012, 60, 90–96. [Google Scholar] [CrossRef]
  13. Priya, G.L.; Domnic, S. Shot based keyframe extraction for ecological video indexing and retrieval. Ecol. Inform. 2014, 23, 107–117. [Google Scholar] [CrossRef]
  14. Vázquez-Martín, R.; Bandera, A. Spatio-temporal feature-based keyframe detection from video shots using spectral clustering. Pattern Recognit. Lett. 2013, 34, 770–779. [Google Scholar] [CrossRef]
  15. Ioannidis, A.; Chasanis, V.; Likas, A. Weighted multi-view key-frame extraction. Pattern Recognit. Lett. 2016, 72, 52–61. [Google Scholar] [CrossRef]
  16. Lee, Y.J.; Ghosh, J.; Grauman, K. Discovering important people and objects for egocentric video summarization. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1346–1353. [Google Scholar]
  17. Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In European Conference on Computer Vision; Springer: Zurich, Switzerland, 2014; pp. 505–520. [Google Scholar]
  18. Yao, P. Key Frame Extraction Method of Music and Dance Video Based on Multicore Learning Feature Fusion. Sci. Program. 2022, 2022, 9735392. [Google Scholar] [CrossRef]
  19. Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 766–782. [Google Scholar]
  20. Zhao, B.; Li, X.; Lu, X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 863–871. [Google Scholar]
  21. Zhao, B.; Li, X.; Lu, X. TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization. IEEE Trans. Ind. Electron. 2020, 68, 3629–3637. [Google Scholar] [CrossRef]
  22. Fajtl, J.; Sokeh, H.S.; Argyriou, V.; Monekosso, D.; Remagnino, P. Summarizing videos with attention. In Asian Conference on Computer Vision; Springer: Perth, Australia, 2018; pp. 39–54. [Google Scholar]
  23. Li, P.; Ye, Q.; Zhang, L.; Yuan, L.; Xu, X.; Shao, L. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognit. 2021, 111, 107677. [Google Scholar] [CrossRef]
  24. Jian, M.; Zhang, S.; Wu, L.; Zhang, S.; Wang, X.; He, Y. Deep key frame extraction for sport training. Neurocomputing 2019, 328, 147–156. [Google Scholar] [CrossRef]
  25. Yuan, Y.; Lu, Z.; Yang, Z.; Jian, M.; Wu, L.; Li, Z.; Liu, X. Key frame extraction based on global motion statistics for team-sport videos. Multimed. Syst. 2021, 28, 387–401. [Google Scholar] [CrossRef]
  26. Yang, H.; Wang, B.; Lin, S.; Wipf, D.; Guo, M.; Guo, B. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Santiago, Chile, 2015; pp. 4633–4641. [Google Scholar]
  27. Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Honolulu, HI, USA, 2017; pp. 202–211. [Google Scholar]
  28. Yuan, L.; Tay, F.E.; Li, P.; Zhou, L.; Feng, J. Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Honolulu, HI, USA, 2019; Volume 33, pp. 9143–9150. [Google Scholar]
  29. Yan, X.; Gilani, S.Z.; Feng, M.; Zhang, L.; Qin, H.; Mian, A. Self-supervised learning to detect key frames in videos. Sensors 2020, 20, 6941. [Google Scholar] [CrossRef]
  30. Li, Y.; Luo, X.; Hou, S.; Li, C.; Yin, G. End-to-end Network Embedding Unsupervised Key Frame Extraction for Video-based Person Re-identification. In 11th International Conference on Information Science and Technology (ICIST); IEEE: Kopaonik, Serbia, 2021; pp. 404–410. [Google Scholar]
  31. Elahi, G.M.E.; Yang, Y.H. Online learnable keyframe extraction in videos and its application with semantic word vector in action recognition. Pattern Recognit. 2022, 122, 108273. [Google Scholar] [CrossRef]
  32. Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
  33. Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-Shot Learning. arXiv 2017, arXiv:1703.05175. [Google Scholar]
  34. Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Salt Lake City, UT, USA, 2018; pp. 1199–1208. [Google Scholar]
  35. Oreshkin, B.; Rodríguez López, P.; Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  36. Gidaris, S.; Bursuc, A.; Komodakis, N.; Pérez, P.; Cord, M. Boosting few-shot visual learning with self-supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Seoul, Korea, 2019; pp. 8059–8068. [Google Scholar]
  37. Hong, J.; Fang, P.; Li, W.; Zhang, T.; Simon, C.; Harandi, M.; Petersson, L. Reinforced attention for few-shot learning and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Nashville, TN, USA, 2021; pp. 913–923. [Google Scholar]
  38. Wei, R.; Mahmood, A. Optimizing Few-Shot Learning Based on Variational Autoencoders. Entropy 2021, 23, 1390. [Google Scholar] [CrossRef] [PubMed]
  39. Hsu, K.; Levine, S.; Finn, C. Unsupervised Learning via Meta-Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  40. Ji, Z.; Zou, X.; Huang, T.; Wu, S. Unsupervised few-shot feature learning via self-supervised training. Front. Comput. Neurosci. 2020, 14. [Google Scholar] [CrossRef] [PubMed]
  41. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
  43. Mentzelopoulos, M.; Psarrou, A. Key-frame extraction algorithm using entropy difference. In Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval; ACM: New York, NY, USA, 2004; pp. 39–45. [Google Scholar]
  44. Nandini, H.M.; Chethan, H.K.; Rashmi, B.S. Shot based keyframe extraction using edge-LBP approach. J. King Saud Univ. Comput. Inf. Sci. 2020; in press. [Google Scholar] [CrossRef]
  45. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers: Lake Tahoe, NV, USA, 2012; pp. 1097–1105. [Google Scholar]
  46. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Salt Lake City, UT, USA, 2018; pp. 8697–8710. [Google Scholar]
  47. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  48. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Boston, MA, USA, 2015; pp. 1–9. [Google Scholar]
  49. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Honolulu, HI, USA, 2017; pp. 4700–4708. [Google Scholar]
  50. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Las Vegas, NV, USA, 2016; pp. 2818–2826. [Google Scholar]
  51. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Honolulu, HI, USA, 2017; pp. 1251–1258. [Google Scholar]
  52. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence; AAAI: San Francisco, CA, USA, 2017. [Google Scholar]
  53. van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. The milking machine of large dairy farm. Videos are recorded when the parlor is rotating. The two cameras shown on the left correspond to the video camera (below the rotary parlor).
Figure 1. The milking machine of large dairy farm. Videos are recorded when the parlor is rotating. The two cameras shown on the left correspond to the video camera (below the rotary parlor).
Data 07 00068 g001
Figure 2. Differences between existing supervised key frame extraction (KFE), supervised KFE, and our proposed unsupervised few-shot KFE (UFSKFE) model.
Figure 2. Differences between existing supervised key frame extraction (KFE), supervised KFE, and our proposed unsupervised few-shot KFE (UFSKFE) model.
Data 07 00068 g002
Figure 3. Six sample key frames (KFs) in the cow teat video. These KFs should provide a clean, unambiguous, and high-resolution image of the dairy cow teats for clinical diagnosis, suppress similar frames, and be diverse enough to reduce redundancy.
Figure 3. Six sample key frames (KFs) in the cow teat video. These KFs should provide a clean, unambiguous, and high-resolution image of the dairy cow teats for clinical diagnosis, suppress similar frames, and be diverse enough to reduce redundancy.
Data 07 00068 g003
Figure 4. The scheme of our proposed unsupervised few-shot key frame extraction (UFSKFE) model. We first calculate the raw distance d r a w between each video frame image and few support key frame (KF) images. Secondly, we employ a pre-trained CNN (ResNet-101) to extract deep features for video frame images Φ ( V ) and 32 support key frames ( Φ ( S ) ) and then calculate the deep distance d d e e p . Lastly, we perform a quality check ( Q C ) to select KFs for each video with a smaller fusion distance (d).
Figure 4. The scheme of our proposed unsupervised few-shot key frame extraction (UFSKFE) model. We first calculate the raw distance d r a w between each video frame image and few support key frame (KF) images. Secondly, we employ a pre-trained CNN (ResNet-101) to extract deep features for video frame images Φ ( V ) and 32 support key frames ( Φ ( S ) ) and then calculate the deep distance d d e e p . Lastly, we perform a quality check ( Q C ) to select KFs for each video with a smaller fusion distance (d).
Data 07 00068 g004
Figure 5. Threshold-based KFE. These red circle frames are the selected KFs.
Figure 5. Threshold-based KFE. These red circle frames are the selected KFs.
Data 07 00068 g005
Figure 6. Quality check between detected key frame (a), which shows the milking apparatus still attached to the dairy cow, and its closest support frame (b). SSIM are computed over a fixed region of interest within the frame (red and green rectangles). Key frame (a) does not pass the quality check since its SSIM score is lower than the predetermined threshold.
Figure 6. Quality check between detected key frame (a), which shows the milking apparatus still attached to the dairy cow, and its closest support frame (b). SSIM are computed over a fixed region of interest within the frame (red and green rectangles). Key frame (a) does not pass the quality check since its SSIM score is lower than the predetermined threshold.
Data 07 00068 g006
Figure 7. Extracted key frame numbers with our UFSKFE model of GH060066 cow teat video. d is the fusion distance to the supported KFs. Red dots are the detected KFs, while green dots are true KFs.
Figure 7. Extracted key frame numbers with our UFSKFE model of GH060066 cow teat video. d is the fusion distance to the supported KFs. Red dots are the detected KFs, while green dots are true KFs.
Data 07 00068 g007
Figure 8. Different key frame comparisons of GH060066 video. The ✓ means a correct prediction, while ✕ means a wrong prediction. The number below each image is the video frame. The F score is also reported in each method. UFSKFE achieves the highest F score.
Figure 8. Different key frame comparisons of GH060066 video. The ✓ means a correct prediction, while ✕ means a wrong prediction. The number below each image is the video frame. The F score is also reported in each method. UFSKFE achieves the highest F score.
Data 07 00068 g008
Figure 9. Parameter analysis for α , R , and O on F score. When α = 0.4 , R = 500 and O = 0.45 , the average F score achieves the maximum number.
Figure 9. Parameter analysis for α , R , and O on F score. When α = 0.4 , R = 500 and O = 0.45 , the average F score achieves the maximum number.
Data 07 00068 g009
Table 1. Statistics of cow teat videos (M: megabyte, G: gigabyte).
Table 1. Statistics of cow teat videos (M: megabyte, G: gigabyte).
#Video Name# Frames# Key FramesMemory Size
1GH06006631926382 M
2GH01006339857478 M
3GH01006734743416 M
4GH02007047597570 M
5GH01006953957647 M
6GH010068704310844 M
7GH01006516,881271.97 G
8GH02007124,399302.85 G
9GH03007225,567272.99 G
10GH04006631,860333.72 G
11GH01007131,860423.72 G
12GH03006631,860443.72 G
13GH01007031,860473.72 G
14GH02007231,860473.72 G
15GH01006631,860433.72 G
16GH02006631,860483.72 G
17GH01007231,860483.72 G
18GH05006631,860433.72 G
Ave-21,191292.47 G
Table 2. F score (%) and computation time (s) of cow teat video key frame extraction (QC is conducted, NAST: NASNetLarge).
Table 2. F score (%) and computation time (s) of cow teat video key frame extraction (QC is conducted, NAST: NASNetLarge).
Videos d S U R F d B i n a r y d S o b e l H S S I M d r a w c r o p d r a w d d e e p A l e x N e t d d e e p N A S T d d e e p R e s N e t 101 UFSKFE
FTimeFTimeFTimeFTimeFTimeFTimeFTimeFTimeFTimeFTime
GH06006654.5109.572.766.940.027.672.73435.220.084.972.741.172.714.460.057.790.920.890.965.3
GH01006357.1118.657.181.042.940.214.31038.10.0105.857.151.030.819.042.994.028.620.461.584.5
GH01006722.2102.644.471.885.742.366.7902.50.092.244.444.222.218.044.473.160.017.750.071.3
GH02007026.7142.462.5100.130.857.825.01241.637.5127.550.061.112.521.60.099.937.523.061.597.5
GH01006923.5160.770.6110.153.365.923.51408.223.5148.658.869.347.123.90.0122.458.826.566.7110.4
GH01006834.8228.845.5144.020.084.243.51830.99.1188.469.690.152.232.334.8158.269.634.280.0145.8
GH01006528.6639.145.6346.146.2219.810.74395.215.4453.240.7218.917.584.235.1354.442.186.750.0359.7
GH02007134.41161.758.3507.339.4364.743.26442.82.9651.857.5348.519.2145.333.3554.043.8188.065.6544.2
GH03007230.81183.443.8562.734.0219.138.96756.03.1653.439.4375.425.0181.316.7593.744.8181.555.6581.9
GH04006669.71151.068.9695.670.5308.867.48395.930.6590.271.1478.969.7280.371.9810.674.2268.775.9780.6
GH01007134.01170.458.6697.342.7303.934.38710.819.6584.556.8475.530.9299.430.6807.653.1272.961.9792.4
GH03006618.81145.351.5671.838.8312.248.08794.723.9579.553.5462.843.1282.037.6777.949.0289.052.1784.3
GH01007035.61169.255.2682.829.5305.036.28733.118.6581.450.5476.523.3311.845.7799.334.3277.954.5783.0
GH02007227.71176.153.3666.037.8338.725.59812.423.2588.653.8475.923.1408.729.7818.046.2281.851.6767.8
GH01006619.1631.656.9667.144.2345.543.39709.130.9598.152.5486.036.0327.343.6799.448.5281.970.2808.5
GH02006621.4671.650.5667.453.3333.739.311023.738.0589.558.5487.334.3300.435.5804.058.5285.255.8792.3
GH01007218.4710.250.9673.640.9329.426.710083.322.0587.952.8474.017.1293.536.2780.337.4286.766.0786.0
GH05006624.7661.047.5681.844.4327.758.68243.134.4588.157.1478.547.5277.625.7785.559.6286.174.2804.3
Ave32.3685.255.3449.644.1223.739.96164.319.6433.055.4310.834.7184.534.7516.152.1173.863.6508.9
Table 3. F score (%) of ablation study.
Table 3. F score (%) of ablation study.
VideosGH060066GH030072GH010066GH050066Ave
d r a w 72.738.452.056.054.8
d d e e p 72.740.547.554.953.9
d r a w + Q C 72.739.452.557.155.4
d d e e p + Q C 90.944.848.559.661.0
d90.945.364.764.766.4
UFSKFE90.955.670.274.272.7
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wieland, M.; Basran, P.S. Unsupervised Few Shot Key Frame Extraction for Cow Teat Videos. Data 2022, 7, 68. https://0-doi-org.brum.beds.ac.uk/10.3390/data7050068

AMA Style

Zhang Y, Wieland M, Basran PS. Unsupervised Few Shot Key Frame Extraction for Cow Teat Videos. Data. 2022; 7(5):68. https://0-doi-org.brum.beds.ac.uk/10.3390/data7050068

Chicago/Turabian Style

Zhang, Youshan, Matthias Wieland, and Parminder S. Basran. 2022. "Unsupervised Few Shot Key Frame Extraction for Cow Teat Videos" Data 7, no. 5: 68. https://0-doi-org.brum.beds.ac.uk/10.3390/data7050068

Article Metrics

Back to TopTop