Detecting Hands In Egocentric Videos After Spinal Cord Injury Through A Combination Of Object Detection And Tracking Approaches

Ryan Visée^1,2, Jirapat Likitlersuang^1,2, José Zariffa^1,2

¹KITE, Toronto Rehab, University Health Network, Toronto, Ontario, Canada

²Insitutute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, Ontario, Canada

INTRODUCTION

Spinal cord injury (SCI) significantly reduces the quality of life of the affected individuals and entails an estimated economic cost of $2.7 billion per year in Canada [1]. A major contribution to the loss of independence after SCI is impaired arm and hand function. In fact, individuals with cervical SCI report they would feel the greatest improvement in their quality of life if they were to regain upper limb function [2]. As a result, new treatments to improve hand function after SCI are needed. Current assessments of the severity of upper limb impairments are typically performed in clinical settings. To accurately represent the true impact these interventions have on patient function and independence, evaluation should occur at home. Currently, there are no methods that directly measure and track the effect of therapy on patient hand function in their daily life at home.

With the emergence of wearable cameras, such as Google Glass™ and GoPro®, innovative and viable ways to directly measure hand function at home have become available. Although videos from wearable cameras (egocentric videos) can be used to monitor patient activities from home, the automated analysis of egocentric videos using computer vision presents significant technical challenges [3,4]. A problem exists in the detection of hands in egocentric videos, which is a necessary first step prior to hand function analysis. Robustly and reliably detecting and tracking the hand is affected by factors including partial occlusions, lighting variations, hand articulations, camera motion, and background or objects that are similar in colour to the skin.

Hand detection can be accomplished by applying a more general and fundamental problem in computer vision; object detection. Recently, significant progress has been made in improving the performance of object detection using convolutional neural networks (CNNs). Existing algorithms can be divided into two categories, region-based and regression-based approaches. Region-based approaches generate a set of region or object proposals in an image, extract a robust set of features, and finally perform class-specific scoring and bounding box regression on each proposal. This approach was applied notably in the region-based CNN (R-CNN) but suffers from expensive computational costs as the region proposals must be calculated and classified in every frame [5]. To improve the speed, Faster R-CNN was introduced, which increased both speed and accuracy but still performs well-below real-time (30 frames per second (FPS)) [6]. Regression-based approaches frame object detection as a regression problem, implementing algorithms that learn to directly regress the location of the bounding box, rather than classify object proposals. You Only Look Once (YOLO) is one algorithm that uses a single CNN to simultaneously predict bounding boxes and class probabilities, competitively performing with Faster R-CNN, while being significantly faster [7]. Subsequently, the second version of YOLO (YOLOv2) outperformed Faster R-CNN in both accuracy and speed, reaching real-time performance [8].

Object detection suffers from the inability to learn the identity of the object or use previous frame information to calculate the object’s new location. Recalculation of box proposals in every frame hinders many detectors from performing in real-time. Object tracking attempts to solve this problem by using previous frame information to calculate the object’s location in the next frame. This low level of complexity results in algorithms that can learn without any pretraining on a specific dataset. These online trackers require manual initialization of a bounding box in the first frame, which is used as a positive example, and then learns the object’s appearance with every new frame [9-12]. Although these simple algorithms are efficient, they quickly build up errors, resulting in tracker drift and inability to recover from occlusions and quick movements.

Therefore, the aim of this study was to generate an algorithm for fast and reliable hand detection in egocentric videos captured in individuals with cervical SCI. We hypothesized that if we integrate object detection techniques with tracking algorithms, we can increase the accuracy and computational efficiency of hand detection algorithms in egocentric videos compared to previous approaches.

METHODS

Hand detection dataset

The egocentric hand detection dataset used for this study was obtained from previous experiments that resulted in videos collected using wearable cameras on individuals with SCI, termed the ANS SCI dataset [4]. This dataset contains 17 individuals with cervical SCI performing a variety of activities of daily living (ADLs), collected in a home simulation laboratory at the Toronto Rehabilitation Institute. Videos in this dataset were recorded using a GoPro® Hero 4 head-mounted wearable camera, recorded at 30 FPS with 1080p resolution. This dataset represents ADLs in environments including the kitchen, washroom, living room, dining room, bedroom, and outdoor. Participants were asked to manipulate over 30 objects in over 35 different ADLs as naturally as possible. Therefore, the ANS SCI dataset (Fig. 1) reflects a range of objects, environments, ADLs, and participants, including different levels of impairment.

Example images of hands of individuals with spinal cord injury manipulating objects in different activities of daily living. Five images contain annotations for right hand, left hand, other hands, and not hand. From left to right: Hands manipulating screw and bolt, folding towel, opening a jar, opening a pill box, and opening a bag with the help of another person. — Figure 1. Example annotated frames in the ANS SCI dataset

We generated a large hand detection dataset by manually labelling bounding boxes around hands in frames using every participant, ADL, and environment. The complete dataset consists of 167,622 images containing labels for “left hand” /”right hand” (L/R), which belong to the camera-wearer, and “other hands” (O), which belong to anyone else that may appear within the video. It also contains labels for “not hand” (N), which was used as negative data in order to generate labels for objects and background in areas that the CNN may confuse as hands. Care was taken to ensure a large distribution between participants, activities, and environments, while also including many difficult annotations such as occlusions, abnormal hand articulations, and quick movements.

Object detection and object tracking

The dataset was used to build upon previous object detection and tracking algorithms that can be made to fit the hand detection problem. For object detection, we tested two state-of-the-art algorithms corresponding to each object detection category; Faster R-CNN and YOLOv2 [6,8]. As for object tracking algorithms, only online trackers were implemented due to their simplicity and efficiency. These algorithms are summarized in Table 1.

Table 1. Implemented object detection and tracking algorithms
Detection Algorithms	Tracking Algorithms
Faster R-CNN [6]	Median Flow (MF) [9]
Faster R-CNN [6]	Kernelized Correlation Filter (KCF) [10]
YOLOv2 [8]	Online Boosting (OLB) [11]
YOLOv2 [8]	Multiple Instance Learning (MIL) [12]

Combining object detectors and trackers

After these algorithms were implemented alone, the top performing algorithms were combined to evaluate the further improvement in performance that can be achieved. The method employed in this study was to use a detector to automatically initialize and reset an online tracker upon failure or after a certain number of frames. This method was explored since the main problem with tracking algorithms is the inability to recover from occlusions or quick movements, thus making it difficult to perform adequately after failure. However, successful recovery from occlusions or quick motions can be aided by a detection algorithm. Further, since online trackers require manual initialization, the process is only semi-automatic. Using a detector to initialize the tracker fully automates the process. Another problem many online trackers face is tracker drift. Using a detector to reset the tracker after a certain number of frames minimizes the effect of tracker drift, thus avoiding the propagation of errors and improving performance. Since YOLOv2 is the fastest and one of the most accurate detection algorithms available, it was used to aid the efficient online trackers listed in Table 1. This could significantly improve the accuracy of the trackers while maintaining the efficiency of these approaches.

Performance evaluation

The detection algorithms in Table 1 were trained using a 3-fold cross-validation approach. The dataset was split into 3 parts and in turn, 2 parts were used for training while the remaining part was used for testing. To evenly split the dataset, we used the American Spinal Injury Association (ASIA) International Standards for Neurological Classification of Spinal Cord Injury (ISNCSCI) assessment tool [13]. As we are focusing on hand function, we specifically used the upper extremity motor subscore (UEMS) to divide our dataset. To calculate participants UEMS scores, 5 upper limb muscles were tested, one from each respective segment of the cervical cord, and were scored on a 5-point grading scale. The final scores were summed to obtain the total UEMS score. The final dataset split, based on the average UEMS, is displayed in Table 2. Using a one-way analysis of variance (ANOVA), these means were found to not be statistically different, F(2,14) = 0.12, p = 0.89. As online models do not require any pretraining on a specific dataset, the online trackers were only tested on 10 randomly chosen videos. The final performance of hand detection was evaluated using the F1-score on the test set, which is the harmonic mean of precision and recall. The frame rate of the model was also used as an evaluation metric as the system will ideally be able to run in real-time. For rehabilitation application purposes, a target of 15-20 FPS would most likely provide the same information as a system that runs at the definition of real-time (30 FPS). For real-time information provided in the home and community, these FPS targets should be achieved on CPU processors, rather than GPUs.

Table 2. Dataset split summary based on average UEMS
Groups	Group A	Group B	Group C
Average UEMS	17.83 ± 5.04	18.80 ± 3.96	19.00 ± 4.10
Total Frames	63102	36051	68469

RESULTS

Results for the object detectors are displayed in Table 3 for the highest performing models. Detector performance was evaluated on a NVIDIA® Titan Xp 12 GB RAM GPU and Intel® Core™ i7-8700k CPU.

Table 3. Quantitative results of top performing models for each detector
Detectors	F1-Score	FPS on GPU	FPS on CPU
YOLOv2 [8]	0.90 ± 0.08	68	1
Faster R-CNN [4]	0.72 ± 0.05	22	0.5

We then evaluated the online trackers listed in Table 1 on an Intel® Core™ i5-7200U CPU. The trackers were not tested on a GPU as they were already efficient on CPU processors. These trackers were tested on 10 randomly chosen videos, including 6 participants, 4 environments, and 19,683 frames. Although a small subset of the data, we believe these chosen videos adequately represent the ability of online trackers on the entire dataset. Results are summarized in Table 4.

Table 4. Quantitative results of top performing models for each online tracker
Detectors	MF [9]	KCF [10]	OLB [11]	MIL [12]
F1-Score	0.49 ± 0.27	0.32 ± 0.31	0.38 ± 0.28	0.40 ± 0.30
FPS on CPU	155	70	25	17

Finally, we implemented our proposed combination method using YOLOv2 to initialize the online trackers and reset the tracker upon failure or after a certain number of frames. YOLOv2 was chosen due to its higher performance (Tab. 3). This combination method was tested on the same subset of data as the online trackers. The best combination resulted in a 1.7x improvement in accuracy compared to the best tracker alone (MF) and was 5x faster than the fastest detector alone (YOLOv2) on a CPU. These results are summarized in Table 5 with detections every 100 frames, chosen empirically, or after tracker failure. Further, to minimize detector usage, the tracker was disabled if it failed and the detector was unable to locate the hand in 3 consecutive frames. The detector then checked every 60 frames until the hand was found.

Table 5. Quantitative results of top performing models for the proposed combination method
Combinations	F1-Score	FPS on GPU	FPS on CPU
YOLOv2_MF	0.78 ± 0.18	253	6
YOLOv2_KCF	0.83 ± 0.16	116	5

DISCUSSION AND CONCLUSION

To deploy a system that can automatically analyze hand function at home and in real-time, it is important that the algorithm be able to perform accurately and efficiently on CPUs rather than on GPU processors. As expected, the implemented object detectors perform relatively well in terms of F1-score but perform significantly slower on CPUs. On the other hand, online trackers have relatively low F1-scores, but are much more efficient on CPUs. They also perform with high standard deviation, displaying the reliance on the manual user initialization and quality of the video. These drawbacks in each domain have hindered the deployment of a complete monitoring system in the home and community. However, combining relatively fast detectors with relatively accurate trackers minimized the faults of each approach resulting in more accurate and efficient hand detections. This is displayed in Table 5, where the combination of YOLOv2 and KCF resulted in the best performing combination method, even though KCF performed the worst on its own. This is because KCF has difficulty recovering from tracking failures but otherwise has high accuracy. Finally, increasing the number of frames between each detection results in frame-rates closer to our target of 15-20 FPS but with a trade-off in F1-score.

A method that will allow for robust and reliable hand detection will aid in the process of evaluating the true impact of new treatments on individuals with SCI living in the community. Hand detection is an essential step before further analysis can be conducted, including hand segmentation, activity recognition, or interaction detection. However, this step continues to be riddled with errors, negatively affecting the latter stages of analysis and hindering the deployment of a complete monitoring system in the home and community. The development of an ideal hand detection algorithm in combination with the availability of wearable cameras will put researchers’ one step closer to innovating ways to directly measure hand function in a patient’s daily life, thus helping restore independence after SCI.

REFERENCES

[1] Krueger H, Noonan VK, Trenaman LM, Joshi P, Rivers CS. The economic burden of traumatic spinal cord injury in Canada. Chronic diseases and injuries in Canada. 2013 Jun 1;33(3).

[2] K. D. Anderson, "Targeting recovery: priorities of the spinal cord-injured population," Journal of neurotrauma, vol. 21, pp. 1371-1383, 2004.

[3] Likitlersuang J, Zariffa J. Interaction Detection in Egocentric Video: Toward a Novel Outcome Measure for Upper Extremity Function. IEEE journal of biomedical and health informatics. 2018 Mar;22(2):561-9.

[4] Likitlersuang J, Sumitro ER, Cao T, Visee RJ, Kalsi-Ryan S, Zariffa J. Egocentric Video: A New Tool for Capturing Hand Use of Individuals with Spinal Cord Injury at Home. arXiv preprint arXiv:1809.00928. 2018 Aug 30.

[5] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition 2014 (pp. 580-587).

[6] Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in neural information processing systems 2015 (pp. 91-99).

[7] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 779-788).

[8] Redmon J, Farhadi A. YOLO9000: better, faster, stronger. arXiv preprint. 2017 Jul 17.

[9] Kalal Z, Mikolajczyk K, Matas J. Forward-backward error: Automatic detection of tracking failures. InPattern recognition (ICPR), 2010 20th international conference on 2010 Aug 23 (pp. 2756-2759). IEEE.

[10] Henriques JF, Caseiro R, Martins P, Batista J. Exploiting the circulant structure of tracking-by-detection with kernels. InEuropean conference on computer vision 2012 Oct 7 (pp. 702-715). Springer, Berlin, Heidelberg.

[11] Grabner M. Real-time tracking via on-line boosting. InBritish Machine Vision Conference 2006.

Audio Version PDF Version

RESNA Annual Conference - 2019