RGB-D-Based Robotic Grasping in Fusion Application Environments

Yin, Ruochen; Wu, Huapeng; Li, Ming; Cheng, Yong; Song, Yuntao; Handroos, Heikki

doi:10.3390/app12157573

Open AccessArticle

RGB-D-Based Robotic Grasping in Fusion Application Environments

¹

Institute of Plasma Physics Chinese Academy of Sciences (ASIPP), Chinese Academy of Sciences, Hefei 230031, China

²

University of Science and Technology of China, Hefei 230026, China

³

Laboratory of Intelligent Machines, School of Energy Systems, LUT University, 53850 Lappeenranta, Finland

⁴

Institute of Energy, Hefei Comprehensive National Science Center, Hefei 230031, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(15), 7573; https://0-doi-org.brum.beds.ac.uk/10.3390/app12157573

Submission received: 30 June 2022 / Revised: 23 July 2022 / Accepted: 25 July 2022 / Published: 27 July 2022

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Although deep neural network (DNN)-based robot grasping has come a long way, the uncertainty of predicted results has prevented DNN-based approaches from meeting the stringent requirements of some industrial scenarios. To prevent these uncertainties from affecting the behavior of the robot, we break down the whole process into instance segmentation, clustering and planar extraction, which means we add some traditional approaches between the output of the instance segmentation network and the final control decision. We have experimented with challenging environments, and the results show that our approach can cope well with the challenging environment and achieve more stable and superior results than end-to-end grasping networks.

Keywords:

clustering; deep neural networks; fusion application environment; point cloud; planar extraction; RGB-D message; robotic grasping

1. Introduction

Robotic grasping is one of the most effective solutions for handling grasping work in challenging environments. In nuclear fusion applications, many tasks need to be carried out by the robotic arm, including robotic grasping, as shown in Figure 1.

As a fundamental robotic task, research into robotic grasping has been carried out for a long time. In the early works, researchers need to use the traditional Computer Vision (CV) algorithms to obtain the right grab point. Just as in [1], researchers use a single camera to observe the same towel from different angles of view, implement the border classification and corner grasp point detection algorithms based on these data, and eventually achieve the towel folding task. These algorithms were quickly replaced by Convolutional Neural Network (CNN)-based algorithms after the rise of deep learning. Thus, in this paper, we focus on learning-based approaches.

In [2], based on a monocular camera, Levine et al. proposed a learning-based approach to hand-eye coordination for robotic grasping, and they focused on training the CNN to understand the spatial relationship between the gripper and objects in the scene so the neural network (NN) could predict the probability that the task-space motion of the gripper will result in successful grasps.

Due to the nature of the robotic grasping task, the final target is to drive the gripper to the optimal gripping pose; the best input data are the 3D spatial coordinates that are hard to obtain from a monocular camera. Thus, after the RGB-D camera was introduced and matured, it quickly became a popular sensor for researchers. Now, the input data of most learning-based robotic grasping methods are based on the RGB-D messages provided by the depth cameras. Just as in [3,4], these approaches output the appropriate grasping rectangle for each object through the CNN structure. The centre of the rectangle is the grasp point. The long side and the short side of the rectangle correspond to the grasp angle and grasp width, respectively. Morrison et al. [5] introduced a Generative Residual Convolutional Neural Network (GR-ConvNet) to infer multiple grasp rectangles for multiple objects in the clutter in real-time; they also output a quality for each grasp pose. On the basis of [5], Kumra et al. [6] adjusted the structure of the neural network and achieved a better result on accuracy and efficiency.

In [7], researchers proposed a generative adversarial network based on monocular RGB images which can achieve an outstanding grasping result both in simulation and realistic environments. Tremblay et al. [8] proposed a 6-DoF pose estimation NN which could output the semantic label of each object. The whole training set for the NN is also built on synthetic data. In [9,10], the RGB-D information is converted into point clouds and neural networks are developed to give contact points for each object separately based on object recognition in the 3D space.

In [11], Wen et al. proposed a method that leverages hand-object contact heatmaps generated in a self-supervised manner in simulation to achieve a dense, point-wise grasping network, and they also introduced the “Non-Uniform Normalized Object Coordinate Space” (NUNOCS) representation for learning category-level object 6D poses and 3D scaling, which allows non-uniform scaling across three dimensions.

With the increasing number of approaches based on generative data, researchers are placing greater demands on the quality of the generated models. In order to make the virtual data as close as possible to the real model, Tobin et al. [12] explored a data generation pipeline for training a deep neural network to perform grasp planning that applies the idea of domain randomization to object synthesis. Meanwhile, Varley et al. [13] proposed a generation network based on CNN, which could complement the 3D shape of an object with an incomplete points cloud from several points of view. With adequate offline training, the generation network could complement the 3D shape accurately and rapidly.

Obtaining training data is always the most important and most difficult part of the supervised learning-based method. Some researchers are beginning to explore the use of unsupervised learning to solve the robotic grasping problem. In [14], researchers trained a vision-based closed-loop grasping reinforcement learning (RL) agent in a simulation environment, then transferred it to the real world and completed the grasping task successfully. Joshi et al. [15] presented a deep reinforcement learning-based method to solve the problem of robotic grasping using Visio-motor feedback. Quillen et al. [16] compared various RL algorithms in a virtual and realistic environment and selected an off-policy algorithm that could handle the task well.

From these related works, the learning-based ones exhibit several trends:

Using the generated data as training data and translating the results from virtual to real;
Transfer the task from individual objects to cluster objects grasping.

Most robotic grasping methods focus on building an end-to-end grasping network that could output a result that contains the grasp point, pose, and width. This is the right development direction and works well under ideal circumstances. However, in the fusion application environment, all components, including the interior of the vacuum chamber and the components to be assembled, are made of the same metal materials and have smooth surfaces. Almost all the RGB-D cameras’ performance on the non-optimal smooth surfaces with specular reflection is poor. The CNN-based grasping network algorithm can hardly yield stable results in such an environment.

Meanwhile, due to the black-box nature of the processes within neural networks, researchers could hardly eliminate such uncertainty by changing or improving the structural design of an NN, so the end-to-end NN approaches would transmit this uncertainty directly to the final decision which is intolerable in many industrial scenarios. To eliminate such uncertainty, some researchers work hard to ensure that the training set is extensive and comprehensive enough to cover all kinds of situations, but it is too expensive. Moreover, despite the availability of many open-source benchmarks, training datasets with labels are still very scarce in some specific scenarios, such as a fusion vacuum vessel (VV), so achieving consistent and high-quality robotic grasping in such a challenging environment is still quite tricky.

A more efficient approach is to abandon the obsessive pursuit of an end-to-end network and add traditional but controllable approaches between the network output and the final decision, thus eliminating these uncertainties. In our case, we decompose the grasping task into object recognition based on an instance segmentation network and grasping pose computation based on plane extraction and clustering algorithm. In this paper, we chose a well-developed instance segmentation network that has been proposed by Xiang et al. [17] as the fundamental of our robotic grasping methods. Then, we use the clustering algorithm to obtain more stable segmentation results and calculate the correct grasping points and poses via plane extraction. The final experimental results show that our method can achieve a high success grasping rate in the challenging environment.

2. Materials and Methods

The process of our approach is shown in Figure 2. Firstly, the RGB and depth images are entered into the instance segmentation network to obtain a rough result. Then, after filtering these results, we perform cluster to the data, We compare whether the number of categories is the same in ① and ②. ③ represents a pool of data, which stores the latest five frames of data. In ④, we take the intersection of these five frames of data as the final object data. Then, we implement planar extraction to the object data and finally calculate the grasp pose.

2.1. Instance Segmentation Network

As a popular research sub-area in Computer Vision (CV), instance segmentation has been extensively and intensively studied. The Unseen Object Instance Segmentation (UOIS) network approached in [17] is a CNN-based instance segmentation network. It could identify partially obscured objects through the RGB-D messages. In most cases, this method works well for instance segmentation of objects on the table. However, to better simulate the internal environment of the VV, we used a metal tabletop made of the same material as the peg parts, and the 3D point cloud data are quite unstable, as shown in Figure 3.

Thus, we add a clustering algorithm and planar extraction algorithm to obtain a stable grasping result under such a situation.

2.2. Clustering Algorithm

In this paper, we chose Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [18] as the clustering algorithm. As a density-based clustering algorithm, DBSCAN has the advantage of finding arbitrarily shaped clusters in a noisy spatial database. Considering that there are non-negligible errors in the measurement results of depth cameras on non-optimal smooth surfaces, the DBSCAN algorithm is well suited to handle the problem we face. As shown in Figure 2, after we obtain the instance segmentation result from the output of the NN, we then project the 2D information into a 3D point cloud by combining the pixel positions of the segmented instances with the corresponding depth map.The project formulas are as follows:

\begin{matrix} z = I m a g e_{d e p t h} (u, v) \\ x = (u - C_{x}) * z / f_{x} \\ y = (v - C_{y}) * z / f_{y} \end{matrix}

(1)

u and v are the pixel coordinates in both RGB images and Depth images,

C_{x}

and

C_{y}

are the coordinates of the camera’s principal point,

f_{x}

and

f_{y}

are the scale factors of the camera in the u-axis and v-axis directions they could obtain from the camera intrinsic matrix. Moreover, it is vital that RGB images and Depth images are aligned. After that, we perform DBSCAN clustering on the filtered segmented images, as shown in Algorithm 1. When we have accumulated the latest five frames of data, we take the intersection of them as the segmentation result.

Algorithm 1 DBSCAN

Input: dataset

D = \{p_{1}, p_{2}, . . ., p_{n}\}

, algorithm parameter

(ϵ, M i n P t s)

1:: Initializing the set of core points $Ω = \emptyset$ , the number of clusters $k = 0$ , unvisited sample set $B = D$ and the set of clusters $C = \emptyset$
2:: for each $p_{i} \in D$ do
3:: Initializing its neighborhood points set $N_{ϵ} (p_{i}) = \emptyset$
4:: for any $p_{m} \neq p_{i} In addition, p_{m} \in D$ do
5:: Calculate the distance $d (i, m)$ to $p_{i}$
6:: if $d (i, m) < ϵ$ then
7:: $N_{ϵ} (p_{i}) = N_{ϵ} (p_{i}) \cup p_{m}$
8:: end if
9:: end for
10:: if $|N_{ϵ} (p_{i})| > M i n P t s$ then
11:: $Ω = Ω \cup p_{i}$
12:: end if
13:: end for
14:: if $Ω = \emptyset$ then
15:: Quit the algorithm
16:: end if
17:: Randomly choose an item $o \in Ω$ , initializing the current core points set $Ω_{c u r} = \{o\}$ , marking the item as the cluster k, initializing its cluster set $C_{k} = \{o\}$ and update $B = B - \{o\}$
18:: if $Ω_{c u r} = \emptyset$ then
19:: $C_{k}$ has been determined, updating the set of clusters $C = \{C_{1}, C_{2}, . . ., C_{k}\}$ and back to line 14.
20:: end if
21:: Updating $Ω = Ω - C_{k}$
22:: Randomly chose an item $o^{'} \in Ω_{c u r}$ , determine its neighborhood points set $N_{ϵ} (o^{'})$ based on the distance parameter $ϵ$ , let $Δ = N_{ϵ} (o^{'}) \cap B$ , updating $C_{k} = C_{k} \cup Δ$ , $B = B - Δ$ and $Ω_{c u r} = Ω_{c u r} \cup (Δ \cap Ω) - o^{'}$ , back to line 18

Output: Clusters set

C = \{C_{1}, C_{2}, . . ., C_{k}\}

2.3. Plane Extraction, Grasping Pose Calculation and Contact Point Chosen

Since all the components to be grasped in this application scenario are very regular aggregates, there are very standard planes in these structured data, and the Random Sample Consensus (RANSAC) [19] plane extraction algorithm achieves good results in this case. The RANSAC-based plane extraction in this paper is shown in Algorithm 2. Once the top plane of the object has been extracted, we can easily calculate the grasping pose.

Algorithm 2 RANSAC plane extraction

Input: Point clouds

P = \{p_{1}, p_{2}, . . ., p_{n}\}

, initialise the parameter

n = 3

, iteration times

k = 1000

and the distance threshold

t = 3 m m

1:: Initializing $i t e r a t i o n = 0$ , $P l a n e_{b e s t} = N u l l$ ,
$I n l i e r S e t_{b e s t} = \emptyset$ , $O u t l i e r S e t_{b e s t} = \emptyset$ ,
$S u m D i s t a n c e_{b e s t} = 0$
2:: for $i t e r a t i o n < k$ do
3:: Randomly pick n points set $P_{s}$ from P, calculate the $P l a n e_{c u r}$ and initializing $I n l i e r S e t_{c u r} = \emptyset$ , $O u t l i e r S e t_{c u r} = \emptyset$ , $S u m D i s t a n c e_{c u r} = 0$
4:: for each $p_{m} \in (P - P_{s})$ do
5:: Calculate the distance d from point $p_{m}$ to the $P l a n e_{c u r}$
6:: if $d < t$ then
7:: Updating $I n l i e r S e t_{c u r} = I n l i e r S e t_{c u r} \cup p_{m}$ , $S u m D i s t a n c e_{c u r} = S u m D i s t a n c e_{c u r} + d$
8:: else
9:: Updating $O u t l i e r S e t_{c u r} = O u t l i e r S e t_{c u r} \cup p_{m}$
10:: end if
11:: end for
12:: if $|I n l i e r S e t_{c u r}| > |I n l i e r S e t_{b e s t}|$ then
13::       $P l a n e_{b e s t} = P l a n e_{c u r}$
      $I n l i e r S e t_{b e s t} = I n l i e r S e t_{c u r}$
      $O u t l i e r S e t_{b e s t} = O u t l i e r S e t_{c u r}$
      $S u m D i s t a n c e_{b e s t} = S u m D i s t a n c e_{c u r}$
14:: else if $|I n l i e r S e t_{c u r}| = |I n l i e r S e t_{b e s t}|$ &
$S u m D i s t a n c e_{b e s t} > S u m D i s t a n c e_{c u r}$ then
15::       $P l a n e_{b e s t} = P l a n e_{c u r}$
      $I n l i e r S e t_{b e s t} = I n l i e r S e t_{c u r}$
      $O u t l i e r S e t_{b e s t} = O u t l i e r S e t_{c u r}$
      $S u m D i s t a n c e_{b e s t} = S u m D i s t a n c e_{c u r}$
16:: end if
17:: $i t e r a t i o n = i t e r a t i o n + 1$
18:: end for

Output:

P l a n e_{b e s t}

,

I n l i e r S e t_{b e s t}

,

O u t l i e r S e t_{b e s t}

As shown in Figure 4, the grasping pose has several elements, such as

P, W

and

θ

.

{P_{a}, P_{b}, P_{c}, P_{d}}

are the four corner points of the extracted plane. For each corner point,

P = {x, y, z, u, v}

,

{u, v}

are the pixel coordinates of the images, and

{x, y, z}

are the 3D coordinates calculated through Formula (1).

P = {x_{c}, y_{c}, z_{c}, V}

,

{x_{c}, y_{c}, z_{c}}

are the 3D coordinates of the centre point of the extracted plane, which could easily be calculated. V is the rotation vector which could be calculated as:

\begin{matrix} V_{a b} = (x_{b} - x_{a}, y_{b} - y_{a}, z_{b} - z_{a}) \\ V_{a c} = (x_{c} - x_{a}, y_{c} - y_{a}, z_{c} - z_{a}) \\ V = V_{a b} \times V_{a c} \end{matrix}

(2)

It should be noted that, according to the right-hand rule, the direction of V is pointing inside the screen, which is in accordance with the pose of the end-effector in the robot’s coordinate system. Since we already have the extracted plane, we can easily obtain the four corner points:

P_{a}, P_{b}, P_{c}

and

P_{d}

. Thus, the

θ

can be calculated as:

\begin{matrix} θ = a r c t a n (\frac{y_{a} - y_{b}}{x_{a} - x_{b}}) \end{matrix}

(3)

W is the distance between

P_{a}

and

P_{c}

.

We use the three-finger gripper as the end-effector to grasp the rectangular part of the object, so the three contact points are on the two long sides of the rectangle. The method we propose (called HUDR) is a hybrid one combining UOIS with DBSCAN and RANSAC.

3. Results

3.1. Experimental Platform

Our platform in this experiment in shown in Figure 5. In our case, the top of the peg is made into a rectangular cube—we call it “Thor’s hammer”—to increase the contact area between the gripper and the peg and thus increase the friction, which is beneficial to the subsequent assembly work. We carry out experiments on a desktop with an Nvidia RTX 2080Ti GPU, Intel i7-10700k CPU and 64 GB of RAM. The calibration between the camera and the robot is based on the Tsai–Lenz camera calibration [20] algorithm. The control of the robot and the gripper is implemented through the Universal Robots RTDE package (https://sdurobotics.gitlab.io/ur_rtde/index.html (accessed on 16 June 2022)).

3.2. Visually Intuitive Evaluation

As previously stated, the RGB-D camera has a high level of distortion in the measurement data on smooth surfaces, which affects the output of the instance segmentation network; this phenomenon is also evident in our experiments. Figure 6 shows the results in various stages in the experiment.

The top left picture is the original RGB image. “Thor’s hammer” is the grasping target. The discs with holes on the right are for subsequent assembly tasks and are not relevant to this paper. The picture on the upper right is the result of the segmentation network, where there are obvious errors. We project the classification results from the 2D image into the 3D points cloud and implement the clustering algorithms to obtain the bottom left image, where the top red area is the result of planar extraction. The data obtained after the planar extraction are displayed independently in the lower right image. From the pictures, we can see that the recognition results are greatly improved after our algorithm.

3.3. Quantitative Evaluation

In this subsection, we will compare the success grasping rate of our method with the GR-ConvNet and UOIS methods. The GR-ConvNet is an end-to-end grasping network that outputs a rectangle representing the grasping pose and width. UOIS is the fundamental instance segmentation network of our method. In the experiment, we calculate the grasping success rate of the UOIS method by calculating the grasping pose directly from the segmentation results output by the network.

To better simulate the environment in VV, the object to be grasped was placed on a metal table in this experiment, as shown in Figure 5. To compare the effect of different materials of the background environment, we also conduct comparative experiments on wooden tables. We performed 100 grasping operations each with these methods under the same scenarios and then calculated the success rate. These scenarios include two different poses: standing pose, lying down pose and two different backgrounds: wooden table and metal table, as shown in Figure 7.

The results in Table 1 show that the different backgrounds make a big difference in the results. If the background and the object are very close in material and color, it is difficult for these networks to identify the object correctly. Meanwhile, the performance of the GR-Convent for the standing poses is much poorer than the lying ones. This is because, when the object is standing on the table, the height difference of the object becomes more significant, and the error in the depth information from the RGB-D camera becomes more prominent, especially for the cylindrical part at the bottom. On the other hand, GR-Convent is more likely to choose the cylindrical part as the grasping target, which means it could hardly find an appropriate grasping pose.

However, the UOIS and the method proposed in this paper are based on instance segmentation. When the object is in its standing pose, the instance segmentation network in these methods could better identify the top plane of the object. Thus, this is why these methods’ performance on the standing pose are better than the lying pose.

Another factor that influences the proposed method’s performance on the lying pose is that, when the object is lying down, the sides of the cylinder would have some adverse effects on the clustering algorithm.

Overall, our method performed much better under different scenarios. In particular, our method still achieved a 97% success grasping rate in the closest conditions to the actual situation (metal table plus standing pose). Most importantly, when our algorithm does not perform well, we can find which part goes wrong and improve it. For our method, the failure cases during the grasp are because of the error of the clustering results. We can adjust the parameters

ϵ

and

M i n P t s

for different situations. The tuning of these parameters can intuitively change the performance of the algorithm for different cases; for example, if we want the algorithm to be more conservative, we also reduce the value of

M i n P t s

. It makes our approach more reliable and controllable in this challenging environment.

4. Conclusions

In this paper, we focus on robotic grasping under a challenging environment in which the data of the RGB-D camera exhibit significant distortion. To eliminate the uncertainty of the grasping result, we proposed a method which combines the CNN-based instance segmentation network with clustering and the plane extraction method. The experimental results fully demonstrate the superiority of our approach in this highly challenging environment. Meanwhile, our approach lacks universality, so our future work will focus on improving the robustness of our algorithms while maintaining stability and high success rates and on carrying out subsequent studies on the peg-in-hole assembly task.

Author Contributions

Conceptualization, R.Y. and H.W.; methodology, R.Y. and H.W.; software, R.Y. and M.L.; validation, R.Y., H.W. and H.H.; formal analysis, R.Y.; investigation, R.Y. and H.W.; resources, R.Y. and M.L.; data curation, R.Y. and Y.C.; writing—original draft preparation, R.Y.; writing—review and editing, H.W. and Y.S.; visualization, R.Y.; supervision, H.W., Y.C. and H.H.; project administration, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been carried out within the framework of the EUROfusion Consortium, funded by the European Union via the Euratom Research and Training Programme (Grant Agreement No. 101052200—EUROfusion). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them. This work also supported by the Comprehensive Research Facility for Fusion Technology Program of China Under Contract No. 2018-000052-73-01-001228 and the University Synergy Innovation Program of Anhui Province with Grant No. GXXT-2020-010.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Maitin-Shepard, J.; Cusumano-Towner, M.; Lei, J.; Abbeel, P. Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 2308–2315. [Google Scholar]
Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [Google Scholar] [CrossRef]
Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef] [Green Version]
Kumra, S.; Kanan, C. Robotic grasp detection using deep convolutional neural networks. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 769–776. [Google Scholar]
Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [Google Scholar] [CrossRef]
Kumra, S.; Joshi, S.; Sahin, F. Antipodal robotic grasping using generative residual convolutional neural network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 9626–9633. [Google Scholar]
Bousmalis, K.; Irpan, A.; Wohlhart, P.; Bai, Y.; Kelcey, M.; Kalakrishnan, M.; Downs, L.; Ibarz, J.; Pastor, P.; Konolige, K.; et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4243–4250. [Google Scholar]
Tremblay, J.; To, T.; Sundaralingam, B.; Xiang, Y.; Fox, D.; Birchfield, S. Deep object pose estimation for semantic robotic grasping of household objects. arXiv 2018, arXiv:1809.10790. [Google Scholar]
Murali, A.; Mousavian, A.; Eppner, C.; Paxton, C.; Fox, D. 6-dof grasping for target-driven object manipulation in clutter. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6232–6238. [Google Scholar]
Sundermeyer, M.; Mousavian, A.; Triebel, R.; Fox, D. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13438–13444. [Google Scholar]
Wen, B.; Lian, W.; Bekris, K.; Schaal, S. CaTGrasp: Learning Category-Level Task-Relevant Grasping in Clutter from Simulation. arXiv 2021, arXiv:2109.09163. [Google Scholar]
Tobin, J.; Biewald, L.; Duan, R.; Andrychowicz, M.; Handa, A.; Kumar, V.; McGrew, B.; Ray, A.; Schneider, J.; Welinder, P.; et al. Domain randomization and generative models for robotic grasping. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3482–3489. [Google Scholar]
Varley, J.; DeChant, C.; Richardson, A.; Ruales, J.; Allen, P. Shape completion enabled robotic grasping. In Proceedings of the 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 2442–2447. [Google Scholar]
James, S.; Wohlhart, P.; Kalakrishnan, M.; Kalashnikov, D.; Irpan, A.; Ibarz, J.; Levine, S.; Hadsell, R.; Bousmalis, K. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12627–12637. [Google Scholar]
Joshi, S.; Kumra, S.; Sahin, F. Robotic grasping using deep reinforcement learning. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 20–21 August 2020; pp. 1461–1466. [Google Scholar]
Quillen, D.; Jang, E.; Nachum, O.; Finn, C.; Ibarz, J.; Levine, S. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 6284–6291. [Google Scholar]
Xiang, Y.; Xie, C.; Mousavian, A.; Fox, D. Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation. In Proceedings of the Conference on Robot Learning (CoRL), Virtual Event, 16–18 November 2020. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Tsai, R. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robot. Autom. 1987, 3, 323–344. [Google Scholar] [CrossRef] [Green Version]

Figure 1. In the fusion environment, many tasks need to be carried out via robotic grasping. In this figure, the robotic arm on the right side is gripping some components from the toolbox on the left.

Figure 2. The overall processing of the robotic grasping method in this paper.

Figure 3. RGB-D cameras have poor performance when measuring smooth surfaces; significant breaks and distortions in the 3D point cloud data exist in the figure.

Figure 4. The elements contained within the grasping pose.

Figure 5. Our experimental platform.

Figure 6. The images from the RGB-D camera’s view of the various stages in the experiment.

Figure 7. Experimental scenarios, including different backgrounds: wooden table and metal table, and different poses: standing pose and lying pose.

Table 1. Success rate comparison.

	GR-ConvNet	UOIS	HUDR
wooden table + lying pose	78%	86%	93%
wooden table + standing pose	53%	92%	100%
metal table + lying pose	49%	44%	82%
metal table + standing pose	26%	71%	97%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yin, R.; Wu, H.; Li, M.; Cheng, Y.; Song, Y.; Handroos, H. RGB-D-Based Robotic Grasping in Fusion Application Environments. Appl. Sci. 2022, 12, 7573. https://0-doi-org.brum.beds.ac.uk/10.3390/app12157573

AMA Style

Yin R, Wu H, Li M, Cheng Y, Song Y, Handroos H. RGB-D-Based Robotic Grasping in Fusion Application Environments. Applied Sciences. 2022; 12(15):7573. https://0-doi-org.brum.beds.ac.uk/10.3390/app12157573

Chicago/Turabian Style

Yin, Ruochen, Huapeng Wu, Ming Li, Yong Cheng, Yuntao Song, and Heikki Handroos. 2022. "RGB-D-Based Robotic Grasping in Fusion Application Environments" Applied Sciences 12, no. 15: 7573. https://0-doi-org.brum.beds.ac.uk/10.3390/app12157573

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RGB-D-Based Robotic Grasping in Fusion Application Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Instance Segmentation Network

2.2. Clustering Algorithm

2.3. Plane Extraction, Grasping Pose Calculation and Contact Point Chosen

3. Results

3.1. Experimental Platform

3.2. Visually Intuitive Evaluation

3.3. Quantitative Evaluation

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI