Next Article in Journal
Combining the Two-Layers PageRank Approach with the APA Centrality in Networks with Data
Next Article in Special Issue
Multi-Scale and Multi-Sensor 3D Documentation of Heritage Complexes in Urban Areas
Previous Article in Journal
New Trends in Using Augmented Reality Apps for Smart City Contexts
Previous Article in Special Issue
An Architecture for Mobile Outdoors Augmented Reality for Cultural Heritage
 
 
Article
Peer-Review Record

CATCHA: Real-Time Camera Tracking Method for Augmented Reality Applications in Cultural Heritage Interiors

ISPRS Int. J. Geo-Inf. 2018, 7(12), 479; https://0-doi-org.brum.beds.ac.uk/10.3390/ijgi7120479
by Piotr Siekański 1,*, Jakub Michoński 1, Eryk Bunsch 2 and Robert Sitnik 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
ISPRS Int. J. Geo-Inf. 2018, 7(12), 479; https://0-doi-org.brum.beds.ac.uk/10.3390/ijgi7120479
Submission received: 6 November 2018 / Revised: 30 November 2018 / Accepted: 13 December 2018 / Published: 15 December 2018
(This article belongs to the Special Issue Data Acquisition and Processing in Cultural Heritage)

Round 1

Reviewer 1 Report

Please, some more details and few questions:

1) Pag. 7, row 204/205: wich one method You use ?

2) Pag. 8, row 224: It seems Your method is functioning only on FLAT surfaces (walls): what is the method You use to determine  the orientation (asset) of the wall ?  The point(s) normal(s) ? Please some more words on this.

3) Pag. 9, row 242: Why NOT ? Maybe because the walls are not always parallel? Please, some more words on this. 

4) Pag. 15, row 365: I suggest You also a SLERP camera filtration, which one is less expensive. (https://en.wikipedia.org/wiki/Slerp).


Author Response

Response to Reviewer 1 Comments

 

Authors of the manuscript would like to thank the reviewer for the constructive and practical suggestions. Especially relevant and helpful were the suggestions about clarification on the orthomap-generation process and regarding the case when the camera points at the ceiling.  In our opinion, the paper is more valuable from a scientific and practical point of view with these corrections. Detailed answers to particular remarks are given below.

Point 1: Pag. 7, row 204/205: wich one method You use ?

Response 1: We greatly appreciate this comment. We use naive point cloud rendering because our cloud is very dense and it does not contain gaps even when it is rendered in 16 times greater resolution. Therefore, we can apply antialiasing to our renders to achieve satisfactory results. We added some clarification on this to the manuscript.

Point 2: Pag. 8, row 224: It seems Your method is functioning only on FLAT surfaces (walls): what is the method You use to determine  the orientation (asset) of the wall ?  The point(s) normal(s) ? Please some more words on this.

Response 2: We perform point cloud segmentation based on point’s normal vector. Then each locally flat segment (i.e. where the normal vectors are consistent) is rendered separately. We added the explanation to the manuscript.

Point 3:  Pag. 9, row 242: Why NOT ? Maybe because the walls are not always parallel? Please, some more words on this. 

Response 3: Thank you for this comment. In our approach we want to address the problem of extreme viewing angles that occur when the user tries to look at the walls of the interior and the angle between the camera’s optical axis and the surface’s normal vector is large. In this case standard wide-baseline descriptors may fail. Therefore, we had to develop another approach. When the mounted camera is rotated with respect to the screen of the viewing device and points at the ceiling, the screen is still oriented more or less parallelly to the interior walls. One may consider rendering also the orthomaps for the walls, but in this case, there will be no performance gain. We add this clarification to the manuscript.

Point 4: Pag. 15, row 365: I suggest You also a SLERP camera filtration, which one is less expensive. (https://en.wikipedia.org/wiki/Slerp).

Response 4: Thank you for this suggestion. We will consider this approach in our future works.

 


Author Response File: Author Response.docx

Reviewer 2 Report

Dear authors, I have reviewed your manuscript“CATCHA: Real-time camera tracking method for Augmented Reality applications in cultural heritage interiors” (IJGI-393662). I have enjoyed reading it.

I think that the paper is quite interesting both in the application itself (camera tracking for AR implementation) and the applied methodology. Also you describe a detailed state of art of the different approaches used to solve this problem. I think that the proposed method is very promising although it lacks of further analysis and checks (especially in complex object geometries and difficult textured surfaces). Anyway homogeneous textured surfaces or bright, glass, metal, etc. surfaces are always a challenging problem for extract information in image base methods.

I have added an archive with some minor comments. Figure 8 must be improved and explained in text and, in my opinion, Figure 13 could be simplified (I think it is not necessary all graphs for all the sequences). Situations (or camera layouts) where higher pose and attitude errors occur should be commented.

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 2 Comments

 

Authors of the manuscript would like to thank the reviewer for the constructive and practical suggestions. Especially relevant and helpful was the suggestion about poorly explained figures. In our opinion, the paper is more valuable from a scientific and practical point of view with these corrections. Detailed answers to particular remarks are given below.

Point 1: Dear authors, I have reviewed your manuscript“CATCHA: Real-time camera tracking method for Augmented Reality applications in cultural heritage interiors” (IJGI-393662). I have enjoyed reading it.

I think that the paper is quite interesting both in the application itself (camera tracking for AR implementation) and the applied methodology. Also you describe a detailed state of art of the different approaches used to solve this problem. I think that the proposed method is very promising although it lacks of further analysis and checks (especially in complex object geometries and difficult textured surfaces). Anyway homogeneous textured surfaces or bright, glass, metal, etc. surfaces are always a challenging problem for extract information in image base methods.

Response 1: Thank you for this suggestion. In our future works we plan to examine more complex interior geometries and different materials.

Point 2: I have added an archive with some minor comments. Figure 8 must be improved and explained in text and, in my opinion, Figure 13 could be simplified (I think it is not necessary all graphs for all the sequences). Situations (or camera layouts) where higher pose and attitude errors occur should be commented.

Response 2: Thank you for this suggestions. Detailed answers to the particular observations are given below.

Observations:

Point 3: It is written “ancient stoa”. May be:  “an ancient stoa”, “ancient stoae” or “ancient stoas”?  

Response 3: Thank you for pointing it out. It should be written “an ancient stoa”. The manuscript was corrected.

Point 4: “light technique in 2009 then renovated”,  rephrase?

Response 4: The sentence was rephrased.

Point 5: Could you indicate which scanner you used?

Response 5: We used FARO Focus3D X130 HDR scanner. This information was added to the manuscript.

Point 6: You use orthographic renders. These renders can fit properly when object is composed of planar surfaces, but even in such cases what happens with hidden parts that do not appear in the render but they are in the images? Is there any limit for areas not well covered in the render?

Response 6: We agree with this opinion. The number of orthomaps depends on the interior’s geometry. If the geometry of the interior is more complex, a higher number of orthomaps is necessary to cover the whole interior. Only areas covered by the orthomaps may be used for tracking, so the algorithm is unable to track the areas that are not covered. One of the possible solutions is to integrate a SLAM-based tracker to detect geometry that was unknown, or has changed, during runtime. We added this point to the Discussion section. If there are enough correspondences found in the acquired images, the method will estimate the camera pose successfully. We did not perform detailed examination on this but the minimum required number of non-colinear correspondences to estimate camera pose is 4, and we reject all solutions that after RANSAC elimination have less than 7 inliers (threshold set by us). We added this point to the manuscript.

Point 7: How are the marks generated?

Response 7: The masks are generated by the user. We used a semi-automated method based on hue, but any method may be used. In our future works we consider fully automating this process by employing machine learning to filter the noise.

Point 8: I suggest that equation No. 4 should be renamed to Eq. No. 2 and then in the following paragraphs you can explain the new equation 2 with the help of eq. 3 and 4 (former eq. 2 and 3). I think that this clearer and you avoid mentioning in text the equation 4 before the eq. 2 and 3.  

Response 8: Thank you for this suggestion. This section was reorganized.

Point 9: “orange”? Does “orange” refer to the orange line in Figure 8?

Response 9: Thank you for this comment. “Orange” refers to the line in Figure 8. We removed this reference in text and added a legend to Figure 8.

Point 10: Please explain Figure 8 in text (and figure must also be referred in the text). The blue line is no explained. Why don’t you add a legend (orange and blue lines) to Figure 8?

Response 10: We appreciate this comment. The legend was added. The figure is referenced in line 194 of the original manuscript.

Point 11: You compute the camera pose with P3P algorithm based on perspective three point layout of the object, don’t you?. What happens if the image fails to match a three point perspective  problems (i.e. orthogonal images)

Response 11: That is correct. If the requirements for input points for the P3P algorithm are not met (i.e. at least 4 non-colinear points), the algorithm will fail to deliver the camera pose. In this case the result is not forwarded to the visualization module and the previous frame is rendered. As we achieve high framerate and this happens only rarely, this effect does not influence user experience. We assume that intrinsic camera parameters do not change during tracking and perspective is always present.  

Point 12: You have carried out six different tests (3 test per case study: King’s cabinet

–test 1, 2, 3 and the Al Fresco –test 4, 5, 6). Tests (sequences) are numbered in table 2 and 3. But in figure 14 you use letters in the figure, although you match the letters with the sequences in the figure caption. Anyway it is very confusing, can you indicate the sequence number also in the graphs? Another comment to this figure is that it seems redundant, at least for me. Can you select 2 or 3 representative sequences instead all sequences?

Response 12: Thank you for this suggestion. We also considered it but decided that it is an important point regarding system’s validation. We added sequence numbers to the graphs.

Point 13: Are attitude errors (rotations) higher than 4 deg acceptable for this purpose? What camera layouts with respect to the poster give rise to such errors?

Response 13: Occasional rotation error values higher than 4 deg may be acceptable for the user because the camera path is smoothed using Gaussian filtration. Thus, it will not affect user experience if these errors occur rarely. One of the example frames of such error is showed in Figure 15. The cause of high rotation errors is that the PnP algorithm may fall in a local minimum It does not depend on a specific camera layout but rather on configuration of matched keypoints found in the image. We discuss this issue in lines 361-367 of the original manuscript, we have extended this section to make it more clear.

 

LINE

OBSERVATION

116

It is written “ancient   stoa”. May be:  “an ancient stoa”,   “ancient stoae” or “ancient stoas”?  

125

“light technique in 2009   then renovated”,  rephrase?

132

Could you indicate which   scanner you used?

158159

You use orthographic   renders. These renders can fit properly when object is composed of planar   surfaces, but even in such cases what happens with hidden parts that do not   appear in the render but they are in the images? Is there any limit for areas   not well covered in the render?

168169

How are the marks generated?  

174

I suggest that equation No.   4 should be renamed to Eq. No. 2 and then in the following paragraphs you can   explain the new equation 2 with the help of eq. 3 and 4 (former eq. 2 and 3).   I think that this clearer and you avoid mentioning in text the equation 4   before the eq. 2 and 3.  

196

“orange”? Does “orange”   refer to the orange line in Figure 8?

197198

Please explain Figure 8 in   text (and figure must also be referred in the text). The blue line is no   explained. Why don’t you add a legend (orange and blue lines) to Figure 8?

216

You compute the camera pose   with P3P algorithm based on perspective three point layout of the object,   don’t you?. What happens if the image fails to match a three point   perspective  problems (i.e. orthogonal   images)

248249

You have carried out six   different tests (3 test per case study: King’s cabinet

–test 1, 2, 3 and the Al Fresco –test 4,   5, 6). Tests (sequences) are   numbered in table 2 and 3. But in figure 14 you use letters in the figure,   although you match the letters with the sequences in the figure caption.   Anyway it is very confusing, can you indicate the sequence number also in the   graphs? Another comment to this figure is that it seems redundant, at least   for me. Can you select 2 or 3 representative sequences   instead all sequences?

Table 2, 3 and Figure 8

Are attitude errors   (rotations) higher than 4 deg acceptable for this purpose? What camera   layouts with respect to the poster give rise to such errors?

 

 


Author Response File: Author Response.docx

Reviewer 3 Report

The paper is well organized; it presents a good background to introduce the scientific context and it describes adequately the adopted method.

The results seem to be encouraging: the idea of using camera tracking in order to create a new tool for restorers and visitors to help them to compare different state of conservation certainly deserves a deepening. The work is clearly presented; the discussion is addressed critically, showing strengths and weaknesses and the conclusions are adequate.


Author Response

Response to Reviewer 3 Comments

 

Authors of the manuscript would like to thank the reviewer for the review.

Point 1: The paper is well organized; it presents a good background to introduce the scientific context and it describes adequately the adopted method.

Response 1: Thank you for this comment.

Point 2: The results seem to be encouraging: the idea of using camera tracking in order to create a new tool for restorers and visitors to help them to compare different state of conservation certainly deserves a deepening.

Response 2: As you suggested we plan further development of our method to deliver better tools for the cultural heritage restorers.

Point 3: The work is clearly presented; the discussion is addressed critically, showing strengths and weaknesses and the conclusions are adequate.

Response 3: We really appreciate this comment.

 


Author Response File: Author Response.docx

Reviewer 4 Report

The paper presents the construction of a new algorithm for Augmented Reality in cultural heritage interiors, named Catcha.

The algorithm follows an innovative approach passing through orthographic model rendering, which seems mostly applicable in specific contexts; indeed, despite the great effort of the authors to solve number of issues and respect strict requirements, the algorithm does not seem applicable "universally in cultural heritage interiors" as stated in the conclusions. CATCHA seems to have limitations not only to interiors with perpendicular walls (as correctly reported in lines 331-333); it can seems to be capable of working mainly with almost flat wall surfaces and in contexts where the actual condition of the artifact is not much different from the original scan. No reference seems to have been done to the case where items have been removed or where different volumes are present in the indoor (columns, windows, steps, etc).

Beside the above, the method seems quite solid and on a good path to improve even further. My only concern therefore regards the extension to other contexts or even other materials (more glossy surfaces or indoor contexts with furnitures, for example).

Furthermore, the reference to cultural heritage does seem to be dictated by the two case studies. If the same paper was presented in architecture studies, perhaps nothing would have change into the paper. If the reference to cultural heritage has to remain, the importance for CH should be stressed out more (i.e. what such an AR application could bring in terms of musealization, artefact study, conservation,...).

For the sake of clarity, the lines 312-313 should be rephrased.

Final minor note, it would be more correct to replace Structure from Motion with Digital Photogrammetry: the first is normally just one step of the other and the production of dense point cloud cannot be undertaken with just SfM algorithms.


Author Response

Response to Reviewer 4 Comments

 

Authors of the manuscript would like to thank the reviewer for the constructive and practical suggestions. Especially relevant and helpful was the suggestion to discuss the issue when significant change occurs in the interior between actual state and the state in which it was scanned. In our opinion, the paper is more valuable from a scientific and practical point of view with these corrections. Detailed answers to specific points in the review are given below.

Point 1: The paper presents the construction of a new algorithm for Augmented Reality in cultural heritage interiors, named Catcha.

The algorithm follows an innovative approach passing through orthographic model rendering, which seems mostly applicable in specific contexts; indeed, despite the great effort of the authors to solve number of issues and respect strict requirements, the algorithm does not seem applicable "universally in cultural heritage interiors" as stated in the conclusions.

Response 1: We agree with this opinion and therefore we decided to remove the word “universal” not to mislead the readers.

Point 2: CATCHA seems to have limitations not only to interiors with perpendicular walls (as correctly reported in lines 331-333); it can seems to be capable of working mainly with almost flat wall surfaces and in contexts where the actual condition of the artifact is not much different from the original scan. No reference seems to have been done to the case where items have been removed or where different volumes are present in the indoor (columns, windows, steps, etc).

Response 2: Our approach relies on a scanned 3D model, if a significant change in the interior occurs, the model has to be updated. Minor changes can be effectively eliminated either in matching filtration step or based on 2D-3D correspondences during RANSAC camera pose estimation. We added this point to the discussion.

Point 3: Beside the above, the method seems quite solid and on a good path to improve even further. My only concern therefore regards the extension to other contexts or even other materials (more glossy surfaces or indoor contexts with furnitures, for example).

Response 3: Glossy surfaces may reflect the light source so they are generally impractical for standard keypoint-based camera tracking algorithms. Although the King’s Chinese Cabinet contains many gildings on the walls, the keypoints are matched robustly. Ceiling-based tracking is more suitable because it is matte. We plan to integrate line-based tracking to deal with interiors with glossy or almost textureless surfaces. In our future works we plan to examine this issue more deeply.

Point 4: Furthermore, the reference to cultural heritage does seem to be dictated by the two case studies. If the same paper was presented in architecture studies, perhaps nothing would have change into the paper. If the reference to cultural heritage has to remain, the importance for CH should be stressed out more (i.e. what such an AR application could bring in terms of musealization, artefact study, conservation,...).

Response 4: Our algorithm can be a base of many potential AR applications, e.g.: visual comparison of two states, interactive model analysis (e.g. curvature visualization), interactive annotations for cultural heritage restorers. The paragraph of potential applications was added to the Discussion section.

Point 5: For the sake of clarity, the lines 312-313 should be rephrased.

Response 5: We agree with this opinion. The sentence was rephrased.

Point 6: Final minor note, it would be more correct to replace Structure from Motion with Digital Photogrammetry: the first is normally just one step of the other and the production of dense point cloud cannot be undertaken with just SfM algorithms.

Response 6: We really appreciate this suggestion. We added clarification to the manuscript.


Author Response File: Author Response.docx

Back to TopTop