Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Building Knowledge Graphs from Unstructured Texts: Applications and Impact Analyses in Cybersecurity Education

Information 2022, 13(11), 526; https://0-doi-org.brum.beds.ac.uk/10.3390/info13110526

by Garima Agrawal^1,*

, Yuli Deng¹, Jongchan Park²

, Huan Liu¹

and Ying-Chih Chen²

Reviewer 1:

Nick Bassiliades

Reviewer 2:

Xingjian Xu

Information 2022, 13(11), 526; https://0-doi-org.brum.beds.ac.uk/10.3390/info13110526

Submission received: 27 September 2022 / Revised: 29 October 2022 / Accepted: 2 November 2022 / Published: 4 November 2022

(This article belongs to the Special Issue Knowledge Graph Technology and Its Applications)

Round 1

Reviewer 1 Report

The authors present a bottom-up approach to curate entity-relation pairs and construct knowledge graphs and question answering model for cybersecurity education. The authors first used natural language processing (NLP) based Named-Entity Recognition (NER) techniques to extract the entities and relations from the instruction manuals and generated the graphs. Using this visual graph representation of the text they identified the key entities, relations, and defined the ontology for cybersecurity education. To minimize the labeling effort, the authors developed a custom entity-matcher in python to extract the useful text matching the key entities from the remaining manuals. The authors developed a web-based user interface (UI) to view the knowledge graphs. The authors also built a chatbot using SVM-classifier to answer the student queries related to cybersecurity. The web UI containing the knowledge graphs and chatbot was linked to the cybersecurity course portal and given to graduate-level students to focus on key concepts in cybersecurity projects. To evaluate the impact of the new learning paradigm the authors conducted surveys and interviews with students after each project to find the usefulness of bot and the knowledge graphs. The results show that students found these tools informative to learn the core concepts and used the knowledge graphs as a visual reference to cross check their progress which helped them complete their project tasks.

Major Comments

- Intro: “Our objective is to explore…and standard datasets” à The scalability of the method, rises some concerns, as this is tested only on the domain of cybersecurity. Which may rise questions for biases, and fine-tuned datasets prior to the implementation of the method.

- Intro: “With increase in internet-connected… solving skills for cybersecurity” à I think this paragraph can be omitted, as we all understand the importance of cybersecurity. It will also make the intro more understandable

- Section 3: “In future, we aim to build a question-answering dataset to train the cybersecurity KG” à cybersecurity KG model? I do not get what this refers to. Maybe the authors refer to the data driven model, which can answer questions with the help of the KG?

- Subsection 3.1.2: This method seems that is a translation of natural language into triplets, with some of the shelf tools.

This seems that it would result into a KG with extremely many classes and properties, which seems not right.

For instance, will there be a class for any subject and objects? wont there be some sort of clustering for them?

Moreover, this unsupervised method of transforming natural language into triplets most certainly will add noise to the graph. How do the others control this unwanted result?

Also, stemming is not used? otherwise I could think of many examples that word which refer to the same named entity will be considered not the same.

Either way unsupervised translation of natural language into triplets is very risky.

- Subsection 3.4: “To avoid the annotation effort…function in spacy library” à I do not think of the shelf tools, can capture the semantics hidden in the natural language sentences. Can the authors elaborate if they capture any semantic information with this tools?

- Subsection 3.5: “csv files to store the triples” à So haven’t the authors stored into a kg tool? such as protege - top braid - GraphDB? to run the kg through a reasoner for any inconsistencies

- Subsection 3.6: “Thus, we built a UI layer…gets the answer from the bot” à So are there some predefined questions that when selected by the user are translated to SPARQL counterparts?

how does this work is not clear to me, I think the authors should elaborate more on this.

- Subsection 3.6.2: “In this work, we considered using…main intents as labels” à Maybe some snap shots would make this clearer.

Also, since the authors extract information from a KG. I think that a SPARQL query example should be provided.

If SPARQL is not used please elaborate on how you extract the information :).

- Evaluation: I could see that the evaluation captures to a point the completeness of the KG, and the information extraction mechanism.

But I consider it necessary for some consistency checking especially considering how the authors insert data (i.e., with noise checking). So, some SHACL rules I think are mandatory to have a complete evaluation, and a trustworthy KG.

- Discussion: Discussion can become bigger with the recommendations that I propose in the evaluation.

Minor Comments

- Intro: “The ontology essentially provides…information from the graph” à I get what the authors are saying here, but the sentence is a bit harsh, some rephrasing might be needed.

- Intro: “crucial step” à crucial steps

- Subsection 3.3.1: “The the edge” à The edge

Language

Consistent and good writing, as it makes clear to the reader the idea, methodology, and contribution of the paper.

Recommendation (Major Revision)

I am proposing a major revision for two reasons: (i) The scalability of the method, rises some concerns, as this is tested only on the domain of cybersecurity. Which may rise questions for biases, and fine-tuned datasets prior to the implementation of the method, and (ii) The fact that noise may be added to the KG and no supervision is present, as well as no evaluation of consistency of the KG is performed. Therefore, I consider it necessary for some consistency checking especially considering how the authors insert data (i.e., with noise checking). So, some SHACL rules I think are mandatory to have a complete evaluation, and a trustworthy KG.

Author Response

Please find our responses attached. Thanks!

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors of this manuscript demonstrate how to build knowledge graphs from unstructured texts and demonstrate its application in cybersecurity education. Highlights include a bottom-up approach to constructing knowledge graphs and question answering models for cybersecurity education based on entity-relationship pairs. While this paper lacks some science novelties, it is of significant engineering value to the field of educational technology. It appears that the section on Methods and Results is clear and reasonable.

However, there are still some points that need to be resolved:

1. We suggest that Abstract should be made shorter and more concise.

2. We suggest the section "3. Our Knowledge Graph Framework for Cybersecurity Education" to be renamed to section "Method" as the tradition of writing academic papers.

3. Any evidence (such as raw data of survey) to support the arguments in "4.1. Results"? These should be addressed in section 4.1 or presented as supplementary files.

4. The efficiency, performance and output quality of the pipeline in manuscript should be compared to other similar pipelines or tools if they exist.

Author Response

Thanks!

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

A) The authors present a methodology that is much needed in the area of knowledge graph and ontology construction and Linked Open Data. The paper should be accepted, but some small changes are still needed before this happens.

In order to consider the paper a complete study and tackle any potential biases that the methodology might rise, I recommend for the method tested over another domains, to consider it a complete study. It could be a subarea of another domain, apart from the cyber security, and just a small subsection that would indicate that this “testing” took place, is enough.

B) The paper needs some proof reading as there are some typos and syntactic issues.

C) Some technicalities, about the technologies, from the abstract can be omitted from the abstract

D) “Relationships for Cybersecurity Education – We analyzed… uses” à Was there a specific purpose for defining these properties? i.e., was there a corpus for defining them, or after some analysis over the data these properties seemed to be the most common ones?

E) Figure 5 à I would like a little more analysis on the tuples as I do not totally get what each value is.

Author Response

We are really thankful for your comments. Attached is our response and the revised manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

The answers to our questions have been provided in a satisfactory manner, and we agree that the manuscript can be accepted as presented.

Article Menu

Building Knowledge Graphs from Unstructured Texts: Applications and Impact Analyses in Cybersecurity Education

Further Information

Guidelines

MDPI Initiatives

Follow MDPI