A very large amount of the European Union’s total budget is spent on regional policy, via the structural funds with the main purpose of reducing the economic disparities between the member states and supporting job creation, business competitiveness, economic growth, sustainable development, and improving the quality of life.
In Greece, the National Strategic Reference Framework (NSRF) establishes the priorities for spending these funds at national level, for a time window of seven years, to raise the competitiveness of the economy, develop human capital and ensure higher employment and income, as well as better social integration [1
]. The General Secretariat for Investments and NSRF of Greece, provide online services for access of all interested parties to the NSRF Project Data and to the transparency of the public sector, in accordance with the provisions of Chapter A of the Law 4305/2014 (Government Gazette 237/A) regarding Open disposition and further use of documents, information and data of the public sector [2
]. The data for the NSRF, 2007–2013, are publicly available at http://2013.anaptyxi.gov.gr
, the official website of the Greek Ministry for Development and Competitiveness, that provides analytical information related to the implementation process of the NSRF projects [3
According to Fazekas & Tóth [4
], EU funds in many cases may increase the risk of corruption and have a negative effect on the development and economic growth of some EU members. During the implementation process of a project unexpected events may occur which could affect project milestones, contracts, payments, or the quality of the product/service being delivered. These events will mark the project as a Red Flag, however the existence of a Red Flag does not necessarily mean that there is corruption in the project [5
]. The importance of Red Flags has been indicated in [6
] and examined with fraud risk indicators related to management fraud [11
]. However, the results of these studies do not show which fraud risk indicators are the most important [13
Prior research regarding corruption risk on public procurement include various tools derived from the Subsidystories1
and Digital Whistleblower2
EU project. Subsidystories.eu collects all the data regarding how each country member of the EU allocates its money from the European Structural and Investment Funds (ESIF), making it easier to follow the money. To achieve this goal Subsidystories uses raw data from various portals and official documents from each member state which then visualize by country, in order to trace it easier. The downside is that it is limited on the visualizations and there is no use of data mining to analyze the available data. On the other hand, the Digital Whistleblower project aims to increase fiscal transparency and the impact of good governance policies assessed, through the systematic collection, structuring, analysis, and dissemination of information on public procurement. As part of this project the Monitoring European Tenders (MET)3
risk assessment software was developed. The software supplies the public authorities engaged in procurement activities with an easy-to use tool that can help them identify risky contracts [14
In addition, another anti-corruption tool that aims to enhance transparency and fight corruption of public procurements is the EU supported project RedFlags.eu4
. This tool is an automatic warning system that uses multiple condition-based algorithms to find Red Flags in the Hungarian procurement documents from Tenders Electronic Daily (TED). Specifically, in their methodology they defined 41 indicators5
to monitor the public procurement documents published at the launch and at the end of the procedure. Each indicator has a separate algorithm that raises a Red Flag if certain conditions are met. Many of those algorithms are based on pattern matching techniques or out of range values. In the end, each procurement procedure can potentially have a red flag per indicator and the more flags a procurement has, the riskier it is. Finally, the EU Commission has developed a risk scoring tool called ARACHNE6
that performs data mining and data enrichment with the primary objective to support the managing authorities of the member states responsible for EU-funded projects by effectively and efficiently detecting the riskiest projects, contracts, contractors and beneficiaries. ARACHNE just like MET are meant to be used only by public authorities.
In spite of the considerable public and policy interest in corruption and risks in EU Funds spending, citizens, journalists, even public authorities need an open monitoring tool that will identify Red Flags in order to retain transparency policies, take precautionary measures and prevent these warnings from escalating into project failure [15
In terms of data mining algorithms, the DBSCAN algorithm has a variety of data mining uses as it has the ability to handle and identify noise, discover clusters of arbitrary shapes, and automatically discover the number of clusters [19
]. DBSCAN is a robust clustering algorithm which has been compared with other data mining algorithms and on a variety of datasets. Recent studies showed that it can be used as part of a system which identified clusters to solve single target and multi-target regression tasks on several datasets [20
] and can be used to generate the fault clustering templates for reducing the influence of noise on diagnostic accuracy of rolling bearing datasets. [21
]. Additionally, it has been tested on high-dimensional datasets in which clusters are formed by both distance and density structures, where many clustering algorithms fail to identify these clusters correctly [22
There are various approaches which combine data mining methods and knowledge discovery with Semantic Web data, which support different data mining tasks and improve the Semantic Web [23
]. The purpose of this paper is to propose a framework and implement a Knowledge Based system to monitor NSRF projects, using open data and semantic web technologies with linked data principles, to be able to link with other datasets and SPARQL endpoint to retrieve data, performance indicators to monitor the implementation, data mining techniques to identify Red Flags and techniques to visualize the results. This knowledge based system was developed as a web application; RedFlags7
. The rest of the paper is structured as follows: Section 2
provides the complete design of the knowledge based system from the data extraction to the data mining techniques. Section 3
reports the results, Section 4
includes the user requirements and a use case scenario and Section 5
concludes this paper with some directions for future research.
2. Materials and Methods
In this section, we describe the knowledge discovery process; the NSRF data used in RedFlags
application, the vocabularies to semantically represent them, as well as the process for retrieving the needed data using SPARQL queries, defining performance indicators and using data mining techniques to identify Red Flags (Figure 1
The official website of the Greek Ministry for Development and Competitiveness publishes data related to the implementation process and the economic activity of the NSRF projects for the programming period at http://2013.anaptyxi.gov.gr/
. In order to strengthen the transparency of the public sector the database is being updated daily and can be accessed through the Open Data API [24
]. These data provide information about two main categories of actions, projects and support-grants.
Projects: “A group of activities aiming at the realisation of a functionally complete and distinct result. Some projects may consist of other subprojects.” [3
Support-Grants: “An advantage in any form whatsoever conferred on a selective basis to organisations involved in economic activity private or public (’undertakings’) by national public authorities with the potential to distort competition and affect trade between member states of the European Union. The advantage can take different forms of assistance including the direct transfer of resources, such as grants and soft loans, and also indirect assistance, for example, relief from charges that an undertaking normally has to bear, such as a tax exemption or the provision of services, loans, at a favourable rate.” [3
There is also a category with 181 Priority Projects ...“the selection of which was made by the Greek authorities in cooperation with the qualified European Commission Services, based on criteria related to the maturity, size and importance of their social and economic impact. The Priority Projects consist of other Projects or Support-Grants” [3
These data include information about the following: public expenditure budget, contracts signed, payment amounts, the start and end date, status, location, description of projects, number and the title of their subprojects, the thematic priority and the operational programme in which they belong, beneficiaries or other involved organisations and various related documents (pictures, pdfs and docs). Also, some projects may involve expropriations. An expropriation is defined as ...“obligatory, according to the law and based on a defined compensation, acquisition of one’s property by the state, for reasons of public necessity or utility” [3
]. The expropriation data consist of information about the area, the compensation money and the decisions based on which they are implemented.
2.3. Semantic Data Modeling
Existing vocabularies that could be used to describe fiscal projects and their implementation process are FRAPO8
, an ontology for describing the administrative information of research projects, FP6 and FP79
, that were used to model information for European Commission’s Framework Programme research projects. These ontologies were very specific about modeling information regarding the research projects and could not be used in our case, which was to describe the properties of a financial project and its implementation process for the Greek NSRF that consists not only of research projects as well infrastructure projects, projects regarding energy, the environment, culture and tourism.
The absence of an ontology that describe financial projects, led us to develop the Vocabulary of Fiscal Projects (VFP) [25
] and National Strategic Reference Framework Greece Vocabulary (NSRF-GR) [26
] ontologies that could be used as a basis for the semantic representation of the fiscal projects and the Greek NSRF data respectively.
VFP is identified by the namespace URI http://purl.org/vocab/vfp#
, the preferred prefix is vfp
and is also available through the GitHub repository 10
. The design is based on the research of other EU countries’ web portals that provide similar information about projects. Table 1
shows the four main classes we defined to optimise the coverage of terminology in the context of fiscal project data.
The main class of ontology is vfp:Project
. A project is always associated with some organisations (vfp:Organization
), a location (vfp:Place
) and some documents (vfp:Document
). A more detailed cross reference of the ontology classes and properties is available on its webpage 11
. Figure 2
depicts the classes and their relations.
NSRF-GR Vocabulary extends VFP with new classes and properties to describe NSRF data in as much detail as possible. It is identified by the namespace URI http://purl.org/vocab/nsrf-gr#
, the preferred prefix is nsrf-gr
and is also available through the GitHub repository12
. The classes and its relations are shown in Figure 3
. For each project category we created another class, subclass of vfp:Project
. More details about the classes and the properties can be found at the cross reference section of the ontology’s web page13
Listing 1. SPARQL query to retrieve basic information about the NSRF projects.
2.4. NSRF Knowledge Graph and Data Retrieval
We retrieved the data through the Open Data API using Python scripts and stored them in a local database. The transformation of the NSRF data to knowledge graph, is done by using the UnifiedViews14
ETL tool. The main advantage of this tool is that it can extract data straight from relational databases and then transform it to RDF triples [27
]. After the transformation process, the RDF files were uploaded to an OpenLink Virtuoso Server15
. Then we used SPARQL queries to retrieve the data from the server and analyze them.
In order to semantically represent the information we extracted from the Open Data Portal about the Greek NSRF projects, we used properties from the VFP ontology to describe the title (vfp:title) of the project, the public expenditure budget (vfp:budget), the total amount of signed contracts (vfp:contracts), the payment amount (vfp:payments), the current status (vfp:currentStatus), the location (vfp:location), a detailed description of the project (vfp:description), its start (vfp:startDate) and end date (vfp:endDate), a status report (vfp:statusReport) and the report date (vfp:statusDate), as well as the url of the project (vfp:url) and the documents related to this project (vfp:document). Also, we used properties from the NSRF-GR vocabulary to represent the project’s beneficiary (nsrf-gr:body), the operational programme to which it belongs (nsrf-gr:operational), its thematic priority (nsrf-gr:thematic) and the number of the subprojects it has (nsrf-gr:
countSubproject). Finally, all projects have a unique code notated as MIS and were assigned the rdf:type of nsrf-gr:Project.
The object properties vfp:currentStatus
weren’t assigned to literal terms, but instead we chose to use code lists. The code lists were semantically represented using SKOS16
, since it’s a widespread vocabulary that provides a standard way to organize knowledge using RDF and allows the hierarchical ordering of terms [28
The query in Listing 1 can be executed on the SPARQL ENDPOINT17
to retrieve information about the title, description, beneficiary, current status, location, operational programme, thematic priority and the url of the NSRF projects.
All the IRIs that resulted from the SPARQL query are dereferenceable and point to HTML pages with information about the resources. For the IRI dereferencing we used the RDFBrowser [30
], which is an open source Linked Data content negotiator and HTML description generator. Figure 4
shows the HTML representation of the resource project with MIS code 200000.
The SPARQL query in Listing 2 can be used to retrieve information about the budget, the contracts, the payments, the start and the end date of the NSRF projects. The results are also shown in Table 2
. Data consumers can use the SPARQL ENDPOINT to get information about the Greek NSRF projects, relevant documents, expropriations and their decisions. The SKOSified code lists are also available.
Listing 2. SPARQL query to retrieve fiscal information about the NSRF projects.
2.5. Performance Indicators
The process of monitoring and evaluating systems is based on indicators that assess the state of a project [6
]. We use three indicators using the contract, budget and payment amounts from the retrieved data. These indicators track the way in which NSRF projects evolve towards completion and consist of the input features in the clustering algorithm.
The completion index is defined by the Greek Ministry of Economy and Finance as the ratio of payments registered at the moment of data retrieval to the updated budget amount at the moment of data retrieval [3
]. We define two other indices, namely, payment completion and contract completion as follows:
Payment completion is defined as the ratio of the payments registered at the moment of data retrieval to the updated contracted amount. The payments completion index shows the status of the payments over the contracts at the time we retrieved the data, while the completion index shows the status of the payments over the budget of the whole project.
Contract completion is the updated contracted amounts to the updated budget at the moment of data retrieval.
The indices range should lie between 0 and 1. A value over 1 means that there is a significant change in a project that was unable to be covered by its fiscal plan and explains why an indicator exceeds the upper limit.
Indicators for each project can be calculated and retrieved using the SPARQL query of Listing 3.
Listing 3. SPARQL query to retrieve indicators for the NSRF projects.
2.6. Density Based Clustering
The information if a project is a Red Flag is not available in the official data portal of the Ministry. The available data, described in Section 2.2
, concern public expenditure budgets, contracts signed, payment amounts, the start and end date, status, location, description of projects, number and the title of their subprojects, the thematic priority and the operational programme in which they belong, beneficiaries or other involved organisations and various related documents (pictures, pdfs and docs). Supervised approaches are used when we have prior knowledge of what the output values for our samples should be. Therefore, unsupervised learning is appropriate to act on data without categorization [29
]. Partitioning and hierarchical clustering algorithms are more effective on compact and well separated clusters, however in the presence of noise and outliers in the data, these methods are not very effective [33
]. We selected Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, to detect areas with high density (clusters of any shape) in the defined feature space (Figure 5
) in order to eventually reveal the projects that could be considered as Red Flags.
Having defined a 3-dimensional feature space described in Section 2.5
, each project is represented by one point. Let
the radius of a neighborhood with respect to some point P
is the minimum number of neighbours within this radius. The notion of density in the feature space is based on the following definitions [19
A point is a core point if at least points are within distance . Those points are said to be directly reachable from .
A point is density reachable to a point with regard to and , if there is a path of core points where each point of the path is directly reachable from the previous one.
A point is density connected to a point with regard to and , if there is a point such that and are density reachable from with respect to and .
A group of density connected points form a density based cluster and points that are not reachable from any other point are outliers.
Based on these density conditions, there are three different kinds of points: core points, density reachable points and outliers, as shown in Figure 5
DBSCAN computes the Euclidean distance from an arbitrary selected point (starting point) and the other points and finds the neighbours within their -distance of the starting one. If the number of neighbours is equal to or greater than the , they form a cluster. These points are considered as “visited”. This process is repeated with the rest core points until the cluster is fully expanded and then, these iterations are also repeated with the unvisited points to form other clusters. If the number of neighbours is less than , the point is marked as a Red Flag.
The rule of thumb, to specify
is to use at least the number of dimensions of the data set plus one. In this case
was set to
]. The optimal
radius was specified using a 4-dimensional tree which computes the 4-nearest neighbours’ distances of every point. Figure 6
shows the points sorted by distance in ascending order and the optimal
parameter is selected to be the knee of the curve, the value where a sharp change occurs and is around 0.015 [19
2.7. Red Flags
Red Flags are defined as clusters of projects with extreme behaviour compared to other clusters of projects. Red Flags are warning signs that do not indicate guilt or innocence [37
]. Clusters with a number of projects less than, or equal to 5% of the total number of projects are characterized as extreme clusters. The 5% threshold was selected by trial and error by testing different thresholds.
application was built in R (version 3.3.2) [38
], with Rstudio (version 1.0.136) [39
], using the packages R Shiny (version 1.0 ) [40
], SPARQL (version 1.16) [41
], dbscan (version 1.0.0) [42
], plotly (version 4.5.6) [43
], ggplot2 (version 2.2.1) [44
], rbokeh (version 0.5.0) [45
], DT (version 0.2) [46
], shinythemes (version 1.1.1) [47
], shinyjs (version 0.9) [48
] and shinydashboard (version 0.5.3) [49
4. User Requirements and Use Case Scenario
Red Flags are an indication to monitor funded projects during their implementation in order to prevent and guide competent authorities to improve or correct weaknesses or prevent failures in operations, accounts and systems. Therefore, RedFlags platform’s user requirements are:
Rejected projects should raise Red Flags, in order to avert project failure if possible.
Assist competent authorities to organize the monitoring process efficiently, without loss or misspend of time, by avoiding to examine most of the non-problematic projects.
According to the user requirements, we performed a use case scenario. In this scenario the competent monitoring authority has to examine NSRF projects, approximately 12 months before the end of the programming period. RedFlags platform assists the authority to organize the monitoring process and examine first the projects that raised a Red Flag. Marking a project as Red Flag, means that this project has probably significant problems, such as higher payments than the available budget (completion index), or higher payments than the available contract amounts (contract completion index). These projects have high priority to be examined to avoid rejection. Since the available ground truth is the rejection at the end of the programming period when the data retrieved, we will evaluate the performance of RedFlags platform on 438 rejected projects over 11558 NSRF projects. Under these circumstances, the use case scenario will show the performance of the RedFlags platform on imbalanced data, since the proportion of rejected projects consist of 3.8% () of the dataset (low prevalence).
The system retrieved the fiscal information, calculated the indicators of the NSRF projects as described in Section 2.5
and identified the Red Flags. To test whether the user requirements are satisfied, we checked the last update of the projects after the end of the NSRF programming period and extracted the number of rejected projects. The following tables show the results of this use case scenario.
The contingency table (Table 6
) of rejected projects and projects classified as Red Flags shows that 312 projects raised Red Flag and were rejected (True Positives-TP), 126 projects were rejected and didn’t raise Red Flag (False Negatives-FN), 8024 didn’t raise Red Flag and were not rejected (True Negatives-TN) and 3096 classified as Red Flags but were not rejected (False Positive-FP). Out of the 11558 projects, 3408 projects were marked as Red Flags.
According to Table 7
, prevalence is equal to 3.8% (
) and is defined as the proportion of rejected projects to the total number of NSRF projects. Low prevalence is expected for a successful NSRF programming period, as a higher percentage of this metric means that the NSRF program encountered problems and that more and more projects failed to complete.
By these terms, Precision (Positive Predictive Value-PPV) and Negative Predictive Value (NPV) are equal to 9% (
) and 98% (
), respectively. Precision corresponds to the estimated probability that a project randomly selected from the indicated Red Flags is rejected. Negative Predictive Value corresponds to the probability that a project randomly selected from the set of not indicated projects as Red Flags is not rejected. However, both metrics depend on the prevalence, which in this case is low and they are not intrinsic to the test, as recall and true negative rate are [50
]. The overall accuracy (ACC) of the RedFlags
platform is equal to 72% (
Based on Table 7
, which presents the joint probabilities for rejected and Red Flags projects, the conditional probabilities of Table 8
were calculated (see also Figure 8
). The results show that recall (Sensitivity, or True Positive Rate-TPR), which is the percentage of raising Red Flags at projects that were rejected after 12 months, is 71% (
). Recall corresponds to the estimated probability that a project randomly selected from the indicated Red Flags projects will be rejected.
Moreover, specificity (SPC) is equal to 72% () and is related to the RedFlags platform’s ability to correctly not raising Red Flags at projects that will not be rejected at the end of the programming period.
In other words, the auditor will not examine first the 72% of the projects that will not be rejected, whereas he will first check the 28% of the projects that will raise a Red Flag but won’t be rejected (False Positive Rate-False Alarm), which is satisfactory according to the user’s demands. Marking projects as Red Flags does not necessarily mean that these projects will be rejected after 12 months, whereas a project that has been rejected should have raised a Red Flag.
Furthermore, we calculated the Positive likelihood ratio (LR+), Negative likelihood ratio (LR-) and the Diagnostic Odds Ratio (DOR). LR+ is defined as the ratio . The greater the value of the LR+, the more likely a Red Flag indication is a Red Flag warning for a rejected project. In other words, rejected projects are more likely to raise Red Flags than not rejected, since the ratio is greater than 1. On the other hand, the algorithm avoided an LR+ < 1 which would imply that not rejected projects are more likely than rejected projects to receive Red Flags.
LR- is defined as the ratio . The meaning of LR- < 1 is that a not rejected project is more likely not to raise a Red Flag than a rejected project. A value greater than 1 would imply that rejected projects are more likely not to raise a Red Flag than not rejected projects.
DOR, which is independent of prevalence, measures the effectiveness of the algorithm. DOR is defined as the ratio of . The value of DOR is greater than one meaning that the algorithm is discriminating correctly.
Therefore, the RedFlags platform user requirements are satisfied. In other words, RedFlags is more likely to raise Red Flags on rejected projects and is more likely not to raise a Red Flag on not rejected projects and eventually assist the competent authorities to organize the monitoring process efficiently, by avoiding to examine most of the non-problematic projects.
We presented how open data can be used with semantic web technologies and data mining techniques to identify possible failures as “Red Flags” in National Strategic Reference Framework projects. The identification is implemented by the RedFlags application, constructed as an interactive knowledge based system. We used data from the Open Data API provided by the Greek Ministry of Economy and Finance. The semantic description of these data involved the development of two ontologies, VFP and NSRF-GR. The NSRF data were transformed into RDF triples and uploaded to an Openlink Virtuoso Server, while RDFBrowser undertook the process of content negotiation and HTML generation. Performance indicators were defined to track the progress of NSRF projects and provided the inputs to the clustering algorithm. The DBSCAN algorithm was used to identify Red Flags.
The RedFlags platform was based on two user requirements. The first requirement is that the rejected projects should raise Red Flags, in order to avoid failure if possible and the second is that there is a need to assist auditors to organize the monitoring process efficiently, without loss or misspend of time, by avoiding to examine most of the non-problematic projects. In the use scenario, an auditor has to examine the NSRF projects in Greece, approximately 12 months before the end of the programming period. The system retrieved the fiscal information, calculated the indicators of the NSRF projects and used the DBSCAN algorithm to identify the Red Flags. RedFlags platform marked 29.5% of the projects as Red Flags. The meaning of the indicated Red Flag projects, is that these projects have probably significant problems, due to updates of budget or payment amount, or due to other factors and have high priority to be examined to avoid rejection. However, the available ground truth is the rejection at the end of the programming period when the data retrieved and we evaluated the performance of RedFlags platform on 438 rejected projects over 11558 NSRF projects.
To test whether the user requirements are satisfied, we checked the last update of the projects that were conducted after the end of the NSRF programming period and extracted the number of rejected projects. The number of rejected projects correspond to prevalence which is equal to 3.8% of the total projects. In terms of rejection, low prevalence corresponds to a successful NSRF programming period, as higher values of this metric means that more and more projects failed to complete.
The estimated probability that a project randomly selected from the indicated Red Flags projects will be rejected was 71% (Recall) and the estimated probability to correctly not raising Red Flags at projects that will not be rejected was 72% (Specificity). Moreover, the positive likelihood ratio showed that rejected projects are more likely than not rejected projects to receive Red Flags, whereas the negative likelihood ratio showed that rejected projects are more likely not to raise a Red Flag than not rejected projects. Finally, the diagnostics odds ratio, which is independent of prevalence, showed that the RedFlags platform is discriminating correctly. Therefore, RedFlags platform assists the auditor to organize the monitoring process and give high priority at the projects that raised a Red Flag, as rejected projects have higher probability to raise a Red Flag.
Currently the resources in our data have been described by W3C’s open standards and have HTTP IRIs so humans can access them and get useful information, but they still don’t have links to other datasets. So, our next step will include creating links to IRIs of other published data in order to achieve 5 star Linked Open Data [51
]. Specifically, we plan to create semantic links between documents that were uploaded to Diavgeia, the official repository where all the decisions of governmental and administrative acts are posted, and the NSRF projects to expand the Greek Linked Open Data (LOD) cloud [53
]. This will give us access to relevant information about the projects in order to create additional performance indicators and increase the efficiency of the data mining algorithm. In addition, we will further improve the ontology by implementing some upper ontology like BFO18
and by reusing terms from other ontologies. Moreover, we will look into adding constraints and validating our graphs by using technologies such as SHACL19
]. Finally, even though the ontologies have their specification drafts, they need to be updated with more detailed documentation and SPARQL examples so consumers, outside of the data portal, can easily compose and execute SPARQL queries using the correct properties.
The findings of this study have been included at the results of the commitment about Linked, Open and Participatory Budgets of the Third Greek Action Plan on Open Government [57
]. Public bodies could adapt efficiently to the RedFlags
Knowledge-Based system as an early warning indicator, in order to make smarter strategies preventing possible failure of projects. Citizens could monitor the progress of a project to find Red Flags, while data journalists could produce data stories about EU funds and relate them with the trends of the Greek economy.