We now provide an answer to the questions formulated above. The results show that a larger value of n does not translate to higher selection accuracy, while greatly increasing the cost for easier problems. The remaining budget to find a solution decreases due to an increase of used to estimate the ELA features. Furthermore, changes in the input sample have minor effects on the performance. This implies that the selector is reliable even with noisy features, suggesting that their variance is small. Let us examine the results in detail.
The performance of the best algorithm for a given problem is illustrated in
Figure 5 for the ICARUS/HOPO and BBLL portfolios. We observe that BIPOP-CMA-ES is the dominant algorithm for
dimensions in both portfolios. The Nelder–Doerr algorithm replaces BFGS in the ICARUS/HOPO portfolio, resulting in deterioration of
for
. There are specialized algorithms for particular problems regardless of the number of dimensions, e.g., LSstep is the best for
.
Table 3 shows the performance of the individual algorithms, the selectors during HOIO and HOPO. In addition, the results from an oracle, i.e., a method which always selects the best algorithm without incurring any cost, and a random selector are presented. The performance is measured as
, the 95% Confidence Interval (CI) of
,
,
,
,
,
and
. In boldface are the best values of each performance measure over each validation method and
. The table shows that only the oracles achieve
, with the ICARUS portfolio having the best overall performance with
during HOPO. The table also shows that
is always the lowest for
, given that less of the budget is used to calculate the ELA features. Highest
is achieved with
for the ICARUS portfolios and
for the BBLL portfolio. However, the differences with smaller
n may not justify the additional cost. For example, a
of
and a
of
is achieved with
for the ICARUS sets, a difference of
in
and
in
for
. Compared with total random selection, the selectors’
and
are at least one order of magnitude smaller.
Table 3 also shows on average a decrease on
of
, an increase of
and
of ≈7% and ≈3% respectively; although the results for the ICARUS sets are below average for
and
, indicating better performance. Furthermore,
is always above 90% for the ICARUS sets, while
falls below 90% during HOPO for the BBLL set. Only BIPOP-CMA-ES with
can match the overall performance of a selector.
Figure 6 illustrate the
against
for the complete benchmark set.
Figure 6a shows the results of individual algorithms, while
Figure 6b shows the performance of such oracle as solid lines and the random selector as dashed lines for the ICARUS/HOIO, ICARUS/HOPO, and BBLL portfolios. As expected from
Figure 5a, the ICARUS/HOPO oracle has a degradation in performance in the easier 10% of the problems. On the top 10% of the problems the performance of the three oracles is equivalent. The ICARUS/HOPO oracle has a small improvement on performance between 10% and 90% of the problems.
Figure 6c,d illustrate the performance of the selectors for
. The shaded region indicates the theoretical area of performance, with the upper bound representing the ICARUS oracle and the lower bound random selection. Although it is unfeasible to match the performance of the oracle, due to the cost of extracting the ELA measures and a
less than 100%, the figure shows the performance improvement against random selection. During HOIO, the ICARUS selector surpasses random selection at
with
, representing 5.9% of the problems; and at
with
, representing 31.2% of the problems. During HOPO, the ICARUS selector surpasses random selection at
with
, representing 9.9% of the problems; and at
with
, representing 40.3% of the problems.
To understand where the selectors fail, we calculate the average
,
,
and
for a given problem or a given dimension during HOIO and HOPO validations.
Table 4 shows these results, where in boldface are the
and
less than 90%, and the
and
higher than 10%. The table shows that an instance of either
have the highest possibility of remaining unsolved, despite the model having information about other instances of these functions. Some problems, such as
, have high
and low
values, which can be explained by their simplicity, as most algorithms can solve them. The performance appears to be evenly distributed in terms of dimension. Overall, the selector appears to struggle with those problems where only one algorithm may be the best. Nevertheless, these results indicate that the architecture of the selector is adequate for testing the effect of systemic errors.