### BBO algorithms

Ising solvers find a solution of a quadratic function so we can approximate the data set \((\varvec{x}, y)\) by the quadratic function \(\hat{y}(\varvec{x}) = \varvec{x}^\mathsf {T}\varvec{A}\varvec{x} + \varvec{b}^\mathsf {T}\varvec{x} + c\), where \(\varvec{x} \in \{-1,1\}^{NK}\) and \(\varvec{A}\), \(\varvec{b}\) and *c* are model parameters. Note that we use \(\varvec{x}\) in this surrogate model instead of \(\varvec{M}\), *y* as the cost associated with a given \(\varvec{x}\) and \(\hat{y}\) as the approximated cost by the surrogate model. This function can be expressed in a simplified quadratic form \(\varvec{x}^{\mathsf {T}}\varvec{A}\varvec{x}\) if we include an additional dimension in the vector \(\varvec{x}\) as \((x_1, \ldots , x_{NK}, 1)\). The different algorithm handles the surrogate model differently.

Bayesian optimisation of combinatorial structures (BOCS)^{2} treats the parameter \(\varvec{A}\) in Bayesian linear regression. The authors proposed to use the horseshoe prior^{14},

$$\begin{aligned} \alpha _k | \beta ^2_k, \tau ^2, \sigma ^2\sim & {} N(0,\beta ^2_k\tau ^2\sigma ^2) \nonumber \\ \tau , \beta _k\sim & {} C^+(0,1) \nonumber \\ P(\sigma ^2)= & {} \sigma ^{-2}, \end{aligned}$$

(10)

where \(\alpha _k\) stands for the coefficient of the *k*-th variable in the linear regression and \(C^+(0,1)\) is a half-Cauchy distribution. Note that as the surrogate model is linear, second-order terms \(x_i x_j\) are treated as independent explanatory variables, i.e. \((x_1, \ldots , x_n, x_1x_2, x_1x_3, \ldots , x_{n-1}x_n)\), where \(n = NK\). Thus, the index *k* runs from 1 to \(n+n(n-1)/2\). As the parameter \(\varvec{A}\) of the surrogate model is a distribution in BOCS, a specific value of \(\varvec{A}\) is chosen based on the distribution inspired by Thompson sampling^{15}. In addition to the horseshoe prior, normal prior \(\alpha _k \sim N(0, \sigma ^2)\) and normal-gamma prior \(\alpha _k, \sigma ^{-2} \sim \text {NormalGamma}(0, 1, 1, \beta )\) are tested.

The sampling from the horseshoe distribution is performed using the Monte Carlo sampling^{14}, which requires a longer execution time compared to the normal and normal-gamma distributions. Samplings of the variables from these distribution functions are accelerated using fast Gaussian samplers^{16,17}.

Factorisation machine with quantum annealing (FMQA)^{4} utilises the factorisation machine (FM)^{18} as the surrogate model. The surrogate model of degree \(d = 2\) is defined as

$$\begin{aligned} \hat{y}(\varvec{x}) := w_0 + \sum _{i=1}^nw_i x_i + \sum _{i=1}^n\sum _{i=i+1}^n\langle \varvec{v}_i,\varvec{v}_j\rangle x_i x_j, \end{aligned}$$

(11)

where \(\langle \cdot ,\cdot \rangle\) represents the dot product of two vectors of size \(k_\text {FM}\),

$$\begin{aligned} \langle \varvec{v}_i,\varvec{v}_j\rangle := \sum _{l=1}^{k_\text {FM}} v_{i,l} v_{j,l}. \end{aligned}$$

(12)

The horseshoe prior and FM introduce sparsity in the surrogate model, whereas normal and normal-gamma priors do not. The results section shows the effects of this sparsity on the performances of algorithms. Each algorithm has its hyperparameter(s) to be fixed before conducting the BBO, i.e. the variance \(\sigma ^2\) in the normal prior, the shape parameter \(\alpha (=1)\) and the inverse scale parameter \(\beta\) in the normal-gamma prior and the size parameter \(k_\text {FM}\) in the FM, while the horseshoe prior has no hyperparameters. Hyperparameters \(\sigma ^2\) and \(\beta\) are optimised for a specific instance. Then, the optimal values are applied to other instances. We do not optimise the size parameter \(k_\text {FM}\) but choose eight from FMQA’s proposal as well as 12 to have enough degree of freedom to represent \(\alpha _k\).

We refer to the vanilla BOCS as vBOCS, the normal prior BOCS as nBOCS and the normal-gamma BOCS as gBOCS. The differences in \(k_\text {FM}\) are identified by FMQA08 and FMQA12. In addition, a random search algorithm is referred to as RS, in which each vector \(\varvec{x}\) is randomly sampled without utilising the pre-obtained data set.

### Ising solvers

BOCS and FMQA use an Ising solver to find the solution of the surrogate model represented in a quadratic form. We evaluate three Ising solvers: simulated annealing, QA and simulated quenching. Simulated annealing (SA) introduces a thermal fluctuation in the exploring process of the cost function^{19}. It is implemented on the Monte Carlo simulation. The temperature in the simulation is a parameter that controls the probability of variable reversal such that the cost function increases. Initially, the temperature is high to help global search in the solution space and find the candidate basin harbouring the global minimum. Later, the temperature becomes low to find the lowest solution in the basin. If the cooling schedule is slow enough, i.e. \(\propto 1/\log (t)\), SA finds the global minimum^{20}.

Quantum annealing (QA) takes place in the thermal fluctuation in SA by quantum fluctuation through a scheduled transverse field. The quantum system starts from a trivial state (superposition of all solutions) and finds the ground state of the cost function at the end of the annealing^{21}. If the transverse field is scheduled as \(\propto t^{-1/(2N-1)}\), the system converges to the ground state at the end^{22}.

We refer to simulated quenching (SQ) as a variation of SA with extremely rapid quenching of the temperature from high to zero. Although this algorithm simplifies and accelerates the Monte Carlo calculation, it eliminates the ability to search solution space globally at the early stage of annealing. Thus, this algorithm tends to be trapped in local minima more frequently than SA and QA.

We utilise SA and QA solvers in D-Wave Ocean SDK with default parameters. The default initial and final temperatures for SA are determined from approximately estimated maximum and minimum effective fields with scaling factors 2.9 and 0.4, respectively. The default annealing time for QA is 20 microseconds. In SQ, the temperature is not annealed but kept constant at 0.1. We optimise the surrogate model 10 times in each iteration step for all solvers to find a better solution.