The results of this study are presented in three categories, namely; the forecasting performance of the baseline stacked LSTM model, the TL models performance results compared to the conventional model in the target domain and the results of applying TL with different volume of available data, respectively. By the term conventional model we refer to the LSTM model in which no TL has been applied; in this context, the conventional LSTM model is solely based on training with data from the target PV.

### Baseline model performance

The stacked LSTM model has the following lag features: (a) Power output measured value, (b) air temperature, (c) global horizontal irradiance, (d) humidity, (e) month of the year (in the form of one-hot encoding) and (f) hour of the day (in the form of sine/cosine transformation). The above-mentioned features are fed into the LSTM model in the format of “5 inputs – 1 output” of hourly data. More specifically, a point value for each feature is fed into the model for the last five hours and the PV power output for the next hour is predicted (one-hour ahead power output forecast).

Ensuring an accurate base model is a prerequisite for achieving accurate predictions in the target domain. In this context, the performance of the LSTM model for the base PV is evaluated with the following procedure: The base PV dataset is split into train set and evaluation set using a 80-20 split, keeping the first 80% as training and the remaining 20% as testing (17563 observations for the training process and 4391 observations for evaluation purposes) and the LSTM model is trained on the training set. The accuracy of the model is evaluated by computing the root mean squared error (RMSE) and the mean absolute error (MAE) of the respective forecasts across the evaluation period considered, as well as the coefficient of determination \(R^2\) between the forecasts and the real values, as follows:

$$\begin{aligned} RMSE= & {} \sqrt{ \frac{1}{n} \displaystyle \sum _{t=1}^{n} ({y_t-\hat{y_t}})^2 } \end{aligned}$$

(7)

$$\begin{aligned} MAE= & {} \frac{1}{n} \displaystyle \sum _{t=1}^{n} | {y_t-\hat{y_t}|} \end{aligned}$$

(8)

$$\begin{aligned} R^2= & {} 1 – \frac{ \displaystyle \sum _{t=1}^{n} ({y_t-\hat{y_t}})^2}{\displaystyle \sum _{t=1}^{n} ({y_t-{\bar{y}}})^2 } \end{aligned}$$

(9)

where \(y_t\) is the real value of the solar production time series at hourly interval *t* of the evaluation period, \(\hat{y_t}\) is the produced forecast of the model and \({\bar{y}}\) is the average of the real values. Apart from these error metrics, two additional metrics are calculated in order to make the model evaluation more complete: the Mean Bias Error (MBE) and the normalized root mean squared error (NRMSE). The MBE represents the systematic error of a forecasting model to under or overforecast, while the NRMSE is suitable for the comparison between models of different scales connecting the RMSE value with the observed range of the variable. These two metrics are calculated as follows:

$$\begin{aligned} MBE= & {} \frac{1}{n} \displaystyle \sum _{t=1}^{n} ( {y_t-\hat{y_t})} \end{aligned}$$

(10)

$$\begin{aligned} NRMSE= & {} \frac{RMSE}{{\bar{y}}} \end{aligned}$$

(11)

The model achieves high accuracy, managing to efficiently capture the daily patterns of the most important variables, as reflected by the utilized metrics (\(MAE = 0.467\), \(RMSE = 0.992\), \(MBE = -\,0.097\), \(nRMSE = 0.301\), \(R^2 = 96.254\%\)). However, even these five error metrics are not enough to sufficiently illustrate the capabilities of the proposed model in comparison with other models in different geographical locations. According to Yang et al.^{57,58}, the accuracy of solar forecasting models (in general, the term “solar forecasting” may refer to either solar irradiance forecasting or solar power forecasting; throughout this study the term refers to solar power forecasting) must be inter-comparable across different locations and different time periods through a common metric which is the forecast skill index. The forecast skill index is based on the comparison of the proposed model to a reference model on a specific error metric. However, two issues arise: What reference model and which error metric must be used? The most common reference model to standardize the verification of solar forecasting models is the persistence model. More specifically, the utilization of a smart persistence model as a reference model is highly recommended, rather that using the naive (or simple) persistence model^{59}. Regarding the optimal error metric, the RMSE is the most suitable metric in the case of solar power production, as a metric that is appropriate for capturing large errors^{57}. Thus, the formula of the forecast skill index is the following:

$$\begin{aligned} skill = 1 – \frac{RMSE_{proposed}}{RMSE_{reference}} \end{aligned}$$

(12)

where \(RMSE_{proposed}\) is the RMSE value of the developed LSTM model and \(RMSE_{reference}\) is the RMSE value of the smart persistence model.

The last question that arises concerns the selection of the smart persistence model in the case of solar power forecasting. For solar irradiance forecasting problems, the smart persistence model derives from integrating clear sky conditions to the reference model^{59}. The same also applies to PV power forecasts, where several smart persistence models have been proposed^{60}. More specifically, a clear sky index has been proposed by Engerer and Mills in case that the characteristics of the PV panel are known^{61}, while another PV smart persistence model based on scaling global horizontal irradiance to PV production value has been presented by Huertas and Centeno^{62}. In this study, the definition of Pedro and Coimbra is adopted, which is based on estimating the expected power output under clear-sky conditions^{63}. The formula of the adopted smart persistence model is described by the following equation:

$$\begin{aligned} {\hat{y}}(t+\Delta t) = {\left\{ \begin{array}{ll} y_{c-s}(t+\Delta t), &{} if\ y_{c-s}(t) = 0 \\ y_{c-s}(t+\Delta t) \frac{y(t)}{y_{c-s}(t)} &{} otherwise \end{array}\right. } \end{aligned}$$

(13)

where *y*(*t*) is the measured power output and \(y_{c-s}(t)\) represents the expected power output under clear-sky conditions. The purpose of this model is to decompose power output, indicating that a fraction of the power output relative to the clear-sky conditions remains the same between short time intervals. Moreover, at night conditions the forecast of the smart persistence model is considered equal to the clear sky power output. The approximated function for the clear-sky model can be created by averaging past power output values depending on the hour of the day (between 0 and 23) and the day of the year (between 0 and 255). The second step involves creating the smooth surface that envelops the above-mentioned function^{63}. The power output expected under clear sky conditions for the base PV (\(PV_1\)) as a function of the hour of the day and the day of the year for the baseline model is presented in Fig. 4.

The smart persistence model performance is reflected by the following error metrics: \(MAE = 0.582\), \(RMSE = 1.274\), \(MBE = 0.029\), \(nRMSE = 0.387\), \(R^2 = 93.811\%\). Although the smart persistence model shows quite good performance in comparison with the naive persistence model (\(RMSE_{Naive} = 1.985\), \(MAE_{Naive} = 1.110\)), it is evident that the LSTM significantly outperforms the smart persistence model. This is also highlighted through the forecast skill index of the LSTM model which is equal to 0.221. A positive forecast skill index indicates that the proposed model outperforms the smart persistence model, while a negative one shows that the smart persistence model performs better.

Finally, Fig. 5 depicts the results of the forecasting models (LSTM baseline model and smart persistence model) for two different periods. It can be concluded that the model manages to capture seasonality, trends and weather-related variations both in summer and winter periods, and thus offer significantly better forecasts compared to the smart persistence model.

### Transfer learning methods

The TL models are equipped with exactly the same characteristics as the baseline model, using the baseline pre-trained model to solve exactly the same problem, with the same features and the same expected output, in a different PV plant. Therefore, the features of the TL models are: (a) Power output measured value, (b) air temperature, (c) global horizontal irradiance, (d) humidity, (e) month of the year (one-hot encoding) and (f) hour of the day (sine/cosine transformation) and the model output is a one-hour ahead forecast of the PV power output.

The validation process of the proposed TL strategies is implemented in 6 PV plants, with different nominal and peak capacity, located in 4 cities in Portugal. Four architectures are compared, including the presented TL strategies, as well as a conventional model where no TL has been applied. For the TL models, a pre-training is applied on the whole dataset of the base PV (30 months of data). Then, the four models are trained using one year of data (8760 h) and they are tested in the rest of the dataset. For each PV plant the size of the test dataset is different depending on data availability, as presented in Table 1. The models’ accuracy is evaluated based on their performance on the evaluation data using *RMSE*, *MBE*, *MAE*, *NRMSE* and \(R^2\).

20 training repetitions are performed for each model, in order to eradicate randomness. This number of repetitions is generally proposed in the literature. The forecasting performance for all models is presented in Table 2, where the average values of RMSE, MBE, MAE, NRMSE and \(R^2\) are reported, providing some very useful insights.

Firstly, it is worth mentioning that all LSTM models perform better than the smart persistence model in terms of RMSE. This fact illustrates the suitability of the selected model and the selected features for this problem. The only case that the LSTM performs worse than the smart persistence model is for the conventional LSTM of \(PV_2\). Even in this case the three TL models have lower error indexes than the smart persistence one. The forecast skill index varies between \(-\,0.15\) (it is negative in the case of \(PV_2\)) and 0.48 for the conventional model, while it varies between 0.28 and 0.56 for the TL strategies. The average percentage increase of the forecasting skill index between the conventional and the TL models is \(16.3\%\). Finally, the MBE index shows that none of the developed models shows any indication of bias.

Regarding the comparison between the conventional LSTM and the three TL models, the impact of TL is evident as TL strategies have better accuracy than the conventional one for all six PVs . The boxplots presented in Fig. 6 also show that the conventional LSTM has greater RMSE average value in all target PV plants, while it also demonstrates a bigger variance compared to the TL models. Indeed, the models that are used without TL suffer from high variance, offering considerably different accuracy in each repetition. On the contrary, models trained with the three TL strategies show nearly zero variance, while also achieving more accurate, non-biased forecasts. A remarkable point is that for \(PV_3\), where the evaluation period is only 38 days (910 hourly point forecasts), the three TL models do not seem to outperform the conventional model in the extent that they do for the other PV plants. This is due to the fact that the evaluation takes place solely on March (winter period) where the problem is more complex as weather patterns are often disturbed, while another sign illustrating the forecasting difficulty in \(PV_3\) is that neither the smart persistence model is able to make better forecasts.

### Data availability impact

As mentioned in the introductory section, one calendar year of data is the minimum time interval for a model to be sufficiently trained, in order to incorporate all seasonal and weather patterns of the problem. Also, the presented results indicate that TL models can perform better than conventional models considering a scenario where one year of training data is available, while conventional models are still better than reference smart persistence ones. However, the application of TL offers the possibility to obtain reliable and accurate predictive models, even when the available training data for the target domain are less than one year. In this context, the proposed architectures are compared on the target PVs for different training periods, namely 3 months, 6 months and 9 months of available data. It must be noted that, although the training period has changed, the testing period has been kept the same for comparison purposes between the different scenarios.

Figure 7 presents the RMSE index of the four models in the four aforementioned scenarios of different training periods. Results indicate that the TL models are more robust considering different volumes of training data and that their performance slightly improves when more data are available. This can be contributed to their anterior training on the base PV over 3 years of hourly data. On the other hand, the impact of data scarcity is apparent for the conventional LSTM model, which radically improves when the training period increases and identifies new seasonal and weather patters. It is worth mentioning that none of the 3-month trained conventional models outperforms the smart persistence model, while only three 6-month trained conventional models manage to achieve better accuracy compared to the smart persistence one. The same does not apply for the 3-month trained TL models, which have lower RMSE compared to both the conventional LSTM and the smart persistence model.

Finally, the difference in terms of RMSE between the conventional model and the best-performing TL model decreases as more training data are becoming available. This is evident in all six target PVs. For example, the difference in terms of RMSE in \(PV_5\) is limited from \(150.5\%\) (3-month training models) to \(15.1\%\), about 10 times lower. Same decrease patterns are also identified in the other five PVs, further highlighting the importance of TL, especially when less than one calendar year of data is available.