# Classification-based motion analysis of single-molecule trajectories using DiffusionLab | Scientific Reports – Nature.com

### The workflow and concepts of DiffusionLab

The workflow of the classification-based motion analysis in DiffusionLab is given in Fig. 1. In the first step, the trajectories are imported from a tracking software. At the time of writing, DiffusionLab supports the import of trajectories from the software package Localizer28 and the ImageJ plug-in DoM29. Moreover, the import of simulated trajectories from COMSOL Multiphysics® is supported to compare experimental and simulated data sets42. Step-by-step instructions for how to export the trajectories from the respective software are given in the Documentation section 2.2. In the second step, a set of descriptors of each trajectory is computed, which are scalars that capture a property of the trajectory. Examples of these properties are the length and tortuosity of the trajectory. Both two and three-dimensional trajectories are supported; however, not all properties are defined in three dimensions. A description and physical interpretation of the properties is given in the Documentation section 3.1.1.

A classification model is constructed to classify the trajectories into populations with similar motion behaviour. The populations are identified by the user and classification is done via a hierarchical classification tree. The hierarchical classification tree is read from top to bottom for each trajectory, weighing one property at each branch, until the trajectory is classified (example in Fig. 2). Hierarchical classification trees have the advantage that they are easily interpretable and important properties and their thresholds are readily available from the model43. This aids rational design of the classification model when constructed manually. For machine learning, the classification tree model has the advantage that it is very well interpretable compared to other machine learning models. In DiffusionLab, hierarchical classification trees can be constructed manually or via supervised machine learning. For manual construction, DiffusionLab has various tools to find important properties for classification and to set the thresholds. The biplot shows the properties with a high dispersion that are therefore good candidates to implement in the tree (see Sec. S1.3). Histograms and correlation plots can be generated to find thresholds in e.g., bimodal distributions as is shown in Fig. 1, step 3a. These thresholds can be optimized by applying the classification and visually assessing the result. In supervised machine learning, the model maps an input (trajectory properties) to an output (motion behaviour type) based on example input–output pairs. The example pairs are called a “training set”, which are here a set of representative trajectories labelled with their motion behaviour. DiffusionLab offers a tool to manually classify a subset of experimental trajectories to obtain such a training set. This is then used to compute a hierarchical classification tree model. On top of that, MATLAB’s Classification Learner app can be used to construct a classification model with any classification models available in MATLAB, which includes support vector machines and K-nearest neighbours. In the fourth step, the classification model is used to classify all trajectories in populations.

The motion of the classified populations is analysed in the last step (Fig. 1, step 5), and this is where the power of the classification-based approach becomes truly visible. With DiffusionLab, the user can quantify both the motion of individual trajectories and the ensemble behaviour of each population. Analysis of the individual trajectories reveals the heterogeneity in molecular motion on the single-molecule level. However, in data sets with short trajectories, the statistical relevance of a single trajectory is limited. Because the trajectories have been classified into populations with similar motion behaviour, we can overcome this by analysing the ensemble-averaged behaviour and diffusion parameters. T-MSD analysis is known to suffer from bias in the estimated diffusion parameters due to correlation in the displacements8,34. Since the detected motion behaviour is dependent on the relation between the MSD and the delay time, the bias makes it particularly challenging to determine the diffusion model that best characterizes the motion of the emitters8. The time–ensemble averaged MSD (TE-MSD) averages the T-MSD over a subset of trajectories, thereby averaging over any correlations within the trajectories

$${\text{TE-MSD}}\left( {t_{n} } \right) = \frac{1}{{J_{n} }}\mathop \sum \limits_{{j,N_{j} \ge n}} \frac{1}{{N_{j} – n + 1}}\mathop \sum \limits_{i = 0}^{{N_{j} – n}} \left| {{\varvec{x}}_{j,i + n} – {\varvec{x}}_{j,i} } \right|^{2}$$

(3)

with the time series of positions $${\varvec{x}}_{j,0} ,{\varvec{x}}_{j,1} ,…,{\varvec{x}}_{{j,N_{j} }}$$ for trajectory $$j$$. For each delay time $$t_{n}$$, we include an ensemble of $$J_{n}$$ trajectories with $$N_{j} \ge n$$. The bias in the TE-MSD is strongly reduced with respect to the T-MSD, which results in a interpretable relation between the MSD and the delay time over a longer time domain, facilitating the determination of the motion behaviour type8. Moreover, the diffusion constant can be readily extracted from the TE-MSD, which allows direct comparison within and between different experiments. We use the T-MSD for the analysis of individual trajectories, while TE-MSD is used to analyse a population of trajectories. Altogether, using these two complementary approaches we obtain both insight in the motion heterogeneity on the single-molecule level as well as the mean diffusion parameters.

### A simulated example of trajectory classification and motion analysis

We demonstrate the workflow of DiffusionLab on a dataset of simulated trajectories (see Methods). The trajectories were simulated from random walkers in three dimensions that could switch between an immobile ($${D}_{1}=0$$ m2 s−1) and mobile state ($${D}_{2}=1.0\times {10}^{-12}$$ m2 s−1), mimicking transient trapping often observed in porous solids1,7,16,37,38. We imported the data set into DiffusionLab and find 10,600 individual trajectories (Fig. 1, step 1). Before computing the properties in step 2, we removed trajectories with fewer than five localizations17. In this way, we reject trajectories and unconnected localisations whose properties, such as a minimum bounding circle radius, are not or ill-defined. Moreover, the localisation and tracking algorithm sometimes finds localisations due to fluctuations in the camera noise. Since these erroneous localisations are random, the probability that they form a trajectory becomes smaller for longer trajectories. Thus, we can eliminate most of these from the analysis by removing trajectories with fewer than five localizations. Finally, for a trajectory of five localisations, the minimum number of points in the MSD curve is four, which we consider the minimum for MSD analysis. Removing the trajectories with less than five localizations reduces the number of trajectories to 2433. DiffusionLab then computes and stores the trajectory properties, as explained in the Documentation Section 3.1.1.

Next, the hierarchical classification tree was created (Fig. 1, step 3), and we first identified the motion types present in the data set. Although the movement of all emitters was simulated using the same underlying diffusion constants, we observed three qualitatively different types of trajectories, in agreement with the experimental results of Hendriks and Meirer et al.1 (Fig. 2a and Sec. S1.4). For some trajectories, the coordinates are separated from each other by no more than the localisation error. These do likely not reflect actual movement, but the apparent motion is due to the localisation error. We call these trajectories ‘immobile’. Other trajectories show continuous movement without interruptions, which we label ‘mobile’ trajectories. A third group of ‘hybrid’ trajectories has both mobile and immobile periods. Next, we manually classified a hundred trajectories, assigning each to one of the three motion types, and used this as a training set for machine learning. A training set of just a hundred trajectories is typically considered to be very small for machine learning; nevertheless, we show that this can be sufficient to construct a classification tree with reasonable accuracy. The classification tree has a resubstitution loss of 2%, which means that two of the hundred trajectories in our training set were incorrectly classified by the tree. With a small training set as is discussed here, we highly recommend to assess the tree’s performance via the resubstitution loss as well as by visual inspection of the complete data set after classification. The classification tree obtained is shown in Fig. 2b. A discussion of the performance of other estimators can be found in Section S1.5. Reproduction of the motion analysis with segmentation using a different training set yields a very similar tree and classification results (Sec. S1.6). Interestingly, the same track properties, that is, the minimum bounding circle radius (MBCR), the minimum bounding circle centre minus the centre of mass (MBCC–CoM), and the number of points, were again found to be most important as previously reported in the classification tree of Hendriks and Meirer et al.1. The tree in their work was constructed with machine learning from a training set with mobile, hybrid, and immobile trajectories obtained from an experimental dataset of molecules diffusing in a porous catalyst particle. At every split of our classification tree, the histogram of the respective trajectory property is given in Fig. 2b. A bimodal distribution is clearly visible in the histogram of the MBCR and MBCC–CoM. The machine learning algorithm had found good threshold values to split the bimodal distribution without having access to the complete data set (shown in Fig. 2b). The stochastic nature of the random walk results in some overlap between the properties of the various motion types, as can been seen in the histogram, and the classification is successful if a vast majority of the trajectories in a population have been correctly classified.

We can rationalize why the MBCR, MBCC–CoM, and number of points are good properties to classify a data set with transient trapping. It is no surprise that the MBCR is a good property to classify immobile trajectories, because the emitter does not move, and localisations are scattered around the emitter’s true position. In Section S1.7, we calculate the probability that an immobile trajectory falls entirely within a minimum bounding circle with some threshold radius. A threshold value of the MBCR of 118 nm is reasonable compared to the localization errors of 12 and 28 nm of an in-focus and 400 nm out-of-focus fluorophore44,45. The probability that a trajectory with 17 localisations, which is the mean length for the immobile population, fits within the minimum bounding circle is > 99.99% for an in-focus immobile emitter, while it drops to 99.95% for an out-of-focus one. Thus, the value of the MBCR allows for out-of-focus emitters to be classified as immobile, while keeping it as low as possible to not include many hybrid trajectories. The MBCC–CoM is a good property to distinguish hybrid trajectories from mobile ones. Because it describes how far the centre of the minimum bounding circle and the centre of mass are apart, its value will be higher when the trajectory has a spatially asymmetric distribution of localisations. This is a characteristic of a trajectory with an immobile segment followed by a mobile segment, in other words: a hybrid trajectory. The number of points is a good indicator for whether the trajectory is either hybrid or mobile. Mobile trajectories are a result of emitters rapidly moving in- and out-of-focus resulting in short trajectories, while the immobile segments in hybrid trajectories make the trajectory longer. Because these properties are good descriptors for this type of motion behaviour, we demonstrated that the presented classification tree can be manually adapted to classify a data set with transient confinement (Sec. S1.8). Next, we used the obtained classification tree to classify all trajectories in the transient trapping data set (Fig. 1, step 4).

The motion of the different populations is quantified with MSD analysis of individual trajectories and in ensemble (Fig. 1, step 5). The diffusion constant was measured from a linear least-squares fit of the T-MSD curve including 25% of the shortest delay times and at least three points in the fit. A histogram of the measured diffusion constants of individual trajectories is given in Fig. 3a,b. Prior to classification, we find a continuous range of diffusion constants over six orders of magnitude (Fig. 3a). The distribution contains all information about the heterogeneity of mobility and motion behaviour, but it is hard to interpret. By pooling the data into motion behaviour classes, we learn how the mobility and motion behaviour are related and how they change per experiment. To understand the origin of this wide range in diffusion constants, we consider the same histogram after classification (Fig. 3b). Remarkably, the immobile trajectories make up for most of the spread, while we would expect the diffusion constant of immobile emitters to be $$D = D_{1} = 0$$. Localisation noise dominates the measured diffusion constants of these immobile emitters, and the spread does not reflect a true distribution of diffusion constants. As expected, the immobile trajectories have a diffusion constant close to zero. The distribution of the diffusion constant of mobile trajectories is narrower and centred around $$D = D_{2} = 10^{ – 12}$$ m2 s−1, i.e., the true mobility of the mobile diffusion state in the simulation. The hybrid trajectories have a wider range of diffusion constants in between those of the immobile and mobile diffusion states, because of the varying length of mobile and immobile periods in the trajectories.

The TE-MSD curve provides insight in the mean motion behavior of the emitters via the shape of the curve and fit of an appropriate diffusion model. We compare the TE-MSD before and after classification (Fig. 3c,d). The shape of the TE-MSD curve prior to classification in Fig. 3c resembles confined motion, marked by the plateau at long delay times39. However, we do not expect confined motion in this data set, since the transient trapping diffusion model does not impose confinement on the emitters. Interestingly, we find the same plateau after classification in the hybrid trajectories’ TE-MSD curve, while the mobile and immobile TE-MSD curves are a straight line (Fig. 3d). To explain the plateau at long delay times, we first consider the TE-MSD curve of the hybrid trajectories. The correlation plot of the diffusion constant and the number of localizations of the hybrid trajectories shows that trajectories with few localizations have a high diffusion constant and vice versa (Fig. 3e). The number of localisations in the trajectory is highly dependent on the number of immobile segments, because the emitter cannot move out of focus during these immobile periods. At long delay times in the TE-MSD curve, only the trajectories with many localizations (thus a longer time in the immobile state) contribute, which results in a decrease in the slope of the curve. Now that we understand the origin of the plateau at long delay times in the TE-MSD of the hybrid trajectories (Fig. 3e), we can explain the TE-MSD prior to classification (Fig. 3d). Here, the curve also flattens to a plateau value—even more rapidly than for the hybrid trajectories—because short mobile trajectories and long immobile trajectories contribute to the TE-MSD curve as well. In contrast to the hybrid trajectories, the TE-MSD curves of the mobile and immobile trajectories are a straight line. This confirms that the mobile trajectories are due to normal diffusion. The slope of the TE-MSD of the immobile trajectories is zero, i.e., the apparent diffusion constant is zero, which is in line with their immobile motion behaviour.

The diffusion constant of the mobile trajectories is extracted from the TE-MSD curves using the model for normal Brownian diffusion. For our data set, we found D = 9.23 ± 0.14 × 10–13 m2 s−1. This value is slightly but statistically significantly lower than the true diffusion constant of the mobile state of D = 1.0 × 10–12 m2 s−1 that we had put in our simulations. This is attributed to a small number of immobile steps in the mobile trajectories, which reduces the measured diffusion constant of the population. We can estimate the fraction of immobile steps in the mobile population with the cumulative distribution function (CDF) of squared displacements for a delay time of one frame (Fig. 3f). Normal diffusion in two dimensions should yield46,

$$1 – {\text{CDF}}\left( {r^{2} ,t_{n} } \right) = {\text{exp}}\left( {\frac{{ – r^{2} }}{{{\text{MSD}}\left( {t_{n} } \right)}}} \right)$$

(4)

with $$r^{2}$$ the squared displacement at a delay time $$t_{n}$$. This produces a straight line in a semi-log plot of 1—CDF against $$r^{2}$$. Multiple decays in the 1—CDF curve indicate that multiple populations of emitters with a different diffusion constant are present, each with a different MSD resulting in a different slope of the curve. For the mobile trajectories, we find a transition from a fast decay (immobile displacements) to a slower decay (mobile displacements) (see black guides to the eye in Fig. 3f). This confirms that ~4% of the displacements in mobile trajectories are immobile, which mainly explains a slight underestimation of the true simulated diffusion constant in the TE-MSD analysis. It also shows that the classification is not perfect, and that motion analysis could be further improved by fitting the 1—CDF curve directly. The remainder of the discrepancy between the measured and true diffusion constant originates from imperfect linking of the localisations into tracks. For the immobile trajectories, we find D = 0 ± 4 × 10–18 m2 s−1 confirming that the diffusion constant of the immobile trajectories is zero—as expected.

In summary, we have demonstrated how the DiffusionLab software can be used to analyse a set of trajectories originating from complex heterogeneous motion. The software is a powerful tool for a wide range of materials scientists studying the motion of molecules and nanoparticles in porous materials, including but not limited to zeolite-based materials1,16. Adsorption and confinement play a big role in such porous materials, and classification prior to motion analysis can greatly increase interpretability, avoid bias, and allows for more reliable spatial mapping of diffusion heterogeneity. In previous experimental work using the DiffusionLab software, we have demonstrated this approach in a study of molecular diffusion through a porous catalyst particle, which is industrially used to “crack” long organic molecules into smaller—more useful—chemical building blocks1. The pore space of the catalyst particle is heterogeneous in material composition and porosity, with pore sizes ranging from the macropore to the micropore regime. The complex structure was reflected in the measured heterogeneous molecular motion behaviour. By classification in mobile, hybrid, and immobile trajectories, diffusivity of only the mobile trajectories could be compared with the bulk diffusion coefficient of feedstock molecules, which turned out to be very similar in magnitude, thereby validating the approach. In other work using DiffusionLab, we have tracked the motion of single oligomers in the pores of ZSM-5 zeolites16. By performing classification and motion analysis, we could not only compare the average diffusion constant between experiments, but also explain whether differences were the result of a change in diffusivity during and/or frequency of the mobile periods. This had led to a detailed insight in pore geometry–mass transport relationships on the microscopic scale. These successful applications from materials science recommend the DiffusionLab approach for the field of life sciences as well. The motion types supported by DiffusionLab have been frequently observed in cellular environments. For example, directed motion has been reported when vesicles are transported along microtubules by molecular motors, transient trapping when molecules interact with actin proteins, or confined diffusion when molecules are confined by cell membrane components2,8.