Accurate delineation of individual tree crowns in tropical forests from aerial RGB imagery using Mask R-CNN

Tropical forests are a major component of the global carbon cycle and home to two-thirds of terrestrial species. Upper-canopy trees store the majority of forest carbon and can be vulnerable to drought events and storms. Monitoring their growth and mortality is essential to understanding forest resilience to climate change, but in the context of forest carbon storage, large trees are underrepresented in traditional field surveys, so estimates are poorly constrained. Aerial photographs provide spectral and textural information to discriminate between tree crowns in diverse, complex tropical canopies, potentially opening the door to landscape monitoring of large trees. Here we describe a new deep convolutional neural network method, Detectree2, which builds on the Mask R-CNN computer vision framework to recognise the irregular edges of individual tree crowns from airborne RGB imagery. We trained and evaluated this model with 3,797 manually delineated tree crowns at three sites in Malaysian Borneo and one site in French Guiana. As an example application, we combined the delineations with repeat lidar surveys (taken between 3 and 6 years apart) of the four sites to estimate the growth and mortality of upper-canopy trees. Detectree2 delineated 65,000 upper-canopy trees across 14 km2 of aerial images. The skill of the automatic method in delineating unseen test trees was good (F1 score = 0.64) and for the tallest category of trees was excellent (F1 score = 0.74). As predicted from previous field studies, we found that growth rate declined with tree height and tall trees had higher mortality rates than intermediate-size trees. Our approach demonstrates that deep learning methods can automatically segment trees in widely accessible RGB imagery. This tool (provided as an open-source Python package) has many potential applications in forest ecology and conservation, from estimating carbon stocks to monitoring forest phenology and restoration. Python package available to install at https://github.com/PatBall1/Detectree2

. Remote sensing data sources. The exact location of the sites is described in Study Sites. Resolution is given as ground resolution for the RGB imagery and as the processed CHM resolution for the lidar scans. Beam divergence is given at the 1/e 2 points. Sepilok West and Sepilok East were separated for analysis due to the different characteristics of the Sepilok forest in these two areas. available to install and apply on new regions 1 .

72
Study sites. The analyses were conducted at four locations across three tropical field sites: 73 Fig. 1. The automatic tree crown delineation workflow. Manually delineated crowns are randomly split into training and test sets (though the figure suggests that the sets were determined geographically, this is purely for visual clarity). The Mask R-CNN framework combines the training set with the RGB imagery to learn how to delineate automatically from RGB images. A set of automatic predictions are produced across the entire RGB image and compared to the test set to evaluate the performance of the automatic delineations. Intersection over union (IoU ) is used to determine when an automatic crown has been successfully matched with a manual crown.
hyperspectral layers. We used several techniques to improve the accuracy of crown delineation, including manipulating the 93 contrast and saturation of the RGB image to exaggerate differences between the crowns, and using a mask of the lidar data to 94 remove irrelevant parts of the RGB imagery. These techniques meant that the vast majority of tree crowns were separable by 95 eye but, it should be noted, that in rare cases, tree crowns were near impossible to delineate with certainty and the labeller's 96 best estimate was used. See Sup. Note 2 for further details. 97 We trained and tested our model with a total of 3797 manually delineated tree crowns across Paracou (1267), Danum (521), 98 Sepilok West (1038) and Sepilok East (971). The crowns from Paracou were validated in the field with an expert local botanist, 99 whereas the crowns in Malaysia were drawn by inspection of the remote sensing products. Four individuals performed the 100 manual delineations which provided the network with variability in the inputs. sets, a minimum crown polygon area coverage of a tile was set at 40%. Including overly sparse tiles was likely to lead to poor 104 algorithm sensitivity while being too strict with coverage would have limited the amount of training and testing data available. 105 If training and test crowns are close to one another, spatial autocorrelative effects are likely to inflate the reported perfor-106 mance (53). To avoid this, individual tiles (rather than individual crowns) were assigned to training and test sets ensuring spatial 107 separation. Approximately 10% of the tiles from each site were reserved at random for testing. To avoid contamination of the test set, tiles with any overlap with the test tiles (including with the buffer) were excluded from the training set. The training tiles were further partitioned into 5-folds for cross validation. This allowed for the tuning of parameters and the implementation 110 of early stopping (see Training and model selection) without exposing the test set. Details of the data processing are described 111 in Sup. Note 3.

112
Model architecture and parameterisation. Instance segmentation combines object detection with object segmentation.

113
Once an object has been detected in a scene, a region of interest (as a bounding box) is established around the object. Then 114 a "segmentation" is then carried out to identify which pixels within the region of interest make up the object of interest (and 115 which lie outside; see Fig. S2 for an example). 116 We adapted Facebook AI's Mask R-CNN architecture as it was the best in class algorithm upon release for instance segmen-117 tation when tested on the Microsoft COCO (Common Objects in Context) benchmark (40, 54) and has since been updated (as 118 Detectron2) with improved training efficiency, documentation and transferability for integration into bespoke tools (41). We 119 adapted the Detectron2 computer vision library to handle geospatial inputs/outputs and perform the delineation of individual 120 tree crowns. The library performs instance segmentation by generating object "masks" which exactly circumscribe the objects 121 in the image (see Fig. S2 for an example prediction). It also has a "model zoo" 2 from which specific model architectures with 122 weights from a variety of pre-training regimes can be loaded. Taking a pre-trained model (weights) and retraining it to perform 123 a novel task (e.g. delineating trees from aerial imagery) is an example of transfer learning which can drastically reduce the 124 amount of training data required to achieve acceptable performance on the new task (55). We selected the R101-FPN config-125 uration 3 as it has "the best speed/accuracy tradeoff" of the architectures available (41 The model hyperparameters (see Table S1) were tuned with a Bayesian hyperparameter sweep implemented on wandb.ai 4 . This is an automated process that allows an automated agent to iteratively adjust hyperparameters to optimise accuracy. The 141 best performing models and optimal confidence threshold for a given model (see Model architecture and parameterisation) 142 were selected based on the F 1 score (see Performance evaluation) on the validation fold.

143
See Sup. Note 4 for more details on model architecture, training and validation. The Colab (Jupyter) notebooks in the 144 GitHub repository 5 illustrate the best practices for training and selecting models.

145
Performance evaluation. After tuning and training, the best performing models were taken forward to be evaluated against 146 the test tiles. Matches between predictions and manual crowns (i.e. true positives) were identified by assessing the degree of 147 spatial overlap between possible pairs. A minimum area threshold for valid crowns was set to 16 m 2 . This removed fewer 148 than 2% of manual crowns and introduced a level of consistency between sites and between the effort given by the manual 149 delineators. The threshold was small enough to allow for an inclusive analysis of the variation in performance by tree height.

150
Crown overlap was calculated as the area intersection over union (see Fig. 1): where A is the automatically delineated crown area and B is the manually delineated crown area. An IoU of an overlapping 152 pair of more than 0.5 was considered a match. This is a commonly used threshold in similar studies (e.g. (25, 36)) that allows 153 for small discrepancies in alignment and outline. These "true positives" as well as the unmatched predictions (false positives) 154 and unmatched manual crowns (false negatives) were used to calculate the precision, recall and F 1 score of the automatic 155 predictions.

156
Despite the best efforts of the manual delineators and selecting for tiles with high manual crown coverage, the manual 157 crowns were inevitably an incomplete representation, so recall (fraction of relevant instances retrieved) was an insightful metric.

158
However, to ensure balance with precision we used the balanced F 1 score metric to assess and compare the accuracy of the 159 models. This approach is not biased by tree crown area and is widely used in tree crown segmentation studies (36, 39). See Sup. Note 5 for more details on the evaluation metrics.

161
To evaluate the performance of Detectree2 across tree heights, we assigned a height to each test crown (based on the median 162 pixel value of the initial CHM within the crown) and arranged them into 5 m height bins. The shortest bin (0-5 m) at each site 163 was iteratively merged with the next shortest bins until more than 10 individuals were represented (e.g. at Paracou 17 trees with  Transferability across sites. To determine whether models were able to generalise across different tropical forest areas, we 167 evaluated the performance of the models when trained on one site and transferred to others. We compared these performances 168 against the effect of using the "combined" training regimes described in Training and model selection. saics (excluding the very edges where distortion is prominent) to generate site wide crown maps. We combined these crown 171 delineations with repeat lidar surveys to determine the height changes in individual trees in our four sites. We determined the 172 relationship between tree height and tree growth by fitting a robust least squares regression (56) to the data. Robust least squares 173 was chosen to minimise the effects of outliers and mortality events on the regression. We note that here we are measuring the 174 vertical growth of trees, instead of the growth in diameter at breast height (DBH) which is traditionally measured in forest 175 inventory data.

176
To estimate mortality rates, we needed a suitable metric to identify mortality events. We took a statistical approach defining 177 a mortality event as a negative change in height of more than three standard deviations below the robust least squares fit. This 178 allowed for the possibility that a mortality event may uncover another layer of vegetation rather than the forest floor. This 179 choice was ratified by manual inspection of trees meeting this threshold, and confirming that they constituted mortality events.

180
Annual rates were determined by dividing by the time between lidar scans.

182
For this reason, we resisted reporting a direct comparison of reported growth and mortality rates between sites. As our focus 183 here was on demonstrating the use of Detectree2 for locating crowns, we considered that a detailed exploration of the potential 184 biases from the lidar data beyond the scope of the current paper.

190
Performance by site and tree height. Detectree2 located and delineated trees well (F 1 > 0.56) across all sites (see Table 2).  Across all sites, accuracy improved with tree height (see Fig. 3). This is likely due to the increased crown visibility of   Fig. 4. Sensitivity of Detectree2 delineation accuracy to the training data used. In (A), "Same" indicates that training and testing took place at the same site. Sepilok West and Danum are "similar" forest types in that they are tall dipterocarp dominated forests in contrast to Sepilok East and Paracou that are shorter forests with a larger number of trees per hectare. As each site has two "different" sites and an average was calculated for the F1 score. (B) shows the change in performance that occurs through employing different combinations of the training data. That can be just a single site, all sites at once ("combined") or all sites at once followed by a limited number of iterations on the site to be tested on.
forest type to the one it was trained on (see Fig. 4a). For example, the performance at the forests of Sepilok West is significantly

206
In general, the model that was trained on all sites at once ("combined") outperformed the models that were trained on just 207 a single site with the exception of Paracou (see Fig. 4b). Across the board, the best performing models were those that were 208 exposed to data from all sites before being trained for a fixed number of iterations at the site to be predicted on. This suggests 209 that providing a broad range of input data helps the networks to learn the key visual features but further tuning for local context 210 helps maximise performance.

211
Application: Growth and mortality. One application of Detectree2 is to study tall tree growth and mortality rates. To do 212 this, we overlaid Detectree2's tree crown predictions at the start date for each site on repeat lidar data (as canopy height models 213 described in Sup. Note 1) to retrieve the tree height dynamics over time. 214 We were able to estimate the relationships between tree height and tree growth for each site by fitting robust least squares  Table S2. The growth rate decreased with tree height in all sites. 217 We assumed trees had died when their height decreased substantially. To evaluate this quantitatively, we fitted a robust 218 least squares to the height change, against the original height of the tree, taking trees that were 3 standard deviations below 219 the mean of the regression fit to be mortality events. The robust least squares regression differs from ordinary least squares 220 as outliers contribute less to the regression fit. Therefore the robust least squares weights the fit towards those trees which did 221 not suffer large height loss, and by taking the threshold to be 3 standard deviations we aim to identify only those trees that 222 are outside the assumed normal distribution of typical tree growth and measurement error. Furthermore, as the robust least  gives the growth rate of trees split by height bin. Due to biases in tree height measurements that can arise from differences in lidar scan parameters we advise against a direct comparison of growth and mortality rates between sites. Uncertainty estimates were determined by bootstrapping.
squares still incorporates outliers when fitting the data, three standard deviations was considered sufficient to identify mortality 224 events. Fig. 5 illustrates how certain trees were identified as mortality events and some visual examples of mortality events 225 from Paracou are given alongside (see Fig. S12 for the other sites).

226
The mortality rates increased with tree height Fig. 6 best performing models were those that were exposed to training data from all the sites and then "honed" with a limited number 254 of training iterations on the site to be predicted on. This suggests that our trained models (provided freely with the Python 255 package 6 ) could be transferred to a new site with very little manual data or training iterations. 256 We note that the manual delineations were done by different people focusing on different parts of the sites. There was no 257 clear effect of different delineators on the results but this would be somewhat confounded with site differences.

258
Application: Growth and mortality rates. Tall trees store the majority of forest carbon and dominate many important forest 259 nutrient cycles. However, they are rare and therefore poorly represented in traditional field inventories (46) which makes estimating their growth and mortality rates particularly challenging (42-45 We focused on aerial RGB imagery which is the cheapest and most widely available imaging source for tropical forests. 288 We also benefited from the variety of pre-trained models that come with this data type. However, different data sources may 289 provide additional information that would help to discern differences between crowns. In particular, multi-spectral imagery that 290 typically includes additional bands in the near-infrared is commonly used to study differences between trees due to the optical an orthomosaic. Including this as a layer would add an additional dimension of information that could help to distinguish fine differences in structure. It would be straightforward to include additional (or different) bands to the Detectree2 framework but it would forego the utility of the pre-trained models. Therefore, it is likely that significantly more training data and computational 296 resources would be required to train a model (from scratch) to the desired performance.

297
Ideally, we could apply this approach to satellite imagery to perform global analyses. Preliminary tests suggest that De-298 tectree2 can accurately delineate trees in RGB imagery at 2 m resolution (see Sup. Note 7) which is equivalent to modern 299 high-resolution satellite imagery. If this proves possible it will help answer many long-standing questions in forest ecology as 300 well as provide an important tool for forest management. A "random resizing" augmentation step would further help improve 301 generalisability across resolutions and incorporating "small object" detection features (65) would improve the sensitivity to 302 shorter trees.

303
While we studied the delineation of a single class (tree), Detectree2 can be trained and make predictions on multiple classes.

304
This may allow for low-cost species identification and mapping. It may also help to automatically assess liana infestation 305 occurrence. Previously, hyperspectral data has been employed to address this problem but with limited success due to the   Training tiles were separated into 5-folds so that cross validation could be performed during the hyperparameter tuning phase. Testing tiles were spatially separated from the training tiles and only seen by the network at the evaluation phase.

Supplementary Note 4: Model architecture, tuning and training
A. Model architecture. Mask R-CNN (40) is a framework for instance segmentation. It builds on Faster R-CNN (69), which carries out object detection using a Region Proposal Network (RPN). An RPN is a fully convolutional network, trained end-toend, that generates Regions of Interest (RoIs) for each image. These RoIs have object bounds and objectness scores attached to them, giving the bounds of the RoI, and the likelihood of the RoI containing an object. These RoIs are then passed through fully connected convolutional networks to determine the class of image contained, and the exact mask of each object, within the respective bounding box. For full details on the structure of Mask R-CNN, please refer to the GitHub repository for this work (https://github.com/PatBall1/Detectree2). The schematic of the architecture of the model architecture is given in Fig. S2.
We selected the R101-FPN configuration -this architecture combines a 101 layer deep ResNet (70) module with a Feature Pyramid Network module. The configuration sets the backbone of the network which is the part that views and extracts features from the scene as a whole. The R101-FPN backbone consists of a 101 layer deep ResNet (71) module with a Feature Pyramid Network (72) module. The initial model weights were generated from pre-training of the network on the ImageNet dataset 7 . It is possible to "freeze" the backbone to different depths depending on the amount of flexibility the user wants to introduce in moving away from the pre-trained model weights. An example of the different predictions is given in Fig. S3.
B. Data augmentation. The training data was augmented by submitting the training data to a variety of transformations including vertical and horizontal flips, rotation, and varying the saturation and contrast of the images. These augmentations are designed to increase the variety of training data seen by the model, to allow it to generalise more readily to new images.
C. Training and hyperparameter tuning. based on a given accuracy metric. We selected the AP50 of the segmentation predictions on the (randomly selected) validation fold as our metric for optimisation. AP50 is the average precision of predictions when a correct match is granted for IoU > 0.5.
AP50 was also used as the metric for early stopping whereby if the model performance failed to improve for a set number of training iterations (the "patience"), training would be terminated and the best model up to that point would be saved. This is a technique for regularisation and prevents wasted computation. As with all deep networks, there were several of hyperparameters that controlled the way the networks handled data handling and were trained. The first of these relates to the architecture of the algorithm, namely the number of hidden layers in the ResNet backbone (71) of the network. ResNets are used as the backbone of Mask R-CNN as very deep neural networks are particularly difficult to train, due to the vanishing gradients problem (73). ResNets use skip connections, whereby certain layers in the network are skipped during backpropagation to avoid this problem.
Other hyperparameters that can be optimised in Mask R-CNN relate to the training of the network. Namely the learning rate, the number of iterations and the batch size. The optimisation of these hyperparameters was carried out using Weights Table S1. Tunable hyperparameters (with their optimised value) and a description of their purpose.

Hyperparameter (=val) Description dl_nums_workers (=2)
To speed up the training process, we make use of the num_workers optional attribute of the DataLoader class. The num_workers attribute tells the data loader instance how many sub-processes to use for data loading. By default, the num_workers value is set to zero, and a value of zero tells the loader to load the data inside the main process. This means that the training process will work sequentially inside the main process. After a batch is used during the training process and another one is needed, we read the batch data from the disk. Now, if we have a worker process, we can make use of the fact that our machine has multiple cores. This means that the next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is where the speed up comes from. The batches are loaded using additional worker processes and are queued up in memory.
The iteration number to decrease the learning rate by GAMMA.
backbone_freeze_at (=3) Freeze the first several stages so they are not trained. There are 5 stages in ResNet. The first is a convolution, and the following stages are each group of residual blocks. Freezing a layer prevents its weights from being modified. This technique is often used in transfer learning, where the base model(trained on some other dataset is frozen. warmup_iters (=120) Warm-up steps are just a few updates with low learning rate before training. After this warm-up, the regular learning rate is use, (schedule) to train the model to convergence. The idea is that this helps the network to slowly adapt to the data intuitively.
If the data set is highly differentiated, it can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your model's initial training can skew badly toward those features -or worse, toward incidental features that are not truly related to the topic at all. Warm-up is a way to reduce the primacy effect of the early training examples. Without it, one may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions. Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1p/n for its learning rate; the second uses 2p/n, and so on: iteration i uses i * p/n, until we hit the nominal rate at iteration n. This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence. Note that the ramp-up is commonly on the order of one epoch -but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. One may want to adjust, depending on how functionally extreme ones batches can become when the shuffling algorithm is applied to the training set. momentum (=0.9) Momentum in neural networks is a variant of the stochastic gradient descent. It replaces the gradient with a momentum which is an aggregate of gradients. Momentum can increase speed when the cost surface is highly non-spherical because it damps the size of the steps along with directions of high curvature thus yielding a larger effective learning rate along with the directions of low curvature.

batch_size_per_image (=1024)
The batch size defines the number of samples that will be propagated through the network. For instance, let's say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through of the network. Problem might happen with the last set of samples. In our example, we have used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get the final 50 samples and train the network. Advantages of using a batch size < number of all samples: It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That is especially important if you are not able to fit the whole dataset in your machine's memory. Typically networks train faster with mini-batches. That is because we update the weights after each propagation. In our example we have propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we have updated our network's parameters. If we used all samples during propagation we would make only 1 update for the network's parameter. Disadvantages of using a batch size < number of all samples: The smaller the batch the less accurate the estimate of the gradient will be. weight_decay (=0.001) Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function. loss = loss + weight decay parameter * L2 norm of the weights. Why do we use weight decay? To prevent overfitting. To keep the weights small and avoid exploding gradient. Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss.
base_lr (=0.0003389) In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain.
In setting a learning rate, there is a trade-off between the rate of convergence and overshooting. While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum. In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance with a learning rate schedule or by using an adaptive learning rate.
max_iter (=462) An iteration describes the number of times a batch of data passed through the algorithm. In the case of neural networks, that means the forward pass and backward pass. So, every time you pass a batch of data through the neural network, you completed an iteration. When you increase the number of iterations, you get closer to the minima and the set of optimal parameters, and these are what improve performance. and Biases, which uses grid search to determine the optimal hyperparameters. The hyperparameters selected are given in Table  S4.1.
We found that while increasing the depth of the ResNet increased the training time of the algorithm, it improved accuracy scores. The batch size was limited by computing resources available. The number of iterations was the key metric to optimise, as we found that our model began to overfit the training data if trained for too many iterations. To determine the number of iterations, we plotted a graph (Fig. S6) of both the total training loss and the total validation loss to determine the minimum of the total validation loss, and hence the optimal number of iterations to train the model. As Mask R-CNN consists of three predictions, namely the class, bounding box and mask, the total loss is the sum of these three losses.
We see that after around 800 iterations, the model started to overfit the training data, at the expense of performance of the validation dataset. As such we determined to stop the training of Mask R-CNN at 800 iterations, and then used the saved weights after 800 iterations to predict tree crowns in Sepilok.

Supplementary Note 5: Evaluation metrics
Two key metrics are used in this work to evaluate the performance of the model: the AP50 score, and the F 1 Score. AP50 relates to the area under the precision-recall curve (AUC-PR) evaluated at a particular threshold for the Intersection over Union, in our case 0.5. A high value for AP50 means that we are seeing both high precision and recall of the model at our particular IoU cutoff. The IoU cutoff was selected as 0.5, on the basis of previous studies evaluating the accuracy of individual tree delineation methods. In rare cases where more than one automatic delineation had an IoU of greater than 0.5 with a manual crown, then the automatic delineation with the greatest IoU was considered a true positive and the others were classed as false positives. Visual examples of different IoU values are given in Fig. 1.

Supplementary Note 7: Sensitivity to image resolution
We performed a preliminary analysis on how sensitive the delineation accuracy is to changes in the image resolution. We then carried out another test, where we trained our model on the highest resolutions available to us (8 and 10 cm), before testing it on lower resolutions to determine the model's sensitivity to resolution. The lower resolutions were chosen as they are typical resolutions of modern, high-resolution satellite imagery. The final test was to train and test the model on low resolutions to determine if this workflow would increase the skill of the model on lower resolutions. As seen in Fig. S11, illustrate that our model trained on high resolution (0.1 m) imagery is most successful when predicting on similar resolutions, and its performance degrades for resolutions an order of magnitude greater (1 m and 2 m). However, we illustrate that when the model is trained on similar resolutions to those it is tested on, performance is largely maintained. It is only on resolutions greater than 1 m that the performance degrades significantly.

Supplementary Note 8: Growth and mortality details
Growth and mortality rates were estimated from the CHMs of the repeat lidar data (taken at all sites as detailed in Table 1). Tree height was defined by the median value of the CHM within the overlayed crown polygon. To calculate growth rate (in m / yr), we evaluated the difference in height of individual trees in the two years of measurements, and divided by the time between the measurements (assuming an approximately constant growth rate).
Mortality estimates were derived by fitting a robust least squares regression to the change in tree height against the original tree height. Mortality events were defined as a drop in height of 3 standard deviations or more from this fit. The rate of mortality (%/yr) was calculated as the percentage of detected trees that died divided by the time between scans. The tree mortality rates were grouped into height bins (depending on the original height of each tree). A drop of three standard deviations is strict but it is possible that drops could result from non-mortality events such as snapping or other wind damage. Uncertainty estimates were determined by bootstrapping (repeatedly sampling from the complete set of predicted crowns). Plots analogous to Fig. 5(A) for the other sites are given in Fig. S12. These plots illustrate the different characteristics of each forest site.