Project Overview: Plant Growth Prediction

This application allows researchers to interactively analyze Tree Amplitude outcomes.

1. The Goal

Our primary objective is to determine which environmental factors—such as soil moisture or solar intensity—are the strongest predictors of tree radius amplitude.

2. Methodologies

  • Multiple Regression: A statistical modeling approach used to quantify how environmental variables influence stem radius amplitude, the size of the daily swelling–shrinking cycle of the tree stem. By modeling predictors like VPD, soil moisture, and solar exposure (and their interactions), the MLR helps identify which conditions most strongly increase or dampen these daily changes in stem size.
  • Logistic Regression: A predictive model used to classify change.
  • Principal Component Analysis (PCA): A dimensionality reduction technique used to explore the structure of the data and identify natural clusters.
  • K-Nearest Neighbors (KNN): A supervised machine learning model used to classify growth categories.
  • K-Means Clustering: An unsupervised machine learning model used to identify natural clusters in the data.
Filtered Dataset
Multiple Regression (OLS & Mixed Effects)

This section shows how environmental factors, particularly Vapor Pressure Deficit (VPD), soil volumetric water content (VWC), and solar irradiance, influence daily radial stem changes in white spruce at Arctic treeline. Tree dummies in OLS and random intercepts in mixed effects are used to account for the repeated daily measurements per tree, ensuring that baseline differences between trees don’t bias the estimated effects of environmental variables Understanding how VPD, soil water, and solar irradiance affect daily stem changes helps reveal how Arctic treeline trees respond to environmental stress, which can inform predictions about tree growth and ecosystem resilience under climate change.

Mixed Effects Model Output

                    
OLS Output

                    
Model Performance Metrics
Accuracy Score

                      
Detailed Classification Report

                    
Understanding the Model
Interactive Controls

Use the sidebar on the left to add or remove variables from the model. You can also toggle the graph to see predictors impacts on the model.

What is Logistic Regression?

Unlike linear regression, which predicts a number, Logistic Regression predicts the probability of an event happening (like a large basal area amplitude).

Understanding the Outputs
  • Precision: When the model predicts 'Extreme Change', how often is it correct? (Quality).
  • Recall: Out of all actual 'Extreme Change' cases, how many did the model find? (Quantity).
  • Coefficients: Positive values (Green) increase the likelihood; Negative values (Red) decrease it.
Feature Importance Graph

Overview

Principal Component Analysis (PCA) is a technique used to condense the feature space by creating new axes from the original features while preserving as much variation as possible. Each component captures a different pattern of variation in the data, with the first components explaining the most variance. Here, we use PCA to reduce our 8 numerical environmental and growth features to a more manageable number while retaining as much information as we can.

1. Feature Correlation Matrix
2. Scree Plot -- Variance reduced by adding further components

Exploring the Principal Components

PCA Plot Settings
3. PCA Scatter Plot
4. Correlation Matrix (All Variables + PCs)
Understanding This Matrix

We use this larger correlation matrix to show how each principal component relates to the original features. This, alongside the loadings plot below, allows us to obtain a better understanding of what these components represent.

Loadings Plot Settings
5. PCA Loadings Biplot

Principal Component Regression

Principal Component Regression (PCR) combines PCA with linear regression to predict tree growth outcomes. Instead of using the original features directly, PCR builds a regression model on our principal components. This can improve interpretability and potentially provide new insights.

PCR Metrics

                
6. Model Performance vs Components
Select Principal Components

Choose which PCs to include in the regression:

7. PCR Predicted vs Actual
K-Nearest Neighbors (KNN) Classification

To explore the effects of different environmental conditions on tree amplitude, we used a K-Nearest Neighbors (KNN) algorithm to classify trees into different daily basal area change categories. The KNN model predicts the change category of new data points based the most common category of its nearest neighbors.

The preselected variables are the ones that yielded the best model performance. The graphs/figures show the following:

  • The best K value based on the model performance on the test set at various K values
  • The model performance metrics on the test set for the chosen ideal K value via:
    • Confusion Matrix
    • Precision and Recall by category
  • Scatter plot of test set based on PCA values with overlays for true value, predicted value, and correct/incorrect classification
  • Partial effects plot which shows how the model's predictions vary with respect to 2 chosen features while holding all other features at their mean
Best K Value Selection
Model Performance
Metrics
Performance by Category
KNN Scatter Plot
Partial Effects Plots
K-Means Clustering

To understand how different environmental conditions grouped into different clusters -- or what we call "forests" --, we used a K-Means algorithm using both numeric and categorical variables.

We settled on three forests after observing the elbow plot below. Both three and four forests have very similiar silhouette scores and three is the best balance between structure and interpretation.

Elbow Plot (Within-Cluster Sum of Squares)
Forests Visualizations
Forests Interpretation (3 Clusters)
Forest Temp Humidity Solar Soil Water Pressure Species Site Amplitude
1
2
3
Cluster Diagnostics
Silhouette & Cluster Sizes

                        
Change in Basal Area by Cluster
Tree Model Conclusion

This project focuses on modeling the impact that environmental factors have on Tree Stem Amplitude (how much a tree grows/shrinks in a day). Tree Amplitude primarily comes from trees absorbing or losing water. As the climate changes (and the arctic is particularly susceptible to warming

It is unknown how trees will react. Our model hopes to provide clarity to what factors impact tree amplitude and therefore how climate change can impact trees.

Logistic Regression

To address the research question of distinguishing distinct physiological regimes, we utilized a multinomial logistic regression classifier to predict categories of Basal Area Daily Amplitude, a choice well-justified for isolating "Extreme Change" events from background noise.

The model achieved an accuracy of 47.7%, outperforming the random baseline (33%), though performance metrics indicate a stronger ability to identify stable periods ("No/Little Change" Recall: 0.61) compared to detecting high-amplitude events ("Extreme Change" Recall: 0.33).

Despite this identification gap, the model successfully validated key biological assumptions, confirming that species identity and energy input are deterministic: Picea mariana and high average solar irradiance emerged as the strongest positive drivers of extreme daily amplitude, while Picea glauca served as a significant negative predictor associated with stability.

Principal Component Analysis and Regression

To condense the number of features in our analysis and potentially simplify the model, we performed principal component analysis. This technique led to the identification of 6 principal components that explain roughly 95% of the variance in the data, allowing us to shrink the number of features from 8 to 6.

While some features are more difficult to interpret (particularly the later ones), we do gain valuable insight from some of the components. For example, PC1 and PC2 appear to be related to current tree stress, with PC1 being highly correlated with current stem radius and basal area, while PC2 is negatively correlated with these factors. PC4 seems to be a good general indciator of tree growth, since it is positively correlated with change in stem radius and basal area, as well as humidity and soil water content, which may indicate that additional expansion comes from water absorption.

When we take these components and apply them to a Principal Component Regression (PCR) model, they appear to work very well. As our scree plot demonstrated, the first four components are particularly useful in explaining the change in basal area. Overall, PCA has proven to be a useful tool.

Multiple Regression

Our regression results support the well-established hydraulic mechanism where high VPD drives reversible trunk shrinkage and recovery with low VPD. Higher VPD increased daily stem amplitude, while greater soil moisture buffered this effect, consistent with Devine & Harrington (2011).

Our results show that Arctic tree stem dynamics respond strongly to VPD and soil water, not just temperature, supporting Jensen’s argument that moisture-related stress is a key but overlooked driver of Arctic tree physiology under climate change.

This is particularly compelling as several studies have found a strong relationship between shrinkage (TWD )and hydraulic stress ($psi$) persisting across all drought conditions until lethal dehydration such as in (Ziegler et al., 2024)/ That is, large TWD (shrinkage) amplitudes are strongly linked to high hydraulic stress ($psi$ approaching lethal levels) because large TWD means living tissues have lost a lot of water (low turgor) to supply transpiration, signaling water stress.

KNN Classification

To approach the question of predicting Basal Area Daily Amplitude from a set of environmental factors, we applied a K-Nearest Neighbors (KNN) classifier. Our model achieved an accuracy of 66%, outperforming the random baseline (33%) by a factor of 2. The model had good predictive performance across all 3 categories, with a "Moderate Change" preforming the worst yet still having a precision of 57.7% and a recall of 52.2%.

This was found using the following features: Air Pressure, Humidity, Temperature, Solar Radiation, Soil Moisture, Stem Radius, and Species. The addition of variables measuring time of year (month), site location, latitude, and longitude (among others) did not improve performance. Thus, a fairly accurate predictive model can be built using only environmental features while effectively controlling in part for species in tree size. While this model cannot show exactly how influential each variable is, it does provide a useful baseline for further analysis and shows evidence that tree daily amplitude is strongly influenced by environmental factors.

However, we can Look at some partial effects plots :heatmaps of 2 numeric features with an overlay for predicted change holding everything else at its mean. We see some of what is happening in the data, but there is a lot of uncertainty. for humidity and soil water content, the most extreme changes are predicted when the conditions are either low humid and dry soil or high humidity and wet soil. For soil water content and temperature, we also see that the soil water content has a large deciding effect at lower temperatures but not at higher ones where many values are predicted to be extreme. For many of the plots with solar irradiance, there is a middle band on sunlight where the prediction is largely moderate with extreme and no/little change occurring at high or low solar irradiance.

K-Means Clustering

To observe how different environmental conditions group trees into distinct "forests," we applied K-Means clustering using both numeric and categorical variables. Based on the elbow plot and nearly identical silhouette scores for three and four clusters, we chose three clusters to keep the model interpretable while still capturing key structure in the data.

Forest 1 represents wetter, hotter, and sunnier sites with extreme changes in mean amplitude, whereas Forest 2 captures drier, colder, and shadier environments where mean amplitude is only moderate. Forest 3 is similar to Forest 1 but has higher humidity and air pressure that result in only moderate mean amplitude. This suggests that small shifts in climate variables can meaningfully change overall tree growth patterns.