Methane emission by beef cattle using Random Forest

This is an example of a predictive model for methane (CH₄) emissions in beef cattle using a novel synthetic data generation method that addresses the scarcity of large, real-world datasets. The core of the approach lies in a rank-based method to generate synthetic databases that preserve non-normal multivariate distributions and the Spearman correlation structure among variables. This enables the creation of biologically plausible data reflective of real-world constraints.

Data source: The model used data from 63 published studies (263 records), focusing on dietary and animal characteristics like body weight (BW), dry matter intake (DMI), crude protein (CP), NDF, ADF, starch, EE, and ash to predict methane emissions.

Synthetic data:

Step 1: Fit normal and best-fit non-normal distributions to each variable using R's gamlss package.
Step 2: Generate 20,000 synthetic records per distribution type (normal and non-normal).
Step 3: Impose original variable correlations via Spearman’s method using a Cholesky decomposition and reranking.
Step 4: Clean data for biological plausibility (e.g., total nutrient % < 100, ADF < NDF).

Prediction models: Two regression models were trained and compared:

Random Forest (RF): High accuracy (R² = 0.927), but performance dropped when applied to data from mismatched distributions.
Multiple Linear Regression (LM): Lower initial accuracy (R² ≈ 0.62) but more robust across distribution types.

Model evaluation:

Statistical metrics: MB, MSEP, RMSEP, CCC, Cb, and AIC were used to assess performance.
RF showed better fit within distribution-matched datasets but was sensitive to mismatches (i.e., lacked generalizability).
LM was more stable and distribution-agnostic.

Feature importance in CH₄ prediction:

Most influential variables: DMI and EE, followed by ADF, NDF, and starch.
Fiber variables gained importance when non-normal distributions were assumed, highlighting the significance of correctly modeling data structure.

Reference:

Tedeschi, L. O. 2025. ASAS-NANP SYMPOSIUM: MATHEMATICAL MODELING IN ANIMAL NUTRITION: Synthetic Database Generation for Non-Normal Multivariate Distributions: A Rank-Based Method with Application to Ruminant Methane Emissions. J. Anim. Sci.:skaf136. doi: 10.1093/jas/skaf136

This synthetic modeling approach is significant for animal science applications where data availability is limited. It enables researchers to simulate, train, and evaluate AI models without compromising data privacy or quality, offering a promising path for future integration with deep learning and hybrid intelligent-mechanistic models.

We welcome your feedback!

If you identify any mistakes, errors that need correction, or have suggestions to improve the model's predictive accuracy, please don’t hesitate to contact us. We are also interested in hearing your thoughts about the tool and its usefulness in your work.