Crop Predictor

The Predictive product offers Crop Productivity and Quality Estimates.

This information is available in different formats depending on the needs of each entity.

Currently, HEMAV has an AI infrastructure that works for SUGAR CANE, CORN, SOYBEAN, COTTON, BEET, VINEYARD, and PALM crops.

The steps to access this service are as follows.

Input data to the model

The models must have as much input data as possible. These are divided into different blocks. With a total of 140 possible variables, associating the most important ones to each model according to the input data.

Customer data (real data)

This is the most important block. Models reflect the quality of the data. If there isn’t a sufficient amount of data, or if it lacks certain quality standards, when training the models they may not achieve the expected results.

In this section we refer to the data that we want the model to learn (production, quality…) and it is the starting point for data management (campaign start date, planting and harvest time).

This section has 40 possible data points to fill in, with the most important being:

  • external_reference_id: Unique identifier per plot.

  • season_label: Management grouping.

  • latitude: Georeferencing of the plot.

  • longitude: Georeferencing of the plot.

  • type_id: Crop

  • sub_type_id: Variety.

  • init_date: Campaign start date (start_date or plantation_date depending on the crop).

  • harvest_or_estimation_date: Date.

  • cut_number: Cut number for crops such as sugarcane or palm.

  • production_per_hectare_real: Actual production obtained. Key variable for estimating the PRODUCTION model.

  • prodph_lastseason: Variable calculated by us, evaluating for that plot the production obtained in the previous season.

  • atr_real: Real ATR obtained. Key variable for ATR model estimation.

  • atr_lastseason: Variable calculated by us, always evaluating the production obtained in the previous season for that plot.

  • pol_real: Actual polarization obtained. Key variable for POL model estimation

  • sac_real: Real sucrose obtained. Key variable for SAC model estimation.

  • sac_lastseason: Variable calculated by us, always evaluating for that plot the production obtained in the previous season.

  • n_bunches_real: No. of bunches. Key variable for the N_BUNCHES model estimation.

  • plants_per_hectare: Number of plants per hectare. Important variable.

  • irrigation_type: Irrigation system used in the plot.

  • days: Growing days.

  • week: Growing week.

All these variables will define the quality of the requested model. This is why a series of reviews are carried out, which will be seen in the data review section.

Spectral data + radar

To the model’s dataset, the average value of the plot for the following spectral and radar values is incorporated for each date at a weekly level:

  • cloudcoverage: % of plot cloud cover on the day of visit. Only applicable for spectral parameters.

  • sigma0: Variable radar.

  • sigma0_std: Standard deviation of the radar variable. Shows the uniformity of the plot.

  • ndre: Nitrogen - chlorophyll index.

  • ndvi: Vegetation index.

  • ndvi_std: Standard deviation of the vegetation index.

  • b1: Sentinel 2 spectral bands.

  • b2: Sentinel 2 spectral bands.

  • b3: Sentinel 2 spectral bands.

  • b4: Sentinel 2 spectral bands.

  • b5: Sentinel 2 spectral bands.

  • b6: Sentinel 2 spectral bands.

  • b7: Sentinel 2 spectral bands.

  • b8: Sentinel 2 spectral bands.

  • b8a: Sentinel 2 spectral bands.

  • b9: Sentinel 2 spectral bands.

  • b11: Sentinel 2 spectral bands.

  • b12: Sentinel 2 spectral bands.

  • tcari_osavi: Soil-adjusted vegetation index that removes soil influence.

  • gndvi: Green-normalized difference vegetation index.

  • ccci: Chlorophyll index.

  • ndwi: Water status index (NDMI).

  • tcari: Index related to chlorophyll absorption.

  • OSAVI: The OSAVI vegetation index is a modified SAVI that also uses near-infrared and red spectral reflectance.

Spectral + radar data (temporal)

These are combinations of the previous variables processed to avoid cloud influence, indicating temporality or sudden data changes, since we work with accumulated data from the start of the campaign/planting, helping us to extract important indicators.

  • ndvi_smoothed: Ndvi working with a smoothing function removing the influence of cloud effects.

  • ndwi_smoothed: Ndwi (NDMI) working with a smoothing function removing the influence due to cloud effects.

  • ndvi_smoothed_temporal_max_diff: Maximum difference between weeks for the NDVI index.

  • ndwi_smoothed_temporal_max_diff: Maximum difference between weeks for the NDWI (NDMI) index.

  • ndvi_smoothed_max: Maximum NDVI value reached during the campaign.

  • ndwi_smoothed_max: Maximum NDWI value reached during the campaign.

  • ndvi_smoothed_temporal_mean_diff: Average difference between weeks for the NDVI index.

  • ndwi_smoothed_temporal_mean_diff: Mean difference between weeks for the NDWI (NDMI) index.

  • ndvi_std_temporal_max_diff: Maximum variability difference within the season.

  • sigma0_temporal_max_diff: Maximum difference of radar value within the campaign.

  • sigma0_max: Maximum radar value reached in the campaign.

  • sigma0_min: Minimum radar value reached in the campaign.

  • sigma0_temporal_mean_diff: Mean radar difference during the campaign.

  • sigma0_std_temporal_max_diff: Maximum radar difference during the campaign.

  • ndvi_growth_first_month: Maximum NDVI reached in the first month of the campaign.

  • ndwi_growth_first_month: Maximum NDWI (NDMI) reached in the first month of the campaign.

Climatological data

Climatological data is very important in the model as it indicates what the crop has been exposed to during the campaign. The variables we measure are the following:

  • pres: Mean pressure (mb).

  • slp: Mean sea level pressure (mb).

  • wind_spd: Average wind speed (Default m/s).

  • wind_gust_spd: Wind gust speed (m/s).

  • max_wind_spd: 2-minute maximum wind speed (m/s).

  • wind_dir: Average wind direction (degrees).

  • max_wind_dir: Direction of maximum 2-minute wind gust (degrees).

  • max_wind_ts: Time of maximum wind gust UTC (Unix Timestamp).

  • temp: Average temperature (Celsius by default).

  • max_temp: Maximum temperature (Celsius by default).

  • min_temp: Minimum temperature (Celsius by default).

  • max_temp_ts: Daily maximum temperature time UTC (Unix Timestamp).

  • min_temp_ts: Daily minimum temperature time UTC (Unix Timestamp).

  • rh: Mean relative humidity (%).

  • dewpt: Average dew point (Celsius by default).

  • clouds: Average cloud cover [satellite-based] (%).

  • precip: Accumulated precipitation (default mm).

  • precip_gpm: Accumulated precipitation [estimated by satellite/radar] (default in mm).

  • solar_rad: Average solar radiation (W/M^2)

  • t_solar_rad: Total solar radiation (W/M^2)

  • ghi: Average global horizontal solar irradiance (W/m^2).

  • t_ghi: Total daily global horizontal solar irradiance (W/m^2) [Clear sky]

  • max_ghi: Maximum value of global horizontal solar irradiance during the day (W/m^2) [Clear sky]

  • dni: Average direct normal solar irradiance (W/m^2) [Clear sky]

  • t_dni: Total direct normal solar irradiance for the day (W/m^2) [Clear sky]

  • max_dni: Maximum direct normal solar radiation value for the day (W/m^2) [Clear sky]

  • dhi: Average diffuse horizontal solar irradiance (W/m^2) [Clear sky]

  • t_dhi: Total daily diffuse horizontal solar irradiance (W/m^2) [Clear sky]

  • max_dhi: Maximum diffuse horizontal solar irradiance value during the day (W/m^2) [Clear sky]

  • max_uv: Maximum UV index (0-11+)

Agro-climatic

We incorporate agro-climatic data due to their importance in the field of agriculture.

  • bulk_soil_density: Bulk soil density (kg/m^3).

  • skin_temp_max: Maximum skin temperature (C).

  • skin_temp_avg: Average skin temperature (C).

  • skin_temp_min: Minimum skin temperature (C).

  • temp_2m_avg: Average temperature at 2 meters (C).

  • precip: Accumulated precipitation (mm).

  • specific_humidity: Mean specific humidity (kg/kg).

  • evapotranspiration: Reference evapotranspiration - ET0 (mm).

  • pres_avg: Average surface pressure (mb).

  • wind_10m_spd_avg: Average wind speed at 10 meters (m/s).

  • dlwrf_avg: Average hourly downward longwave solar radiation (W/m^2 · H).

  • dlwrf_max: Maximum hourly downward long-wave solar radiation (W/m^2 · H).

  • dswrf_avg: Average hourly downward shortwave solar radiation (W/m^2 · H).

  • dswrf_max: Maximum hourly downward shortwave solar radiation (W/m^2 · H).

  • dlwrf_net: Net longwave solar radiation (W/m^2 · D)

  • dswrf_net: Net shortwave solar radiation (W/m^2 · D).

  • soilm_0_10cm: Average Soil moisture content 0 to 10 cm depth (mm).

  • soilm_10_40cm: Average Soil moisture content 10 to 40 cm depth (mm).

  • soilm_40_100cm: Average Soil moisture content 40 to 100 cm depth (mm).

  • soilm_100_200cm: Average Soil moisture content 100 to 200 cm depth (mm).

  • v_soilm_0_10cm: Average volumetric soil moisture content from 0 to 10 cm depth (fraction).

  • v_soilm_10_40cm: Average volumetric soil moisture content from 10 to 40 cm depth (fraction).

  • v_soilm_40_100cm: Average volumetric soil moisture content from 40 to 100 cm depth (fraction).

  • v_soilm_100_200cm: Average volumetric soil moisture content from 100 to 200 cm depth (fraction)

  • soilt_0_10cm: Average soil temperature at 0 to 10 cm depth (C).

  • soilt_10_40cm: Average soil temperature at 10 to 40 cm depth (C).

  • soilt_40_100cm: Average soil temperature at 40 to 100 cm depth (C).

  • soilt_100_200cm: Average soil temperature at 100 to 200 cm depth (C).

Processed climatological and agro-climatic data

  • gdd: Growing degree days accumulated with base temperature according to crop.

  • precip_temporal_max_diff: Maximum difference between weeks in precipitation during the campaign.

  • precip_max: Maximum precipitation value in a week.

  • evapotranspiration_max_diff: Maximum evapotranspiration difference between weeks during the campaign.

  • rh_max_diff: Maximum humidity difference between weeks during the campaign.

  • skin_temporal_max_min_diff: Maximum difference in maximum soil temperature between weeks during the campaign.

  • skin_temporal_min_min_diff: Maximum temperature difference in minimum soil temperature between weeks during the campaign.

  • gdd_min_diff: GDD difference between weeks during the campaign.

  • precip_first_month: Maximum precipitation reached during the first month of the campaign.

  • rh_first_month: Maximum humidity reached during the first month of the campaign.

  • skin_temp_max_first_month: Maximum soil temperature reached during the first month of the season.

  • solar_rad_first_month: Maximum solar radiation reached during the first month of campaign.

Data Review

The first step is calculating statistics for all the variables explained in the previous section. Once calculated, an automatic review of them is performed.

Currently, a summary report of this data review is generated, which can detect:

  • Temporal outliers: Such as problems with planting/harvest dates. Problems with predictor variables (anomalous estimations for a crop)

  • Global outliers: Detects any calculation issues across all explained variables.

All of this is represented in a summary report like the following:

REPORT

  1. Enter the name of the report, it indicates that it is a preprocessing of the PROD variable.

  2. Gives us information about the USER ID (be careful not to confuse user_id with customer_id or agrouser_id).

  3. Shows the size of the dataframe.

  4. Shows number of fields and seasons available for that client.

  5. Number of actual data points for training (very important).

  6. Number of data points that could be considered outliers.

  7. Number of seasons with incorrect planting dates. Currently this is recorded if the NDVI in the first 30 days is greater than 0.4 (for “sugarcane”, “beetroot”, “soybean”, “corn”, “cotton”). (File attached in S3).

  8. Number of records where the harvest date is considered incorrect (File attached in S3) taking into account the planting date (e.g., seasons that are too long or too short).

  9. Number of real production data points that are considered outlier candidates, using two standard deviations above the moving average of the data on the days axis. (File attached in S3).

  10. Graph showing the number of columns with missing data and % of missing data per variable.

  11. NDVI evolution graph by cultivation days for season_ids with errors.

  12. Evolution of production over time series. The upper graph shows the amount of data with actual production (green) compared to data without reported actual production (red). The lower graph shows the evolution of reported actual production.

  13. Yield by cultivation days with limit axes to detect outliers based on time series data using population mean and two-thirds (2/3) of standard deviation over the moving average.

  14. NDVI evolution by cultivation days smoothed by plant day-cycle.

  15. Isolation forest: Shows the distribution of scores in a histogram given by the algorithm, and the number of observations considered outliers. (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)

All outlier candidates warn us about possible values that should not be introduced into the model, but this step is performed in the next section.

Model generation

Model generation is also performed automatically.

Currently, we have the automatic generation of 3 simultaneous models performing hyperparametrization techniques with the aim of obtaining the best model for the provided data.

After automatic training, a summary report of results evaluation is also generated:

MODEL REPORT

  1. Report title.

  2. Indicate the user of the model.

  3. Data that finally enters the training.

  4. Outliers detected with isolation forest

  5. Number of outliers in preprocessing, as in “faulty_seasons_dict”

  6. Number of variables that entered the model.

  7. Training date.

  8. The outlier_threshold that was set in the config.

  9. Model user.

  10. Model type.

  11. Model training.

  12. Model type with the best accuracy (RFR -> Random Forest Regressor (helps prevent overfitting); GBR -> Gradient Boosting Regressor (intermediate, but overfits more); XGBR -> eXtreme Gradient Boosting Regressor (overfits the most, but usually gives the best results)).

  13. Summary table of accuracies.

  14. Graph to check actual vs predicted data distinguishing between train and test.

  15. Evolution of actual production over time.

  16. Most important variables of the trained model.

  17. Distribution of errors in train and test. If train and test are very different, then you should review what happened, and why the test data does not represent the training data.

  18. Error distribution over time.

In addition to this summary report, you can also consult with your KAM/CPM interesting graphs for more in-depth data analysis within our production model manager (MLFLOW)

Prediction Generation

Once the data has been reviewed and trained, we use the generated model to make predictions in the campaign.

Predictions are updated on a weekly basis. To internally validate the results, we also generate reports to ensure everything is running correctly and we don’t detect strong anomalies.

FORECASTECH REPORT

  1. Report title.

  2. Indicates the user that the prediction has been made.

  3. Model type.

  4. If any season_label filter has been used.

  5. If it has been applied to update only future or all.

  6. Prediction date.

  7. Number of predictions generated (for each season_id there are multiple dates, hence the large number).

  8. Plots that should have their planting date reviewed.

  9. Summary table. Shows the distributions of the data used for training and the predicted data. It is normal that there will always be some deviation between the training mean and the forecast, since the training only takes the harvest date and the forecast takes everything.

  10. Graph rescued from preprocessing to detect planting problems.

  11. Distribution graph of the most important variables in the model, to observe the distribution of the model and the data to be predicted

  12. Data frequency according to the value to predict, segmented by trained and predicted

  13. Evolution of estimates over time, along with the data used to train the model (if there is data from the last year used to train it).

  14. Variables with greater importance, showing how their values affect the prediction. The vertical line indicates the expected value (E(X)). Each observation has a value for each variable. The graph above shows how high values of that variable (in red) modify the observation with respect to the expected value. For example, the most important variable shows that red values increase the prediction value relative to the expected value or mean. Conversely, blue colors represent low values of that variable, and in the most important variable, you can see how low values (blue) make the prediction lower.

API de Crop Predictor

Crop Predictor dispone de una API REST que permite consultar las predicciones generadas por los modelos de forma programática. La documentación interactiva (Swagger UI) está disponible en:

Autenticación

Todos los endpoints de la API requieren una API key que se pasa como parámetro de consulta (query parameter) en la URL:

?api_key=TU_API_KEY

Esta API key es la misma que se utiliza para cualquier interacción con las APIs de la plataforma Layers. Para obtenerla, contacta con tu KAM/CPM o utiliza el endpoint /publicapi/getApiKey de la API principal de Layers.

Si la API key no se proporciona, la API devolverá un error 401 Not authenticated. Si la API key no es válida o no se encuentra en el sistema, devolverá un error 403 API key not found.

Roles y permisos

El acceso a los datos está controlado por el rol asociado a la API key:

Rol

Acceso

Admin

Acceso a todos los campos y usuarios

Agrouser

Campos de los clientes y cooperativas asignados

Cooperativa

Campos de los clientes miembros

Cliente (Farmer)

Solo sus propios campos

Si se solicitan campos a los que la API key no tiene acceso, la API devolverá un error 403.


Endpoints

Health Check

Método

Endpoint

Descripción

GET / HEAD

/

Comprobación de disponibilidad del servicio


POST /forecast

Endpoint principal para obtener las predicciones de productividad y calidad generadas por los modelos.

Parámetros de consulta (query):

Parámetro

Tipo

Requerido

Descripción

api_key

string

API key de autenticación

Cuerpo de la petición (JSON):

Campo

Tipo

Requerido

Descripción

field_reference

lista de strings

Lista de identificadores propios del cliente para sus campos (external reference). Cada cliente define sus propias referencias para identificar sus parcelas

from_date

string (YYYY-MM-DD)

Fecha de inicio del rango de consulta

to_date

string (YYYY-MM-DD)

Fecha de fin del rango de consulta

Ejemplo de petición:

curl -X 'POST' \
  'https://agropred.layers.hemav.com/forecast?api_key=TU_API_KEY' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "field_reference": [
    "10001_500123"
  ],
  "from_date": "2025-01-01",
  "to_date": "2025-12-31"
}'

Ejemplo de respuesta exitosa (200):

La respuesta es un objeto JSON donde cada clave es una field_reference solicitada, y el valor es una lista de registros de predicción semanales.

{
  "10001_500123": [
    {
      "date": "2025-01-05",
      "season_id": "3012345",
      "season_label": "SAFRA 2024-25",
      "user_id": "10001",
      "crop_type": "sugarcane",
      "field_id": 500123,
      "forecasts_production_per_hectare_real": 78.54,
      "forecasts_atr_real": 132.17,
      "update_date_production_per_hectare_real": "2025-04-10",
      "update_date_atr_real": "2025-04-10",
      "model.metadata.run_id_production_per_hectare_real": "abc123def456",
      "model.metadata.run_id_atr_real": "789ghi012jkl"
    },
    {
      "date": "2025-01-12",
      "season_id": "3012345",
      "season_label": "SAFRA 2024-25",
      "user_id": "10001",
      "crop_type": "sugarcane",
      "field_id": 500123,
      "forecasts_production_per_hectare_real": 79.12,
      "forecasts_atr_real": 131.85,
      "update_date_production_per_hectare_real": "2025-04-10",
      "update_date_atr_real": "2025-04-10",
      "model.metadata.run_id_production_per_hectare_real": "abc123def456",
      "model.metadata.run_id_atr_real": "789ghi012jkl"
    }
  ]
}

Descripción de los campos de respuesta:

Campo

Tipo

Descripción

date

string

Fecha de la predicción (resolución semanal)

season_id

string

Identificador de la campaña/zafra

season_label

string

Etiqueta legible de la campaña (ej: “SAFRA 2024-25”)

user_id

string

Identificador del usuario propietario

crop_type

string

Tipo de cultivo (ej: “sugarcane”, “corn”, “soybean”)

field_id

integer

Identificador numérico de la parcela

forecasts_production_per_hectare_real

float

Predicción de producción por hectárea (t/ha)

forecasts_atr_real

float

Predicción de ATR (Azúcares Totales Recuperables, kg/t)

update_date_production_per_hectare_real

string

Fecha de la última actualización del modelo de producción

update_date_atr_real

string

Fecha de la última actualización del modelo de ATR

model.metadata.run_id_production_per_hectare_real

string

Identificador de la ejecución del modelo de producción (MLflow)

model.metadata.run_id_atr_real

string

Identificador de la ejecución del modelo de ATR (MLflow)

Note

Los campos de predicción disponibles dependen del cultivo y de los modelos entrenados para cada cliente. Los campos más comunes son forecasts_production_per_hectare_real (producción) y forecasts_atr_real (ATR), pero pueden incluirse otros como forecasts_pol_real, forecasts_sac_real o forecasts_n_bunches_real según la configuración del modelo.

Note

Los valores de update_date_* y model.metadata.run_id_* pueden estar vacíos ("") si la predicción aún no ha sido generada para ese modelo o fecha concreta.

Códigos de error:

Código

Descripción

401

API key no proporcionada

403

API key no válida o sin permisos para los campos solicitados


POST /s3/file_paths

Endpoint para obtener las rutas de los archivos generados por el sistema (reports de preprocessing, entrenamiento y predicción) almacenados en S3.

Parámetros de consulta (query):

Parámetro

Tipo

Requerido

Descripción

api_key

string

API key de autenticación

Cuerpo de la petición (JSON):

Campo

Tipo

Requerido

Descripción

user_id

integer

Identificador del usuario cuyos archivos se desean consultar

Ejemplo de petición:

curl -X 'POST' \
  'https://agropred.layers.hemav.com/s3/file_paths?api_key=TU_API_KEY' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "user_id": 10001
}'

Ejemplo de respuesta exitosa (200):

[
  "preprocessing/10001/report_prod_2025-03.pdf",
  "training/10001/model_report_prod_2025-03.pdf",
  "forecast/10001/forecast_report_prod_2025-04.pdf"
]

Códigos de error:

Código

Descripción

401

API key no proporcionada o sin permisos para el usuario solicitado

403

API key no válida

404

No hay archivos disponibles para el usuario


Ejemplo de uso completo

A continuación se muestra un ejemplo completo de integración con la API utilizando Python:

import requests

API_URL = "https://agropred.layers.hemav.com"
API_KEY = "TU_API_KEY"

# Obtener predicciones para una parcela
response = requests.post(
    f"{API_URL}/forecast",
    params={"api_key": API_KEY},
    json={
        "field_reference": ["10001_500123"],
        "from_date": "2025-01-01",
        "to_date": "2025-12-31"
    }
)

if response.status_code == 200:
    data = response.json()
    for field_ref, forecasts in data.items():
        print(f"Parcela: {field_ref}")
        for forecast in forecasts:
            print(f"  Fecha: {forecast['date']}")
            print(f"  Producción: {forecast.get('forecasts_production_per_hectare_real', 'N/A')} t/ha")
            print(f"  ATR: {forecast.get('forecasts_atr_real', 'N/A')} kg/t")
else:
    print(f"Error {response.status_code}: {response.text}")

Tip

Puedes consultar múltiples parcelas en una sola petición pasando varias referencias en el array field_reference. La API procesará las consultas en paralelo para optimizar el tiempo de respuesta.