# Model Evaluation

The **Eval** simulation mode enables you to **compare multiple larva models** against experimental reference datasets using statistical tests. This is essential for model validation and selection.

---

## Purpose

Use `Eval` mode to:

- ✅ **Validate models** against real experimental data
- ✅ **Select best model** from multiple candidates
- ✅ **Behavioral fingerprinting** across 40+ metrics
- ✅ **Hypothesis testing** with statistical rigor

For mode comparison, see {doc}`../concepts/simulation_modes`.

---

## Quick Start

### CLI

```bash
larvaworld Eval -refID exploration.30controls --modelIDs explorer navigator
```

### Python

```python
from larvaworld.lib.sim import EvalRun

eval_run = EvalRun(
    refID='exploration.30controls',
    modelIDs=['explorer', 'navigator', 'forager'],
    N=3,                  # agents per model (use larger N for real runs)
    screen_kws={},        # headless
)
eval_run.simulate()
eval_run.plot_results(show=False)
```

---

## Workflow

### 1. Select Reference Dataset

**Reference datasets** are experimental recordings imported into Larvaworld.

**Available datasets**:

```python
from larvaworld.lib import reg

# List all reference datasets
ref_ids = reg.conf.Ref.confIDs
print(f"Available: {len(ref_ids)} reference datasets")

# Inspect a dataset
ref_conf = reg.conf.Ref.getID("exploration.30controls")
print(ref_conf)
```

**Loading a reference**:

```python
from larvaworld.lib import reg

ref_dataset = reg.loadRef(id="exploration.30controls", load=True)
print(f"Reference: {ref_dataset.config.refID}")
print(f"Agents: {len(ref_dataset.agent_ids)}")
print(f"Duration: {ref_dataset.config.duration} min")
```

For details on importing datasets, see {doc}`../data_pipeline/lab_formats_import`.

---

### 2. Select Models to Compare

**Predefined models**:

| Model ID         | Description                     |
| ---------------- | ------------------------------- |
| `'explorer'`     | Baseline exploration            |
| `'navigator'`    | Odor-guided navigation          |
| `'forager'`      | Feeding/foraging                |
| `'rover'`        | High-activity forager phenotype |
| `'sitter'`       | Low-activity forager phenotype  |
| `'max_forager'`  | Maximal feeding rate            |
| `'max_feeder'`   | Feeder-focused behavior         |
| `'RLnavigator'`  | RL-enhanced navigation          |
| `'OSNnavigator'` | OSN-based navigation            |

**Inspect models**:

```python
from larvaworld.lib import reg

# List all models
model_ids = reg.conf.Model.confIDs
print(f"Available: {len(model_ids)} models")

# Inspect model configuration
model_conf = reg.conf.Model.getID("explorer")
print(model_conf)
```

---

### 3. Run Evaluation

```python
from larvaworld.lib.sim import EvalRun

eval_run = EvalRun(
    refID='exploration.30controls',          # Reference dataset
    modelIDs=['explorer', 'navigator'],      # Models to compare
    N=3,                                     # Agents per model (increase for real)
    screen_kws={},                           # headless
)

# Run simulations
eval_run.simulate()
```

**What happens**:

1. Load reference dataset
2. For each model:
   - Run one simulation with `N` larvae (per model)
   - Compute 40+ behavioral metrics
3. Compare model distributions to reference using **Kolmogorov-Smirnov (KS) tests**

---

### 4. Access Results

```python
# Statistical comparison (endpoint metrics)
print(eval_run.error_dicts["pooled"]["end"])

# Statistical comparison (distribution metrics)
print(eval_run.error_dicts["pooled"]["step"])

# Raw datasets per model
for model_id, datasets in eval_run.model_datasets.items():
    print(f"{model_id}: {len(datasets)} runs")
```

---

### 5. Visualize Results

#### Statistical Comparison Plots

```python
# Aggregate comparison plots
eval_run.plot_results()  # KS D-statistic heatmaps
```

**Generated plots**:

- **KS D-statistic heatmap**: Models × Metrics
- **Box plots**: Metric distributions per model
- **P-value summary**: Statistical significance

#### Model-Specific Visualizations

```python
# Individual model plots
eval_run.plot_models()  # Trajectories, distributions
```

**Generated plots per model**:

- **Trajectories**: Spatial paths
- **Angular distributions**: Orientation, turns
- **Spatial distributions**: Velocity, dispersal
- **Bout distributions**: Stride/turn/pause durations

---

## Evaluation Metrics

Larvaworld computes **40+ behavioral metrics** across three categories:

### Endpoint Metrics (Summary Statistics)

| Metric      | Description              | Unit  |
| ----------- | ------------------------ | ----- |
| **cum_dur** | Total duration           | s     |
| **cum_sd**  | Total distance           | m     |
| **v_mu**    | Mean linear velocity     | mm/s  |
| **a_mu**    | Mean linear acceleration | mm/s² |
| **av_mu**   | Mean angular velocity    | rad/s |
| **fov_mu**  | Mean forward velocity    | mm/s  |
| **pau_N**   | Number of pauses         | count |
| **str_N**   | Number of strides        | count |
| **run_N**   | Number of runs           | count |
| **str_f**   | Stride frequency         | Hz    |
| **run_t**   | Average run duration     | s     |
| **pau_t**   | Average pause duration   | s     |

### Distribution Metrics (Time-Series)

| Metric         | Description                                |
| -------------- | ------------------------------------------ |
| **angular**    | Orientation, angular velocity/acceleration |
| **spatial**    | Linear velocity/acceleration distributions |
| **dispersal**  | Spatial spread over time                   |
| **tortuosity** | Path straightness (sliding windows)        |

### Bout Metrics (Event-Based)

| Metric              | Description                      |
| ------------------- | -------------------------------- |
| **stride_duration** | Distribution of stride durations |
| **turn_amplitude**  | Distribution of turn amplitudes  |
| **pause_duration**  | Distribution of pause durations  |
| **run_distance**    | Distribution of run distances    |

---

## Statistical Testing

### Kolmogorov-Smirnov (KS) Test

**Purpose**: Compare distributions between model and reference.

**Null Hypothesis**: Model and reference are drawn from the same distribution.

**KS D-Statistic**: Maximum difference between cumulative distributions.

- Formula: `D = max_x |F_model(x) - F_ref(x)|`
- Where `F_model(x)` is the cumulative distribution of the model and `F_ref(x)` that of the reference.

**Interpretation**:

- `D = 0`: Perfect match
- `D < 0.2`: Good match
- `D > 0.5`: Poor match

**Computing KS tests manually**:

```python
from larvaworld.lib.process.evaluation import eval_fast

# Compare two datasets
ks_results = eval_fast(
    datasets=[model_dataset],
    refDataset=ref_dataset,
    metric_definition="angular"  # or "spatial", "all"
)

print(ks_results['end'])  # Endpoint metrics
print(ks_results['step'])  # Distribution metrics
```

---

## Example: Rover vs. Sitter Comparison

```python
from larvaworld.lib.sim import EvalRun

# Compare rover vs sitter models (short demo)
eval_run = EvalRun(
    refID='exploration.30controls',
    modelIDs=['rover', 'sitter'],
    N=3,
    screen_kws={},
)

eval_run.simulate()

# Plot comparison
eval_run.plot_results()

# Access D-statistics
ks_end = eval_run.error_dicts["pooled"]['end']
print("Endpoint KS D-statistics:")
for model, metrics in ks_end.items():
    print(f"\n{model}:")
    for metric, d_stat in metrics.items():
        if d_stat < 0.2:
            print(f"  {metric}: {d_stat:.3f} ✅ (good match)")
        else:
            print(f"  {metric}: {d_stat:.3f} ❌ (poor match)")
```

---

## Custom Metric Selection

By default, Larvaworld auto-selects metrics based on experiment type. You can customize:

```python
from larvaworld.lib.sim import EvalRun

eval_run = EvalRun(
    refID='exploration.30controls',
    modelIDs=['explorer'],
    duration=1.0,  # short demo
    N=5,
    screen_kws={},

    # Custom metric selection
    metric_definition="angular"  # Only angular metrics
    # Options: "angular", "spatial", "spatial+angular", "all"
)

eval_run.simulate()
```

---

## Parallelization

Currently `EvalRun.simulate()` runs single-process. For parallel runs, launch multiple `EvalRun` instances via your own batching (e.g., shell/xargs or a task runner) and combine results manually.

---

## Saving Results

```python
# Save evaluation results
eval_run.store()

# Location: DATA/SimGroup/eval_runs/{refID}/
print(f"Saved to: {eval_run.dir}")

# Load later
from larvaworld.lib.sim import EvalRun
eval_run_loaded = EvalRun.load(path=eval_run.dir)
```

---

## Advanced: Custom Reference Data

You can use your own experimental data:

### Step 1: Import Dataset

```python
from larvaworld.lib import reg

lab = reg.gen.LabFormat(labID="Schleyer")
lab.import_dataset(
    parent_dir="exploration",
    merged=True,
    max_Nagents=30,
    min_duration_in_sec=60,
    id="my_experiment",
    refID="my_experiment",
    save_dataset=True,
)
```

For details, see {doc}`../data_pipeline/lab_formats_import`.

### Step 2: Process Dataset

```python
dataset = reg.loadRef(id="my_experiment", load=True)

dataset.preprocess(filter_f=3.0)
dataset.process(proc_keys=["angular", "spatial"])
dataset.annotate(
    anot_keys=["bout_detection", "bout_distribution", "interference"]
)
```

### Step 3: Evaluate Against Custom Reference

```python
eval_run = EvalRun(
    refID='my_experiment',
    modelIDs=['explorer', 'navigator'],
    duration=5.0
)
eval_run.simulate()
```

---

## Related Documentation

- {doc}`../concepts/simulation_modes` - Simulation mode comparison
- {doc}`../data_pipeline/lab_formats_import` - Importing experimental data
- {doc}`../data_pipeline/data_processing` - Data processing pipeline
- {doc}`../data_pipeline/reference_datasets` - Reference dataset management
- {doc}`../tutorials/model_evaluation` - Step-by-step tutorial