How It Works
ARISE sits between your agent and its tool library. Every time your agent runs a task, ARISE records what happened, evaluates how well it went, and — when enough failures accumulate — synthesizes new tools to fill the gaps.
The Evolution Loop
Section titled “The Evolution Loop”The 5 Steps in Detail
Section titled “The 5 Steps in Detail”1. Observe
Section titled “1. Observe”Every call to arise.run(task) produces a Trajectory: a record of the task, every tool call the agent made (with inputs, outputs, and errors), the final outcome, and the reward score.
Trajectories are stored locally in SQLite (or sent to SQS in distributed mode). ARISE keeps the most recent max_trajectories records (default: 1,000).
2. Score
Section titled “2. Score”After each episode, the reward_fn you provide evaluates the trajectory and returns a float in [0.0, 1.0]. Scores below 0.5 are counted as failures. ARISE watches two conditions:
- Failure threshold: if the last
failure_thresholdepisodes (default: 5) are all failures, evolution triggers. - Plateau detection: if success rate hasn’t improved by
plateau_min_improvement(default: 5%) over the lastplateau_window(default: 10) episodes, evolution triggers even without a failure streak.
3. Detect
Section titled “3. Detect”When evolution triggers, ARISE sends the recent failure trajectories to an LLM (the cheap model you set in config, not your agent’s model). The LLM analyzes:
- What tasks failed
- What errors appeared in tool calls
- What tools the agent tried to call that didn’t exist
- What the agent said it needed but couldn’t do
The output is a list of GapAnalysis objects — each with a description, evidence, a suggested function name, and a suggested signature.
4. Synthesize
Section titled “4. Synthesize”For each detected gap, ARISE:
- Checks the registry (if
registry_check_before_synthesis=True) — if a proven skill already exists there, pulls it instead of calling the LLM. - Calls the LLM to write a Python function implementing the tool, along with a test suite.
- Runs the tests in a sandbox (subprocess or Docker). If tests fail, ARISE refines and retries up to
max_refinement_attemptstimes. - Runs adversarial validation — a separate LLM call specifically tries to break the tool with edge cases, type boundaries, and security-probing inputs.
- If adversarial validation fails, ARISE refines again and re-tests.
For existing skills that are failing on specific inputs, ARISE instead runs a patch — a minimal targeted fix — and starts an A/B test between the original and the patched version.
Synthesis runs in parallel (up to max_synthesis_workers=3 concurrent threads).
5. Promote
Section titled “5. Promote”A skill that passes both the sandbox tests and adversarial validation is marked ACTIVE and added to the tool library. On the next arise.run() call, the agent has access to the new tool.
Every promotion is checkpointed in SQLite with a version number. You can roll back to any previous state with arise rollback <version> or arise.rollback(version).
Skill Lifecycle
Section titled “Skill Lifecycle”TESTING → ACTIVE → DEPRECATED
- TESTING: synthesized but not yet promoted (failed adversarial tests, or in A/B test)
- ACTIVE: promoted, available to the agent
- DEPRECATED: removed (lost A/B test, manually removed, or rollback)
A/B Testing
Section titled “A/B Testing”When ARISE patches an existing skill, it doesn’t replace it immediately. Instead, both versions run concurrently — each episode randomly uses one variant. After min_episodes (default: 20), the variant with the higher success rate wins and is promoted; the loser is deprecated.
Cost Control
Section titled “Cost Control”Evolution is rate-limited by max_evolutions_per_hour (default: 3). Each evolution cycle costs 3–5 LLM calls with gpt-4o-mini, so the worst case is roughly $0.01–0.15/hour for tool synthesis.