Telexed

#0843

#0843Agents & tools Hacker News · AI2 weeks ago

`Forge` Pushes Local 8B Agent Reliability Near Frontier APIs

70radar

ForgeLLM guardrail runtime — improves local tool-call reliability

Guardrails, not bigger weights, drive the jump. The useful takeaway is architectural: retries, recovery, and serving backend choice can matter more than model size.

Ministral 8B with Forge hit 99.3%, versus Claude Sonnet with guardrails at 100% across the reported eval setup.
Without retry nudges, scores dropped 24-49 points. Reliability work belongs in the agent runtime, not only in model selection.
Serving backend changed the same Mistral-Nemo 12B weights from 7% on llama-server native function calling to 83% on Llamafile prompt mode.
Error recovery scored 0% for every tested model without retry logic. Tool agents need explicit recovery paths before production use.

Source: github.com/antoinezambelli/forgeRead original →