---
title:
Rewrote Cursor rules into skills and added benchmarks
date:
2026-01-27
draft:
false
---
Rewrote the set of Cursor rules into skills. At the same time, I started adding benchmarks. I had to replicate (well, almost) the Cursor context so that the tests would be close to reality. But now I can write skills based on data rather than intuition, using llm-as-a-judge.
AssistFlow Benchmarks is an automated testing system for AI agents in isolated Docker sandboxes. It verifies task performance quality not by the response text, but by actual project changes (files, git logs, status). The process includes automatic context assembly (similar to Cursor), agent execution of bash commands in a closed environment, and subsequent evaluation of the result by an independent LLM judge based on a checklist defined in the scenario, generating a detailed interactive trace.html report for analyzing each iteration.