yaab.governance¶
Registry, lifecycle, guardrails, audit, evals, compliance mappers.
yaab.governance ¶
Governance, registry & compliance: registry, lifecycle, policy, audit, and compliance.
ToolApprovalPlugin ¶
Bases: Plugin
Require human approval before sensitive tool calls run.
ApprovalDecision ¶
Bases: str, Enum
The lifecycle state of a pending approval.
ApprovalRequest ¶
Bases: BaseModel
A durable record of a sensitive tool call awaiting human sign-off.
The correlation ids tie the record back to the parked run: run_id is the
run it belongs to and resume_id is the checkpoint key the loop resumes
from once a reviewer decides. tool and arguments are surfaced to the
reviewer so they can judge the request.
ApprovalStore ¶
Bases: Protocol
Pluggable storage for pending and decided approval records.
Implementations are durable and safe to share across replicas: a request persisted on one process is visible — and decidable — from any other.
create
async
¶
create(req: ApprovalRequest) -> None
Persist a new pending approval request (idempotent on approval_id).
A re-create with an existing id is a no-op — it must never clobber a record a reviewer already decided — so a crash-window re-pause that re-derives the same deterministic id self-heals instead of duplicating.
get
async
¶
get(approval_id: str) -> ApprovalRequest | None
Fetch one request by id, or None if unknown.
list_pending
async
¶
list_pending(*, agent: str | None = None) -> list[ApprovalRequest]
List still-pending requests, optionally scoped to one agent.
decide
async
¶
decide(approval_id: str, *, decision: ApprovalDecision, reviewer: str, reason: str | None = None, override_arguments: dict[str, Any] | None = None, answer: Any = None) -> ApprovalRequest | None
Record a reviewer's decision; returns the updated record or None.
override_arguments (reviewer-edited tool args) and answer (a typed
ask_user answer) are persisted alongside the verdict so the resume
path can flow the decided value back to the held tool. First-write-wins:
a decide on an already-decided record is a no-op returning it unchanged.
for_run
async
¶
for_run(run_id: str) -> list[ApprovalRequest]
All approval records (pending or decided) belonging to a run.
list_by_key
async
¶
list_by_key(correlation_key: str, *, pending_only: bool = True) -> list[ApprovalRequest]
All records carrying correlation_key (a business key lookup).
InMemoryApprovalStore ¶
Process-local approval store — the default for tests and single-process dev.
Holds records in a dict; nothing survives the process, so swap in a durable backend before running more than one replica.
PostgresApprovalStore ¶
Durable approval store backed by Postgres / Aurora PostgreSQL.
Uses psycopg (pip install 'yaab-sdk[postgres]'), imported lazily, so
the dependency is only needed when this backend is actually constructed. The
true multi-replica backend: any pod can list and decide a pending request.
RedisApprovalStore ¶
Durable approval store backed by Redis / ElastiCache / MemoryDB.
Uses redis (pip install 'yaab-sdk[redis]'), imported lazily; a
pre-built client may be injected for tests. Each request is a JSON value in a
per-id hash field, with a pending-id set and per-run id set for fast listing,
so any replica sees and decides the same records.
SQLiteApprovalStore ¶
Durable approval store backed by SQLite for single-node deployments.
Records are stored as JSON keyed by approval_id with indexed run_id,
agent, and decision columns so two views over one database file see
each other's pending records — the floor for resuming a parked run on a
different worker than the one that paused it.
ReviewDecision ¶
Bases: BaseModel
A human's decision on one :class:~yaab.types.Pending.
The only thing agent.run(resume=...) consumes. It is self-correlating: the
approval_id (and the resume_id copied from the store row) locate the
parked run's checkpoint, so resume never needs the original session or any
in-memory object from the process that paused.
DecisionValidationError ¶
Bases: GovernanceError
Raised when a human's payload fails validation before anything is stored.
The pending record stays pending and the run stays paused — a mistyped answer or malformed edit never half-commits a decision.
ResumeBundle ¶
Bases: BaseModel
Several :class:Decision values keyed by approval_id, resumed at once.
Built by :func:multiplex when one model turn guarded multiple tools: decide
each, then agent.run(resume=bundle) resolves every held tool with its
matching decision in a single resume.
AuditEvent ¶
Bases: BaseModel
A single tamper-evident audit entry.
AuditLog ¶
The hash-chained audit ledger.
AuditSink ¶
Bases: Protocol
A destination for audit events (OTel collector, Logfire, SQL, ...).
SQLiteAuditSink ¶
Durable audit sink backed by SQLite.
CallableAuthorizer ¶
Wrap a plain function (tool, args, ctx) -> bool | Decision.
Decision ¶
An authorization decision with an optional human-readable reason.
IdempotencyPlugin ¶
Bases: Plugin
Dedupe side-effecting tool calls within a run (or across runs via a store).
The idempotency key defaults to a hash of the tool name + sorted args; pass
key_fn to derive it from domain fields (e.g. an order id). On a repeat
key the cached result is returned and the tool is not executed again.
By default the cache lives for the plugin's lifetime (shared across runs on
the same Runner). Pass per_run=True to scope it to a single run via
ctx.state.
RBACAuthorizer ¶
Allow/deny tools by name and by required capability.
allow— if set, only these tools may run (allow-list);deny— these tools never run (takes precedence);require_capability— map tool name -> capability string that the caller'sctx.state['capabilities'](a set/list) must contain.
ToolAuthorizationPlugin ¶
Bases: Plugin
Enforce a chain of authorizers before each tool call.
All authorizers must allow; the first denial wins. With hard=True a
denial raises :class:PolicyViolation; otherwise it short-circuits the tool
with an error string fed back to the model (so the agent can adapt). Every
decision that isn't a plain allow is audited.
ToolAuthorizer ¶
Bases: Protocol
Decides whether a tool call may proceed.
Case ¶
Bases: BaseModel
One evaluation example: input, optional expected output, metadata.
Experiment ¶
Runs a task over a dataset and scores it with a set of evaluators.
ExperimentResult ¶
Bases: BaseModel
FunctionEvaluator ¶
Wrap an arbitrary scoring function as an :class:Evaluator.
JSONMatch ¶
Score 1.0 if output parses to JSON equal to case.expected.
Levenshtein ¶
Normalized edit-distance similarity in [0, 1] vs case.expected.
LLMJudge ¶
Score an output's quality 0-1 with a model judge (call :meth:ascore).
NumericTolerance ¶
Score 1.0 if the numeric output is within tol of case.expected.
Regex ¶
Score 1.0 if the output matches a regex (the pattern is case.expected).
ResponseMatch ¶
Token-overlap (ROUGE-style) similarity in [0, 1] vs case.expected.
The fraction of the expected answer's words that appear in the output — a deterministic, offline text-overlap signal that tolerates wording the way exact match cannot, without needing a model judge.
RubricJudge ¶
Score an output against named criteria with a model judge.
Unlike the freeform :class:LLMJudge (one blended number), the judge is asked
to score each rubric criterion separately and return them as a JSON object, so
the result is an inspectable per-criterion breakdown plus an aggregate (the
mean). Use :meth:ascore_rubric for the breakdown, :meth:ascore for just
the aggregate float (so it drops into the same eval pipeline as every metric).
ToolTrajectoryMatch ¶
Score how well an agent's tool-call sequence matches an expected one.
Unlike the output-string metrics above, it scores the process — which tools were called, in what order, with which arguments — which is what you actually want to regression-test for tool-using agents.
The expected trajectory is a list of {"name": str, "arguments"?: dict}
steps, read from case.metadata["expected_tool_trajectory"] (or, as a
convenience, case.expected when it is a list). The actual trajectory
is pulled from the run's events by :meth:Experiment.run and handed to this
evaluator via a context dict — so this evaluator's output argument is the
context dict, not the final string. That is why it is context-aware:
:meth:Experiment.run detects evaluators that accept the context and feeds
it to them, while keeping plain (case, output) evaluators working.
Scoring (strict=False, the default): the fraction of expected steps that
appear in the actual trajectory as an ordered subsequence (so a missing or
reordered step costs proportionally, never more). With strict=True the
actual sequence must equal the expected sequence exactly (1.0 or 0.0).
A step's arguments, when given, must be a subset of the actual call's arguments (extra actual args are fine) — agents often pass defaults the eval author did not pin down, and over-specifying would make tests brittle.
EvalCase ¶
Bases: BaseModel
One portable evaluation example.
A single-turn case is just a one-entry conversation; multi-turn cases
list the user turns in order (the agent/task is expected to drive the turns
in between). expected_tool_trajectory is an ordered list of
{"name": str, "arguments"?: dict} steps for trajectory scoring.
to_case ¶
to_case() -> Case
Convert to a yaab :class:~yaab.governance.eval.Case.
The last user turn is the input the task receives; the full
conversation is preserved under metadata["conversation"] so
multi-turn-aware tasks can replay it, and the expected trajectory is
stashed under metadata["expected_tool_trajectory"] where
:class:~yaab.governance.eval.ToolTrajectoryMatch looks for it.
EvalSet ¶
Bases: BaseModel
A named, versioned, portable collection of :class:EvalCases.
save ¶
Write the set to path as pretty JSON with a schema_version.
Returns the path written. The on-disk object is the model dump plus a
leading schema_version so future readers can branch on format.
load
classmethod
¶
load(path: str | Path) -> EvalSet
Read an evalset back from path (ignores schema_version).
Unknown top-level fields (including schema_version) are dropped by
pydantic, which is what gives us forward compatibility across minor
format additions.
to_dataset ¶
Convert to a yaab :class:~yaab.governance.eval.Dataset.
The returned dataset can be handed straight to
:class:~yaab.governance.eval.Experiment, so an evalset file becomes a
runnable suite with no extra glue.
LLMGuardScanner ¶
Run Protect AI LLM-Guard scanners as a YAAB guardrail.
NeMoGuardrailsScanner ¶
Enforce NeMo Guardrails rails as a YAAB guardrail.
PresidioPIIScanner ¶
Detect & redact PII via Microsoft Presidio.
EvidenceArtifact ¶
Bases: BaseModel
A piece of evidence attached at a lifecycle transition.
LifecycleManager ¶
Drives agents through the model-risk lifecycle with audited transitions.
DriftMonitor ¶
Track eval scores over time and detect material regressions.
baseline is the first baseline_window scores (e.g. validation-time);
drift is flagged when the mean of the last recent_window scores falls
more than threshold below the baseline mean.
TrustScorer ¶
Compute a 0–1 trust score for an agent from eval + audit signals.
The score blends three components (each in [0, 1], higher is better):
performance— mean eval score for the agent (defaults to 1.0 if none);safety— 1 minus the rate of guardrail blocks per run;reliability— 1 minus the rate of errors per run.
Weights are configurable; the breakdown is returned for transparency.
PIIScanner ¶
Detect and redact common PII (email, phone, SSN, credit card).
PolicyEngine ¶
Runs a set of scanners over text at a given stage.
PromptInjectionScanner ¶
Heuristic prompt-injection / jailbreak detector.
SecretScanner ¶
Detect leaked credentials/API keys (blocks on output).
SystemPromptLeakScanner ¶
Prevent the model from echoing its own system prompt.
TopicScanner ¶
Allow/deny list of banned topics (keyword based).
AgentCard ¶
Bases: BaseModel
The registry record for one agent version (A2A-compatible superset).
extra="allow" so a central/enterprise registry can attach its own fields
(e.g. usecase_id, blueprint, cost-center) and have them round-trip
losslessly through model_dump/JSON. Prefer the typed metadata dict for
organization-specific attributes you want to query consistently.
to_a2a_card ¶
Render an A2A-style discovery card for /.well-known/agent.json.
AgentRegistry ¶
The registry facade over a pluggable backend.
inventory ¶
Produce the SR 11-7 / EU AI Act model-inventory view.
DecisionAuthority ¶
Bases: str, Enum
What the agent is allowed to do with its output.
EUActCategory ¶
Bases: str, Enum
EU AI Act risk categories (Reg. 2024/1689).
RemoteRegistryBackend ¶
RegistryBackend backed by a central/enterprise HTTP registry service.
Lets governance enforce against an org-wide system-of-record instead of a
local store: register() writes through to the remote service, and the
enforcing run-gate reads approval status from it on every run.
Expected REST contract (override *_path to adapt to your service):
PUT {base_url}/agents/{agent_id} body: AgentCard JSON -> 2xx
GET {base_url}/agents/{agent_id} -> AgentCard JSON (404 if absent)
GET {base_url}/agents -> [AgentCard, ...] or {"agents": [...]}
Because AgentCard allows extra fields, any custom attributes your central
registry returns (usecase_id, blueprint, ...) round-trip intact.
A pre-built httpx.Client may be injected (handy for tests via
httpx.MockTransport); otherwise one is created from base_url +
headers + timeout. Requires the http extra (pip install
'yaab-sdk[http]').
RiskTier ¶
Bases: str, Enum
Internal risk tiering, orthogonal to any single regulatory regime.
SQLiteRegistryBackend ¶
Durable registry backend backed by SQLite.
GovernanceService ¶
Facade over the governance components, parameterized by mode.
SimulationEvaluator ¶
Wrap :func:simulate as an evaluation that scores a run in [0, 1].
With no metric the score is goal_achieved mapped to 1.0/0.0 — the
simplest useful signal: did the simulated user accomplish what it set out to
do. Pass a metric callable (:class:SimulationResult → float) for richer
scoring (turn efficiency, transcript length, an LLM judge over the
transcript, …).
SimulationResult ¶
Bases: BaseModel
The outcome of a simulated multi-turn conversation.
transcript is the full dialogue as {"role", "content"} dicts in
order (user/assistant alternating); turns counts completed user→agent
exchanges; goal_achieved is the simulator's own final self-assessment;
agent_usage aggregates the agent's token/cost accounting across turns.
UserSimulator ¶
An LLM playing a persona with a goal, driving a multi-turn conversation.
The simulator is itself model-driven: :meth:next_message renders a system
prompt (persona + goal + instructions) plus the conversation so far — but
from the user's point of view, so the agent's turns are presented as the
counterpart's messages — and asks the model for the next user utterance.
A reply of [DONE] (or one containing it) signals the simulator is
finished. :meth:assess_goal issues the final GOAL_ACHIEVED: yes/no
self-assessment.
next_message
async
¶
Produce the next user message, or signal completion.
Returns (message, done). done is True when the model emits the
[DONE] sentinel; in that case message is the empty string (the
sentinel itself is never appended to the transcript).
assess_goal
async
¶
Ask the simulator whether its goal was achieved (final self-assessment).
Parsed leniently: any yes in the reply (case-insensitive) counts as
achieved, so a chatty model that says "GOAL_ACHIEVED: yes, because…"
still scores correctly.
simulate
async
¶
simulate(agent: Any, simulator: UserSimulator, *, session_id: str | None = None) -> SimulationResult
Drive a multi-turn conversation between simulator and agent.
The loop: the simulator produces a user turn → the agent answers (carrying a
stable session_id so it sees the running history) → repeat. It ends when
the simulator emits [DONE], the simulator's stop_when predicate fires
on the agent's reply, or simulator.max_turns is reached. Finally the
simulator self-assesses goal_achieved.
A session_id is always used (generated if not supplied) because
multi-turn evaluation only means something if the agent accumulates history;
without it each agent.run would be amnesiac and the eval would be a
series of unrelated single-turn calls.
simulate_evalset
async
¶
simulate_evalset(agent: Any, evalset: Any, simulator_model: ModelProvider | str, *, max_turns: int = 8, stop_when: Callable[[str], bool] | None = None) -> list[SimulationResult]
Run a persona-driven simulation per case in an :class:EvalSet.
Each :class:~yaab.governance.evalset.EvalCase seeds one simulation: the
persona and goal come from case.metadata['persona'] and
case.metadata['goal']. When a case omits them, the case's conversation is
used as a fallback (the first user turn becomes the goal and the persona
defaults to a generic user) so legacy single-turn cases still drive a
sensible simulation instead of erroring.
Every case gets its own session_id so simulations don't bleed history
into each other. Returns one :class:SimulationResult per case, in order.