yaab.governance¶

Registry, lifecycle, guardrails, audit, evals, compliance mappers.

yaab.governance ¶

Governance, registry & compliance: registry, lifecycle, policy, audit, and compliance.

ToolApprovalPlugin ¶

Bases: Plugin

Require human approval before sensitive tool calls run.

ApprovalDecision ¶

Bases: str, Enum

The lifecycle state of a pending approval.

ApprovalRequest ¶

Bases: BaseModel

A durable record of a sensitive tool call awaiting human sign-off.

The correlation ids tie the record back to the parked run: run_id is the run it belongs to and resume_id is the checkpoint key the loop resumes from once a reviewer decides. tool and arguments are surfaced to the reviewer so they can judge the request.

ApprovalStore ¶

Bases: Protocol

Pluggable storage for pending and decided approval records.

Implementations are durable and safe to share across replicas: a request persisted on one process is visible — and decidable — from any other.

create `async` ¶

create(req: ApprovalRequest) -> None

Persist a new pending approval request (idempotent on approval_id).

A re-create with an existing id is a no-op — it must never clobber a record a reviewer already decided — so a crash-window re-pause that re-derives the same deterministic id self-heals instead of duplicating.

get `async` ¶

get(approval_id: str) -> ApprovalRequest | None

Fetch one request by id, or None if unknown.

list_pending `async` ¶

list_pending(*, agent: str | None = None) -> list[ApprovalRequest]

List still-pending requests, optionally scoped to one agent.

decide `async` ¶

decide(approval_id: str, *, decision: ApprovalDecision, reviewer: str, reason: str | None = None, override_arguments: dict[str, Any] | None = None, answer: Any = None) -> ApprovalRequest | None

Record a reviewer's decision; returns the updated record or None.

override_arguments (reviewer-edited tool args) and answer (a typed ask_user answer) are persisted alongside the verdict so the resume path can flow the decided value back to the held tool. First-write-wins: a decide on an already-decided record is a no-op returning it unchanged.

for_run `async` ¶

for_run(run_id: str) -> list[ApprovalRequest]

All approval records (pending or decided) belonging to a run.

list_by_key `async` ¶

list_by_key(correlation_key: str, *, pending_only: bool = True) -> list[ApprovalRequest]

All records carrying correlation_key (a business key lookup).

InMemoryApprovalStore ¶

Process-local approval store — the default for tests and single-process dev.

Holds records in a dict; nothing survives the process, so swap in a durable backend before running more than one replica.

PostgresApprovalStore ¶

Durable approval store backed by Postgres / Aurora PostgreSQL.

Uses psycopg (pip install 'yaab-sdk[postgres]'), imported lazily, so the dependency is only needed when this backend is actually constructed. The true multi-replica backend: any pod can list and decide a pending request.

RedisApprovalStore ¶

Durable approval store backed by Redis / ElastiCache / MemoryDB.

Uses redis (pip install 'yaab-sdk[redis]'), imported lazily; a pre-built client may be injected for tests. Each request is a JSON value in a per-id hash field, with a pending-id set and per-run id set for fast listing, so any replica sees and decides the same records.

SQLiteApprovalStore ¶

Durable approval store backed by SQLite for single-node deployments.

Records are stored as JSON keyed by approval_id with indexed run_id, agent, and decision columns so two views over one database file see each other's pending records — the floor for resuming a parked run on a different worker than the one that paused it.

ReviewDecision ¶

Bases: BaseModel

A human's decision on one :class:~yaab.types.Pending.

The only thing agent.run(resume=...) consumes. It is self-correlating: the approval_id (and the resume_id copied from the store row) locate the parked run's checkpoint, so resume never needs the original session or any in-memory object from the process that paused.

DecisionValidationError ¶

Bases: GovernanceError

Raised when a human's payload fails validation before anything is stored.

The pending record stays pending and the run stays paused — a mistyped answer or malformed edit never half-commits a decision.

ResumeBundle ¶

Bases: BaseModel

Several :class:Decision values keyed by approval_id, resumed at once.

Built by :func:multiplex when one model turn guarded multiple tools: decide each, then agent.run(resume=bundle) resolves every held tool with its matching decision in a single resume.

AuditEvent ¶

Bases: BaseModel

A single tamper-evident audit entry.

signing_payload ¶

signing_payload() -> str

The canonical string that gets hashed into the chain.

AuditLog ¶

The hash-chained audit ledger.

verify ¶

verify() -> bool

Return True iff the hash chain is intact.

AuditSink ¶

Bases: Protocol

A destination for audit events (OTel collector, Logfire, SQL, ...).

SQLiteAuditSink ¶

Durable audit sink backed by SQLite.

CallableAuthorizer ¶

Wrap a plain function (tool, args, ctx) -> bool | Decision.

Decision ¶

An authorization decision with an optional human-readable reason.

IdempotencyPlugin ¶

Bases: Plugin

Dedupe side-effecting tool calls within a run (or across runs via a store).

The idempotency key defaults to a hash of the tool name + sorted args; pass key_fn to derive it from domain fields (e.g. an order id). On a repeat key the cached result is returned and the tool is not executed again.

By default the cache lives for the plugin's lifetime (shared across runs on the same Runner). Pass per_run=True to scope it to a single run via ctx.state.

RBACAuthorizer ¶

Allow/deny tools by name and by required capability.

allow — if set, only these tools may run (allow-list);
deny — these tools never run (takes precedence);
require_capability — map tool name -> capability string that the caller's ctx.state['capabilities'] (a set/list) must contain.

ToolAuthorizationPlugin ¶

Bases: Plugin

Enforce a chain of authorizers before each tool call.

All authorizers must allow; the first denial wins. With hard=True a denial raises :class:PolicyViolation; otherwise it short-circuits the tool with an error string fed back to the model (so the agent can adapt). Every decision that isn't a plain allow is audited.

ToolAuthorizer ¶

Bases: Protocol

Decides whether a tool call may proceed.

Budget `dataclass` ¶

A spend cap for one key over a rolling window.

InMemorySpendStore ¶

Process-local spend ledger — the default for tests and single-process dev.

PostgresSpendStore ¶

Durable spend ledger backed by Postgres / Aurora — the multi-pod backend.

Uses psycopg (pip install 'yaab-sdk[postgres]'), imported lazily, so a spend cap is enforced against one shared ledger every pod reads and writes — a rate/budget that is global across replicas, not per-pod.

SpendGovernancePlugin ¶

Bases: Plugin

Enforce per-identity / per-tenant spend caps across runs.

before_run blocks a run whose identity or tenant key is already at/over its budget; after_model records each call's cost_usd against those keys. tenant_of maps an identity to a tenant key (None = no tenant tier). clock is injectable for tests.

SpendStore ¶

Bases: Protocol

An append-only spend ledger keyed by an opaque string.

SQLiteSpendStore ¶

Durable spend ledger backed by SQLite — durable on a single node.

Two views over one database file see each other's spend, so a paused/over-budget decision is consistent across worker threads and processes on the same host.

Case ¶

Bases: BaseModel

One evaluation example: input, optional expected output, metadata.

Experiment ¶

Runs a task over a dataset and scores it with a set of evaluators.

ExperimentResult ¶

Bases: BaseModel

aggregate `property` ¶

aggregate: dict[str, float]

Mean score per evaluator across all cases.

FunctionEvaluator ¶

Wrap an arbitrary scoring function as an :class:Evaluator.

JSONMatch ¶

Score 1.0 if output parses to JSON equal to case.expected.

Levenshtein ¶

Normalized edit-distance similarity in [0, 1] vs case.expected.

LLMJudge ¶

Score an output's quality 0-1 with a model judge (call :meth:ascore).

NumericTolerance ¶

Score 1.0 if the numeric output is within tol of case.expected.

Regex ¶

Score 1.0 if the output matches a regex (the pattern is case.expected).

ResponseMatch ¶

Token-overlap (ROUGE-style) similarity in [0, 1] vs case.expected.

The fraction of the expected answer's words that appear in the output — a deterministic, offline text-overlap signal that tolerates wording the way exact match cannot, without needing a model judge.

RubricJudge ¶

Score an output against named criteria with a model judge.

Unlike the freeform :class:LLMJudge (one blended number), the judge is asked to score each rubric criterion separately and return them as a JSON object, so the result is an inspectable per-criterion breakdown plus an aggregate (the mean). Use :meth:ascore_rubric for the breakdown, :meth:ascore for just the aggregate float (so it drops into the same eval pipeline as every metric).

ToolTrajectoryMatch ¶

Score how well an agent's tool-call sequence matches an expected one.

Unlike the output-string metrics above, it scores the process — which tools were called, in what order, with which arguments — which is what you actually want to regression-test for tool-using agents.

The expected trajectory is a list of {"name": str, "arguments"?: dict} steps, read from case.metadata["expected_tool_trajectory"] (or, as a convenience, case.expected when it is a list). The actual trajectory is pulled from the run's events by :meth:Experiment.run and handed to this evaluator via a context dict — so this evaluator's output argument is the context dict, not the final string. That is why it is context-aware: :meth:Experiment.run detects evaluators that accept the context and feeds it to them, while keeping plain (case, output) evaluators working.

Scoring (strict=False, the default): the fraction of expected steps that appear in the actual trajectory as an ordered subsequence (so a missing or reordered step costs proportionally, never more). With strict=True the actual sequence must equal the expected sequence exactly (1.0 or 0.0).

A step's arguments, when given, must be a subset of the actual call's arguments (extra actual args are fine) — agents often pass defaults the eval author did not pin down, and over-specifying would make tests brittle.

EvalCase ¶

Bases: BaseModel

One portable evaluation example.

A single-turn case is just a one-entry conversation; multi-turn cases list the user turns in order (the agent/task is expected to drive the turns in between). expected_tool_trajectory is an ordered list of {"name": str, "arguments"?: dict} steps for trajectory scoring.

to_case ¶

to_case() -> Case

Convert to a yaab :class:~yaab.governance.eval.Case.

The last user turn is the input the task receives; the full conversation is preserved under metadata["conversation"] so multi-turn-aware tasks can replay it, and the expected trajectory is stashed under metadata["expected_tool_trajectory"] where :class:~yaab.governance.eval.ToolTrajectoryMatch looks for it.

from_case `classmethod` ¶

from_case(case: Case) -> EvalCase

Build an :class:EvalCase from a yaab :class:Case.

A conversation already in the case's metadata wins; otherwise the case's inputs becomes a single-turn conversation. The expected tool trajectory is read from metadata if present.

EvalSet ¶

Bases: BaseModel

A named, versioned, portable collection of :class:EvalCases.

save ¶

save(path: str | Path) -> Path

Write the set to path as pretty JSON with a schema_version.

Returns the path written. The on-disk object is the model dump plus a leading schema_version so future readers can branch on format.

load `classmethod` ¶

load(path: str | Path) -> EvalSet

Read an evalset back from path (ignores schema_version).

Unknown top-level fields (including schema_version) are dropped by pydantic, which is what gives us forward compatibility across minor format additions.

to_dataset ¶

to_dataset() -> Dataset

Convert to a yaab :class:~yaab.governance.eval.Dataset.

The returned dataset can be handed straight to :class:~yaab.governance.eval.Experiment, so an evalset file becomes a runnable suite with no extra glue.

from_cases `classmethod` ¶

from_cases(cases: list[Case], *, name: str = 'evalset', version: str = '1') -> EvalSet

Build an :class:EvalSet from existing yaab :class:Case objects.

LLMGuardScanner ¶

Run Protect AI LLM-Guard scanners as a YAAB guardrail.

NeMoGuardrailsScanner ¶

Enforce NeMo Guardrails rails as a YAAB guardrail.

PresidioPIIScanner ¶

Detect & redact PII via Microsoft Presidio.

EvidenceArtifact ¶

Bases: BaseModel

A piece of evidence attached at a lifecycle transition.

LifecycleManager ¶

Drives agents through the model-risk lifecycle with audited transitions.

DriftMonitor ¶

Track eval scores over time and detect material regressions.

baseline is the first baseline_window scores (e.g. validation-time); drift is flagged when the mean of the last recent_window scores falls more than threshold below the baseline mean.

TrustScorer ¶

Compute a 0–1 trust score for an agent from eval + audit signals.

The score blends three components (each in [0, 1], higher is better):

performance — mean eval score for the agent (defaults to 1.0 if none);
safety — 1 minus the rate of guardrail blocks per run;
reliability — 1 minus the rate of errors per run.

Weights are configurable; the breakdown is returned for transparency.

PIIScanner ¶

Detect and redact common PII (email, phone, SSN, credit card).

PolicyEngine ¶

Runs a set of scanners over text at a given stage.

evaluate ¶

evaluate(text: str, stage: Stage) -> list[GuardrailResult]

Run all stage-relevant scanners; redactions chain through the text.

decide `staticmethod` ¶

decide(results: list[GuardrailResult]) -> tuple[Action, str]

Collapse scanner results into a single effective action + text.

PromptInjectionScanner ¶

Heuristic prompt-injection / jailbreak detector.

SecretScanner ¶

Detect leaked credentials/API keys (blocks on output).

SystemPromptLeakScanner ¶

Prevent the model from echoing its own system prompt.

TopicScanner ¶

Allow/deny list of banned topics (keyword based).

AgentCard ¶

Bases: BaseModel

The registry record for one agent version (A2A-compatible superset).

extra="allow" so a central/enterprise registry can attach its own fields (e.g. usecase_id, blueprint, cost-center) and have them round-trip losslessly through model_dump/JSON. Prefer the typed metadata dict for organization-specific attributes you want to query consistently.

to_a2a_card ¶

to_a2a_card(url: str = '') -> dict[str, Any]

Render an A2A-style discovery card for /.well-known/agent.json.

AgentRegistry ¶

The registry facade over a pluggable backend.

inventory ¶

inventory() -> list[dict[str, Any]]

Produce the SR 11-7 / EU AI Act model-inventory view.

DecisionAuthority ¶

Bases: str, Enum

What the agent is allowed to do with its output.

EUActCategory ¶

Bases: str, Enum

EU AI Act risk categories (Reg. 2024/1689).

RemoteRegistryBackend ¶

RegistryBackend backed by a central/enterprise HTTP registry service.

Lets governance enforce against an org-wide system-of-record instead of a local store: register() writes through to the remote service, and the enforcing run-gate reads approval status from it on every run.

Expected REST contract (override *_path to adapt to your service):

PUT  {base_url}/agents/{agent_id}   body: AgentCard JSON  -> 2xx
GET  {base_url}/agents/{agent_id}   -> AgentCard JSON (404 if absent)
GET  {base_url}/agents             -> [AgentCard, ...] or {"agents": [...]}

Because AgentCard allows extra fields, any custom attributes your central registry returns (usecase_id, blueprint, ...) round-trip intact.

A pre-built httpx.Client may be injected (handy for tests via httpx.MockTransport); otherwise one is created from base_url + headers + timeout. Requires the http extra (pip install 'yaab-sdk[http]').

RiskTier ¶

Bases: str, Enum

Internal risk tiering, orthogonal to any single regulatory regime.

SQLiteRegistryBackend ¶

Durable registry backend backed by SQLite.

GovernanceService ¶

Facade over the governance components, parameterized by mode.

check_registered ¶

check_registered(agent_id: str | None, identity: str | None) -> None

Enforce registration + approval before a run (enforcing mode only).

scan ¶

scan(text: str, stage: Stage, *, agent_id: str | None = None, identity: str | None = None) -> str

Run guardrails. Returns possibly-redacted text; may raise on BLOCK.

In OBSERVE mode a BLOCK is recorded but downgraded to a flag (text passes through); in ENFORCING mode a BLOCK raises PolicyViolation.

SimulationEvaluator ¶

Wrap :func:simulate as an evaluation that scores a run in [0, 1].

With no metric the score is goal_achieved mapped to 1.0/0.0 — the simplest useful signal: did the simulated user accomplish what it set out to do. Pass a metric callable (:class:SimulationResult → float) for richer scoring (turn efficiency, transcript length, an LLM judge over the transcript, …).

score ¶

score(result: SimulationResult) -> float

Score an already-computed :class:SimulationResult.

SimulationResult ¶

Bases: BaseModel

The outcome of a simulated multi-turn conversation.

transcript is the full dialogue as {"role", "content"} dicts in order (user/assistant alternating); turns counts completed user→agent exchanges; goal_achieved is the simulator's own final self-assessment; agent_usage aggregates the agent's token/cost accounting across turns.

UserSimulator ¶

An LLM playing a persona with a goal, driving a multi-turn conversation.

The simulator is itself model-driven: :meth:next_message renders a system prompt (persona + goal + instructions) plus the conversation so far — but from the user's point of view, so the agent's turns are presented as the counterpart's messages — and asks the model for the next user utterance. A reply of [DONE] (or one containing it) signals the simulator is finished. :meth:assess_goal issues the final GOAL_ACHIEVED: yes/no self-assessment.

next_message `async` ¶

next_message(transcript: list[dict[str, str]]) -> tuple[str, bool]

Produce the next user message, or signal completion.

Returns (message, done). done is True when the model emits the [DONE] sentinel; in that case message is the empty string (the sentinel itself is never appended to the transcript).

assess_goal `async` ¶

assess_goal(transcript: list[dict[str, str]]) -> bool

Ask the simulator whether its goal was achieved (final self-assessment).

Parsed leniently: any yes in the reply (case-insensitive) counts as achieved, so a chatty model that says "GOAL_ACHIEVED: yes, because…" still scores correctly.

simulate `async` ¶

simulate(agent: Any, simulator: UserSimulator, *, session_id: str | None = None) -> SimulationResult

Drive a multi-turn conversation between simulator and agent.

The loop: the simulator produces a user turn → the agent answers (carrying a stable session_id so it sees the running history) → repeat. It ends when the simulator emits [DONE], the simulator's stop_when predicate fires on the agent's reply, or simulator.max_turns is reached. Finally the simulator self-assesses goal_achieved.

A session_id is always used (generated if not supplied) because multi-turn evaluation only means something if the agent accumulates history; without it each agent.run would be amnesiac and the eval would be a series of unrelated single-turn calls.

simulate_evalset `async` ¶

simulate_evalset(agent: Any, evalset: Any, simulator_model: ModelProvider | str, *, max_turns: int = 8, stop_when: Callable[[str], bool] | None = None) -> list[SimulationResult]

Run a persona-driven simulation per case in an :class:EvalSet.

Each :class:~yaab.governance.evalset.EvalCase seeds one simulation: the persona and goal come from case.metadata['persona'] and case.metadata['goal']. When a case omits them, the case's conversation is used as a fallback (the first user turn becomes the goal and the persona defaults to a generic user) so legacy single-turn cases still drive a sensible simulation instead of erroring.

Every case gets its own session_id so simulations don't bleed history into each other. Returns one :class:SimulationResult per case, in order.

yaab.governance¶

yaab.governance ¶

ToolApprovalPlugin ¶

ApprovalDecision ¶

ApprovalRequest ¶

ApprovalStore ¶

create async ¶

get async ¶

list_pending async ¶

decide async ¶

for_run async ¶

list_by_key async ¶

InMemoryApprovalStore ¶

PostgresApprovalStore ¶

RedisApprovalStore ¶

SQLiteApprovalStore ¶

ReviewDecision ¶

DecisionValidationError ¶

ResumeBundle ¶

AuditEvent ¶

signing_payload ¶

AuditLog ¶

verify ¶

AuditSink ¶

SQLiteAuditSink ¶

CallableAuthorizer ¶

Decision ¶

IdempotencyPlugin ¶

RBACAuthorizer ¶

ToolAuthorizationPlugin ¶

ToolAuthorizer ¶

Budget dataclass ¶

InMemorySpendStore ¶

PostgresSpendStore ¶

SpendGovernancePlugin ¶

SpendStore ¶

SQLiteSpendStore ¶

Case ¶

Experiment ¶

ExperimentResult ¶

aggregate property ¶

FunctionEvaluator ¶

JSONMatch ¶

Levenshtein ¶

LLMJudge ¶

NumericTolerance ¶

Regex ¶

ResponseMatch ¶

RubricJudge ¶

ToolTrajectoryMatch ¶

EvalCase ¶

to_case ¶

from_case classmethod ¶

EvalSet ¶

save ¶

load classmethod ¶

to_dataset ¶

from_cases classmethod ¶

LLMGuardScanner ¶

NeMoGuardrailsScanner ¶

PresidioPIIScanner ¶

EvidenceArtifact ¶

LifecycleManager ¶

DriftMonitor ¶

TrustScorer ¶

PIIScanner ¶

PolicyEngine ¶

evaluate ¶

decide staticmethod ¶

PromptInjectionScanner ¶

SecretScanner ¶

SystemPromptLeakScanner ¶

TopicScanner ¶

AgentCard ¶

to_a2a_card ¶

AgentRegistry ¶

inventory ¶

DecisionAuthority ¶

EUActCategory ¶

RemoteRegistryBackend ¶

create `async` ¶

get `async` ¶

list_pending `async` ¶

decide `async` ¶

for_run `async` ¶

list_by_key `async` ¶

Budget `dataclass` ¶

aggregate `property` ¶

from_case `classmethod` ¶

load `classmethod` ¶

from_cases `classmethod` ¶

decide `staticmethod` ¶

next_message `async` ¶

assess_goal `async` ¶

simulate `async` ¶

simulate_evalset `async` ¶