Skill Grader

Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.

Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.

When to Use

✅ Use for:

Auditing a single skill's quality
Comparing skills against each other
Prioritizing which skills to improve first
Quality control sweeps across a skill library
Generating improvement roadmaps

❌ NOT for:

Creating new skills (use skill-architect)
Grading code quality or non-skill documents
Evaluating agent performance (different from skill quality)

Grading Process

flowchart TD
  A[Read SKILL.md + all files] --> B[Score each of 10 axes]
  B --> C[Assign letter grade per axis]
  C --> D[Compute overall grade]
  D --> E[Write improvement recommendations]
  E --> F[Produce grading report]

Step-by-Step

Read the entire skill folder — SKILL.md, all references, scripts, CHANGELOG, README
Score each axis — Use the rubric below (0-100 per axis)
Convert to letter grade — See grade scale
Compute overall grade — Weighted average (Description and Scope are 2x weight)
Write 1-3 specific improvements per axis scoring below B+
Produce the grading report in the output format below

The 10 Evaluation Axes

Axis 1: Description Quality (Weight: 2x)

Does the description follow [What] [When] [Keywords]. NOT for [Exclusions]?

Grade	Criteria
A	Specific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words
B	Has keywords and NOT clause, but slightly vague or missing synonym coverage
C	Too generic, missing NOT clause, or >100 words of process detail
D	Single vague sentence ("helps with X") or name/description mismatch
F	Missing or empty description

Axis 2: Scope Discipline (Weight: 2x)

Is the skill narrowly focused on one expertise type, or a catch-all?

Grade	Criteria
A	One clear expertise domain, "When to Use" and "NOT for" sections both present and specific
B	Mostly focused, minor boundary ambiguity
C	Covers 2-3 related but distinct domains, should probably be split
D	Catch-all skill ("helps with anything related to X")
F	No scope boundaries defined at all

Axis 3: Progressive Disclosure

Does the skill follow the three-layer architecture (metadata → SKILL.md → references)?

Grade	Criteria
A	SKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions
B	SKILL.md <500 lines, some references used, index present
C	SKILL.md >500 lines, or all content inlined with no references
D	SKILL.md >800 lines, or references exist but aren't indexed in SKILL.md
F	Single massive file with no structure

Axis 4: Anti-Pattern Coverage

Does the skill encode expert knowledge that prevents common mistakes?

Grade	Criteria
A	3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes
B	1-2 anti-patterns with clear explanation
C	Anti-patterns mentioned but no structured template
D	No anti-patterns, just positive instructions
F	Contains advice that IS an anti-pattern (outdated, harmful)

Axis 5: Self-Contained Tools

Does the skill include working tools (scripts, MCPs, subagents)?

Grade	Criteria
A	Working scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification
B	Scripts exist and work but lack error handling or docs
C	Scripts referenced but are templates/pseudocode
D	Phantom tools (SKILL.md references files that don't exist)
F	References non-existent tools AND no acknowledgment

Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.

Axis 6: Activation Precision

Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?

Grade	Criteria
A	Description has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors
B	Good keywords, minor false-positive risk
C	Generic keywords that overlap with other skills
D	No specific keywords, or NOT clause contradicts intended use
F	Description would cause constant false activation

Axis 7: Visual Artifacts

Does the skill use Mermaid diagrams, code examples, and tables effectively?

Grade	Criteria
A	Decision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns
B	Some diagrams or tables, but key decision trees still in prose
C	Tables used but no Mermaid diagrams for processes
D	Prose-only, no visual structure
F	Wall of text with no formatting aids

Axis 8: Output Contracts

Does the skill define what it produces in a format consumable by other agents?

Grade	Criteria
A	Explicit output format (JSON schema, markdown template, or structured sections), subagent-consumable
B	Output format implied but not explicitly documented
C	No output format, but content is structured enough to infer
D	Unstructured prose output expected
F	N/A (pure reference skill) — exempt from this axis

Axis 9: Temporal Awareness

Does the skill track when knowledge was current and what has changed?

Grade	Criteria
A	Timelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates
B	Some temporal context, CHANGELOG exists
C	No dates on knowledge, but CHANGELOG exists
D	No temporal context anywhere, knowledge could be stale
F	Contains demonstrably outdated advice without warning

Axis 10: Documentation Quality

README, CHANGELOG, and reference organization.

Grade	Criteria
A	README with quick start, CHANGELOG with dated versions, references well-organized with clear filenames
B	README and CHANGELOG exist, references present
C	SKILL.md is the only file, but it's well-structured
D	No README, no CHANGELOG, disorganized references
F	SKILL.md is the only file and it's poorly structured

Grade Scale

Letter	Score Range	Meaning
A+	97-100	Exemplary — sets the standard
A	93-96	Excellent — minor improvements possible
A-	90-92	Very good — a few small gaps
B+	87-89	Good — notable room for improvement
B	83-86	Solid — several areas need work
B-	80-82	Above average — meaningful gaps
C+	77-79	Average — significant improvements needed
C	73-76	Below average — major gaps
C-	70-72	Barely adequate
D+	67-69	Poor — fundamental issues
D	63-66	Very poor — needs major rework
D-	60-62	Near-failing quality
F	<60	Failing — start over

Overall Grade Computation

Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.

Overall = (2×Axis1 + 2×Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12

Convert the numeric average to a letter grade using the scale above.

Output Format

Produce this exact structure:

# Skill Grading Report: [skill-name]

**Graded**: [date]
**Overall Grade**: [letter] ([score]/100)

## Axis Grades

| # | Axis | Grade | Score | Key Finding |
|---|------|-------|-------|-------------|
| 1 | Description Quality | [grade] | [score] | [1-line finding] |
| 2 | Scope Discipline | [grade] | [score] | [1-line finding] |
| 3 | Progressive Disclosure | [grade] | [score] | [1-line finding] |
| 4 | Anti-Pattern Coverage | [grade] | [score] | [1-line finding] |
| 5 | Self-Contained Tools | [grade] | [score] | [1-line finding] |
| 6 | Activation Precision | [grade] | [score] | [1-line finding] |
| 7 | Visual Artifacts | [grade] | [score] | [1-line finding] |
| 8 | Output Contracts | [grade] | [score] | [1-line finding] |
| 9 | Temporal Awareness | [grade] | [score] | [1-line finding] |
| 10 | Documentation Quality | [grade] | [score] | [1-line finding] |

## Top 3 Improvements (Highest Impact)

1. **[Axis]: [Specific action]** — [Why this matters, expected grade improvement]
2. **[Axis]: [Specific action]** — [Why this matters, expected grade improvement]
3. **[Axis]: [Specific action]** — [Why this matters, expected grade improvement]

## Detailed Notes

### [Axis name] ([grade])
[2-3 sentences of specific feedback with examples from the skill]

[Repeat for each axis scoring below B+]

Quick Grading (Abbreviated)

For rapid triage across many skills, produce only:

| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Anti-Patterns in Grading

Grade Inflation

Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.

Missing Context

Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A — tools not applicable for this skill type."

Ignoring Phantoms

Wrong: Scoring Axis 5 as B because scripts are "referenced." Right: Actually check if every referenced file exists. If scripts/validate.py is mentioned but doesn't exist, that's D.

When to Use​

Grading Process​

Step-by-Step​

The 10 Evaluation Axes​

Axis 1: Description Quality (Weight: 2x)​

Axis 2: Scope Discipline (Weight: 2x)​

Axis 3: Progressive Disclosure​

Axis 4: Anti-Pattern Coverage​

Axis 5: Self-Contained Tools​

Axis 6: Activation Precision​

Axis 7: Visual Artifacts​

Axis 8: Output Contracts​

Axis 9: Temporal Awareness​

Axis 10: Documentation Quality​

Grade Scale​

Overall Grade Computation​

Output Format​

Quick Grading (Abbreviated)​

Anti-Patterns in Grading​

Grade Inflation​

Missing Context​

Ignoring Phantoms​