Skill Grader
Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.
Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.
When to Useโ
โ Use for:
- Auditing a single skill's quality
- Comparing skills against each other
- Prioritizing which skills to improve first
- Quality control sweeps across a skill library
- Generating improvement roadmaps
โ NOT for:
- Creating new skills (use
skill-architect) - Grading code quality or non-skill documents
- Evaluating agent performance (different from skill quality)
Grading Processโ
flowchart TD
A[Read SKILL.md + all files] --> B[Score each of 10 axes]
B --> C[Assign letter grade per axis]
C --> D[Compute overall grade]
D --> E[Write improvement recommendations]
E --> F[Produce grading report]
Step-by-Stepโ
- Read the entire skill folder โ SKILL.md, all references, scripts, CHANGELOG, README
- Score each axis โ Use the rubric below (0-100 per axis)
- Convert to letter grade โ See grade scale
- Compute overall grade โ Weighted average (Description and Scope are 2x weight)
- Write 1-3 specific improvements per axis scoring below B+
- Produce the grading report in the output format below
The 10 Evaluation Axesโ
Axis 1: Description Quality (Weight: 2x)โ
Does the description follow [What] [When] [Keywords]. NOT for [Exclusions]?
| Grade | Criteria |
|---|---|
| A | Specific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words |
| B | Has keywords and NOT clause, but slightly vague or missing synonym coverage |
| C | Too generic, missing NOT clause, or >100 words of process detail |
| D | Single vague sentence ("helps with X") or name/description mismatch |
| F | Missing or empty description |
Axis 2: Scope Discipline (Weight: 2x)โ
Is the skill narrowly focused on one expertise type, or a catch-all?
| Grade | Criteria |
|---|---|
| A | One clear expertise domain, "When to Use" and "NOT for" sections both present and specific |
| B | Mostly focused, minor boundary ambiguity |
| C | Covers 2-3 related but distinct domains, should probably be split |
| D | Catch-all skill ("helps with anything related to X") |
| F | No scope boundaries defined at all |
Axis 3: Progressive Disclosureโ
Does the skill follow the three-layer architecture (metadata โ SKILL.md โ references)?
| Grade | Criteria |
|---|---|
| A | SKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions |
| B | SKILL.md <500 lines, some references used, index present |
| C | SKILL.md >500 lines, or all content inlined with no references |
| D | SKILL.md >800 lines, or references exist but aren't indexed in SKILL.md |
| F | Single massive file with no structure |
Axis 4: Anti-Pattern Coverageโ
Does the skill encode expert knowledge that prevents common mistakes?
| Grade | Criteria |
|---|---|
| A | 3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes |
| B | 1-2 anti-patterns with clear explanation |
| C | Anti-patterns mentioned but no structured template |
| D | No anti-patterns, just positive instructions |
| F | Contains advice that IS an anti-pattern (outdated, harmful) |
Axis 5: Self-Contained Toolsโ
Does the skill include working tools (scripts, MCPs, subagents)?
| Grade | Criteria |
|---|---|
| A | Working scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification |
| B | Scripts exist and work but lack error handling or docs |
| C | Scripts referenced but are templates/pseudocode |
| D | Phantom tools (SKILL.md references files that don't exist) |
| F | References non-existent tools AND no acknowledgment |
Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.
Axis 6: Activation Precisionโ
Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?
| Grade | Criteria |
|---|---|
| A | Description has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors |
| B | Good keywords, minor false-positive risk |
| C | Generic keywords that overlap with other skills |
| D | No specific keywords, or NOT clause contradicts intended use |
| F | Description would cause constant false activation |
Axis 7: Visual Artifactsโ
Does the skill use Mermaid diagrams, code examples, and tables effectively?
| Grade | Criteria |
|---|---|
| A | Decision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns |
| B | Some diagrams or tables, but key decision trees still in prose |
| C | Tables used but no Mermaid diagrams for processes |
| D | Prose-only, no visual structure |
| F | Wall of text with no formatting aids |
Axis 8: Output Contractsโ
Does the skill define what it produces in a format consumable by other agents?
| Grade | Criteria |
|---|---|
| A | Explicit output format (JSON schema, markdown template, or structured sections), subagent-consumable |
| B | Output format implied but not explicitly documented |
| C | No output format, but content is structured enough to infer |
| D | Unstructured prose output expected |
| F | N/A (pure reference skill) โ exempt from this axis |
Axis 9: Temporal Awarenessโ
Does the skill track when knowledge was current and what has changed?
| Grade | Criteria |
|---|---|
| A | Timelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates |
| B | Some temporal context, CHANGELOG exists |
| C | No dates on knowledge, but CHANGELOG exists |
| D | No temporal context anywhere, knowledge could be stale |
| F | Contains demonstrably outdated advice without warning |
Axis 10: Documentation Qualityโ
README, CHANGELOG, and reference organization.
| Grade | Criteria |
|---|---|
| A | README with quick start, CHANGELOG with dated versions, references well-organized with clear filenames |
| B | README and CHANGELOG exist, references present |
| C | SKILL.md is the only file, but it's well-structured |
| D | No README, no CHANGELOG, disorganized references |
| F | SKILL.md is the only file and it's poorly structured |
Grade Scaleโ
| Letter | Score Range | Meaning |
|---|---|---|
| A+ | 97-100 | Exemplary โ sets the standard |
| A | 93-96 | Excellent โ minor improvements possible |
| A- | 90-92 | Very good โ a few small gaps |
| B+ | 87-89 | Good โ notable room for improvement |
| B | 83-86 | Solid โ several areas need work |
| B- | 80-82 | Above average โ meaningful gaps |
| C+ | 77-79 | Average โ significant improvements needed |
| C | 73-76 | Below average โ major gaps |
| C- | 70-72 | Barely adequate |
| D+ | 67-69 | Poor โ fundamental issues |
| D | 63-66 | Very poor โ needs major rework |
| D- | 60-62 | Near-failing quality |
| F | <60 | Failing โ start over |
Overall Grade Computationโ
Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.
Overall = (2รAxis1 + 2รAxis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12
Convert the numeric average to a letter grade using the scale above.
Output Formatโ
Produce this exact structure:
# Skill Grading Report: [skill-name]
**Graded**: [date]
**Overall Grade**: [letter] ([score]/100)
## Axis Grades
| # | Axis | Grade | Score | Key Finding |
|---|------|-------|-------|-------------|
| 1 | Description Quality | [grade] | [score] | [1-line finding] |
| 2 | Scope Discipline | [grade] | [score] | [1-line finding] |
| 3 | Progressive Disclosure | [grade] | [score] | [1-line finding] |
| 4 | Anti-Pattern Coverage | [grade] | [score] | [1-line finding] |
| 5 | Self-Contained Tools | [grade] | [score] | [1-line finding] |
| 6 | Activation Precision | [grade] | [score] | [1-line finding] |
| 7 | Visual Artifacts | [grade] | [score] | [1-line finding] |
| 8 | Output Contracts | [grade] | [score] | [1-line finding] |
| 9 | Temporal Awareness | [grade] | [score] | [1-line finding] |
| 10 | Documentation Quality | [grade] | [score] | [1-line finding] |
## Top 3 Improvements (Highest Impact)
1. **[Axis]: [Specific action]** โ [Why this matters, expected grade improvement]
2. **[Axis]: [Specific action]** โ [Why this matters, expected grade improvement]
3. **[Axis]: [Specific action]** โ [Why this matters, expected grade improvement]
## Detailed Notes
### [Axis name] ([grade])
[2-3 sentences of specific feedback with examples from the skill]
[Repeat for each axis scoring below B+]
Quick Grading (Abbreviated)โ
For rapid triage across many skills, produce only:
| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Anti-Patterns in Gradingโ
Grade Inflationโ
Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.
Missing Contextโ
Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A โ tools not applicable for this skill type."
Ignoring Phantomsโ
Wrong: Scoring Axis 5 as B because scripts are "referenced."
Right: Actually check if every referenced file exists. If scripts/validate.py is mentioned but doesn't exist, that's D.