Skip to main content

Skill Grader

Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.

Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.


When to Useโ€‹

โœ… Use for:

  • Auditing a single skill's quality
  • Comparing skills against each other
  • Prioritizing which skills to improve first
  • Quality control sweeps across a skill library
  • Generating improvement roadmaps

โŒ NOT for:

  • Creating new skills (use skill-architect)
  • Grading code quality or non-skill documents
  • Evaluating agent performance (different from skill quality)

Grading Processโ€‹

flowchart TD
A[Read SKILL.md + all files] --> B[Score each of 10 axes]
B --> C[Assign letter grade per axis]
C --> D[Compute overall grade]
D --> E[Write improvement recommendations]
E --> F[Produce grading report]

Step-by-Stepโ€‹

  1. Read the entire skill folder โ€” SKILL.md, all references, scripts, CHANGELOG, README
  2. Score each axis โ€” Use the rubric below (0-100 per axis)
  3. Convert to letter grade โ€” See grade scale
  4. Compute overall grade โ€” Weighted average (Description and Scope are 2x weight)
  5. Write 1-3 specific improvements per axis scoring below B+
  6. Produce the grading report in the output format below

The 10 Evaluation Axesโ€‹

Axis 1: Description Quality (Weight: 2x)โ€‹

Does the description follow [What] [When] [Keywords]. NOT for [Exclusions]?

GradeCriteria
ASpecific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words
BHas keywords and NOT clause, but slightly vague or missing synonym coverage
CToo generic, missing NOT clause, or >100 words of process detail
DSingle vague sentence ("helps with X") or name/description mismatch
FMissing or empty description

Axis 2: Scope Discipline (Weight: 2x)โ€‹

Is the skill narrowly focused on one expertise type, or a catch-all?

GradeCriteria
AOne clear expertise domain, "When to Use" and "NOT for" sections both present and specific
BMostly focused, minor boundary ambiguity
CCovers 2-3 related but distinct domains, should probably be split
DCatch-all skill ("helps with anything related to X")
FNo scope boundaries defined at all

Axis 3: Progressive Disclosureโ€‹

Does the skill follow the three-layer architecture (metadata โ†’ SKILL.md โ†’ references)?

GradeCriteria
ASKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions
BSKILL.md <500 lines, some references used, index present
CSKILL.md >500 lines, or all content inlined with no references
DSKILL.md >800 lines, or references exist but aren't indexed in SKILL.md
FSingle massive file with no structure

Axis 4: Anti-Pattern Coverageโ€‹

Does the skill encode expert knowledge that prevents common mistakes?

GradeCriteria
A3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes
B1-2 anti-patterns with clear explanation
CAnti-patterns mentioned but no structured template
DNo anti-patterns, just positive instructions
FContains advice that IS an anti-pattern (outdated, harmful)

Axis 5: Self-Contained Toolsโ€‹

Does the skill include working tools (scripts, MCPs, subagents)?

GradeCriteria
AWorking scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification
BScripts exist and work but lack error handling or docs
CScripts referenced but are templates/pseudocode
DPhantom tools (SKILL.md references files that don't exist)
FReferences non-existent tools AND no acknowledgment

Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.

Axis 6: Activation Precisionโ€‹

Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?

GradeCriteria
ADescription has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors
BGood keywords, minor false-positive risk
CGeneric keywords that overlap with other skills
DNo specific keywords, or NOT clause contradicts intended use
FDescription would cause constant false activation

Axis 7: Visual Artifactsโ€‹

Does the skill use Mermaid diagrams, code examples, and tables effectively?

GradeCriteria
ADecision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns
BSome diagrams or tables, but key decision trees still in prose
CTables used but no Mermaid diagrams for processes
DProse-only, no visual structure
FWall of text with no formatting aids

Axis 8: Output Contractsโ€‹

Does the skill define what it produces in a format consumable by other agents?

GradeCriteria
AExplicit output format (JSON schema, markdown template, or structured sections), subagent-consumable
BOutput format implied but not explicitly documented
CNo output format, but content is structured enough to infer
DUnstructured prose output expected
FN/A (pure reference skill) โ€” exempt from this axis

Axis 9: Temporal Awarenessโ€‹

Does the skill track when knowledge was current and what has changed?

GradeCriteria
ATimelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates
BSome temporal context, CHANGELOG exists
CNo dates on knowledge, but CHANGELOG exists
DNo temporal context anywhere, knowledge could be stale
FContains demonstrably outdated advice without warning

Axis 10: Documentation Qualityโ€‹

README, CHANGELOG, and reference organization.

GradeCriteria
AREADME with quick start, CHANGELOG with dated versions, references well-organized with clear filenames
BREADME and CHANGELOG exist, references present
CSKILL.md is the only file, but it's well-structured
DNo README, no CHANGELOG, disorganized references
FSKILL.md is the only file and it's poorly structured

Grade Scaleโ€‹

LetterScore RangeMeaning
A+97-100Exemplary โ€” sets the standard
A93-96Excellent โ€” minor improvements possible
A-90-92Very good โ€” a few small gaps
B+87-89Good โ€” notable room for improvement
B83-86Solid โ€” several areas need work
B-80-82Above average โ€” meaningful gaps
C+77-79Average โ€” significant improvements needed
C73-76Below average โ€” major gaps
C-70-72Barely adequate
D+67-69Poor โ€” fundamental issues
D63-66Very poor โ€” needs major rework
D-60-62Near-failing quality
F<60Failing โ€” start over

Overall Grade Computationโ€‹

Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.

Overall = (2ร—Axis1 + 2ร—Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12

Convert the numeric average to a letter grade using the scale above.


Output Formatโ€‹

Produce this exact structure:

# Skill Grading Report: [skill-name]

**Graded**: [date]
**Overall Grade**: [letter] ([score]/100)

## Axis Grades

| # | Axis | Grade | Score | Key Finding |
|---|------|-------|-------|-------------|
| 1 | Description Quality | [grade] | [score] | [1-line finding] |
| 2 | Scope Discipline | [grade] | [score] | [1-line finding] |
| 3 | Progressive Disclosure | [grade] | [score] | [1-line finding] |
| 4 | Anti-Pattern Coverage | [grade] | [score] | [1-line finding] |
| 5 | Self-Contained Tools | [grade] | [score] | [1-line finding] |
| 6 | Activation Precision | [grade] | [score] | [1-line finding] |
| 7 | Visual Artifacts | [grade] | [score] | [1-line finding] |
| 8 | Output Contracts | [grade] | [score] | [1-line finding] |
| 9 | Temporal Awareness | [grade] | [score] | [1-line finding] |
| 10 | Documentation Quality | [grade] | [score] | [1-line finding] |

## Top 3 Improvements (Highest Impact)

1. **[Axis]: [Specific action]** โ€” [Why this matters, expected grade improvement]
2. **[Axis]: [Specific action]** โ€” [Why this matters, expected grade improvement]
3. **[Axis]: [Specific action]** โ€” [Why this matters, expected grade improvement]

## Detailed Notes

### [Axis name] ([grade])
[2-3 sentences of specific feedback with examples from the skill]

[Repeat for each axis scoring below B+]

Quick Grading (Abbreviated)โ€‹

For rapid triage across many skills, produce only:

| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Anti-Patterns in Gradingโ€‹

Grade Inflationโ€‹

Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.

Missing Contextโ€‹

Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A โ€” tools not applicable for this skill type."

Ignoring Phantomsโ€‹

Wrong: Scoring Axis 5 as B because scripts are "referenced." Right: Actually check if every referenced file exists. If scripts/validate.py is mentioned but doesn't exist, that's D.