Methodology
A structured way to measure whether AI-generated images and video are actually deliverable — technically, perceptually, and commercially. Nine dimensions, three gates, automated scorecards.
In video QA, “artifact” often names a defect. HarteFact scores outputs anyway — assets, streams, pixels, facts.
Local-first. No cloud dependencies. Designed to run on Apple Silicon using open-source components. The framework is incremental — each phase produces infrastructure consumed by later phases.
Core principles
Model-agnostic by design
Most metrics measure properties of the output file — resolution, texture, temporal stability, color accuracy, identity consistency — regardless of which model produced it. Scoring does not require recalibration when models change.
Algorithmic vs. AI-evaluated
Every score is labeled algorithmic or ai_evaluated. VLM scores are reported with mean and variance and are never presented as equivalent to deterministic metrics.
Tiered gating
Three gates avoid wasting compute on content that has already failed. A clip with the wrong codec never consumes GPU cycles on identity-drift analysis.
Versioned, reproducible
Every run logs framework version, calibration version, and model versions. Re-evaluations are new runs, not silent replacements. Score history is queryable per asset.
Pipeline architecture
Three gates separate fast, cheap checks from expensive deep analysis. Each gate is a checkpoint — content that fails Gate 1 never gets the expensive Gate 2 analysis, which keeps costs low and feedback fast. Failed content gets immediate, specific feedback identifying the failure dimension — without the cost of downstream scoring.
- Gate 1Technical specsDimension 1
Pass / fail on file specs, codec, resolution, audio packaging.
- Gate 2Spatial qualityDimension 2
Pass / fail on catastrophic spatial failures (severe artifacts, banding).
- Gate 3Temporal & audio basicsDimensions 3 + 4 (parallel)
Pass / fail on flicker, scene-cut sanity, audio levels, sync offset.
- DeepIdentity, lighting, brand, prompt adherenceDimensions 5 – 9
Per-character analysis, scene integrity, client-compliance scoring.
- OutputVersioned scorecard
Pass/fail summary, per-dimension detail, annotated frame thumbnails, timeline visualization, per-frame metric trends, client threshold reference.
The nine dimensions
Each dimension owns a distinct axis of output quality. Build phases follow the dependency map: each phase produces infrastructure later phases reuse, so no work is thrown away.
Technical Delivery Compliance
File specs, codecs, container, color space, VMAF, audio packaging. The non-negotiable foundation.
- Resolution / frame rate
- Codec & container
- VMAF score
- Color space
Spatial & Texture Integrity
Per-frame visual quality. Compression artifacts, texture noise, banding, VAE seam detection.
- BRISQUE / NIQE
- Laplacian sharpness
- Color banding
- Wavelet noise analysis
Temporal Consistency & Motion
Stability across frames. Background flicker, optical flow consistency, scene-cut detection.
- Background SSIM
- Optical flow
- Flicker detection
- Scene cuts
Audio Quality
Loudness, clipping, sync offset. Runs in parallel with the temporal pipeline.
- LUFS measurement
- Clipping detection
- Sync offset
- Spectral integrity
Lip Sync Precision
Combines mouth aspect ratio (MAR) with audio phoneme timing via DTW alignment.
- MAR extraction
- DTW alignment
- WhisperX phonemes
- Sync drift over time
Character & Identity Integrity
Face identity drift, hand failures, body proportions, teeth, clothing consistency.
- InsightFace cosine similarity
- Hand failure logging
- Body proportions
- Skin tone stability
Lighting & Scene Integrity
Shadow coherence, luminance tracking, color temperature stability, reflection plausibility.
- Shadow masking
- Luminance per region
- Color temperature drift
- Reflection flagging
Brand & Client Compliance
Per-client palette, talent reference, logo placement, LUT comparison, typography.
- Brand HEX Delta-E
- Talent face match
- LUT comparison
- Logo / wordmark presence
Prompt & Action Adherence
VLM-evaluated framing, composition, physics plausibility, object/spatial flagging.
- VLM scene description
- Framing & composition
- Physics flags
- Slideshow detection
Includes ai_evaluated scores; reported with mean + variance.
What this framework is not
- —Not a scoring rubric for taste, creativity, or commercial appeal. Aesthetic judgment remains human.
- —Not a model leaderboard. The framework benchmarks output properties; model comparisons are a separate activity built on top of the same infrastructure.
- —Not a SaaS dashboard. Phase 1 ships a local pipeline and a versioned scorecard format, not a hosted product.
- —Not a substitute for human QC on edge cases. The system is designed to scale review, not to replace the final sign-off on high-stakes deliverables.
Print-on-demand extension
A separate addendum extends the framework with print-specific quality metrics: CMYK gamut warnings, ink coverage limits, transparency edge fringing, design placement safety, and pre-generation input validation.
Read the POD addendumPilot engagements
Phase 1 (Technical Delivery) and Phase 1b (Identity Consistency) are in active build. We're scoping a small number of pilot engagements with production studios, agencies, and POD operators for the second half of 2026.
Get in touch