Text-to-CAD Benchmark

🛠️ MUSE

Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

A Text-to-CAD benchmark for complex, editable B-Rep assemblies. Pairs practical design instances with structured Design Specifications and evaluates LLM outputs through a three-stage funnel: code check → geometric check → design-intent alignment.

Xiaoyu Dong¹ · Zhi Li^2,* · Xiao-Ming Wu^1,*

¹ The Hong Kong Polytechnic University ² Curvature Flow Co., Ltd.

^* Co-corresponding authors

Abstract

Beyond geometric similarity, toward engineering usability

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating isolated primitive CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability.

To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality.

To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design — even the strongest models achieve limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design.

Evaluation Funnel

Three stages, one fail = downstream zero

Every (case, model, sample) flows through:

Stage 1 · Code Check

Sandbox Success

Does the generated CadQuery script execute without raising? Pure binary signal — no geometry inspection yet.

→ Sandbox

Stage 2 · Geometry Check

Geometric Validity

Four binary OCCT checks — watertight, manifold, self-intersection free, overlap free. Overall means all four passed.

→ Watertight · Manifold · Self-Int · Overlap

Stage 3 · Design Intent

Rubric-Based VLM Judge

Gemini-3.1-Pro scores three pillars against a per-case rubric: Functionality, Manufacturability, Assemblability. Final is the mean.

→ Functionality · Manufacturability · Assemblability

Top 5 — leaderboard preview

Closed-source still leads, but not by as much as you'd think

Below: top 5 models by Final Score. See the full table — 15+ models, 10 metrics, sortable / filterable — on the leaderboard page.

#	Model	Sandbox	Geom. Valid	Function	Manuf.	Assembl.	Final ↓
Loading…

View full leaderboard →

Citation

If you use MUSE, please cite

@misc{dong2026muse,
  title         = {MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation},
  author        = {Xiaoyu Dong and Zhi Li and Xiao-Ming Wu},
  year          = {2026},
  eprint        = {2605.28579},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.28579}
}