Text-to-CAD Benchmark

πŸ› οΈ MUSE

Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

A Text-to-CAD benchmark for complex, editable B-Rep assemblies. Pairs practical design instances with structured Design Specifications and evaluates LLM outputs through a three-stage funnel: code check β†’ geometric check β†’ design-intent alignment.

Xiaoyu Dong1  Β·  Zhi Li2,*  Β·  Xiao-Ming Wu1,*

1 The Hong Kong Polytechnic University    2 Curvature Flow Co., Ltd.

* Co-corresponding authors

A subset of design instances from the MUSE dataset

Figure 1. A subset of design instances from the MUSE dataset β€” practical, multi-component CAD assemblies that must satisfy real engineering constraints, not just look like the reference shape.

Abstract

Beyond geometric similarity, toward engineering usability

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating isolated primitive CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability.

To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality.

To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design β€” even the strongest models achieve limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design.

Evaluation Funnel

Three stages, one fail = downstream zero

Every (case, model, sample) flows through:

Stage 1 Β· Code Check
Sandbox Success
Does the generated CadQuery script execute without raising? Pure binary signal β€” no geometry inspection yet.
β†’ Sandbox
Stage 2 Β· Geometry Check
Geometric Validity
Four binary OCCT checks β€” watertight, manifold, self-intersection free, overlap free. Overall means all four passed.
β†’ Watertight Β· Manifold Β· Self-Int Β· Overlap
Stage 3 Β· Design Intent
Rubric-Based VLM Judge
Gemini-3.1-Pro scores three pillars against a per-case rubric: Functionality, Manufacturability, Assemblability. Final is the mean.
β†’ Functionality Β· Manufacturability Β· Assemblability

Top 5 β€” leaderboard preview

Closed-source still leads, but not by as much as you'd think

Below: top 5 models by Final Score. See the full table β€” 15+ models, 10 metrics, sortable / filterable β€” on the leaderboard page.

# Model Sandbox Geom. Valid Function Manuf. Assembl. Final ↓
Loading…

Citation

If you use MUSE, please cite

@misc{dong2026muse,
  title         = {MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation},
  author        = {Xiaoyu Dong and Zhi Li and Xiao-Ming Wu},
  year          = {2026},
  eprint        = {2605.28579},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.28579}
}