ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild

Hanyu Chen1, Ruojin Cai2, Steve Marschner1, Noah Snavely1
1Cornell University 2Kempner Institute, Harvard University
ArchSym teaser figure

We introduce a single-view reflectional symmetry detector, trained on ArchSym, a newly curated dataset of in-the-wild architectural symmetries.

Abstract

Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and therefore fail to generalize to real-world scenes. Furthermore, because monocular inputs are inherently scale ambiguous, localizing a 3D symmetry plane is ill-posed, and many prior methods only predict plane orientation. We address these limitations by introducing the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, with a focus on architectural landmarks. Our approach has two key components: (1) a scalable annotation pipeline that automatically curates ArchSym, a large-scale dataset of architectural symmetries, from SfM reconstructions via cross-view image matching; and (2) a single-view symmetry detector that localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate the annotation pipeline against geometry-based alternatives and show that the detector significantly outperforms state-of-the-art baselines on the new benchmark.

Interactive Symmetry Viewer

We visualize predicted reflectional symmetry planes from three different methods on a shared point cloud. The input image is shown on the bottom right. See below for more qualitative comparisons.

Method: None Scene: None

Input Image

No image

Add image.png to the scene folder or set an image path in scenes.json.

Visible Planes

Dataset Curation

We propose an image-matching-based approach to annotate reflectional symmetries. We first sample image pairs from an SfM reconstruction of each scene. For each pair, we horizontally flip one image, find dense matches with the other image via MASt3R, unproject matched pixels to 3D points using depth maps, and fit a plane to the resulting point pairs. The final symmetry planes annotations are then determined by clustering candidate planes with DBSCAN.

Symmetry Annotation

Why image-aware extraction helps

As a symmetry annotation baseline, we run Langevin, a purely geometry-based method, on the dense COLMAP point clouds to extract reflectional symmetries. We observe that it is highly sensitive to incomplete point clouds (e.g. Frauenkirche, Isa Khan's Tomb) and detects architecturally implausible planes (e.g., horizontal plane on the Arc de Triomphe, misaligned plane on the Pisa Cathedral). In contrast, our method extracts semantically-correct symmetries.

Comparison of geometry-only versus image-aware symmetry extraction
ArchSym dataset statistics

Symmetry Annotation

Dataset Statistics

The ArchSym dataset consists of 93 landmark scenes and a total of 34,177 images. Most scenes contain one or two symmetries, while a few contain four (e.g., clock towers) or eight (e.g., octagonal buildings). The images contribute a wide range of challenges like occlusions, partial views, and varying camera parameters and illumination.

Single-View Symmetry Detector

In our single-view symmetry detector, a frozen VGGT backbone first extracts features from a single input image. A transformer decoder extracts symmetry-aware instance queries to identify candidate planes, and a conditioned DPT-style prediction head produces signed distance maps relative to the predicted scene geometry. Final 3D-grounded symmetry planes are then recovered by fitting plane parameters to the predicted signed distances and 3D points.

Overview of the ArchSym single-view symmetry detector

Symmetry Detection

Why signed-distance maps

Directly regressing full plane parameters, normals $\mathbf{\vec{n}}$ and offsets $d$, from a single image is ill-defined due to the inherent scale ambiguity in monocular 3D reconstruction. Instead, our model predicts signed-distance maps defined relative to its own point map predictions. This provides a natural coordinate frame for grounding symmetry planes and a scale-consistent supervision signal. As shown below, even when compared to regressing plane parameters in this fixed coordinate frame, our dense signed-distance formulation still allows more accurate alignment with the scene geometry.

Symmetry Detection

Qualitative Comparisons

We compare our method with a recent single-view symmetry detector, Reflect3D, finetuned on our dataset and a simple baseline, Direct, where symmetry plane parameters are directly regressed from frozen VGGT features. The two methods often struggle with partially visible symmetries and predict misaligned planes; our signed-distance-based parameterization allows our method to consistently predict planes that are well-aligned with the scene geometry.

Reflection symmetry prediction examples

Note: Reflect3D predicts plane normals but not offsets; for visualization, we use a point on the nearest ground truth plane as an anchor.

Symmetry Detection

Quantitative Evaluation

Quantitative results are averaged over 19 held-out test scenes. Our method outperforms both baselines normal-only prediction ($\mathrm{Geo}$, $\mathrm{F}@x^\circ$) and full-plane localization ($\mathcal E_{\text{dense}}$). We refer to the paper for details on evaluation metrics.

$\mathrm{Geo} \downarrow$ $\mathrm{F}@1^\circ \uparrow$ $\mathrm{F}@5^\circ \uparrow$ $\mathrm{F}@15^\circ \uparrow$ $\mathcal E_{\text{dense}} \downarrow$
Reflect3D 10.46 0.07 0.34 0.55
Direct 5.06 0.16 0.64 0.81 0.18
Ours 3.71 0.25 0.70 0.84 0.13

BibTeX

@article{replace_me,
  title   = {ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild},
  author  = {Replace with final citation},
  journal = {Replace with venue},
  year    = {Replace with year}
}

Acknowledgments

This work was funded in part by the National Science Foundation (IIS-2212084) and by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (No. RS-2024-00457882, National AI Research Lab Project).