SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

Immel, Fabian; Pauls, Jan-Hendrik; Fehler, Richard; Bieder, Frank; Merkert, Jonas; Stiller, Christoph

SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

Accepted for publication at NeurIPS 2025

Fabian Immel¹, Jan-Hendrik Pauls², Richard Fehler¹, Frank Bieder¹, Jonas Merkert², Christoph Stiller²

¹FZI Research Center for Information Technology, ²Karlsruhe Institute of Technology

Paper Code arXiv

Example Results of SDTagNet on the Argoverse 2 dataset in comparison with the existing navigation map encoding methods PMapNet and SMERF.

Abstract

Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors.

Overview of the model architecture of SDTagNet. To fully exploit textual annotations and all element types in large public SD map databases like OpenStreetMap, SDTagNet introduces novel NLP tag embedding and SD map encoder modules. Text annotation embeddings are first computed with a BERT embedding model. They are then fused with scene-level context in a SD map encoder, which uses graph transformer-like methods to flexibly encode points, polylines and element relations. The encoded information is finally supplied to the base model via cross-attention.

Visualization of the SD map prior input data utilized by existing methods. Existing approaches are limited to rasterized images or polylines with manually defined classes. SDTagNet is the first method that can handle open-vocabulary textual annotations and diverse element types such as points, polylines, and relational information.

Example of the tag embedding contrastive pretraining objective. A positive sample is selected from tagsets with the same semantically meaningful tags, but different not meaningful ones (like the street name). Negative samples are selected from all other unique tagsets. The number of negative samples in practice is much larger than depicted here to prevent unstable training.

Detailed design of the SD map encoder and its queries. Each point query is composed of the positional sin/cos encoding of the point, the respective tag embedding and orthogonal random features (ORF) that function as element identifiers and can model graph edges.

Comparison of SD map prior encoding methods on Argoverse 2 , with a geographical split. *: With the 7 classes from SMERF in the input features, which are not used in the original work. †: With OSM nodes in the input features, which are not used in the original work. All models are trained for 24 epochs.

Ablation study of different encoder components on the Argoverse 2 dataset. All experiments are in the near range setting and all models are trained for 24 epochs. †: With OSM nodes in the input features, which are not used in the original work. + BEV Ft.: With BEV features as an additional prior mode.

Further qualitative comparison of SDTagNet with PMapNet (all info.) and SMERF (all info.) on Argoverse 2 in the far range setting. SDTagNet is able to utilize text-annotated information such as number of lanes and oneway roads to improve prediction results.

BibTeX

@article{SDTagNet2025,
  title={SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction},
  author={ Fabian Immel and Jan-Hendrik Pauls and Richard Fehler and Frank Bieder and Jonas Merkert and Christoph Stiller},
  booktitle = {39th Conference on Neural Information Processing Systems (NeurIPS)},
  year={2025},
  url={https://immel-f.github.io/SDTagNet/}
}

More Works from Our Lab

M3TR: A Generalist Model for Real-World HD Map Completion

SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

Accepted for publication at NeurIPS 2025

Example Results of SDTagNet on the Argoverse 2 dataset in comparison with the existing navigation map encoding methods PMapNet and SMERF.

Abstract

Detailed design of the SD map encoder and its queries. Each point query is composed of the positional sin/cos encoding of the point, the respective tag embedding and orthogonal random features (ORF) that function as element identifiers and can model graph edges.

Further qualitative comparison of SDTagNet with PMapNet (all info.) and SMERF (all info.) on Argoverse 2 in the far range setting. SDTagNet is able to utilize text-annotated information such as number of lanes and oneway roads to improve prediction results.

BibTeX