MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval

Published in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, Virtual, China, 2020


The Information Retrieval (IR) community has witnessed a flourishing development of deep neural networks, however, only a few managed to beat strong baselines. Among them, models like DRMM and DUET were able to achieve better results thanks to the proper handling of exact match signals. Nowadays, the application of pre-trained language models to IR tasks has achieved impressive results exceeding all previous work. In this paper, we assume that established IR cues like exact term-matching, proven to be valuable for deep neural models, can be used to augment the direct supervision from labeled data for training these pre-trained models. To study the effectiveness of this assumption, we propose MarkedBERT a modified version of one of the most popular pre-trained models via language modeling tasks, BERT. MarkedBERT integrates exact match signals using a marking technique that locates and highlights Exact Matched query-document terms using marker tokens. Experiments on MS MARCO Passage Ranking task show that our rather simple approach is actually effective. We find that augmenting the input with marker tokens allows the model to focus on valuable text sequences for IR.

Bibtex Citation

    @inproceedings{10.1145/3397271.3401194,
    author = {Boualili, Lila and Moreno, Jose G. and Boughanem, Mohand},
    title = {MarkedBERT: Integrating Traditional IR Cues in Pre-Trained Language Models for Passage Retrieval},
    year = {2020},
    isbn = {9781450380164},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3397271.3401194},
    doi = {10.1145/3397271.3401194},
    booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
    pages = {1977–1980},
    numpages = {4},
    keywords = {exact matching, deep learning, passage retrieval},
    location = {Virtual Event, China},
    series = {SIGIR '20}