Antoine Louis | research

publications

A collection of my academic publications.

2025

Unpublished
Machine Learning Solutions for Improving Access to Law

Antoine Louis

Doctoral thesis, Maastricht University.

Abstract Thesis Bib

Many individuals are likely to face a legal dispute at some point in their lives, but the lack of understanding of how to navigate these complex issues often renders those with few resources vulnerable. The advancement of machine learning and natural language processing opens new avenues for bridging this legal literacy gap through the development of automated legal aid systems. This thesis investigates the potential of such systems in automatically retrieving relevant legislation and delivering understandable answers to legal questions posed in layperson’s terms. These automated tools could provide a free professional assisting service for people in need, empowering disadvantaged parties in legal disputes and improving equal access to justice.
@mastersthesis{louis2025thesis, author = {Louis, Antoine}, title = {Machine Learning Solutions for Improving Access to Law}, school = {Maastricht University}, address = {Maastricht, The Netherlands}, year = {2025}, url = {https://cris.maastrichtuniversity.nl/en/publications/machine-learning-solutions-for-improving-access-to-law}, note = {Doctoral thesis} }
COLING
Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

Antoine Louis, Gijs Van Dijck, and Gerasimos Spanakis

In Proceedings of the 31st International Conference on Computational Linguistics, pages 4293–4312.

Abstract Article Code Bib

Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.
@inproceedings{louis2024interpretable, title = {Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain}, author = {Louis, Antoine and Van Dijck, Gijs and Spanakis, Gerasimos}, booktitle = {Proceedings of the 31st International Conference on Computational Linguistics}, year = {2025}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.coling-main.290}, pages = {4293--4312} }
COLING
ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

Antoine Louis, Vageesh Saxena, Gijs Van Dijck, and Gerasimos Spanakis

In Proceedings of the 31st International Conference on Computational Linguistics, pages 4370–4383.

Abstract Article Code Bib

State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages.
@inproceedings{louis2024interpretablf, title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval}, author = {Louis, Antoine and Saxena, Vageesh and Van Dijck, Gijs and Spanakis, Gerasimos}, booktitle = {Proceedings of the 31st International Conference on Computational Linguistics}, year = {2025}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.coling-main.295}, pages = {4370--4383} }

2024

AAAI
Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models

Antoine Louis, Gijs Van Dijck, and Gerasimos Spanakis

In Proceedings of the 38th AAAI Conference on Artificial Intelligence, pages 22266–22275.

Abstract Article Code Dataset Bib

Many individuals are likely to face a legal dispute at some point in their lives, but their lack of understanding of how to navigate these complex issues often renders them vulnerable. The advancement of natural language processing opens new avenues for bridging this legal literacy gap through the development of automated legal aid systems. However, existing legal question answering (LQA) approaches often suffer from a narrow scope, being either confined to specific legal domains or limited to brief, uninformative responses. In this work, we propose an end-to-end methodology designed to generate long-form answers to any statutory law questions, utilizing a "retrieve-then-read" pipeline. To support this approach, we introduce and release the Long-form Legal Question Answering (LLeQA) dataset, comprising 1,868 expert-annotated legal questions in the French language, complete with detailed answers rooted in pertinent legal provisions. Our experimental results demonstrate promising performance on automatic evaluation metrics, but a qualitative analysis uncovers areas for refinement. As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains. We publicly release our code, data, and models.
@inproceedings{louis2024interpretablg, title = {Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models}, author = {Louis, Antoine and Van Dijck, Gijs and Spanakis, Gerasimos}, booktitle = {Proceedings of the 38th AAAI Conference on Artificial Intelligence}, year = {2024}, address = {Vancouver, Canada}, publisher = {AAAI Press}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/30232}, pages = {22266--22275} }

2023

EACL
Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks

Antoine Louis, Gijs Van Dijck, and Gerasimos Spanakis

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2753–2768.

Abstract Article Code Bib

Statutory article retrieval (SAR), the task of retrieving statute law articles relevant to a legal question, is a promising application of legal text processing. In particular, high-quality SAR systems can improve the work efficiency of legal professionals and provide basic legal assistance to citizens in need at no cost. Unlike traditional ad-hoc information retrieval, where each document is considered a complete source of information, SAR deals with texts whose full sense depends on complementary information from the topological organization of statute law. While existing works ignore these domain-specific dependencies, we propose a novel graph-augmented dense statute retriever (G-DSR) model that incorporates the structure of legislation via a graph neural network to improve dense retrieval performance. Experimental results show that our approach outperforms strong retrieval baselines on a real-world expert-annotated SAR dataset.
@inproceedings{louis2023finding, title = {Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks}, author = {Louis, Antoine and Van Dijck, Gijs and Spanakis, Gerasimos}, booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics}, year = {2023}, address = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.eacl-main.203/}, pages = {2753--2768} }

2022

ACL
A Statutory Article Retrieval Dataset in French

Antoine Louis, and Gerasimos Spanakis

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 6789–6803.

Abstract Article Code Dataset Bib

Statutory article retrieval is the task of automatically retrieving law articles relevant to a legal question. While recent advances in natural language processing have sparked considerable interest in many legal tasks, statutory article retrieval remains primarily untouched due to the scarcity of large-scale and high-quality annotated datasets. To address this bottleneck, we introduce the Belgian Statutory Article Retrieval Dataset (BSARD), which consists of 1,100+ French native legal questions labeled by experienced jurists with relevant articles from a corpus of 22,600+ Belgian law articles. Using BSARD, we benchmark several state-of-the-art retrieval approaches, including lexical and dense architectures, both in zero-shot and supervised setups. We find that fine-tuned dense retrieval models significantly outperform other systems. Our best performing baseline achieves 74.8% R@100, which is promising for the feasibility of the task and indicates there is still room for improvement. By the specificity of the domain and addressed task, BSARD presents a unique challenge problem for future research on legal information retrieval. Our data and source code are publicly available.
@inproceedings{louis2022statutory, title = {A Statutory Article Retrieval Dataset in French}, author = {Louis, Antoine and Spanakis, Gerasimos}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.acl-long.468/}, pages = {6789--6803} }

2020

Unpublished
NetBERT: A Pre-trained Language Representation Model for Computer Networking

Antoine Louis

Master’s thesis, University of Liège.

Abstract Thesis Code Bib

Obtaining accurate information about products in a fast and efficient way is becoming increasingly important at Cisco as the related documentation rapidly grows. Thanks to recent progress in natural language processing (NLP), extracting valuable information from general domain documents has gained in popularity, and deep learning has boosted the development of effective text mining systems. However, directly applying the advancements in NLP to domain-specific documentation might yield unsatisfactory results due to a word distribution shift from general domain language to domain-specific language. Hence, this work aims to determine if a large language model pre-trained on domain-specific (computer networking) text corpora improves performance over the same model pre-trained exclusively on general domain text, when evaluated on in-domain text mining tasks. To this end, we introduce NetBERT (Bidirectional Encoder Representations from Transform-ers for Computer Networking), a domain-specific language representation model based on BERT and pre-trained on large-scale computer networking corpora. Through several extrinsic and intrinsic evaluations, we compare the performance of our novel model against the general-domain BERT. We demonstrate clear improvements over BERT on the following two representative text mining tasks: networking text classification (0.9% F1 improvement) and networking information retrieval (12.3% improvement on a custom retrieval score). Additional experiments on word similarity and word analogy tend to show that NetBERT capture more meaningful semantic properties and relations between networking concepts than BERT does. We conclude that pre-training BERT on computer networking corpora helps it understand more accurately domain-related text.
@mastersthesis{louis2020netbert, author = {Louis, Antoine}, title = {NetBERT: A Pre-trained Language Representation Model for Computer Networking}, school = {University of Liège}, address = {Liège, Belgium}, year = {2020}, url = {https://matheo.uliege.be/handle/2268.2/9060?locale=en}, note = {Master's thesis} }

publications

2025

2024

2023

2022

2020

teaching