Towards a cleaner document-oriented multilingual crawled corpus J Abadji, PO Suarez, L Romary, B Sagot arXiv preprint arXiv:2201.06642, 2022 | 154 | 2022 |
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus J Abadji, PJO Suárez, L Romary, B Sagot CMLC 2021-9th Workshop on Challenges in the Management of Large Corpora, 2021 | 65 | 2021 |
Towards a cleaner document-oriented multilingual crawled corpus. arXiv e-prints, page J Abadji, PO Suarez, L Romary, B Sagot arXiv preprint arXiv:2201.06642, 2022 | 21 | 2022 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints J Abadji, P Ortiz Suarez, L Romary, B Sagot arXiv preprint arXiv:2201.06642, 2022 | 7 | 2022 |
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus M Futeral, A Zebaze, PO Suarez, J Abadji, R Lacroix, C Schmid, ... arXiv preprint arXiv:2406.08707, 2024 | | 2024 |