|
Article domain: Theoretical, Mathematical, and Computational Physics
Statistics of a Large-Scale Romanian Corpus for Language Modelling
T.A. Diac, A.F. Neagoe, M.C. Raportaru, A. Oprea, R.-M. Drăgan, A. Nicolin-Żaczek
Received June 12, 2025
Abstract. We report here a series of detailed statistical analyses on a novel largescale multi-domain Romanian corpus that we use to train a small-language model. We identify the core vocabularies pertaining to six different domain-specific subcorpora and show that they follow the so-called Zipf's law independent on how we count words. Moreover, we introduce two novel frequency-word maps for a domain-specific subcorpus, one showcasing word ranks and one measuring the deviation of the word structure from a perfect vowel-consonant or consonant-vowel repeating pattern. Finally, we show a few examples of prompt/response instances.
Key words: Large-scale multi-domain Romanian corpus, Zipf’s law, Language modelling.
Article no. 111:
Download
Romanian Journal of Physics 70 (7-8), 111 (2025)
|