Back to top

Place of statistics in a language model

Keywords: morphology, corpus linguistics, linguistic variation, text statistics


The article speculates on how quantitative data may fit into a theoretical model of language. It argues that the language model should include an idea about the generation procedure at play, albeit a speculative one. A concrete example shows how quantitative data form an integral part of a model of Estonian morphology, another concrete example shows how corpus-based statistical models may result in dubious statistical calculations, and two descriptions of old experiments in statistical learning show a potential path worth following in corpus linguistics in the future: one should pay more attention to some not-so-obvious features that play a role in human language learning, namely, transitional probabilities and linguistic units that should be left out from computations.


Heiki-Jaan Kaalep (b. 1962), PhD, University of Tartu, Senior Researcher,


B a k e r, Carl L. 1979. Syntactic theory and the projection problem. – Linguistic Inquiry, kd 10, nr 4, lk 533–581.

B y b e e, Joan L. 1995. Diachronic and typological properties of morphology and their implications for representation. – Morphological Aspects of Language Processing. Toim Louis B. Feldman. Hillsdale, NJ: Lawrence Erlbaum Associ­ates, lk 225–246.

D i v j a k, Dagmar, L e v s h i n a, Natalia, K l a v a n, Jane 2016a. Cognitive linguistics: Looking back, looking forward. – Cognitive Linguistics, kd 27, nr 4, lk 447–463.

D i v j a k, Dagmar, A r p p e, Antti, B a a y e n, Harald 2016b. Does language-as-used fit a self-paced reading paradigm? (The answer may well depend on how you model the data.) – Slavic Languages in Psycholinguistics: Chances and Challenges for Empirical and Experimental Research. Toim T. Anstatt, A. Gattnar, C. Clasmeier. Tübingen: Narr Francke Attempto Verlag, lk 52–82.

EKK = Mati Erelt, Tiiu Erelt, Kristiina Ross 2007. Eesti keele käsiraamat. Kolmas, täiendatud tr. Tallinn: Eesti Keele Sihtasutus.

G l e i t m a n, Lila R., L a n d a u, Barbara 2012. Every child an isolate: Nature’s experiments in language learning. – Rich Languages from Poor Inputs. Toim Massimo Piattelli-Palmarini, Robert C. Berwick. Oxford: Oxford University Press, lk 91–104.

G r o p e n, Jess, P i n k e r, Steven, H o l l a n d e r, Michelle, G o l d b e r g, Richard, W i l s o n, Ronald 1989. The learnability and acquisition of the dative alternation in English. – Language, kd 65, nr 2, lk 203–257.

H a s s e l b l a t t, Cornelius 2000. Eesti keele ainsuse sisseütlev on lühike. – Keel ja Kirjandus, nr 11, lk 796–803.

H o p p e r, Paul J., B y b e e, Joan L. 2001. Introduction to frequency and the emergence of linguistic structure. – Frequency and the Emergence of Linguistic Structure. Toim J. L. Bybee, P. J. Hopper. Amsterdam–Philadelphia: John Benjamins, lk 1–24.

K a a l e p, Heiki-Jaan 2009. Kuidas kirjeldada lühikest sisseütlevat kasutusandmetega kooskõlas? – Keel ja Kirjandus, nr 6, lk 411–425.

K a a l e p, Heiki-Jaan 2010. Mitmuse osastav eesti keele käändesüsteemis. – Keel ja Kirjandus, nr 2, lk 94–111.

K a a l e p, Heiki-Jaan 2012. Eesti käänamissüsteemi seaduspärasused. – Keel ja Kirjandus, nr 6, lk 418–449.

K i o, Kati 2006. Sisseütleva käände kasutus eesti kirjakeeles. Magistritöö. Tartu.

K l a v a n, Jane 2012. Evidence in Linguistics: Corpus-linguistic and Experimental Methods for Studying Grammatical Synonymy. (Dissertationes linguisticae Universitatis Tartuensis 15.) Tartu: Tartu Ülikooli Kirjastus.

K l a v a n, Jane, D i v j a k, Dagmar 2016. The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence. – Folia Linguistica, kd 50, nr 2, lk 355–384.

M i l i n, Petar, D i v j a k, Dagmar, D i m i t r i j e v i ć, Strahinja, B a a y e n, Harald R. 2016. Towards cognitively plausible data science in language research. – Cognitive Linguistics, kd 27, nr 4, lk 507–526.

S a f f r a n, Jenny R. 2009. What is statistical learning, and what statistical learn­ing is not. – Neoconstructivism: The New Science of Cognitive Development. Toim Scott Johnson. New York: Oxford University Press, lk 180–195.

S a f f r a n, Jenny R., A s l i n, Richard N., N e w p o r t, Elissa L. 1996. Statistical learning by 8-month-old infants. – Science, kd 274, nr 5294, lk 1926–1928.

S a f f r a n, Jenny R., K i r k h a m, Natasha Z. 2018. Infant statistical learning. – Annual Review of Psychology, kd 69, lk 181–203.

S c h m i d, Hans-Jörg 2010. Does frequency in text instantiate entrenchment in the cognitive system? – Quantitative Methods in Cognitive Semantics: Corpus-driven Approaches. Toim Dylan Glynn, Kerstin Fischer. Berlin–New York: De Gruyter Mouton, lk 101–136.

S i i m a n, Ann 2016. Ainsuse sisseütleva vormi valiku seos morfosüntaktiliste ja semantiliste tunnustega – materjali ning meetodi sobivus korpusanalüüsiks. – Emakeele Seltsi aastaraamat, kd 61 (2015). Tallinn: Teaduste Akadeemia Kirjastus, lk 207–232.

V i i t s o, Tiit-Rein 2003. Structure of the Estonian language. – Estonian Language. (Linguistica Uralica. Supplementary series 1.) Toim Mati Erelt. Tallinn: Estonian Academy Publishers, lk 9–129.

W o n n a c o t t, Elizabeth, N e w p o r t, Elissa L., T a n e n h a u s, Michael K. 2008. Acquiring and processing verb argument structure: Distributional learning in a miniature language. – Cognitive Psychology, kd 56, nr 3, lk 165–209.

ÕS 2013 = Eesti õigekeelsussõnaraamat ÕS 2013. Toim Maire Raadik. Koost Tiiu Erelt, Tiina Leemets, Sirje Mäearu, M. Raadik. Tallinn: Eesti Keele Sihtasutus, 2013.