Conversely, KRES shows more lemmas to do with legal texts, such as "clen, odstavek, zakon" (article, paragraph, law), so that even with [slWaC.sub.2] having more texts of this type than [slWaC.sub.1], it still has much less than KRES.
KRES also has a specific group of lemmas, thematising a person in relation to another person, e.g.
Not surprisingly, a comparison between [slWaC.sub.2] and the Gigafida corpus showed rather similar results to the comparison between [slWaC.sub.2] and KRES. The top part of the list again contains content lemmas like "spleten" (Web), "aplikacija" (application), "blog", "uporabnik" (user), "facebook", "sistem" (system), etc., indicating [slWaC.sub.2] has more computer and Web related texts.
Apart from lemmas, it is also interesting to compare how the distribution of morphosyntactic categories of [slWaC.sub.2] differs from that of KRES. To this end we calculated six LL comparison scores, for uni-, bi- and trigrams of part of-speech (PoS) and of complete morphosyntactic descriptions (MSDs).
Conversely, KRES with its numerals shows a preponderance of newspaper texts, which tend to use lots of dates, times, amounts, and sports scores.
are treated as abbreviations, whereas they are common nouns in KRES. The same reasoning applies to combinations with punctuation.
As for MSDs, the differences in unigrams in favour of [slWaC.sub.2] are greatest for the three unknown word types that KRES doesn't use (Xf: foreign word, Xp: program mistake and Xt: typo), followed by general adverbs in the positive degree, coordinating conjunctions, present tense first person auxiliary verb in the plural ("smo") and animate common masculine singular noun in the accusative, i.e.
We also compared the content of the [slWaC.sub.2] corpus to three other Slovene corpora (the [slWaC.sub.1] corpus, the balanced reference corpus KRES and the reference corpus Gigafida) with frequency profiling on lemmas and grammatical descriptions.
In the lemma comparison with KRES it has less legal texts but more user generated content and more commercial, sports, political and computer related texts.