Finnish Letter Frequencies
All text files provided are encoded in utf-8. The frequencies from this page are generated from about 90 Million characters of Finnish text, sourced from Wortschatz. The text files containing the counts can be used with ngram_score.py and used for breaking ciphers, see this page for details, just substitute out the English ngram file for the one you want. If you want to compute the letter frequencies of your own piece of text you can use this page.
Monogram Frequencies §
Finnish single letter frequencies are as follows (in percent %):
A : 12.22 K : 4.97 U : 5.01 B : 0.28 L : 5.76 V : 2.25 C : 0.28 M : 3.20 W : 0.09 D : 1.04 N : 8.83 X : 0.03 E : 7.97 O : 5.61 Y : 1.74 F : 0.19 P : 1.84 Z : 0.05 G : 0.39 Q : 0.01 Å : 0.00 H : 1.85 R : 2.87 Ä : 3.58 I : 10.82 S : 7.86 Ö : 0.44 J : 2.04 T : 8.75
The finnish_monograms.txt file provides the counts used to generate the frequencies above:
Common Finnish Words §
The following words are the most common words in a 'news' text corpus. The numbers represent percentage of occurance, e.g. 'JA' constitutes 3.94% of all words in the corpus. Different styles of writing give rise to different words being most common, e.g. we would expect 'EN' and 'OLE' to be more common in a more personal writing style.
JA : 3.94 MUTTA : 0.38 JONKA : 0.20 ON : 2.86 TAI : 0.37 KUITENKIN : 0.20 OLI : 1.05 SEN : 0.36 NOIN : 0.19 HÄN : 0.81 ETTÄ : 0.36 MUKAAN : 0.18 MYÖS : 0.71 SEKÄ : 0.31 JOSSA : 0.18 VUONNA : 0.67 HÄNEN : 0.30 VUODEN : 0.18 JOKA : 0.53 JÄLKEEN : 0.24 ELI : 0.17 EI : 0.46 KANSSA : 0.24 VOI : 0.16 SE : 0.45 KUN : 0.21 SUOMEN : 0.16 OVAT : 0.45 KUIN : 0.20 OLLUT : 0.16
The finnish_words.txt file provides the counts used to generate the frequencies above:
Bigram Frequencies §
We can't list all of the bigram frequencies here, the top 30 are the following (in percent %):
EN : 2.14 IT : 1.16 TI : 1.00 IS : 1.95 SA : 1.15 AS : 0.98 IN : 1.95 SE : 1.11 VA : 0.92 TA : 1.90 LI : 1.11 NE : 0.92 AN : 1.71 KA : 1.08 LA : 0.92 ST : 1.54 ON : 1.06 AT : 0.92 SI : 1.36 TE : 1.05 EL : 0.90 AA : 1.33 LL : 1.05 NA : 0.83 TT : 1.23 AI : 1.04 NT : 0.82 AL : 1.21 JA : 1.00 SS : 0.82
The finnish_bigrams.txt file provides the counts used to generate the frequencies above:
Trigram Frequencies §
We can't list all of the trigram frequencies here, the top 30 are the following (in percent %):
IST : 0.61 LLA : 0.33 ALL : 0.28 STA : 0.58 ITT : 0.32 EST : 0.27 SSA : 0.56 AJA : 0.32 ELL : 0.26 AAN : 0.46 AIS : 0.31 OLI : 0.26 ISE : 0.43 INE : 0.31 ISI : 0.26 NEN : 0.34 ETT : 0.31 ENK : 0.26 SEN : 0.34 LIS : 0.31 LLI : 0.24 TTA : 0.33 TAA : 0.30 ENT : 0.24 EEN : 0.33 IIN : 0.29 ENS : 0.24 KSI : 0.33 TTI : 0.29 AVA : 0.23
The finnish_trigrams.txt file provides the counts used to generate the frequencies above:
Quadgram Frequencies §
We can't list all of the quadgram frequencies here, the top 30 are the following (in percent %):
INEN : 0.23 ALLI : 0.12 AISE : 0.10 ISTA : 0.21 LAIS : 0.12 OITT : 0.10 ISSA : 0.16 ISTE : 0.11 MYÖS : 0.10 ISEN : 0.15 TETT : 0.11 ONNA : 0.10 LLIS : 0.15 TAJA : 0.10 TTII : 0.09 ASSA : 0.15 UKSE : 0.10 ALLA : 0.09 TAAN : 0.13 ISES : 0.10 VUON : 0.09 TIIN : 0.13 ESSA : 0.10 ESTA : 0.09 UTTA : 0.13 ALAI : 0.10 SEST : 0.09 ASTA : 0.13 UONN : 0.10 TAVA : 0.09
The finnish_quadgrams.txt file provides the counts used to generate the frequencies above:
comments powered by DisqusContents
Further reading
We recommend these books if you're interested in finding out more.