Susanna Mett, Senja Emilia Salmi, Sami Tiainen, Jakob Lindström, Jajwalya Karajgikar
Abstract: Our project took a look at how young males and females are described in Finnic runosongs in the Suomen Kansan Vanhat Runot (SKVR) and Eesti Regilaulude Andmebaas (ERAB) databases. The main obstacle in all stages of our research was the nature of the data since the corpus contains a significant amount of linguistic, dialectal and regional variations of the words, altogether 866 936 different word forms. We tokenized the texts, generated a list of keywords concerning a variety of words that mean “young woman” or “young man”, and proceeded to compute the co-occurrences of words. We used co-occurring words to detect and analyse some patterns in how young men and women were described in the songs, focusing on adjectives and proper names. The results indicate that the amount of different words used describing young people is not very large. Of the words used, the ones meaning “young” were the most prominent. In future research, the challenge of high variativity should be tackled, and both computational and humanistic methods should be further implemented, including more of close-reading and increasing the number of co-occurring words.
Historically and culturally, boys and girls have often had different roles — the runosong at hand hints that boys were being expected to live in the property of their parents while girls were expected to get married and move to the home of the groom. Thus, it is likely that they are also described differently in runosongs, the traditional song type of Finnic cultures. During this Hackathon, we decided to see what the youth are up to in runosongs, in other words — how are young women and men depicted in the material.
1.1 The data: Finnic oral poetry
The data used in the project are the Runosongs of several Finnic groups. This form of oral tradition is common in all Finnic groups and languages, with few exceptions. All the runosongs have some common features, but there are also a lot of regional and dialectal variations. Kalevala, a literary work created on the basis of Northern Finnic folk poetry, is the most well-known example of runosongs Kalevala’s poetic form is more regular than the form of runosongs from most areas. One common feature in runosongs is alliteration which means that words in the same line tend to begin with similar sounds. Not all runosongs have alliteration, but most of them do. Parallelism is another quite common feature of runosongs. The parallelistic features can be grammatical or semantic.
The data used in the project is from Suomen Kansan Vanhat Runot (SKVR) and Eesti Regilaulude Andmebaas (ERAB). These corpora contain a significant part of the runosongs that have been documented in Karelia, Ingria, Estonia and Finland, mostly during the 19th and early 20 century. All together the Finnish SKVR database has over 89 000 runosongs, and the Estonian ERAB database 100 034 runosongs. The Finnish corpus mostly contains poems in Northern Finnic languages (Ingrian, Karelian, Finnish, but also Votic) and the Estonian one in Southern Finnic languages (Northern and Southern Estonian, Seto), but there are some overlaps.
1.2 Research questions
The data of runosongs is extremely variative and interesting, containing many different song types in which different topics are discussed, from mythological songs to joking songs about everyday life. In the scope of this Hackathon, we made a tough choice to only concentrate on young female and male characters. We wanted to find out what type of characters of young women and men are depicted and how they are depicted in different contexts of runosongs regarding descriptions and connection to other song elements. Our interest was also to compare regional similarities and differences and to find out whether we could computationally discover different archetypes of young women and men.
The theme of young female and male characters has not been thoroughly studied before, not to mention looking at both Estonian and Finnish databases simultaneously. Thus, our project gives a primary insight into the depiction of young men and women characters in runosongs.
2. Approach during the hackathon
In order to answer our research questions, we first had to identify the key characters in runosongs and limit the number of different character types that we chose to analyse further. We started by close-reading different texts of runosongs and searching through word frequency lists, and chose to further analyse female characters of “maiden”, “girl”, “sister”, and “daughter”, male characters of “boy”, “brother”, and “son”. The lists of female and male characters were then explored further using the collocations (here meaning the words often recurring near to each other in poetic verses) computed from our whole dataset, and checking details relating to meanings of both the individual words and collocations by close reading some parts of the data, taking also into account the regional variations.
In the end, we chose word stems and their various variations for “flikka”, “impi”, “neitsi”, “tütar/tüdruk”, “piiga”, “sisar”, “õde”, “kapo” to analyse young women and “poeg”, “poiss/poika” and “veli/vend” to analyse young men. The main difficulty during the whole project was dealing with the large amount of both linguistic and contextual variation which derive from poetic and regional differences and the practices of runosong collectors. Each chosen word has many different forms and sometimes the meanings of words are different in Karelian, Ingrian, Finnish and Estonian. Sometimes a word form may derive from several stems. Nonetheless, most of the terms observed appeared across the whole area, at least in some form. For example, for “maiden” we had to look at “neit” stem variations like “neito”, “neid”, “näio”, “neitsi” and some more in both singular and plural forms, in different word cases, and also as diminutives like “neiukene” among the others — on top of this, “neid” means both “maiden” and “them” in Estonian. This meant we had to allow some noise, and sometimes omit some ambiguous or difficult cases.
With such an amount of variation and non-standard language, traditional language technology tools are not applicable. Instead we needed to rely on manual and unsupervised methods. To tackle the variation of terms of interest, we developed such regular expression queries that would capture the most relevant words with as few irrelevant matches as possible. With some words this was easier than with others. The search for maiden words gave quite clean results, but the query using different words meaning brother included much more unrelated results, such as “boat”, “Russian” or “to dwell”. Luckily, most of the results in all the queries made with regular expressions were related, so manually deleting the unrelated results was not an enormous task.
To answer the research questions, we decided to focus on collocations – all the possible word pairs – within a range of line to find what the characters in focus are associated with. The initial idea was to do it by focusing on adjectives but we also made some observations about verbs and related characters. Simultaneously, we looked into proper names list that was compiled using computational methods.
2.1 Computational pipeline
Tokenization, i.e. parsing words to a separate list from corpus, was one of the first computational tasks. This was essential so that we could start working with clean enough data without noise included. This was achieved with regular expressions in Python. In general, we wanted alphabets and other characters (including numbers) parsed as separate words. Exceptions were the words containing numbers, character “:”, and last letters in that order, e.g. word “3:n”, a numeral with morphological ending.
In order to find significant co-occurrence of words (i.e. collocations) we searched for words that appear together within the same verse and saved them as pairs with their frequencies. Then we examined their co-occurrence frequency and individual frequencies. Based on those, we calculated two different values (mutual information and log-likelihood ratio λ), to help us determine their co-occurrence significance. At last, we computed the cosine similarity of term co-occurrence vectors and got numerical data about the similarity of contexts in which the words appear, that could be subsequently interpreted. One limitation with this solution is that different orthographic, morphological and dialectal variants of the same word are considered as different words. E.g. “neiut”, “neiduda”, “neidu”, “neiuta” (morphological); “neitshyt”, “neitšüt”, “neitsüt”, “neiccyt”, “neitschyt” (orthographic) and “neiu”, “neidu”, “neido”, “neio”, “neitsit”, “neitsyt” (dialectal). All in all, there are 866 936 words and word forms in our corpus if capital letters and punctuation are not taken into account.
In the corpus, every poetic line begins with a capital letter. To detect proper names we assumed that every capitalized word that is not the first one in the verse is a name. Distinguishing persons and place names from each other was not done computationally but was left for human interpretation. Implementing an algorithm based on this we could make a query to find potential names from the database. However, since numbers and other characters other than letters were also considered as words, the solution was not that simple. We had to check that the found capitalized word precedes a valid, alphabet based word. In addition, in some parts of the data (especially the small 17–18th century part of the data) capital letters are used in exceptional ways.
In the end, we also experimented with visualising collocations of selected characters and proper names with Gephi, a software for network analysis.
The results of this project rely a lot on computational and manual observations made about co-occurrences. We used co-occurrences to detect the most prominent adjectives, but also to detect statistically the closeness in the use of the proper names and terms under observation. The terms and proper names were studied computationally, as well as by close-reading text examples. Some network images were made to illustrate how these different adjectives, names and characters relate to each other.
To our surprise, the list of most common adjectives connected to the most common character nouns turned out not to be very long — there are a few frequent collocates and some rare ones. The fine line between adjectives and other words also turned out to be quite difficult to define. A word pair such as “Lauri poika Lappalainen” can indicate that “Lauri-boy” (sometimes “the son of Lauri) belongs to the family of Lappalainen, or it can be an adjective telling that Lauri makes his living by hunting and fishing
, , or that he dwells in the wilderness, or belongs to Sami people. Also numerals were somewhat frequent collocates with our keywords.
The most frequent adjective occurring with young women is “young” (“noor”, “nuori” etc.), both in Finnish and Estonian corpus. “Young maiden(s)” and “young daughter(s)” is a formula often used in the beginning of songs but sometimes also in the middle of the songs to address the maidens and other young women.
"Mitä itket, nuori neiti? Itken pientä veikkoani, kun läksi sotahan piennä”
(Why are you crying young maiden?/ I am crying for my young brother,/ Because he left for war when he was small).
There are many ways how young maidens are approached — sometimes they are asked a question, sometimes they are invited, warned or advised, e.g.
“Tütar noori linnukene Mine ikki sa vanale Vanal o palju varada/ ...”
(“Young daughter, a bird,/ Marry an older man/ The old one has a lot of property/…”).
In addition, the formula “young maiden” is also used to describe the young woman character without directly addressing her.
Other popular adjectives related to maidens in Finnish corpus are mostly ones that describe the appearance of the maidens, such as beautiful (“sorja”, “koria”) or scabby (“rupinen”), whereas in the Estonian corpus there is a greater variety of frequent adjectives. Alongside beautiful (“ilus”, “kaunis”), there are other adjectives, such as lazy (“laisk”), hard-working (“virk”), dull (“tuim”) and sassy (“ninakas”). Sometimes the maidens of different regions are compared in songs, for example:
“Viru neiud, virgad neiud, Harju neiud laisad neiud, Ei nad oska võida tehä,”
(“Viru maidens, hard-working maidens,/ Harju maidens, lazy maidens,/ They are bad at making butter,/ …”).
Maidens mentioned by the term “likka”, “flikka” or “plikka” were not related to almost any adjectives. Instead they were often mentioned with verbs, like sitting by the window or sitting in a bedroom. “Likka” is used in newer songs that have a slightly different form than the older ones, for example tending to rhyme more than the older ones. The song type where a girl is claimed as a bride is the most common song type where the word “likka” can be found.
“Likka istui ikkunalla, Itki ja huokas”
(“Girl was sitting by the window, / crying and sighing.”)
“Young brother” is also often used both in Finnish and Estonian corpus. However, for young men, the most frequent collocate is “tender”, usually as “tender brothers”, although the collocate appears mostly in Estonian corpus and only rarely in the Finnish one. It is an addressing formula similar to “young maidens”, e.g.
“Veli hella vellekene, ära naera neidusida,”
(“Brother, tender brother,/ Don’t laugh at maidens,/ …”).
The Karelian texts in the Finnish corpus have different adjectives for tender or good brothers: “Sulho, viljo veljyeni” (“Groom, my good brother”). In addition to “tender”, there are also cases of “dear” (“kallis”), which also appears among the collocates of mainly daughters (in the form of “kulla”).
The Finnish corpus reveals that boys can be fierce (tuima) and poor (poloinen).
“Mistäs on jalkasi vereen tullut, p[oikani] p[oloinen]?”
(“Why are your feet/legs bloody, my poor boy?”)
In the Finnish corpus there seems to be some sort of pattern in using different words meaning poor. “Poloinen” is often used with “piika” (girl) and “poika” (boy), but “parka” is used with other words such as “neito” (maiden). This might be because of the poetic style of runosongs favouring alliteration. The form “pilonen” is sometimes used with “piika” instead of “poloinen” possibly to strengthen the alliteration.
“Voi piika, pilone piika, kuin ei oo minua naitu”
(“Oh girl, [I am such a] poor girl,/ for no one has married me.”)
At initial look, describing young men and women by adjectives seems not to be central in runosongs. However, there are often parallel lines that add some details about the characters in question. For example, in an Estonian song talking about the sorrows of women remaining at home during a war, the song starts with “young maidens” but keeps describing the maidens in next verses:
“Neiukesed noorukesed, Uduhellad, kenad kallid. Sinised silmad, sinised lilled, Paled nagu ehapunad,”
(“Young maidens/tender ones, beautiful darlings./Blue eyes, blue flowers,/cheeks like the red of a sunset/…”).
Maidens are described not only as young, they are also beautiful and have physical attributes like blue eyes and red cheeks. Yet, in the scope of this project we did not look into the parallel lines so there is a lot more to discover here.
Apart from adjectives, we also noticed other frequent collocates of other characters and verbs. In many cases, brothers/sons/boys and sisters/daughters/maidens are mentioned together but some collocates include mothers and fathers as well. Sons and boys are sometimes mentioned together with brides.
Both Estonian and Finnish runosongs have a frequent verse formula of a young woman or a young man answering to what has been said before (“neito/poika vasten vastaeli”, “neidu/poega kostis”, ‘maiden/boy responded’ etc.). So the maiden is often depicted as someone who understands the concerns or intents of others and responds to them. For example, there is a song about a young man who persuades a maiden to marry him but she sees through his sly promises and answers that he is not as wealthy as he claims himself to be which is why the maiden refuses to marry him. She answers the boy:
“Neidu mõistis vasta kostis: “Oh sa petis peiukene,”
(“The maiden understood and answered:/”Oh, you imposter, bridegroom”/…”).
However, not only young people are answering to the words and acts of others but it is a common verse line for all kinds of characters, including mothers, fathers, first person characters, and named characters. Another frequent verb, which is more often connected to young women but also sometimes to young men is “to cry”.
The network (Fig. 4.) shows interesting clusters of words that tend to occur with similar collocates in poetic lines. Northern and Southern Finnic languages mostly form their own clusters, but some words are common to both. The clusters also seem to form according to poetic genres and themes. E.g. in the Northern Finnic languages, the terms relating to tragic young men such as “Iivana the son of Kojonen” or “the beautiful son of Kaleva” make one cluster and the maidens of ballad-like stories and hunting charms such as “Marketta” or “Annikki the handsome maiden” and “Annikki the daughter of Tapio, the master of forest” make their own.
3.2 Character names
Based on the network analysis of the collocations relating to the common and proper names, and the relations of linguistic clusters in these visualizations, it looks like that proper names may have been more significant in the Finnish dataset whereas in the Estonian corpus personal pronouns and general nouns such as “I”, “maiden” or “girl” may be more popular.
One name that comes up quite often with maidens is Maria, with many different variations of the name. Most mentions allude to Virgin Mary and they are mostly found in different types of charms. This charm from Northern Finland is related to healing tooth ache:
“Neitsyt Maria, emoni äiti, Puhu sualla puhtahalla, Herran hengellä hyvällä!”
(“Virgin Mary, mother of my mother [or “motherly mother”], / Speak with a clean mouth, /With the good spirit of the Lord!”)
Maiden is not the only type of young woman connected to Virgin Mary. In Karelian songs Virgin Mary is also called “pyhä piika taivahainen” (“Holy girl of the Heavens”). Of our keywords “pyhä” (holy) appears only with “piika” and “poika” (boy). Both of these uses are connected to Christianity and point to either Virgin Mary or Jesus.
Other often mentioned names include Annikki, Anni, Marjatta, Mari, Kaleva, Kojonen and Toomas. These are all characters that appear in several kinds of stories (poetic types), and most of them nearly all across the Finnic area, although sometimes with regional name variants (e.g. Annikki–Annike; Marketta-Mareta). Interestingly, sometimes the similarities seem to point to historical spreading of a song. There are some mentions of “Mareta koreta neidu” in the Estonian corpus, although “koreta” has no real meaning in Estonian. It probably is a version of the Ingrian and Karelian verse “Marketta korea neito” (Marketta beautiful maiden), quite frequent in the Finnish corpus.
As such, the form of the network visualization of collocations of proper names (Fig. 5), shows several names are often given similar attributes or are connected to similar poetic contexts, and from our close reading we also know that some often occur together. The names form clusters, here colored by the Gephi modularity algorithm, that seem to form on the basis of factors relating to languages, regional poetic cultures, genres, and even poetic types, but often overlap in interesting ways. For example, the pink nodes are mainly Estonian place names while the blue nodes include both place names and Southern Finnic proper nouns. Overall, there is a lot more to be discussed and discovered concerning proper names.
3.2 Character-names ending with -tar
A special case with female characters are those whose name ends with -tar. The name preceding denotes the family of which this daughter belongs to. The families in this form are usually mythical in nature, like sky or air daughter “Ilmatar”, the Wolveriness-beer-smith “Osmotar”, and also Lady of pain “Kiwutar”, and “Loviatar”, a woman who can move between the spirit-world and the world of the living. These names do not specify the name of the female character, but in some cases the poetic context tells they are maidens or mothers.
This form is fairly rare in the corpus so statistical methods aren’t very useful for creating representations. They seem to be more prominent in Northern Finland and Karelia than in other regions, and the Estonian corpus does not contain this particular form at all. It’s worth noting that this is also due to linguistic morphology; e.g. “Ilmatütär” or “Ilmaneiu” (daughter of air) is present in the Estonian corpus and translates the same as “Ilmatar”. To discern if they are being used in similar motifs and themes requires more attention. In any case, it looks like the two forms, the Estonian and Fenno-Karelian, are closely related.
There are roughly 200 instances of character names and variations ending with -tar. Some of them appear collocated with “maiden”, and are then picked up by the computational methods used. Due to low frequencies they are, however, bound to disappear as background noise in wider computational views, and thus serve as an example of how a statistical approach can in some cases generalise the results too much – some of these characters are described as maidens, but this particular group mostly refers to mythical beings.
4. Current challenges
Our starting point was that, for various reasons, our data is extremely varying and difficult, and we need to be creative when looking for computational views that tell us something new about it. This also means that quite a lot of handwork, close reading and comparison should be done in order to validate some of our preliminary results.
Due to morphologic variation, some compromises have been made when performing queries in the corpus. We are aware that some meaningful instances are not captured in favour of collecting less noise. In practice, correcting this would require more close-reading and handpicking, as automating the process would mean a great number of exceptions and special cases. Building a dictionary would be useful for future research to be able to fully grasp the material, but it requires working with more than 800,000 different word forms.
It is also apparent that we did not manage to go through all the interesting cases which is why our observations are not comprehensive. We also did not concentrate on regions separately, which is why dialects and even languages (Votic, Seto) with less material might have gotten less attention.
It is also important to keep in mind that the data at hand is not an ideal representation of Finnic oral poetry tradition but there are many biases in the material. The two corpora in our use include different amounts of different song types and not all types are found in all the geographical regions. This has some effect on the results. For example, there are much more wedding songs in the Estonian corpus than in the Finnish one, and what there is in the Finnish corpus is mostly recorded from Ingria and Karelia. There are no wedding songs at all, and only little long narrative songs from Western Finland. This is due to a number of regional and historical differences, including dynamics of Lutheran and Russian Orthodox Christianity and literacy in different parts of the Finnic area. This unevenness affects e.g. how often words such as brother and sister occur, as these are very common in wedding songs. The various choices and preferences of the collectors of the songs also have a significant impact on what the datasets include.
5. Plans for future research
This project was only a very small example of all the things that could be studied by looking at the whole corpus of Finnic oral poetry computationally. We had an overview of the data at hand regarding the characters in focus, especially regarding the adjectives, but we did not manage to look into regional and thematic differences or verbalising archetypes as we initially intended to. Furthermore, the collocations could be studied more carefully, as detecting all the adjectives is quite hard and slow because of the vast linguistic variation of the data. This means that detecting all the adjectives requires a lot of manual labour. This same applies to nearly all the aspects of this project: everything could be looked at in more detail, and all the queries and comparisons could be polished even further.
There are also many other things that could be studied from the dataset used in this project. For example, more attention could be paid to proper names and if they describe certain characters or are they used in a more general manner. Another possibility would be to look more closely at verbs related to the keywords of this project. For example, if there are some verbs that occur more often with males or females, or if women are more often actors or objects of action. Moreover, differences and similarities between different regions and dialects, and different poetic genres and types could be studied further.
In this project, we searched for co-occurrences of words only within one verse. One possibility would be to look at the wider context of the keywords. The co-occurence search could include the previous and next verses of the keyword, or even the context of the whole poem could be taken into account. Broadening the research in this manner would surely bring out some interesting patterns that might not be recognized by solely qualitative research.
There is plenty that can still be done with this data in the future and this Hackathon project managed to only scratch the surface. But in this short period, many interesting aspects and patterns were discovered and also many more questions were raised that can hopefully be answered in the future. Combining SKVR and ERAB corpora and using computational methods provides valuable insight that has previously remained inaccessible to the researchers. We are looking forward to the fascinating results of future research on runosongs!
 See Apo, Nenola & Stark-Arola 1998 on gender and oral tradition; Kupiainen, Tarja 2004 on youths in Finnic oral tradition.
 Kallio, Frog & Sarv 2017, 141.
 Sarv 2018.
 Saarinen 2019; Sarv 2017.
 Dunning 1993; Bordag 2008
 About Virgin Mary in Finnish runosongs, see Timonen 2017.
EÜS = The folklore collection of the Estonian Students’ Society at the Estonian Folklore Archives of the Estonian Literary Museum
H = The folklore collection compiled by Jakob Hurt at the Estonian Folklore Archives.
SKVR = Suomen Kansan Vanhat Runot (Old Poems of the Finnish People) published by the Finnish Literature Society (1908-1997); online version available at http://skvr.fi/.
Apo, S., Nenola, A. & Stark-Arola, L. (eds.) 1998: Gender and Folklore. Perspectives on Finnish and Karelian Culture. SFF 4. Helsinki: SKS.
Bordag S. 2008. A Comparison of Co-occurrence and Similarity Measures as Simulations of Context. In: Proceedings of CiCLing 2008. https://www.researchgate.net/publication/221628921_A_Comparison_of_Co-occurrence_and_Similarity_Measures_as_Simulations_of_Context
Dunning T. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1): 61–74.
Kallio, K, Frog, M & Sarv, M. 2017. ‘What to Call the Poetic Form: Kalevala-Meter or Kalevalaic Verse, regivärss, Runosong, the Finnic Tetrameter, Finnic Alliterative Verse or Something Else? RMN Newsletter 12–13: 139–161. https://www.helsinki.fi/sites/default/files/atoms/files/rmn_12-13_2016-2017.pdf
Kupiainen, Tarja. 2004. Kertovan kansanrunouden nuori nainen ja nuori mies. Helsinki: SKS.
Saarinen, J. 2018. Runolaulun poetiikka: Säe, syntaksi ja parallelismi Arhippa Perttusen runoissa. Helsinki: Helsingin yliopisto.
http://urn.fi/URN:ISBN:ISBN 978-951-51-3919-1 (PDF).
Sarv, M. 2017. Towards a Typology of Parallelism in Estonian Poetic Folklore. Folklore: Electronic Journal of Folklore 67 (2017): 65–92.
Sarv, M. 2019. Poetic metre as a function of language: linguistic grounds for metrical variation in Estonian runosongs. Studia Metrica et Poetica 6(2): 102–148, https://doi.org/10.12697/smp.2019.6.2.04.
Timonen, S. 2017: Suomalainen neitsyt Maria: parantaja, loistava näky, rukoiltava apu. – Kallio, Kati & Lehtonen, Tuomas & Timonen, Senni & Järvinen, Irma-Riitta & Leskelä, Ilkka 2017: Laulut ja kirjoitukset: Suullinen ja kirjallinen kulttuuri uuden ajan alun Suomessa. Helsinki: SKS, pp. 215–233. http://dx.doi.org/10.21435/skst.1427.