Part 2: FILTER database structure

Originally, the FILTER database was meant to serve as a backend for the Web application Runoregi that allows for browsing automatically computed similarities. Right now, it also provides data to the visualization app.

The database is generated from raw data (SKVR in XML format, ERAB in the format of CSV tables containing embedded XML) by a set of small programs ("scripts"), that are collectively referred to as the "FILTER data pipeline". The pipeline is fully automatized, so that reproducing the database or generating a new copy for it is easy. The code repository for the pipeline is not public yet, mainly because it contains the raw data, but there are plans to make it public.

The tables in the FILTER database can be generally divided into three types:

tables storing entities: collectors, places, poems, polygons, types, verses, words
tables storing relations between entities: p_col, p_pl, p_sim, p_typ, v_sim, v_clust, verse_poem, word_occ
tables storing additional information: place_stats, p_year, refs, v_clust_freq, v_clusterings, verses_cl, word_freq

Tables storing entities

These tables store detailed information about entities, i.e. "things that can be named". In addition to the data, each entity is given a database-internal identifier (number), with which it is referenced in other tables.

`collectors`

col_id - database-internal collector ID
col_orig_id - the ID in the original dataset + dataset prefix
name - collector name

`places`

This table stores places as they are referred to in the corpora (SKVR, ERAB).

pl_id - database-internal place ID
place_orig_id - the ID in the original dataset + dataset prefix
name - place (parish/county) name
type - either parish or county
par_id - parent ID (loc_id of the parent location, which is county for parishes and 0 for counties)

`poems`

p_id - database-internal poem ID
nro - poem ID in the original dataset, e.g. skvr07108063 or ekms80206380019
collection - the subcorpus to which the poem belongs (one of: skvr, erab, literary)

`polygons`

This table stores the polygons that are displayed on the map in the visualizations app.

pol_id - polygon ID
lang - the language of the parish
code - short parish code (not used)
name
geometry - the spatial information in ST binary format (more information here)

`types`: the type indices

t_id - database-internal type ID
type_orig_id - type ID in the original dataset + dataset prefix
name - type title
description
par_id - parent ID (t_id of the type one level upper in the hierarchy)

`verses`

v_id - database-internal verse ID
type - the name of the XML element which contained the verse; V for text verses, CPT for titles, L or K for text-internal notes etc.
text - verse text

`words`

w_id - database-internal word ID
text - the word itself

Tables storing relations between entities

This kind of table typically contains two database-internal IDs specifying a relation between two entities (e.g. a poem being ascribed to a parish).

`p_col`: poem to collector

p_id - poem ID
col_id - collector ID

`p_pl`: poem to place

p_id - poem ID
pl_id - place ID

`p_clust`: poem to cluster

p_id - poem ID
clust_id - poem cluster ID

`p_sim`: similarity between poems

p1_id - ID of the first poem
p2_id - ID of the second poem
sim_cos - cosine similarity of verse cluster vectors
sim_al - alignment similarity (proportion of aligned verse pairs to the length of the alignment)
sim_al_l - left alignment similarity (proportion of aligned verse pairs to the length of the first poem; used for "Contained in poems" list in Runoregi)
sim_al_l - right alignment similarity (proportion of aligned verse pairs to the length of the second poem; used for "Contains poems" list in Runoregi)
cont - continuity (how continuous the aligned part of the two poems is)

`p_typ`: poem to type

p_id - poem ID
t_id - type ID
is_minor - 1 if the type is marked as "minor" in SKVR (not used in other datasets); otherwise 0

`pol_pl`: place to polygon

This table maps places from the corpora to polygons on the map. It has been reviewed manually. In case a polygon represents multiple places, exactly one of them is marked as "primary".

pol_id - polygon ID
pl_id - place ID
is_primary

`v_sim`: similarity between verses

v1_id - ID of the first verse
v2_id - ID of the second verse
sim_cos - bigram-based cosine similarity of the verses

`v_clust`: verse to cluster

When using this table, it's important to always add a condition on clustering_id (select which clustering to use). Otherwise the queries might be very slow and give incorrect results.

v_id - verse ID
clust_id - cluster ID
clustering_id - clustering ID

`verse_poem`: verse to poem

p_id - poem ID
pos - position of the verse in the poem
v_id - verse ID

`word_occ`: word to verse

w_id - word ID
pos - position of the word in the verse
v_id - verse ID

Tables storing additional information

`place_stats`

This table stores additional statistics for locations.

pl_id - place ID
npoems - number of poems assigned to the location
nverses - total number of verses
nwords - total number of words (tokens)

`p_clust_freq`: sizes of poem clusters

clust_id - poem cluster ID
freq - the number of poems in the cluster

`p_year`: year of publication of a poem

p_id - poem ID
year

`refs`: footnotes to poems

p_id - poem ID
num - the position of the footnote in the list of footnotes for this poem (note: this matters only for ERAB; in SKVR, all footnotes are contained in one entry)
type - the name of the XML element containing the footnote
text - the content of the footnote

`v_clust_freq`: frequency of verse clusters

clust_id - cluster ID
freq - frequency (the total number of occurrences of verses from this cluster)
clustering_id - clustering ID

`v_clusterings`: information about stored clusterings

This table provides information about the different available clusterings.

clustering_id
name - the short name as shown in Runoregi
description

`verses_cl`: cleaned verses

This table contains the verse texts as they are preprocessed for similarity computation: removal of punctuation, numbers and special characters, lowercasing etc. Verses that are disregarded in the computation (e.g. empty or consisting only of punctuation marks) are missing from this table.

v_id - verse ID
text - the cleaned verse text

`word_freq`: word frequencies

w_id - word ID
freq - frequency

Part 2: FILTER database structure

Tables storing entities

collectors

places

poems

polygons

types: the type indices

verses

words

Tables storing relations between entities

p_col: poem to collector

p_pl: poem to place

p_clust: poem to cluster

p_sim: similarity between poems

p_typ: poem to type

pol_pl: place to polygon

v_sim: similarity between verses

v_clust: verse to cluster

verse_poem: verse to poem

word_occ: word to verse