Part 2: FILTER database structure

Originally, the FILTER database was meant to serve as a backend for the Web application Runoregi that allows for browsing automatically computed similarities. Right now, it also provides data to the visualization app.

The database is generated from raw data (SKVR in XML format, ERAB in the format of CSV tables containing embedded XML) by a set of small programs ("scripts"), that are collectively referred to as the "FILTER data pipeline". The pipeline is fully automatized, so that reproducing the database or generating a new copy for it is easy. The code repository for the pipeline is not public yet, mainly because it contains the raw data, but there are plans to make it public.

The tables in the FILTER database can be generally divided into three types:

  • tables storing entities: collectors, places, poems, polygons, types, verses, words
  • tables storing relations between entities: p_col, p_pl, p_sim, p_typ, v_sim, v_clust, verse_poem, word_occ
  • tables storing additional information: place_stats, p_year, refs, v_clust_freq, v_clusterings, verses_cl, word_freq

Tables storing entities

These tables store detailed information about entities, i.e. "things that can be named". In addition to the data, each entity is given a database-internal identifier (number), with which it is referenced in other tables.

collectors

  • col_id - database-internal collector ID
  • col_orig_id - the ID in the original dataset + dataset prefix
  • name - collector name

places

This table stores places as they are referred to in the corpora (SKVR, ERAB).

  • pl_id - database-internal place ID
  • place_orig_id - the ID in the original dataset + dataset prefix
  • name - place (parish/county) name
  • type - either parish or county
  • par_id - parent ID (loc_id of the parent location, which is county for parishes and 0 for counties)

poems

  • p_id - database-internal poem ID
  • nro - poem ID in the original dataset, e.g. skvr07108063 or ekms80206380019
  • collection - the subcorpus to which the poem belongs (one of: skvr, erab, literary)

polygons

This table stores the polygons that are displayed on the map in the visualizations app.

  • pol_id - polygon ID
  • lang - the language of the parish
  • code - short parish code (not used)
  • name
  • geometry - the spatial information in ST binary format (more information here)

types: the type indices

  • t_id - database-internal type ID
  • type_orig_id - type ID in the original dataset + dataset prefix
  • name - type title
  • description
  • par_id - parent ID (t_id of the type one level upper in the hierarchy)

verses

  • v_id - database-internal verse ID
  • type - the name of the XML element which contained the verse; V for text verses, CPT for titles, L or K for text-internal notes etc.
  • text - verse text

words

  • w_id - database-internal word ID
  • text - the word itself

Tables storing relations between entities

This kind of table typically contains two database-internal IDs specifying a relation between two entities (e.g. a poem being ascribed to a parish).

p_col: poem to collector

  • p_id - poem ID
  • col_id - collector ID

p_pl: poem to place

  • p_id - poem ID
  • pl_id - place ID

p_clust: poem to cluster

  • p_id - poem ID
  • clust_id - poem cluster ID

p_sim: similarity between poems

  • p1_id - ID of the first poem
  • p2_id - ID of the second poem
  • sim_cos - cosine similarity of verse cluster vectors
  • sim_al - alignment similarity (proportion of aligned verse pairs to the length of the alignment)
  • sim_al_l - left alignment similarity (proportion of aligned verse pairs to the length of the first poem; used for "Contained in poems" list in Runoregi)
  • sim_al_l - right alignment similarity (proportion of aligned verse pairs to the length of the second poem; used for "Contains poems" list in Runoregi)
  • cont - continuity (how continuous the aligned part of the two poems is)

p_typ: poem to type

  • p_id - poem ID
  • t_id - type ID
  • is_minor - 1 if the type is marked as "minor" in SKVR (not used in other datasets); otherwise 0

pol_pl: place to polygon

This table maps places from the corpora to polygons on the map. It has been reviewed manually. In case a polygon represents multiple places, exactly one of them is marked as "primary".

  • pol_id - polygon ID
  • pl_id - place ID
  • is_primary

v_sim: similarity between verses

  • v1_id - ID of the first verse
  • v2_id - ID of the second verse
  • sim_cos - bigram-based cosine similarity of the verses

v_clust: verse to cluster

When using this table, it's important to always add a condition on clustering_id (select which clustering to use). Otherwise the queries might be very slow and give incorrect results.

  • v_id - verse ID
  • clust_id - cluster ID
  • clustering_id - clustering ID

verse_poem: verse to poem

  • p_id - poem ID
  • pos - position of the verse in the poem
  • v_id - verse ID

word_occ: word to verse

  • w_id - word ID
  • pos - position of the word in the verse
  • v_id - verse ID

Tables storing additional information

place_stats

This table stores additional statistics for locations.

  • pl_id - place ID
  • npoems - number of poems assigned to the location
  • nverses - total number of verses
  • nwords - total number of words (tokens)

p_clust_freq: sizes of poem clusters

  • clust_id - poem cluster ID
  • freq - the number of poems in the cluster

p_year: year of publication of a poem

  • p_id - poem ID
  • year

refs: footnotes to poems

  • p_id - poem ID
  • num - the position of the footnote in the list of footnotes for this poem (note: this matters only for ERAB; in SKVR, all footnotes are contained in one entry)
  • type - the name of the XML element containing the footnote
  • text - the content of the footnote

v_clust_freq: frequency of verse clusters

  • clust_id - cluster ID
  • freq - frequency (the total number of occurrences of verses from this cluster)
  • clustering_id - clustering ID

v_clusterings: information about stored clusterings

This table provides information about the different available clusterings.

  • clustering_id
  • name - the short name as shown in Runoregi
  • description

verses_cl: cleaned verses

This table contains the verse texts as they are preprocessed for similarity computation: removal of punctuation, numbers and special characters, lowercasing etc. Verses that are disregarded in the computation (e.g. empty or consisting only of punctuation marks) are missing from this table.

  • v_id - verse ID
  • text - the cleaned verse text

word_freq: word frequencies

  • w_id - word ID
  • freq - frequency