Part 2: FILTER database structure
Originally, the FILTER database was meant to serve as a backend for the Web application Runoregi that allows for browsing automatically computed similarities. Right now, it also provides data to the visualization app.
The database is generated from raw data (SKVR in XML format, ERAB in the format of CSV tables containing embedded XML) by a set of small programs ("scripts"), that are collectively referred to as the "FILTER data pipeline". The pipeline is fully automatized, so that reproducing the database or generating a new copy for it is easy. The code repository for the pipeline is not public yet, mainly because it contains the raw data, but there are plans to make it public.
The tables in the FILTER database can be generally divided into three types:
- tables storing entities:
collectors
,places
,poems
,polygons
,types
,verses
,words
- tables storing relations between entities:
p_col
,p_pl
,p_sim
,p_typ
,v_sim
,v_clust
,verse_poem
,word_occ
- tables storing additional information:
place_stats
,p_year
,refs
,v_clust_freq
,v_clusterings
,verses_cl
,word_freq
Tables storing entities
These tables store detailed information about entities, i.e. "things that can be named". In addition to the data, each entity is given a database-internal identifier (number), with which it is referenced in other tables.
collectors
col_id
- database-internal collector IDcol_orig_id
- the ID in the original dataset + dataset prefixname
- collector name
places
This table stores places as they are referred to in the corpora (SKVR, ERAB).
pl_id
- database-internal place IDplace_orig_id
- the ID in the original dataset + dataset prefixname
- place (parish/county) nametype
- eitherparish
orcounty
par_id
- parent ID (loc_id
of the parent location, which is county for parishes and0
for counties)
poems
p_id
- database-internal poem IDnro
- poem ID in the original dataset, e.g.skvr07108063
orekms80206380019
collection
- the subcorpus to which the poem belongs (one of:skvr
,erab
,literary
)
polygons
This table stores the polygons that are displayed on the map in the visualizations app.
pol_id
- polygon IDlang
- the language of the parishcode
- short parish code (not used)name
geometry
- the spatial information in ST binary format (more information here)
types
: the type indices
t_id
- database-internal type IDtype_orig_id
- type ID in the original dataset + dataset prefixname
- type titledescription
par_id
- parent ID (t_id
of the type one level upper in the hierarchy)
verses
v_id
- database-internal verse IDtype
- the name of the XML element which contained the verse;V
for text verses,CPT
for titles,L
orK
for text-internal notes etc.text
- verse text
words
w_id
- database-internal word IDtext
- the word itself
Tables storing relations between entities
This kind of table typically contains two database-internal IDs specifying a relation between two entities (e.g. a poem being ascribed to a parish).
p_col
: poem to collector
p_id
- poem IDcol_id
- collector ID
p_pl
: poem to place
p_id
- poem IDpl_id
- place ID
p_clust
: poem to cluster
p_id
- poem IDclust_id
- poem cluster ID
p_sim
: similarity between poems
p1_id
- ID of the first poemp2_id
- ID of the second poemsim_cos
- cosine similarity of verse cluster vectorssim_al
- alignment similarity (proportion of aligned verse pairs to the length of the alignment)sim_al_l
- left alignment similarity (proportion of aligned verse pairs to the length of the first poem; used for "Contained in poems" list in Runoregi)sim_al_l
- right alignment similarity (proportion of aligned verse pairs to the length of the second poem; used for "Contains poems" list in Runoregi)cont
- continuity (how continuous the aligned part of the two poems is)
p_typ
: poem to type
p_id
- poem IDt_id
- type IDis_minor
-1
if the type is marked as "minor" in SKVR (not used in other datasets); otherwise0
pol_pl
: place to polygon
This table maps places from the corpora to polygons on the map. It has been reviewed manually. In case a polygon represents multiple places, exactly one of them is marked as "primary".
pol_id
- polygon IDpl_id
- place IDis_primary
v_sim
: similarity between verses
v1_id
- ID of the first versev2_id
- ID of the second versesim_cos
- bigram-based cosine similarity of the verses
v_clust
: verse to cluster
When using this table, it's important to always add a condition on
clustering_id
(select which clustering to use). Otherwise the queries might
be very slow and give incorrect results.
v_id
- verse IDclust_id
- cluster IDclustering_id
- clustering ID
verse_poem
: verse to poem
p_id
- poem IDpos
- position of the verse in the poemv_id
- verse ID
word_occ
: word to verse
w_id
- word IDpos
- position of the word in the versev_id
- verse ID
Tables storing additional information
place_stats
This table stores additional statistics for locations.
pl_id
- place IDnpoems
- number of poems assigned to the locationnverses
- total number of versesnwords
- total number of words (tokens)
p_clust_freq
: sizes of poem clusters
clust_id
- poem cluster IDfreq
- the number of poems in the cluster
p_year
: year of publication of a poem
p_id
- poem IDyear
refs
: footnotes to poems
p_id
- poem IDnum
- the position of the footnote in the list of footnotes for this poem (note: this matters only for ERAB; in SKVR, all footnotes are contained in one entry)type
- the name of the XML element containing the footnotetext
- the content of the footnote
v_clust_freq
: frequency of verse clusters
clust_id
- cluster IDfreq
- frequency (the total number of occurrences of verses from this cluster)clustering_id
- clustering ID
v_clusterings
: information about stored clusterings
This table provides information about the different available clusterings.
clustering_id
name
- the short name as shown in Runoregidescription
verses_cl
: cleaned verses
This table contains the verse texts as they are preprocessed for similarity computation: removal of punctuation, numbers and special characters, lowercasing etc. Verses that are disregarded in the computation (e.g. empty or consisting only of punctuation marks) are missing from this table.
v_id
- verse IDtext
- the cleaned verse text
word_freq
: word frequencies
w_id
- word IDfreq
- frequency