Part 1. Introduction to databases

Relational databases

Why use relational databases?

Storing the data in a database has multiple advantages over using simple file formats, like CSV or XML files. In our case, the most important reasons are:

  • efficient storage of large data - not taking more disk space than necessary
  • efficient querying - typically we are only interested in a small part of the data at once (e.g. a single poem): a well-designed database will only read from the disk the parts of the dataset that are currently needed;
  • easy indexing - we can precompute an index for every field that is expected to be often used in search or join operations; this speeds up the queries a lot
  • powerful query language - the database can be queried using the language SQL, which allows for complicated conditions and transformations on the data.

For the sake of completeness, we can also mention good reasons to use relational databases that are not applicable in our case:

  • support for local changes - a small change, like adding a new row to a table, is an easy operation for the database
  • atomic transaction properties - the possibility to define "transactions" - changes to the database that are guaranteed to be executed either entirely or not at all (e.g. in case of power / network failure), but never partially. For example in case of a bank transfer, subtracting the amount from the sender's account and adding it to the receiver's account are two operations that must be inseparable under any conditions.

Basic concepts

Tables

A relational database consists of tables. Each row of the table is an entry (e.g. a poem, a verse, a collector) and each column contains a specific piece of information about an entry (e.g. a collector's name or a verse type):

+--------+-------------+-----------------+
| col_id | col_orig_id | name            |
+--------+-------------+-----------------+
|      1 | 1           | Ahlqvist  A. E. |
|      2 | 2           | Ahtia,  E       |
|      3 | 3           | Andberg, A. F.  |
+--------+-------------+-----------------+
+------+------+-----------------------------------+
| v_id | type | text                              |
+------+------+-----------------------------------+
|    1 | V    | Vaan se on vanha V[äinämöinen]    |
|    2 | V    | Lähtiäks[ensä] käkesi,            |
|    3 | V    | Tullaks[ensa] toiv[otteli]        |
+------+------+-----------------------------------+

The columns have a specified data type: for example an integer number, or a string of characters.

Numeric identifiers

Keys

Connecting to the database

One way to access the database is via one of the popular graphical clients, e.g.: * MySQL workbench * HeidiSQL

Client software

Accessing the database from scripting languages

Many programming language contain a MariaDB client library, so that data can be obtained straight from the database without having to store it in files. E.g. in R, an SQL query might be used to obtain a data frame. The advantage of such approach is that the code will contain the SQL query, documenting exactly how the data was obtained.

Important: connecting to the database requires password - it is bad coding practice to include it in the code, even if it's not widely shared. A simple and safe-enough solution is to use environment variables.

R

The following code shows how to connect to the database in R using the library RMariaDB.

Note: The function Sys.getenv() is used to get the value of an environment variable. This assumes that the environment variables DB_HOST, DB_USER and DB_PASS has been set to your connection data before running this code.

One way of setting them is running the following (substitute the data with your DB connection data):

Sys.setenv(DB_HOST='myhost', DB_USER='myusername', DB_PASS='mypassword')

For security reasons, it's better not to store this part in the code, but rather type/paste it manually each time.

Then, here's the main script:

library(RMariaDB)

# define a function querying a database and getting the result as a data frame
query_db <- function(con, q) {
  res <- dbSendQuery(con, q)
  data <- dbFetch(res)
  dbClearResult(res)
  data
}

# connect to the DB
# set the following environment variables for connecting:
#  DB_HOST -- IP address of the MariaDB server
#  DB_USER -- MariaDB username
#  DB_PASS -- password
con <- dbConnect(MariaDB(), host=Sys.getenv('DB_HOST'), dbname='filter',
                 user=Sys.getenv('DB_USER'), password=Sys.getenv('DB_PASS'),
                 bigint='integer')

# example query: get number of poems by location
q <- '
  SELECT
    l2.name AS county_name,
    l.name AS parish_name,
    count(*) as n
  FROM
    p_loc
    JOIN locations l ON l.loc_id = p_loc.loc_id
    JOIN locations l2 ON l.par_id = l2.loc_id
  GROUP BY l.loc_id;
'

# get the query results
query_result <- query_db(con, q)

Python