Part 1. Introduction to databases
Relational databases
Why use relational databases?
Storing the data in a database has multiple advantages over using simple file formats, like CSV or XML files. In our case, the most important reasons are:
- efficient storage of large data - not taking more disk space than necessary
- efficient querying - typically we are only interested in a small part of the data at once (e.g. a single poem): a well-designed database will only read from the disk the parts of the dataset that are currently needed;
- easy indexing - we can precompute an index for every field that is expected to be often used in search or join operations; this speeds up the queries a lot
- powerful query language - the database can be queried using the language SQL, which allows for complicated conditions and transformations on the data.
For the sake of completeness, we can also mention good reasons to use relational databases that are not applicable in our case:
- support for local changes - a small change, like adding a new row to a table, is an easy operation for the database
- atomic transaction properties - the possibility to define "transactions" - changes to the database that are guaranteed to be executed either entirely or not at all (e.g. in case of power / network failure), but never partially. For example in case of a bank transfer, subtracting the amount from the sender's account and adding it to the receiver's account are two operations that must be inseparable under any conditions.
Basic concepts
Tables
A relational database consists of tables. Each row of the table is an entry (e.g. a poem, a verse, a collector) and each column contains a specific piece of information about an entry (e.g. a collector's name or a verse type):
+--------+-------------+-----------------+
| col_id | col_orig_id | name |
+--------+-------------+-----------------+
| 1 | 1 | Ahlqvist A. E. |
| 2 | 2 | Ahtia, E |
| 3 | 3 | Andberg, A. F. |
+--------+-------------+-----------------+
+------+------+-----------------------------------+
| v_id | type | text |
+------+------+-----------------------------------+
| 1 | V | Vaan se on vanha V[äinämöinen] |
| 2 | V | Lähtiäks[ensä] käkesi, |
| 3 | V | Tullaks[ensa] toiv[otteli] |
+------+------+-----------------------------------+
The columns have a specified data type: for example an integer number, or a string of characters.
Numeric identifiers
Keys
Connecting to the database
One way to access the database is via one of the popular graphical clients, e.g.: * MySQL workbench * HeidiSQL
Client software
Accessing the database from scripting languages
Many programming language contain a MariaDB client library, so that data can be obtained straight from the database without having to store it in files. E.g. in R, an SQL query might be used to obtain a data frame. The advantage of such approach is that the code will contain the SQL query, documenting exactly how the data was obtained.
Important: connecting to the database requires password - it is bad coding practice to include it in the code, even if it's not widely shared. A simple and safe-enough solution is to use environment variables.
R
The following code shows how to connect to the database in R using the library
RMariaDB
.
Note: The function Sys.getenv()
is used to get the value of an
environment variable. This assumes that the environment variables DB_HOST
,
DB_USER
and DB_PASS
has been set to your connection data before running
this code.
One way of setting them is running the following (substitute the data with your DB connection data):
Sys.setenv(DB_HOST='myhost', DB_USER='myusername', DB_PASS='mypassword')
For security reasons, it's better not to store this part in the code, but rather type/paste it manually each time.
Then, here's the main script:
library(RMariaDB)
# define a function querying a database and getting the result as a data frame
query_db <- function(con, q) {
res <- dbSendQuery(con, q)
data <- dbFetch(res)
dbClearResult(res)
data
}
# connect to the DB
# set the following environment variables for connecting:
# DB_HOST -- IP address of the MariaDB server
# DB_USER -- MariaDB username
# DB_PASS -- password
con <- dbConnect(MariaDB(), host=Sys.getenv('DB_HOST'), dbname='filter',
user=Sys.getenv('DB_USER'), password=Sys.getenv('DB_PASS'),
bigint='integer')
# example query: get number of poems by location
q <- '
SELECT
l2.name AS county_name,
l.name AS parish_name,
count(*) as n
FROM
p_loc
JOIN locations l ON l.loc_id = p_loc.loc_id
JOIN locations l2 ON l.par_id = l2.loc_id
GROUP BY l.loc_id;
'
# get the query results
query_result <- query_db(con, q)