Relational Database as a Data Transformation tool: The MFDB system

TODO: I think the impact from the mns could be stronger if we can broaden its scope on the fundamental idea that the coupling of databases with models is largely missing in the context of fishery science and that it becomes 'almost' essential when we move more complex ecosystem modeling. I've suggested some text on this towards the beginning. Guys, please tell me if you have a different view and you think I'm out of track.

TODO: Yes, one thing I've been trying to get across whilst playing with the text since, is that as you move to ecosystem modelling, having to coerce data from multiple sources together is inevitable from the nature of the exercise. Automating that process (to me, at least) seems essential to making multi-species models whilst remaining sane. I also think treating this as relatively separate exercise to forming the model data is also important.

TODO: However, do we go as far as concentrating this abstract point over how MFDB in particular helps solve it? One could do most of what MFDB does with data.tables and dplyr, for example.

Didn't we talk about some genetics example too whilst in Brussels too? Or have I completely mis-remembered where I had that conversation?

TODO: Aiming to concentrate on more cookbook examples / demonstration of advantages, rather than theory of how MFDB works (schema, breakdown of queries, how bootstrapping works, etc). Good move?

TODO: Convert to RMarkdown to run scripts

Abstract

TODO:

Introduction

Advances in available computing power means that it is possible to utilise much more data when modelling marine stocks, both in depth of history used when fitting and breadth when constructing multi-species models. For example the Gadget modelling framework (Begley_and_Howell-2004) offers many opportunities to insert data to improve fit. When data comes from several disparate sources, there are inevitably problems with mismatching levels of detail, differing aggregations, etc. A data amalgamation step is required.

Whilst any amalgamation could be acheived with a collection of spreadsheets or similar, to do do so with a reasonable level of rigour takes time, which will have to be repeated as you refine your model and reconsider area aggregations, temporal resolution, length intervals. This will impose an upper limit on the complexity of model that can be created. This can be a considerable limit on the uptake of more complex ecosystem modelling, an essential part of ecosystem-based fishery management.

In the context of fisheries science, an evident gap exists between source data and models which are the main tool to evaluate the status of marine resources and ecosystems for fisheries management. In Christensen-2009 we see the potential for database driven modelling and it's application to EwE, however is not offering any generic tools for forumating this database. In the DST^2 project we see a database formed to manage case study data, DSTDW (kupca-2004). Whilst this can be used for other models, it's primarily designed for the case studies in the project, to interface with Gadget, and has proven difficult to extend.

Finally, there are many freely available databases, for instance PostgreSQL. Given enough knowledge of the tools, it's certainly possible for someone to formulate a database for their given area. However, this requires knowledge of SQL and database optimisation. Also the structure of the data involved will have many similarities across any ecosystem modelling exercise, meaning a lot of development work can be shared across projects. Much as a modelling excercise is more productive by using a framework such as Gadget vs. coding everything yourself, a framework for data collation can also speed model development.

Thus, we believe a single database design can provide the benefits of database-driven modelling without needing to design a database from scratch, especially when combined with a programming language like R to fall back to for when the database system does not provide the exact transformations required.

The MFDB system

Generally when databases are mentioned, one envisages a central server to connect to, acting as a hub for information storage. Information is stored and then searched, combining together as appropriate. Thanks to advances in computing power, we do not need to install databases onto specialist servers. We can install a database onto our own laptop, and still take advantage of the data transformation features they offer.

The MareFrame Database is not a central database. Instead it is a R package, or toolbox, to help you manage a database on your own computer. It provides:

Automatic set-up and configuration of a PostgreSQL database
Functions to ingest data automatically from files or other database APIs
Functions to transform and aggregate the data
Functions to create input files for ecosystem modelling tools.

MFDB primarily targets Gadget, although the data aggregation functions could be used as part of any other system. For example, data can also be adapted to generate RPath models. See the example-ewe.R example included as part of MFDB.

The overall aim here is to provide tools to assist in amalgamating data sources and to automate the model creation process as much as possible. An obvious benefit of this is the time saved during development, but is not the only rationale.

An automatic process lessens the likelihood of human error, in comparison to transforming data with a set of manual steps, which may get forgotten. Whenever MFDB is used to update model data files, it will ensure that the model is internally consistent, updating all references to that data in one go.

Unlike using a spreadsheet, the relational database has a schema the data has to adhere to, and all of MFDB's aggregation routines are wrapped with unit tests and integration tests using the unittest package, to rid the model development process from systematic errors, for example mis-typing a spreadsheet formulae, or a range becoming out of date.

Finally the end result becomes more transparent and repeatable, as anyone with access to the source data can re-run the script to reproduce the model, as well as alter it for their needs. As noted in [Lucey_Gaichas_Aydin-2020], whilst EwE's graphical user interface provides an easy interface initially, it causes many problems when trying to reproduce simulations. Ensuring everything is set-up identically is problematic and error-prone. Scripting the setup in a programming language such as R solves this.

MFDB is an evolution of the ideas found in the DST^2 data warehouse, DSTDW. This also used a database to provide programatic aggregation of data to generate Gadget data files. Some fundamental differences were decided on the outset to address shortcomings which became evident in DSTDW:

Create an R library rather than a a standalone application. This has a multitude of advantages:
- Most modellers are already familiar with R, making it more accessible.
- R already has many functions for dealing with reading data files, downloading data files, interacting with APIs, etc. Instead of implementing our own we can use what's available.
- DSTDW had to deal with many corner cases, whereas in MFDB you can fall back to manipulating data.frames with R if required.
Isolate the database structure from users, to allow the database schema to change with the needs of modellers: DSTDW suffered from a schema that was fixed too early. MFDB has already had 6 revisions of it's schema, which it can automatically upgrade between.

Obviously it would also be possible to use Postgresql and/or R directly to manage data for a model, however that would require knowledge of database design SQL to be able to generate queries.

The MFDB system is open source, and can be installed directly from the CRAN repository. For more information on how to install it and Postgres, see the MFDB Github page.

Overview

The figure fig-data-flow below demonstrates the flow of information around an MFDB installation, and the order in which you are likely to to use the functions available:

Data flow in a typical MFDB usage

We will now walk through the steps in order.

Importing data

The first step is defining taxonomies. For many aspects of the data, e.g. area, species, vessel, tows, any possible values have to be defined in advance. This means data can be checked on entry, and if it doesn't match then errors can be reported.

Next we import the actual data (step 2 in fig-data-flow). The database is designed for marine ecosystem modelling, and so the concepts it can store are related. For example...

Data from survey fleets, commercial landings logbooks, etc.
Stomach content surveys
Surface temperature data and other survey indices, e.g. acoustic data

Because we are using the R language, you do not have to provide a particular CSV format, for example. Instead an R data.frame needs to be provided, and this could come from, for example...

A CSV file imported and/or downloaded using read.table
An R package interfacing to an institutional database, e.g. the DATRAS package

We also have all of R available for manipulating data before it is inserted. For example, here we load a CSV file, insert some common columns and import it into MFDB:

> data_in <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
+ year month age length count
+ 2000    1   1    10      8
+ 2000    1   1    20      7
+ 2000    1   2    20      6
+ 2000    1   2    30      5
+ 2000    1   3    30      4
+ 2000    1   3    40      3
+ 2000    1   4    50      2
+ 2000    1   4    60      1
+ 2000    1   5    90      4
+ ")
> data_in$areacell <- 'A0A1'
> data_in$species <- 'COD'
> data_in$sampling_type <- 'IGFS'
> mfdb_import_survey(mdb, data_in, data_source = 'cod-2000')

Querying data

Now, queries can be made (step 3 in fig-data-flow). MFDB does not require you to learn SQL to write queries, instead MFDB provides functions that generate reports, all of which can...

Filter data, for example by species and areas
Aggregate data, for example into length groups
Perform bootstrap sampling of data, for example by area (see later)

For example, we can query the above data grouped by age and length:

> mfdb_sample_count(mdb, c('age', 'length'), list(
+     age = mfdb_group(imm = 1:3, mat = 4:5),
+     length = mfdb_step_interval("len", by = 20, to = 100)))
$`0.0.0.0.0`
  year step area age length number
1  all  all  all imm   len0     40
2  all  all  all imm  len20    110
3  all  all  all imm  len40     15
4  all  all  all mat  len40     10
5  all  all  all mat  len60      5
6  all  all  all mat  len80     20

...or filter to only see data for given ages grouped by length:

> mfdb_sample_count(mdb, c('length'), list(
+     age = c(1,2,3),
+     length = mfdb_step_interval("len", by = 20, to = 100)))
$`0.0.0.0`
  year step area length number
1  all  all  all   len0     40
2  all  all  all  len20    110
3  all  all  all  len40     15

All mfdb querying functions have 2 parameters; cols is a vector of column names you want the output data to be grouped by. In the first example, we wanted to group by age and length, and we can see the corresponding columns in the output.

The second, params is a list of aggregations that will inform MFDB how to filter that data if the column is not in that output, or group the data if it is. Note that in the second query above we don't show the age column, so specifying c(1,2,3) just tells MFDB what range of data we want. If we had, then the data would be grouped by ages 1, 2 and 3 separately. There are various aggregate options for different purposes:

NULL (default): Don't do any filtering/grouping, and put "all" if the column is included. This can be seen in the year/step/area output above.
character / numeric vector: Aggregate all samples together where they match. For example, year = 1990:2000 will return data for years 1990..2000, grouped by individual years.
mfdb_unaggregated(): Don't do any aggregation for this column, return all possible values. See Generating model input files for an example.
mfdb_group(): Group several discrete items together. For example, age = mfdb_group(imm = 1:3, mat = 4:5) would group data for years 1..3 together under an "imm" row, and years 4..5 together under a "mat" row. This can be seen in the first query above.
mfdb_interval(): Group irregular numeric ranges together. For example, length = mfdb_interval('len', c(0, 10, 100, 1000)) results in groups of "len0", "len10", "len100", 1000 being the the upper bound to len100.
mfdb_step_interval(): Group regular ranges together. For example, length = mfdb_step_interval('len', to = 100, by = 10) results in "len0", "len10", ... , "len90". This can be seen in the length output above.

Each of these functions produces an mfdb_aggregate object that can be converted to SQL using S3 generic functions. In the case of mfdb_group() we also create a temporary table that contains the grouping. We can see this happening by setting the log level to finest:

> logging::setLevel('FINEST')
> mfdb_sample_count(mdb, c('length'), list(
+     age = mfdb_group(imm = 1:3, mat = 4:5),
+     length = mfdb_step_interval("len", by = 20, to = 100)))
FINEST:mfdb:CREATE TEMPORARY TABLE temp_dompa (sample INT DEFAULT 1 NOT NULL
                                 , name VARCHAR(10)
                                 , value  REAL )
FINEST:mfdb:INSERT INTO temp_dompa (sample,name,value)
    VALUES (0,'imm',1),(0,'imm',2),(0,'imm',3),(0,'mat',4),(0,'mat',5)
FINEST:mfdb:CREATE INDEX ON temp_dompa (value,name,sample)
FINEST:mfdb:SELECT 0 || '.' ||0 || '.' ||0 || '.' ||0  AS bssample
                 , 'all' AS grp_year
                 , MIN(c.year) AS grp_min_year
                 , MAX(c.year) AS grp_max_year
                 , 'all' AS grp_step
                 , MIN(c.month) AS grp_min_step
                 , MAX(c.month) AS grp_max_step
                 , 'all' AS grp_area
                 , 'len' || (least(greatest(floor(c.length)::integer, 0), 80) / 20) * 20 AS grp_length
                 , SUM(c.count) AS number
              FROM sample c, temp_dompa
             WHERE TRUE
               AND c.length >= 0
               AND c.length < 100
               AND c.age = temp_dompa.value
          GROUP BY bssample,grp_year,grp_step,grp_area,grp_length
          ORDER BY bssample,grp_year,grp_step,grp_area,grp_length

The name chosen is a hash of the values in the grouping, thus if 2 queries use an identical grouping the table will only need to be created once.

Generating model input files

Finally, the output of these queries can be easily adapted to suit model input files (step 4 in fig-data-flow). Any associated files, e.g. area and length aggregation files, will be updated at the same time.

For example, data from the above example can be fed into a Gadget likelihood component:

> aggdata <- mfdb_sample_count(mdb, c('age', 'length'), list(
+     age = mfdb_unaggregated(),
+     length = mfdb_step_interval("len", by = 20, to = 100)))
> gadget_dir_write(
+     gadget_directory('model_output'),
+     gadget_likelihood_component("catchdistribution",
+         name = "ldist",
+         weight = 1,
+         data = aggdata[[1]]))

As a result, not only is the data stored in the appropriate output file:

> writeLines(readLines("model_output/Data/catchdistribution.ldist.sumofsquares"))
; Generated by mfdb 6.0.0
; -- data --
; year  step    area    age     length  number
all     all     all     1       len0    8
all     all     all     1       len20   7
all     all     all     2       len20   11
all     all     all     3       len20   4
all     all     all     3       len40   3
all     all     all     4       len40   2
all     all     all     4       len60   1
all     all     all     5       len80   4

...but the likelihood component itself has been created:

> writeLines(readLines("model_output/likelihood"))
; Generated by mfdb 6.0.0
;
[component]
name    ldist
weight  1
type    catchdistribution
datafile        Data/catchdistribution.ldist.sumofsquares
function        sumofsquares
aggregationlevel        0
overconsumption 0
epsilon 10
areaaggfile     Aggfiles/catchdistribution.ldist.area.agg
ageaggfile      Aggfiles/catchdistribution.ldist.age.agg
lenaggfile      Aggfiles/catchdistribution.ldist.len.agg
fleetnames
stocknames

...as well as the aggregation files that inform Gadget how the data was generated:

> writeLines(readLines("model_output/Aggfiles/catchdistribution.ldist.len.agg"))
; Generated by mfdb 6.0.0
len0    0       20
len20   20      40
len40   40      60
len60   60      80

...in the case of age, we specified it shouldn't be aggregated, Gadget is informed of the min/max values automatically:

> writeLines(readLines("model_output/Aggfiles/catchdistribution.ldist.age.agg"))
; Generated by mfdb 6.0.0
1       1
2       2
3       3
4       4
5       5

We can redefine the length aggregation just be re-running the above commands, and there's no chance of us forgetting to update the length aggregation file.

Using MFDB for bootstrap sampling within Gadget

As well as grouping data, it's possible to use MFDB to perform bootstrap sampling, used for uncertainty estimation in Elvarsson-2018. mfdb_bootstrap_group() performs random sampling with replacement on a supplied mfdb_group().

First construct some extra data based on the previous examples:

> data_in2 <- data_in
> data_in2$areacell <- 'A0A2'
> data_in2$count <- data_in$count * 2
> mfdb_import_survey(mdb, data_in2, data_source = 'cod-2000-A0A2')
> data_in3 <- data_in
> data_in3$areacell <- 'A0A3'
> data_in2$count <- data_in$count * 3
> mfdb_import_survey(mdb, data_in3, data_source = 'cod-2000-A0A3')
> data_in4 <- data_in
> data_in4$areacell <- 'A0A4'
> data_in2$count <- data_in$count * 4
> mfdb_import_survey(mdb, data_in4, data_source = 'cod-2000-A0A4')

Now we can use mfdb_bootstrap_group to sample within a selection of areas 5 times:

> aggdata <- mfdb_sample_count(mdb, c(), list(
+     area = mfdb_bootstrap_group(5, mfdb_group(A0234 = c('A0A1', 'A0A2',
'A0A3', 'A0A4')), seed = 4)))
> aggdata
$`0.0.1`
  year step  area number
1  all  all A0234    240
$`0.0.2`
  year step  area number
1  all  all A0234    200
$`0.0.3`
  year step  area number
1  all  all A0234    200
$`0.0.4`
  year step  area number
1  all  all A0234    240
$`0.0.5`
  year step  area number
1  all  all A0234    160

Note how each uses a different set for area:

> lapply(aggdata, function (a) attr(a, 'area'))
$`0.0.1`
$`0.0.1`$A0234
[1] "A0A3" "A0A1" "A0A2" "A0A2"
$`0.0.2`
$`0.0.2`$A0234
[1] "A0A4" "A0A2" "A0A3" "A0A4"
$`0.0.3`
$`0.0.3`$A0234
[1] "A0A4" "A0A1" "A0A4" "A0A2"
$`0.0.4`
$`0.0.4`$A0234
[1] "A0A1" "A0A4" "A0A2" "A0A2"
$`0.0.5`
$`0.0.5`$A0234
[1] "A0A4" "A0A3" "A0A4" "A0A4"

A naïve approach to sampling in SQL might use the IN operator to pick out our randomly chosen samples, for example SELECT * FROM table WHERE area IN (1,8,3,3,5). However, since IN is a set operation, there will not be a repeat of row 3 in the selected data. However, in MFDB we represent groups by storing them in temporary tables. We can repeat entries in this temporary table and when the 2 tables are joined together any entries from the sample table are repeated:

> logging::setLevel('FINEST')
> mfdb_sample_count(mdb, c(), list(
+     area = mfdb_bootstrap_group(2, mfdb_group(A0234 = c('A0A1', 'A0A2', 'A0A3', 'A0A4')), seed = 99)))
FINEST:mfdb:CREATE TEMPORARY TABLE temp_ljyre_bs299
            (sample INT DEFAULT 1 NOT NULL, name VARCHAR(10), value  INT)
FINEST:mfdb:INSERT INTO temp_ljyre_bs299
                 SELECT 1 AS sample
                      , 'A0234' AS name
                      , areacell_id AS value
                   FROM division
                  WHERE division IN ('A0A3','A0A1','A0A4')
FINEST:mfdb:INSERT INTO temp_ljyre_bs299
                 SELECT 1 AS sample
                      , 'A0234' AS name
                      , areacell_id AS value
                   FROM division
                  WHERE division IN ('A0A3')
FINEST:mfdb:INSERT INTO temp_ljyre_bs299
                 SELECT 2 AS sample
                      , 'A0234' AS name
                      , areacell_id AS value
                   FROM division
                  WHERE division IN ('A0A3','A0A4','A0A2')
FINEST:mfdb:INSERT INTO temp_ljyre_bs299
                 SELECT 2 AS sample
                      , 'A0234' AS name
                      , areacell_id AS value
                   FROM division
                  WHERE division IN ('A0A3')
FINEST:mfdb:CREATE INDEX ON temp_ljyre_bs299 (value,name,sample)
FINEST:mfdb:SELECT 0 || '.' ||0 || '.' ||temp_ljyre_bs299.sample  AS bssample
                 , 'all' AS grp_year
                 , MIN(c.year) AS grp_min_year
                 , MAX(c.year) AS grp_max_year
                 , 'all' AS grp_step
                 , MIN(c.month) AS grp_min_step
                 , MAX(c.month) AS grp_max_step
                 , temp_ljyre_bs299.name AS grp_area
                 , SUM(c.count) AS number
              FROM sample c, temp_ljyre_bs299
             WHERE TRUE AND c.areacell_id = temp_ljyre_bs299.value
             GROUP BY bssample,grp_year,grp_step,grp_area
             ORDER BY bssample,grp_year,grp_step,grp_area
$`0.0.1`
  year step  area number
1  all  all A0234    160

$`0.0.2`
  year step  area number
1  all  all A0234    200

Maintaining code quality

TODO: The package also uses the Travis CI system to ensure all versions of the code are tested.

TODO: Versions, changelogs, etc.

TODO: Unit testing

TODO: Testing section, CRAN? TODO: Released to CRAN, benefits

Working in combination with rgadget

The Rgadget_ package has taken the idea of programatically configuring Gadget above and applied it to the entirety of the Gadget configuration format. As a result a a single R script can be formulated that will build the entire model structure.

The Rgadget gadget_update function will add/update a section from a gadgetfile object, and in exactly the same manner as the gadget_likelihood_component function above, can be provided with the results of MFDB queries to provide aggregation detail as well as data. Chained together using the magrittr_ %>% operator a gadget input file can be quickly built up.

The gadget-models repository at https://github.com/fishvice/gadget-models contains several examples of how this can be used in practice. For example, making a catchdistribution component can be performed with the following:

aldist.gil_e <- mfdb_sample_count(mdb, c('age', 'length'),
                    c(list(sampling_type = 'SEA',
                           data_source = 'iceland-aldist',
                           gear='GIL',
                           age = mfdb_step_interval('age',by=1,from=minage,to=maxage,
                                                    open_ended = c("upper")),
                           length = mfdb_interval("len",
                                                  seq(minlength, maxlength, by = dl),
                                                  open_ended = c("upper","lower"))),
                      defaults_early))
gadgetlikelihood('likelihood',gd$dir,missingOkay = TRUE) %>%
  gadget_update("catchdistribution",
                name = "aldist.gil_e",
                weight = 1,
                data = aldist.gil_e[[1]],
                fleetnames = c("gil_e"),
                stocknames = stock_names) %>%
  write.gadget.file(gd$dir)

This has several advantages over keeping just the model configuration files; the model structure is far easier to modify during development, and new data can be incorporated by re-running the script. New models can be started by taking existing scripts and adapting to a new scenario, Gadget uptake has been vastly simplified by new users taking model setup scripts in the gadget-models repository and adapting to their needs.

Conclusion

The package is available and developed at Github - http://github.com/mareframe/mfdb/. Here you can dowload the source, view documentation and report any issues you find.

All functions have R help pages. Once installed you can read an overview of using MFDB by using the package?mfdb command. All other MFDB commands have documentation, you can see a list by using the ??mfdb command.

An obvious next step for MFDB is to utilise the rich GIS extensions available for Postgres to be able to define areas as polygons in the database itself.

MFDB also has obvious overlap with dbplyr, an R library to help forumlate SQL queries. The syntax of dplyr is now familiar to most R users. Some initial integration has been done with mfdb_dplyr_* functions, that let you use dplyr on MFDB tables. However, these do not currently let you report aggregations in the same way as the native functions. This will also be investigated as part of future versions.

Acknowledgements

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no.613571.

References

[Kupca-2004]

Kupca V.. A standardized database for fisheries data, 2004 pg. 34 ICES Document CM 2004/FF: 15

[Christensen-2009]

Christensen, Villy & Walters, Carl & Ahrens, Robert & Alder, Jacqueline & Buszowski, J & Bang Christensen, Line & Cheung, William & Dunne, J & Froese, Rainer & Karpouzi, Vasiliki & Kaschner, Kristin & Kearney, Kelly & Lai, Sherman & Lam, Vicki & Palomares, M L D & Peters-Mason, Aja & Piroddi, Chiara & L. Sarmiento, Jorge & Steenbeek, Jeroen & Pauly, Daniel. (2009). Database-driven models of the world's Large Marine Ecosystems. Ecological Modelling. 220. 1984-1996. 10.1016/j.ecolmodel.2009.04.041.

[Begley_and_Howell-2004]

An overview of Gadget,the Globally applicable Area-Disaggregated https://pdfs.semanticscholar.org/1eeb/a8c4e218c9cb260f1e16213da6d580ac87b7.pdf

[Lucey_Gaichas_Aydin-2020]

Conducting reproducible ecosystem modeling using the open source mass balance model Rpath

[Elvarsson-2018]

Pushing the limits of a data challenged stock: A size- and age-structured assessment of ling (Molva molva) in Icelandic waters using Gadget

[read.table]

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html