Dbt is a ruby library that helps you to automate the creation, importing and migration of databases. Dbt uses convention over configuration and is designed to be a composable component of a larger automation tool chain. Dbt has existing drivers for SQL Server and Postgres.
Create, Import, Migrate
Dbt expects you to define the database using SQL scripts or via a tool that generates SQL scripts (i.e. Domgen). Dbt is responsible for executing the scripts in a defined order. Dbt can also load fixture data from yaml files as part of the database creation. Dbt also defines a mechanism for importing one version of the database into a newer version. Sometimes modifying an existing database, in easier to manage and Dbt offers migrations for use in this scenario.
Convention over Configuration
The library adopts a series of conventions aimed to simplify large, complex, legacy projects but without adding burden to smaller, early stage projects. In particular it grew out of several complex projects that made extensive use of schemas, functions, constraints, stored-procedures, triggers, views and custom types. Most of these projects had in excess 5000 database artifacts. The library is also used in projects where the database is little more than a persistence mechanism with all the complex business logic outside the database.
Composable Component
The library is just one part of the automation tool chain and is easily integrated into other tools. It is used extensively with tools such as Rake, Buildr and Domgen.
Getting Started
A simple database configuration that explicitly defined the schema and the set of tables part of the schema follows;
If this was declared within a Rakefile
, it would then define the tasks dbt:create
and
dbt:drop
that would create and drop the database respectively. The create task would create the
schema MyModule
, scan the search_dirs
based on a rules and load any sql files found.
The task would also fixture data for the MyModule.tblFoo
table if a yaml file was found in the correct
location. The drop task would just delete the database. The runtime configuration required to connect to the
database is read from the file config/database.yml
.
The above example gives a basic feel of the library but it is not representative of how the library is used in
practice. In most cases, Dbt is integrated into external tools such as Domgen and will not have to define the
modules
or table_map
. Nor does the example define any imports, migrations or datasets.
A more realistic example is given below. It configures a single database, and integrates with domgen to define
the modules, tables and expects domgen to generate sql into database/generated
directory.
Semantic Model
Conceptually, each Dbt database is composed of 0 or more modules. Each module maps to a schema in the database and by convention the schema is the same name as the module. Each database has a search path form which Dbt collects files used to create the database. Each module is mapped to a sub-directory of each element of the search path.
Thus if a database definition defined a search path of ["database", "database/generated"]
, then
the module "MyModule" would load data from the directories "database/MyModule"
and
"database/generated/MyModule"
Each module has a number of sub directories that contain database entities such as triggers, views, stored-procedures etc. A module may also contain a sub-directory that contains fixtures. Fixtures define the contents of specific tables at creation time. The module may also include 0 or more directories defining import scripts or migration scripts.
The order in which the sub-directories are processed is dependent on the configuration in the database
definition but follows a specific convention based on the activity. Within a directory, the order of the sql
files is determined by an index file (named index.txt
by default) that lists one sql file per
line. If the index file is not present or does not include all the sql files, then the sql files are loaded
in alphanumeric ordering.
Database Creation Workflow
Create the database.
During this step the database is created. Both of the existing database drivers will use a control database
(master
for SQL Server and postgres
for PostgreSQL) to create the new database. The
driver will then change to the created database for the remainder of the database creation steps. In some
environments, it is not possible to access the control database so Dbt allows you to disable database creation
by setting the no_create
flag to false in the runtime configuration file (i.e.
database.yml
). In this scenario it is expected that a system administrator has already set up the
database.
Process the pre_create_dirs
.
Load any sql files that need to be loaded prior to processing any modules. The pre_create_dirs
setting defaults to ["db-hooks/pre"]
.
Process the up phase for each module in order.
The purpose of the up phase is to setup the database but without enabling or creating any runtime checks. i.e The tables should be created as part of this phase and populated with any initial fixture data but may not yet have constraints or triggers created or if created, left disabled.
Create the database schema for the module.
Unless overridden the schema name is the same as the module name. Dbt will create the schema if it
does not already exist. This allows multiple modules to map to the same database schema or for a module
to map to a predefined schema. (i.e. dbo
under SQL Server or public
in postgres).
Process the up_dirs
for the module.
The up_dirs
are used to define the artifacts in the database. They are separated into
separate directories purely as a mechanism to simplify the ordering of database creation scripts.
The up_dirs
setting defaults to
['.', 'types', 'views', 'functions', 'stored-procedures', 'misc']
.
Load any fixtures tables in the module.
The database definition includes a list of tables in each module. Dbt will look for a yaml file
in the fixture_dir_name
that matches the name of the table. This will then be as a
yaml file and inserted into the database. To ensure the order of the rows, the top level element
of the yaml file can be encoded as an omap. The fixture_dir_name
setting defaults to
'fixtures'
.
Process the import phase for each module.
If the creation is a "create_by_import" create, then the import of the module occurs at this step. See the import section for further details on importing.
Process the finalize phase for each module.
The finalize phase prepares the database for operation. This typically involves enabling triggers, rebuilding indexes and applying any final security grants.
Process the finalize_dirs
for the module.
The finalize dirs are configurable by the finalize_dirs
setting which
defaults to ['triggers', 'finalize']
.
Process the post_create_dirs
for the database.
Process the dirs defined by the post_create_dirs
setting. The setting defaults to
["db-hooks/post"]
.
Setup migrations infrastructure.
The last thing that occurs when creating the database is to setup migrations if migrations are enabled.
The migrations section goes into further details about how migrations
proceeds. During database creation the table that records the migrations is created. By default the
creation will assume that there is no need to apply the database migrations and will mark all migrations
present as having being applied. This can be overriden by setting the migrations_applied_at_create
setting on the database definition in which case the migrations will be applied during database creation.
Import a Database
Importing is typically used to upgrade from the old version of the database to the new version of the database. Conceptually it selects data form the old database and inserts it into the new database. However, like the other components of Dbt it can be overriden by the user. It is not uncommon to use the import facility to import from several databases.
The import process can be integrated into the database creation process and/or it can be a standalone task. Import that is integrated into the database creation process is much easier to manage, primarily as it occurs prior to the database finalize phase and thus triggers and constraints are not yet enabled. The standalone import process must also assume that the database has already been used and thus must delete data from the tables as part of the import process.
It should be noted that the import process will only import data for "importable" tables. These constitute any table which does not have a fixture in the fixtures directory.
Process the pre_import_dirs
.
The pre_import_dirs dirs are configurable by the pre_import_dirs
setting which
defaults to ['import-hooks/pre']
.
Process the delete phase for each module in reverse order.
If the import process is a standalone then the import process will delete data from all the importable tables. The delete will delete data from importable tables in the reverse order. Each module will also be processed in reverse order.
Process the import phase for each module.
Each module will be processed according to the order in which it is declared.
Import each table.
Each importable table will be imported in order. Import files are loaded from a directory
defined in the import definition that defaults to import
. If a .yml
file with the same name as the table is present it is loaded into the table. If a .sql
file with the same name as the table is present it is processed to import the table. If neither
a .yml
nor a .sql
file is present then the import process will attempt
to import data from the old database using a table name and set of columns that matches the new
database.
Process the post_data_module_import at the end of the import phase for each module.
At the end of importing each module,the database driver is given the option to optimise the module using the post_data_module_import hook. The sql server driver will attempt to shrink any tables to remove unused space and then re-index the table. Both steps can be disabled in the runtime configuration.
Process the post_import_dirs
.
The pre_import_dirs
setting controls the directories processed in this phase and
it defaults to ['import-hooks/pre']
.
Migrate a Database
Migrations are used to incrementally modify a database without recreating or re-importing the database
from scratch. The migrations_dir_name
setting controls the directory from which the migration
scripts are sourced and defaults to 'migrations'.
The migrations scripts are ordered using the same mechanism as per standard directories using an
index.txt
file or alphanumeric ordering based on the name of the script. When the scripts are
applied, a record of the application is recorded in a per-database migrations table. The name of the
migration table differs between databases and is [dbo].[tblMigrations]
under sql server and
"public"."tblMigration"
under postgres.
TODO: Document "release" migrations and how migrations are handled at database creation.
Defining Datasets
TODO: Document me.
Module Groups
TODO: Document me.