Obevo – Design Walkthrough

InfoQ Article

Note: we have published an article on InfoQ that gives an overview of Obevo and some technical details.

The InfoQ article has a slightly more up-to-date and cleaner representation of some of the information on this page. Feel free to look at the InfoQ article first before proceeding on this page.

Obevo Problem Statement - A Compiler for Stateful and Stateless Code Classes

Design goal of Obevo - to give idempotent deploy semantics onto a platform target that does not have it by default, the keys problem to solve being:

There is no existing mechanism to take a full binary of code and deploy it to upgrade an environment
Some objects within the environment are stateful, thus requiring incremental commands to modify them (and making #1 difficult)
There is no default ordering mechanism for object changes to the environment; must be manually defined by users.

Databases happen to hit all 3 points, notably for applying SQL DDL changes to DB schemas, per the overview

#1: There is no mechanism currently defined by DBMS providers to define a full schema definition in source code and apply the results to a database that can work both against a blank and existing db
#2a: Table modifications cannot be expressed by simply presenting a full view of the object definition. Instead, one must provide an incremental modification statement to alter the table, and changes should only be applied once, lest an exception is hit or data applied in duplicate.
#2b: At the same time, other DB objects are stateless (unlike tables), where one can simply drop and add objects to recreate them. Such objects like stored procedures and views share semantics more similar to application code.
#3:When deploying a full database, order must be considered to deploy dependency objects before the ones depending on them (e.g. deploying a table before the view that depends on it.

Noting point #2, each object type (stateful vs. stateless) have different ways to maintain them, and a tool should ideally be able to handle them both. This is the premise and value-add of Obevo.

Boiling down the core problem and design - it actually does not involve SQL or DB constructs at all. Instead, it is about managing code files in the stateless and stateful manners and applying changes to them to the target environment. The end result is a “mini-compiler” of sorts; hence, the title of this section.

This page details the generic pattern here on compiling stateful and stateless code for an underlying system that only accepts incremental changes. As of today, the implementations have been around RDBMS, but we’ve already started to venture into NoSQL where not all operations need SQL (Hive), and depending on the creative minds out there, we can apply this to more.

So yes, this is a solution looking for a problem :) - but an interesting problem it is!

Problem Terminology

Let’s lay out the main actor involved in a System, at least with regards to this problem statement:

Actor	Role Description Equivalent for a database setup	Equivalent for an application setup (Java)
Developers	The folks building the System Developers	Developers
Source Code	The code that Developers write that is eventually built, shipped, and run to provide functionality to a user. But first, it must be put into a form that can be run... DDLs, SQL scripts, etc.	Java classes
Artifact / Binary	The result of building the source code into something that can be run. zip archive containing the SQL scripts to deploy	Java Jar
Environment	An instance of a running System (i.e. where the end result of a Source Code build is run). A system can have many Environments, from dev to uat to production. A database / set of databases on some known schemas	A host with a container to execute Java processes
Deploy Tool	The mechanism that applies a built binary to an Environment A SQL command-line interface, or a deploy tool like Obevo	General install scripts, FTP, whatever your firm uses
Deploy Team	The individuals that carry out the deployment (i.e. invoke whatever commands or UIs to carry out the deployment using the tool Team member/s assigned to this role	Team member/s assigned to this role

Requirements for Stateful objects / run-once statements

For any given deployment activity, a set of statements need to be applied to a target environment.

As mentioned earlier, statements on stateful objects must only be run once, and the state of an environment over time is the accumulation of such statements applied to it.

e.g.

Binary Version 1 - apply statements 1, 2, 3
Binary Version 2 - apply statements 4, 5
Binary Version 3 - apply statements 6, 7, 8

We present two ways to do this:

The Deploy Team notes which statement/s from the Source Code need to be applied for that migration
The Deploy Tool takes the full set of statements from the Source Code and figures out which ones need to be applied to the environment

Option 1 was the way teams did this without automation. Tooling-wise, it is quite simple; but it is a riskier endeavor as it leaves more responsibility on a human operator.

Option 2 is what most database deployment tools do. We detail that now.

Deployment Algorithm

Diagram

Deploy Algorithm

Developer Guide note:

Algorithm in this diagram is implemented in com.gs.obevo.impl.MainDeployer
Deploy Log in the diagram maps to com.gs.obevo.api.platform.ChangeAuditDao
Enviroment in the diagram maps to com.gs.obevo.api.appdata.Environment
Changeset in the diagram maps to com.gs.obevo.impl.Changeset
Step 1) to read Source Code changes is done in com.gs.obevo.impl.SourceChangeReader

Recapping the steps in the diagram:

Read Changes in Source Code
Read Changes from Deploy Log
Calculate ChangeSet between Source Code and Deploy Log
Apply Changeset to Environment and Deploy Log

Details on some of these steps:

ChangeKey to match Source Code and Deploy Log

Step 3 is to calculate the ChangeSet between the Source Code and Deploy Log. This implies some kind of key that can identify a Change to facilitate this match. The choice of type to use (e.g. number, string) does not matter.

The ChangeKey will be stored in the Deploy Log when applying the change so that it can be used later.

Stateful Changes: No modifying or deleting already deployed changes allowed

As mentioned earlier, statements on stateful objects must only be run once, and the state of an environment over time is the accumulation of such statements applied to it. (yes, we mention this fact again - it plays a key role!)

That means we cannot modify a Change entry in the Source Code once it is deployed, as otherwise the deploy tool will not know how to apply the difference. The modification must either be to add a new Change or to rollback/delete the change per the specifications of the tool (if supported).

To guard against such user actions, a hash of the Change text is stored in the Deploy Log. Upon subsequent releases, a hash is taken of the Change in Source Code to compare against the hash from the Deploy Log. (a hash is taken instead of storing the full text for space efficiency).

To summarize the hash difference behavior for Stateful Changes:

Hash Comparison	Action
Hashes match in Source Code and Deploy Log	No action
Hashes value in Source Code, but not Deploy Log	Deploy Change
Hashes value in Deploy Log, but not Source Code	Exception - Source Code Change was removed improperly
Hashes differ between Source Code and Deploy Log	Exception - Source Code Change was changed improperly

The following tables demonstrate an example:

Changeset Input

Changeset Result

Deploy Result

Deploy #1 - new environment

Source Code		Deploy Log
ChangeKey	Hash	ChangeKey	Hash
change1	abc
change2	123

change1 and change2 to be deployed

Deploy Log
ChangeKey	Hash
change1	abc
change2	123

Deploy #2 - new release on environment

Source Code		Deploy Log
ChangeKey	Hash	ChangeKey	Hash
change1	abc	change1	abc
change2	123	change2	123
mynewChange3	qwerty
otherChange4	uiop

change1 and change2 to be deployed

Deploy Log
ChangeKey	Hash
change1	abc
change2	123
mynewChange3	qwerty
otherChange4	uiop

Deploy #3 - bad change; failure

Source Code		Deploy Log
ChangeKey	Hash	ChangeKey	Hash
change1	abc	change1	abc
change2	123	change2	123
mynewChange3	qwertyChanged	mynewChange3	qwerty
~~otherChange4~~	~~uiop~~	otherChange4	uiop

Failed validation: mynewChange3 hash changed, otherChange4 was removed

No changes applied

Stateless Changes: Modifying or deleting already deployed changes IS allowed

What about Stateless Changes?

The hash calculation logic remains the same as for Stateful Changes
Only the behavior on hash differences changes. Stateless Changes can be rerun (or dropped and added), so we have more flexibility here.

To summarize the hash difference behavior for Stateful Changes:

Hash Comparison	Action
Hashes match in Source Code and Deploy Log	No action
Hashes value in Source Code, but not Deploy Log	Deploy Change
Hashes value in Deploy Log, but not Source Code	Remove Change
Hashes differ between Source Code and Deploy Log	Re-deploy Change (drop/add if necessary)

Source Code Format

We have the algorithm in abstract now - the question becomes: how best to represent the Changes in source code?

We present 2 alternatives - the first is what most DB tools support that works best with stateful objects, and the second is what Obevo uses to work with both stateful and stateless changes.

Methodology Name

File per migration (used by most other tools)

File per DB object (used by Obevo)

Description

Most closely ties to how teams would do deployments manually
Each file has the set of changes intended for the upcoming migration.
Each file may span multiple database objects, and we can have multiple files for a migration.

Example:

Original creation statements in version 1 (V1). Subsequent versions have followup updates
The ChangeKey here is the file name that includes the version number

Code/Changes for each class/object are kept in their own dedicated files/set of files
An object requiring stateful changes can have many Changes defined in its file
An object requiring stateless changes would only have one Change defined, which would be re-deployed for each change.
This ties closely to how object-oriented languages maintain their code, e.g. Java having one-file-per-class

Example:

The ChangeIdentity is the file name + "//// CHANGE" name for the subsection in the file (if applicable).
Note the stateless objects do not have "//// CHANGE" to denote multiple changes

Example:

File Names	Content
V1_baseline.sql	CREATE TABLE table1 (...) CREATE TABLE table2 (...) CREATE TABLE table3 (...) CREATE VIEW view1 SELECT * FROM table2 inner join table3 ... where colA=1
V2_table1ColAdd.sql	ALTER TABLE table1 ADD COLUMN colY ALTER TABLE table1 ADD COLUMN colZ
V2_addIndexes.sql	CREATE INDEX t1ind1 ON table1 (...) CREATE INDEX t1ind2 ON table1 (...) CREATE INDEX t2ind1 ON table2 (...)
V3_repTableColAdds.sql	ALTER TABLE table1 ADD COLUMN colB ALTER TABLE table3 ADD COLUMN colB
V4_fkAdds.sql	ALTER TABLE table3 ADD FOREIGN KEY ON (colB) REFERENCES table1 (colB)
V4_viewFix1.sql	DROP VIEW view1 CREATE VIEW view1 SELECT * FROM table2 inner join table3 ... where colA=2
V5_viewFix2.sql	DROP VIEW view1 CREATE VIEW view1 SELECT * FROM table2 inner join table3 ... where colA=3

File Names	Content
table1.sql	//// CHANGE name=init CREATE TABLE table1 (...) //// CHANGE name=alter1 ALTER TABLE table1 ADD COLUMN colY ALTER TABLE table1 ADD COLUMN colZ //// CHANGE name=index CREATE INDEX t1ind1 ON table1 (...) CREATE INDEX t1ind2 ON table1 (...) //// CHANGE name=alter2 ALTER TABLE table1 ADD COLUMN colB
table2.sql	//// CHANGE name=init CREATE TABLE table2 (...) //// CHANGE name=ind1 CREATE INDEX t2ind1 ON table2 (...)
table3.sql	//// CHANGE name=init CREATE TABLE table3 (...) //// CHANGE name=alter2 ALTER TABLE table3 ADD COLUMN colB //// CHANGE name=fk1 ALTER TABLE table3 ADD FOREIGN KEYON (colB) REFERENCES table1 (colB)
view1.sql	CREATE VIEW view1 SELECT * FROM table2 inner join table3 ... where colA=1 (edited on change 2) CREATE VIEW view1 SELECT * FROM table2 inner join table3 ... where colA=2 (edited on change 3) CREATE VIEW view1 SELECT * FROM table2 inner join table3 ... where colA=3

Pros

Easy to adopt to initially as this is similar to the way teams would manually deploy db changes
Similar changes across multiple tables can easily be done in one file (see the V2_addIndexes.sql example; such a use case is useful say if you forgot to apply a particular attribute to a db object)
Easy to see which changes go into a particular version directly from the file system

Easier to understand the DB structure just from the file structure, as the two match (much like Java classes)
File count will not grow significantly larger than the # of DB objects
Editing stateless objects in place (following a model similar to that of Java code) is now possible
More conducive to supporting the Level 7 maturity level (see below for more information)
Re-baselining the SQLs will not lead to an excessively large single file, as we would have with the migration-based model.
- In addition, re-baselining can be done on a per-table basis, instead of having to do it for the whole schema. This may be useful in case some tables are still actively developed in consecutive versions and others are not

Cons

Can lead to a cluttered and growing file system
- As each migration is contained in a file, the number of files would continue to grow endlessly over time
- The toy example above only involves 4 database objects over 5 releases, and we have 7 files. Imagine how much more that will grow with the number of tables and versions that get added
Harder to baseline, and leads to a less-desirable baseline file
- Baselining the DDLs would help avoid a cluttered file system
- However, that file itself would end up being very large, as it would contain all the database objects in the schema (you can see already how V1_baseline.sql is larger than the other tables).
- Such a file structure would not be as readable (certainly, we don't look to define all our Java classes in a single file)
Very unfriendly approach to use for stateless objects
- Note the view1 example. We change the view 3 times (specifically the where clause), and even though it is a stateless objects, the "migration" methodology of the DB deployment tool forces us to duplicate the text across each file. Certainly it would be nice to edit this in place, similar to a Java class
This is less conducive to supporting the Level 7 Maturity Level mentioned below

Bulk changes to the schema could require changes to multiple files (e.g. note how the single V2_addIndexes.sql change in the migration-based model is now represented in two files)
- However, note that the same consideration would apply for Java classes files regardless
- In addition, some concerns on applying bulk changes like permissions is mollified by the support that Obevo has to define grants in a central config file (i.e. no need to write grant statements for each table)
Cannot see in one file the changes that would be done for a particular version
- However, note that the diffs across versions can still be viewed by comparing tags, similar to how you would do this for Java classes. (notice a pattern here in how we can maintain DB objects more like code?)

Sorting Changes in the file-per-object format

The file-per-object format that Obevo chose has many benefits, but it adds some complexity in the tool’s implementation, namely: how to define the order of the changes?

With the file-per-migration format, it is simpler to just order by the files; but the file-per-object format would likely require some ordering constraints across changes. In this section, we discuss how this is handled.

[Developer Guide note - this section is implemented in com.gs.obevo.impl.graph.GraphEnricher]

Let’s look at the source code example above, redrawn as a graph:

Each filled box represents a Change
The stateful table* files consist of many changes, and the stateless view* file only has a single change.
The arrows represent order dependencies among the changes, notably:
view1 depends on table2 and table3 per the SQL definition (an explicit code dependency)
table3’s foreign key depends on table1 to be created (an explicit code dependency)
The Changes listed within a stateful object must be done in order (an implicit dependency based on the order in the file)
Note that changes not related via arrows (whether directly or transitively) don’t have any dependency order, e.g. table1.init and table2.init can be executed in any order.

Dependency Graph

Given that graph, an acceptable deployment order can be obtained usingtopological sort. Please read the link for more information.

[Developer Guide note - the topological sorting is implemented (with the help ofJGraphT) in com.gs.obevo.impl.graph.GraphSorter]

For the example above, we have a number of acceptable topological sorts:

table1.init, table1.alter1, table1.index, table1.alter2, table3.init, table3.alter2, table3.fk1, table2.init, table2.ind1, view1
table2.init, table2.ind1, table3.init, table3.alter2, table1.init, table1.alter1, table1.index, table1.alter2, table3.fk1, view1

For easier supportability by clients, we will tune the topological sorting algorithm to give a consistent sorting.

Defining and discovering dependencies in Source Code

One detail left for this algorithm - how is the dependency graph created from the source code? Specifically the explicit code dependencies, which require some sort of parsing of the code. Parsing code is not a trivial problem; to do it correctly, we would need ASTs for the coding language of the text, and that may not be easily avaiable or easy to implement.

Part 1: The simplest methology is to avoid parsing entirely by defining a metadata attribute on each Change for the dependencies. This would technically work, however it will not scale well for developers if there are hundreds (or thousands) of code files to maintain. But this method will still prove useful; we will com back to this.

Part 2: We can discover the dependencies in text using a rudimentary technique:

We have the object names (based on the directory structure) and the text of the code
For each Change, search for references to those object names. Whichever object names are found in the text are counted as dependencies.

[Developer Guide note - this technique is implemented in com.gs.obevo.impl.text.TextDependencyExtractor]

From the example above:

The object name list is: table1, table2, table3, view1
The text for view1 is as follows, with the obect names in red and italicized

CREATE VIEW view1 SELECT * FROM table2 inner join table3 ... where colA=3

Excluding the object’s own name (view1), we find that table2 and table3 are dependencies.

Part 3: What if our rudimentary technique doesn’t work? We may find false positives or false negatives.

In this case, fall back to the metadata tag option mentioned in part 1. We either let clients:

Override the programmatic calculation by defining their list
Supplement the programmatic calculation by specifying dependencies to add (fixing the false negatives)
Supplement the programmatic calculation by specifying dependencies to remove (fixing the false positives)

Putting it together - integrating DB Deployments and Other Platforms

Thus far, we have not had any specific DB or SQL imprints in the algorithms (The text parsing algorithm could work with any language).

The points where DB-specific (or any other platform-specific) log would go are:

“Apply Changeset to Platform” - this point in the algorithm takes the Change as input; how to actually apply the change is platform-specific
Deploy Log implementation - this should ideally co-exist w/ the environment (e.g. a table in the DB for DB platforms), but that is not an absolute requirement.
“Read Changes from Source Code” - the structure to impose for a platform on users can be defined as an implementation desires.
“Calculate Changeset” and Discovering Dependencies - there are a few minor details for each platform implementation to fill in (e.g. handling case-sensitivity)

[Developer Guide note - com.gs.obevo.api.platform.Platform and com.gs.obevo.api.platform.ChangeType classes are the key interfaces to implement to customize the behavior for each platform.]