Writing model tests - best practices

Here at our institute, we work on a couple of (larger scale) economics related GAMS models. Different teams, involved in different projects make adaptations to the models in order to answer their specific research questions. The models are equipped with a Graphical User Interface (GUI), and are version controlled through SVN. The general idea is that the SVN master branch depicts the most recent, stable model version, which should be capable of running all possible GUI settings (and their combinations), or should be returning appropriate abort statements if the selected combinations are deemed logically infeasible (e.g. growing Oranges without any Orange trees).

The problem
Imagine a team focuses on a specific problem, e.g. the production of Oranges. As they improve details on the production process of Oranges, e.g. required machinery, labor need, etc., they now want to commit their changes into the master SVN branch, as their additions improve the overall state and abilities of the model.
However, due to the structure of the model, the production of Oranges is not separated (and should not be) from the rest of the operation, and has implications on other parts of the model as well. With the way their changes were implemented, now imagine that the production of Apples has become mathematical infeasible (which it should not be).
Now with the (erroneous) changes committed, the research team involved in Apple production faces an infeasible model, and in order to correct the bug introduced by the other team, has to first go through the changes made by the other team, understand them, come up with a fix, and eventually commit to the master branch once again.
After this final commit, the production of Bananas is not working anymore… I guess everyone get’s the idea :frowning:

Tests to the rescue - Our current solution
In order to prevent such a scenario, each team defines a set of GUI settings which needs to work for their current project in a so called batch file. This batch file is a textual representation of the settings selected in the GUI. Before every commit, each team is supposed to run the model with their local modifications with the batch file settings of every team. The model outcome for the past revisions are stored on a server, where selected variables levels for each model run and revision are stored.

Especially with an economic model, this is a tedious task. Commits often bring updates in policies or input/output prices, which are expected to change the model results. However, judging whether the new results are “correct” due to the introduced changes or the deviations being “too far” from their past level (indicating a bug) often requires hours of manual checking. This is obviously not a popular task, and quickly facilitates the “this change is so minimal, I don’t need to run the tests” mindsets, after time leading again to the problem(s) initially described.

Best practices
I strongly believe that we are not the only group facing this (methodological) issue. Tests are a significant part of software engineering in general, though I have the feeling that with mathematical programming models, the overall procedure on how tests can be conducted is/should be different.
That’s why my actual question is: How do you (fellow modelers) resolve this issue? What are best practices that you implement in your projects/teams that facilitate writing clean, well tested code? Are there methods that have been proven to work well?

Chris,

Your note struck a chord with me. Here at GAMS we also struggle with how best to handle developments and changes in our code.

From your note it sounds like the tests are run manually, via a user using the GUI, with the “so called batch file” a sort of recipe. This does indeed sound very time consuming and tedious. I suppose you have already considered automating this task. It’s possible and even convenient to design a model so it can be run equally well in a batch mode and from a GUI. We’ve included many features in GAMS to make this possible. We appreciate the power of automation here - some have commented it almost borders on mania! So over the years we have worked hard to make that part of our quality control and testing schemes. I could say more about that offline if you like.

The next issue is how to compare the results: are the changes made by the Oranges team OK or not? This to me seems very model-dependent. Apart from the obvious things (no failures in GAMS, model and solve status codes the same or at least as expected) I can’t say much.

You mentioned SVN and branches. We were an SVN shop for many years before we switched to git. This is an interesting topic where people have different and sometimes strong views, but even git haters will admit that git offers better support for branching and merging of repos that are more independent of each other than with svn. Your description of Oranges and Apples teams working on their own branches reminded me of our situation with GAMS source and svn: this was a motivator to move to git.

HTH,

-Steve

Hi Chris
Same story here. We have a project with 5 models, 3 coded in Matlab, 1 in Python and mine in Gams. We have written interfaces for transferring data from one model to another in Matlab and all models are run in a loop coded in Matlab (each model is an object, to keep it completely encapsulated). Every module has a stand-alone and an integrated switch. We use a mixture of Git and SVN. All-in-all enough to give you a real headache.
On my side, I have started writing lots of “assertions” in my Gams code for information I get from the other modules as I have not enough insight into what my colleagues change.
As most of the data is in a central MySQL database, I can, for example, read the cost inputs in the other data that are relevant for me and compare them to my assumptions/data. Or check if they still use the same set of technologies, changed names, etc.
Every stand-alone module has one scenario that should reproduce the start data taken from the database (i.e. the electricity generation investment model should reproduce the situation in 2015, the same for the other four models).
It would be great to have a kind of workshop where ideas on how to manage projects like ours and exchange ideas, problems and frustrations.
Cheers
Renger

Thanks Steve and Renger for your replies,

So again a short note about our current procedure:
We don’t do the tests manually anymore, as we wrote an R-script that automates the process:

  1. On every SVN update, a post-commit hook on the SVN server runs the R-script, which in turn runs the GAMS batch files and creates a CSV files with every scenario defined in the batch file and selected variable levels (see attached pictures)
  2. If a team made changes to their working copy, and now wants to commit these changes to master, they are supposed to run the R-script on their local copy:
    The GAMS batch files are run again, the latest version of the CSV file is located on the server and copied to the local GAMS working directory, and a new column is appended to the CSV showing the results of the “local” working copy compared to the last master runs. This is shown in one of the pictures attached: All model versions run from the server are prepended by an “X”, while the “local” versions will just have their version number as the column label.

But even though the task is automated, it is still tedious: On the one hand, the “local” model check takes quiet long (ca. 30min, as every scenario has to be solved), and I don’t see any way in reducing the required time significantly. Furthermore, there are ca. 300 rows to be checked manually, finding out whether there are significant changes to a variable in a scenario and whether they make sense or not. Also, with the script written in R (as this is the most common known language in the Institute besides GAMS), we often face runtime errors between different PCs due to missing/incompatible packages or R versions. With the development of the embedded code facility I was thinking about rewriting the code in Python, but we need to discuss this internally first.

Thanks Steve about the hint towards git. I’m a huge fan of git myself, but as you already said there are sometimes strong opinions towards the one or the other option. With git, and especially Github, tracking issues and pull requests is very nice, and we could probably run the tests on every PR that is submitted, prohibiting a merger if issues persists (e.g. using something like https://circleci.com/). On the other hand, with git’s decentralized nature, every checkout would contain a lot of bulk, especially if the model has been growing over years (or even decades).
I would also love the idea to have a workshop as Renger suggests. Discussing our problems and brainstorming together would most likely help us come up with better testing solutions, maybe even solutions that could be generalized for testing/developing different kinds of models.
qm.PNG
qm2.PNG

Chris,

You wrote about the remaining tedium:

But even though the task is automated, it is still tedious: On the one hand, the “local” model check takes quiet long (ca. 30min, as every scenario has to be solved), and I don’t see any way in reducing the required time significantly. Furthermore, there are ca. 300 rows to be checked manually, finding out whether there are significant changes to a variable in a scenario and whether they make sense or not. Also, with the script written in R (as this is the most common known language in the Institute besides GAMS), we often face runtime errors between different PCs due to missing/incompatible packages or R versions. With the development of the embedded code facility I was thinking about rewriting the code in Python, but we need to discuss this internally first.

I wonder if you could solve some of these scenarios in parallel. There are plenty of issues along such a path, but if you assume each machine has 4 cores - not too unreasonable - you could expect a good speedup if the nature of your scenarios is independent rather than sequential.

The problem of checking “if significant changes [in results] make sense or not” is a tough one, of course. It’s as much what to check as how to do the check.

Re: the runtime errors on different PCs, I would think setting a minimum standard (e.g. a minimum version of R like 3.4 and a minimum package set), some automatic checks or scripts to verify a machine meets the standard, and then some attention to detail when writing the test scripts would take care of this. I am no Python guru, but I have suffered some due to the large number of versions and variants and packaging systems that form the Python universe, so I found it ironic that you suggested Python development immediately after mentioning version headaches with R.

I would also find a workshop interesting and useful. Plenty of interesting topics to discuss.

-Steve