User Guide

Library of Modeling Knowledge

The library of modeling knowledge consists of templates for modeling processes and entities in the domain of interest. The syntax and semantics of the formalism for specifying templates is described in Section 4 of [2], more specifically Section 4.2. Section 3.2 and 3.2 of the same document [2] illustrates the use of the formalism on an example of a simple library of knowledge for modeling aquatic ecosystems.

The distribution of ProBMoT includes two example libraries: a library for modeling aquatic ecosystems AquaticEcosystem.pbl, described in Section 3 of [2], and a library for modeling endocytosis EndocytosisLibrary.pbl, described in [4].

Following is a brief description of the sytax used for describing a library of domain knowledge.

In the process-based formalism, a library is specified with the following syntax:

library Name; template_defs

where template_defs is a sequence of entity, process and compartment definitions.

template_defs::= (entity_template_def | process_template_def | compartment_template_def)*

entity_template_def has the following syntax:

template entity Name [: SuperEntity] { vars: template_variable_def (,template variable def)*; consts: template_constant_def (,template constant def)*; }

where template_variable_def is a variable template definition and template_constant_def is a constant template definition.

process_template_def has the following syntax:

template process Name [( argumentSpec)*] [ : SuperProcess] { consts: template_constant_def (,template_constant_def)*; equations: template_eq_def (,template_eq_def)*; processes: nested_processes_list; }

where template_eq_def is an equation template and the argumentSpec is a specification of the arguments of the process template which specify which types of entities can be involved in the process.

argumentSpec is defined as following:

argumentSpec ::= argumentName : argumentType[<minCard, maxCard> | <card>]

where argumentName is an argument identifier used in the body of the process template and argumentType is the entity template specifying the type of the argument (the argument has to be an instance of that entity template). The last part of the specification designates the allowed cardinality. Cardinality is specified with lower and upper bound as the interval <minCard, maxCard>. However, if the lower and upper bound on the cardinality is the same, then the definition of cardinality can be shortened to <card>. If both the lower and upper cardinality of the argument is 1, then the cardinality declaration can be omitted, as 1 is the default cardinality for arguments. The definition of an argument then becomes:

argumentName : argumentType

The entity templates and process templates form a entity and process taxonomies of in the form of a rooted tree. The SuperEntity and SuperProcess are referencs to an ancestor template entity and template process correspondingly.

A variable template is specified with the following syntax:

name { range: <lower_bound, upper_bound>; unit: string_value; aggregation: aggregation_name; }

The process-based formalism includes the following aggregation functions: sum, product, average, minimum and maximum, with the sum being the default function when no aggregation is explicitly specified.

A constant template is specified with the following syntax:

name { range: <lower_bound, upper_bound>; unit: string_value; }

Each equation template can refer to an algebraic or a differential equation. The left-hand side of the equation consists of a template variable from an argument of the template process. The time derivative of a variable appearing on the left hand side is specified by the function td. On the right-hand side of the equation, we distinguish two main components: the mathematical functional form of the right-hand side, and the variables and constants that appear on the right-hand side. The functional form can be any mathematical function. In the process based formalism, we support functions that are expressed as formulas containing the following operators and functions: unary negation (-), addition (+), subtraction (-), multiplication (*), division (/), sine (sin), cosine (cos), signum (sign), power (pow), minimum (min), maximum (max), exponential (exp), natural logarithm (log) and common logarithm (log10). The variables and constants that appear on the right hand side can be a variable or a constant template from the arguments of the process or a local constant template from the process.

Nested processes are used to decompose a large and complex process into several smaller processes. A process can contain an arbitrary number of nested processes. Each nested process can in turn contain its own nested processes. The processes which are not nested in any other process are called top-level processes. All process templates are defined at the library level.

Finally, a compartment template is, just as a library, a named collection of entity, process and compartment templates.

In the process-based formalism, a compartment template is specified with the following syntax:

template compartment Name { entities: TE1, TE2, ...; processes: TP1, TP2, ...; compartments: TK1, TK2, ...; }

where TE1, TE2, ... are identifiers of existing entity templates, TP1, TP2, ... are identifiers of existing process templates, and TK1, TK2, ... are identifiers of existing compartment templates. The order in which entities, processes, and compartments are specified does not have any influence. If the compartment template does not contain any entity templates, the entities part can be omitted. Similarly, processes and compartments can be omitted if there are no process or compartment templates, respectively.

Process-Based Model

Process-based model consists of a set of instances, i.e., specific entities and processes, of the templates from the library of modeling knowledge. The syntax and semantics of the formalism for specifying process-based models is described in Section 4.3 of [2]. Section 3.3 of [2] illustrates the use of the formalism on an example of a process-based model of a simple aquatic ecosystem. The example in Section 3.3 represents a model with completely specified structure (set of entities and processes) and parameters (values of all the constant parameters).

The formalism for specifying process-based models also allows for specifying incomplete models, where some parts of the structures or some values of the parameters are missing. An incomplete model can be transformed to a process-based model with a complete structure by adding the missing parts of the model structure with the templates specified in the library of modeling knowledge. Using alternative templates, the incomplete model can be completed in a number of process-based models. When using ProBMoT for learning models from data, incomplete models are being used to specify the set of candidate model structures considered by the learning algorithm.

The distribution of ProBMoT includes two examples of incomplete models: an incomplete model of an aquatic ecosystem BledIncomplete.pbm and a incomplete model of endocytosis EndocytosisModel.pbm.

Following is a brief description of the sytax used for describing a process-based model.

In the process-based formalism, a model is defined with the following syntax:

model modelName : LibraryName; instance_defs

where instance_defs is a sequence of entity, process and compartment instance definitions.

instance_defs::= (entity_instance_def | process_instance_def | compartment_instance_def)*

entity_instance_def has the following syntax:

entity entityName : TemplateEntity { vars: instance_variable_def (,instance_variable_def)*; consts: instance_constant_def (,instance_constant_def)*; }

where instance_variable_def is a variable instance definition and instance_constant_def is a constant instance definition. Each entity instance must contain the same variables and constants as its template.

process_instance_def the following syntax:

process processName ( Arguments ) : TemplateProcess { consts: instance_constant_def (,instance_constant_def)*; processes: nested_processes_list; }

The number of arguments of a process must correspond to the number of arguments of its process template. Furthermore, the type and number of the entities in each argument must correspond to the type and cardinality of the argument of the process template. Each argument of a process instance is a set of entity instances.

In the instance_variable_def the range, unit, and aggregation function of each variable are the same as in its template and do not have to be explicitly defined. The instantiation however requires definition of role and initial value.

The role of the variable can be either exogenous or endogenous. Exogenous variables are input variables that are used as forcing in uences to the system. They are not modeled within the system, their behavior through time comes from external measurements. Endogenous variables, on the other hand, are modeled within the system. Each endogenous variable is assigned an equation (possible through combining several equation fragments), with which its value is computed. Endogenous variables can be further classified as auxiliary or state. State variables are influenced by differential equations, whereas auxiliary variables are influenced by algebraic equations. A variable cannot be influenced by both algebraic and differential equations. State variables have an initial value, which is the value of the variable in the first time point.

The value of the constant in both entities and process es is its only property which is specified when instantiating the constant. Therefore, it makes sense to make the assignment of a value to a constant straightforward, by simply assigning the value to the constant name itself with the syntax:

constName = realNumber.

In the instance processes there is no way to define an equation instance in a model. Equation instances are implicitly instantiated whenever a process instance is instantiated. The equation instance is uniquely determined with the equation template defined in the process template and the arguments of the process.

Similarly to equations, one nested process in the process template corresponds to a set of nested processes in the process instance. In the process-based formalism, nested processes are specified as a single list and not as a list of sets, because the system can infer the the placement of nested processes by the order in which they are specified. The number of nested processes in the process template must correspond to the number of nested process sets in the process instance. The type of the process specified as nested process in a process instance must be compatible with the template of the nested process specified in the process template.

Finally, a compartment is very similar to a model and acts like a mini-model. It also consists of entities, processes and nested compartments. In the process-based formalism, a compartment is defined with the following syntax:

compartment compartmentName : TemplateCompartment { instance_defs }

where compartmentName is the name of the compartment, TemplateCompartment is the name of the compartment template and instance_defs has the same meaning as above.

Compartments can be nested in other compartments, forming a taxonomy of nested compartments. The topmost compartments are at the level of the model.

Data Set

When learning models from data, user has to specify a data set with measurements of the variables of the observed system in consecutive time points. The format of the file with the data set is simple: the first row specifies the names of the system variables (separated with spaces), while each of the following rows corresponds to the measured values of the system variables in the same order as the names of the system variables in the first row.

The distribution of ProBMoT includes a number of examples of data sets, including 02.data, the measurement of a number of variables in Lake Bled in 2002 and endo.data, the measurements of a switch in concentrations of Rab5 and Rab7 domain proteins during the early stages of endocytosis.

Task Specification

The inputs and setting for running ProBMoT is specified in a task specification file in XML format. The complete XML schema specifying the structure of the task specification file can be found HERE (probmot_task.xsd). In the continuation, we are going to explain each of the XML tags in the schema.

The whole task specification consists of a single <task> element. Three groups of elements comprise the task specification; each of them being described in the following three subsections. The distribution of ProBMoT includes two example task files: pbm for running a simple aquatic ecosystem modeling task and pbm for running the endocytosis modeling task.

ProBMoT Input Files

The two elements of <library> and <model> (or <incomplete>) are used to specify the relative paths (relative to the directory, where ProBMoT is being run from) of the file with the library of modeling knowledge (.pbl file) and the file with the (in)complete model (.pbm file). The two elements correspond to the first two ProBMoT inputs specified above.

The <data> element specifies the files containing data sets with measurements of the variables of the observed system: it contains a sequence of <d> elements, where each of them specifies the relative path to a data set file. The <d> element has two attributes of sep and id. The first specifies the character used to separate the values (columns) in the data set file (the default character being space " "). The id attribute specifies the ID used as a reference to the particular data set.

The <output> declares the output of the simulation of the model as a lists of <constants> and <variables>. Each <var> element in <variables> specifies the formula for calculating a single output of simulation - most often the output is equal to a single model variable, e.g., BledIncomplete.phyto.conc specifies the conc variable of the phyto entity of the BledIncomplete model. The name attribute in <var> specifies the ID used as a reference to the particular output. The <cons> elements in <constants> specify constant parameters that can be used in the formulas for calculating outputs. The <cons> element has three attributes: a Boolean attribute fit specifying whether the value of the constant parameter is fitted against data, an attribute value specifying the value of the constant parameter, in cases when fit=False and an attribute range specifying the lower and upper limit on the fitted value of the parameter, in cases when fit=True.

The <mappings> element provides further declarations of the variables used to establish mappings between the variables/outputs of the model and the variables in the data sets. The <dimension> element specifies the name of the dimension variables in the data set files: each <dim> correspond to a single dimension variable, most often time. Furthermore, Each <exo> element in <exogenous> relates the name of a single input/output variable of the model to the name of a single variable in the data set. Similarly, each <endo> element in the <endogenous> relates the name of the single state/endogenous variable of the model to the name of a single variable in the data set. Finally, each <out> element in the <outputs> relates a single output variable name to a single data set variable name.

The <dim>, <exo>, <endo> and <out> elements has two attributes of <name> and <col>. The name attribute specifies the model variable (e.g., BledIncomplete.phyto.conc in the previous example) or output name, while the col attribute specifies the name of the corresponding variable in the data set file.

ProBMoT Command and Output Files

The <command> element specifies the particular task we want to perform with ProBMoT. Currently, ProBMoT supports three tasks on incomplete models:

count: counts the number of candidate model structures that can be generated from the given incomplete model;
enumerate: enumerates the candidate model structures that can be generated from the given incomplete model and outputs each in the process-based model formalism;
exhaustive_search: ranks the candidate model structures according to their degree of fit to the measured behavior of the observed system included in the data sets.

and a single task on complete models:

simulate_model: simulates the given complete model and output the simulated behavior to a file.

The <writeDir> element specifies the directory with the output files. The naming convention for the output files is as follows:

simulations/Model#ID.sim, where ID refers to the particular process-based model; each file contains the simulation of the particular model considered during execution of the exhaustive_search and simulate_model commands;
fitPerformance.log file contains the logs of the parameter estimation for each model structure considered during the execution of the exhaustive_search command; in particular, the evolution of the objective function value through iterations is being included in the log;
Models.out file contains the list of the models considered during the execution of the exhaustive_search command; each model with fitted parameter values is being written in the process-based formalism and can be used as a complete model specification in future ProBMoT runs.

ProBMoT Settings

The <settings> element allows for changing the setting of the three main ProBMoT components: the simulator, the parameter estimator and the model evaluator.

First set of setting of the simulator <initialValuesSpec> specifies how the initial values of the model state variables are being obtained. The first alternative (attribute sameforalldatasets=True) is to use the initial values of the model state/endogenous variables as specified in the model, while the sedond (attribute usedatasetvalues=True) is to use the initial values from the data sets. Both aforementioned Boolean attributes are optional; the default alternative being the use of the initial values specified in the model.

Other simulator setting refer to the settings of the parameters of the CVODE solver of ordinary-differential equations (part of the SUNDIALS suite ). In particular, ProBMoT uses the Backward Differentiations Formula linear multistep solver combined with the Newton's method and SPGMR to simulate process-based models. Users can set the absolute tolerance, relative tolerance and maximal number of steps using the elements <abstol>, <reltol> and <steps> elements respectively.

ProBMoT uses Differential Evolution (as implemented in the JMetal suite) for parameter estimation; the parameter estimates uses the DE global optimization method to find values of the model parameters that optimize (minimize) the objective function measuring the discrepancy between the model simulation and the observed (data sets) and/or desired system behavior. The <evaluation> element specifies how many evaluations of the objective function per model parameter are being performed, <population> specifies population size, <strategy> specifies the DE strategy, <Cr> the cross-over probability and <F> differential weights.

By default the fitter uses the sum of root mean square error as an objective function: it measures the discrepancy between simulated model behavior and observed system behavior. The user can select a different objective function using the element <objectives>. PRoBMoT contains a set of predefined objective functions which includes the root mean squared error (RMSEMultiDataset) and the relative root mean squared error (RelativeRMSEObjectiveFunctionMultiDataset). Note also that ProBMoT can be easily extended with custom, user defined objective functions by providing a Java class that extends the fit.objective.TrajectoryObjectiveFunction and implements a constructor with attributes (List<Dataset> measured, BiMap<String, String> outsToCols). The class should be in the package fit.objective.

Finally, the third set of setting is related to model evaluation. By default, ProBMoT evaluates models on the training data set(s). However, user can specify alternative evaluation scenarios that include separate training and test data sets; these can be specified using the <train> and <test> elements that refer to the data set IDs as specified in the <d> elements introduced above.

Starting ProBMoT

ProBMoT requires Java JDK 7 or greater in order to run correctly. The path to the Java compiler contained in the JDK distribution should be included in the PATH system variable.

After preparing the necessary input files, ProBMoT is run from the directory with the probmot-1.2.jar file using the command

java -jar probmot-1.2.jar TASK-SPECIFICATION

where TASK-SPECIFICATION refers to the path to the task specification file of interest. ProBMoT reports basic information about its proceedings on the standard output, the output files are written in the directory specified in the task specification file (see the description in the previous section).

References

Bridewell W, Langley P, Todorovski L, Dzeroski S (2008) Inductive process modeling. Machine Learning, 71: 1-32. Available at http://cll.stanford.edu/~willb/publications/bridewell08ML.pdf
Čerepnalkoski D (2013) Process-Based Models of Dynamical Systems: Representation and Induction. Doctoral Dissertation. Available at http://probmot.ijs.si/pubs/Darko_Cerepnalkoski_PhD.pdf
Dzeroski S, Todorovski L (2007) Equation discovery for systems biology: finding the structure and dynamics of biological networks from time course data. Current Opinion in Biotechnology, 19: 360-368. Available at http://www.sciencedirect.com/science/article/pii/S0958166908000839
Tanevski J, Todorovski L, Kalaidzidis Y, Dzeroski S (2015) Domain-specific model selection for structural identification of the Rab5-Rab7 dynamics in endocytosis. BMC Systems Biology, 9:31. Available at http://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-015-0175-x
Tanevski J, Todorovski L, Dzeroski S (2016) Learning stochastic process-based models of dynamical systems from knowledge and data. BMC Systems Biology, 10:30. Available at http://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-016-0273-4

Running ProBMoT