# Design Notes on Subsample As Parameter problem

## Contents |

### Status: completed

Task is completed, notes below we do keep just in case.

### Problem Scope

**This is design notes, it is sketchy and may be incorrect, feel free to change it.**

Currently we have one special model parameter: subsample number (a.k.a. member or replica). It is created by runtime as integer [0,N] where N is number of subsamples specified as run option:

model.exe -General.Subsamples 16

Subsample number plays fundamental role in calculation of model Output Expressions. It is only parameter which used to calculate average (CV, SE, SD and all others) output values. For example if model runs with 16 subsamples then it will produce 16 values for each output accumulator and output expression value is an average of 16 accumulators across subsamples.

It may not be always necessary to have subsample number as special parameter; it can be any other model parameter or set of parameters which varies between model runs. And output expression(s) can be calculated as average (CV, SD, etc.) across any parameter values. However such "demote of subsample number" is quite significant change in model runtime.

Currently model run cycle looks like (extremely simplified):

- start model.exe and connect to database
- read all model parameters
- create modeling threads for each model subsample
- run modeling threads: do simulation
- write output accumulators for each subsample in database
- wait until all subsamples done (wait for exit from all modeling threads)
- calculate output expression values as average (CV,SE,SD,etc.) of accumulators across subsamples
- report on simulation success and exit from model main

If we decide to "demote subsample" or call it as "generalize parameters" then modeling cycle can look like:

- use some external utility to create modeling task and prepare set of input parameter (see Model Run: How to Run the Model)
- (optional) specify runtime expression to vary some model parameters, e.g. subsample number parameter
- run model until modeling task completed (until all input processed) and write all accumulators into database
- use some external utility to calculate output expressions as average (CV,SE,SD,etc.) across any parameter(s)

**Questions and problems:**

1. How to specify model parameters generators (how to calculate model parameters at runtime). Now we have ompp code translated into c++ by omc compiler to do all derived (model-generated) parameters. It is not dynamic enough - we don't want and should not re-compile model to specify parameter(s) generator. We also have primitive subsample number parameter generator as [0,N]. Such primitive for-loop generators may be good in many situations but not enough.

Is it enough to have an ability in model runtime specify for-loop parameter(s) generator(s) and rely on external utilities (i.e. use our R package) to create more complex modeling tasks?

2. Output expressions calculations. Now we use SQL to calculate averages and, in fact, that SQL allow to have almost arbitrary calculation, but it does aggregation across subsample number.

How to generalize SQL to aggregate across any parameter values, not only subsample number? Do we need to replace SQL with c++ code in model runtime? Do we need to create other "db_aggregator" utility instead of using model?

3. How to specify parameter generators and output expressions to make it powerful enough and avoid re-inventing of R (Octave, Matlab, SPSS, SAS)?

### Example of the problem

Let's assume some hypothetical model with following input parameters:

- population by age and sex
- taxation level
- election outcome
- workforce strike longevity
- random generator seed

And model output value is household income.

Model input parameters can be divided in following categories:

- "constant": where parameter values are known and does not changed during modeling
- population current and projected values assumed to be well known and fixed for our model

- "variable": parameter(s) which user want to change to study effect on modeling output results
- taxation level varies from 1% to 80% with 0.1% step

- "uncertainty": parameters where values are random
- election outcome parameter: Bernoulli distribution (binary) with mean = 0.6
- workforce strike: Poisson distribution with rate = 4
- random number generator seed

In order to study taxation level effect user run the model 800 times with different tax percent input value and calculate 800 average household income output values. Each output income value is an average of 32 "accumulator" values. Each "accumulator" value is a household income value produced by the model for specific combination of "uncertainty" parameters:

// create 32 input tuples of uncertainty parameters // int setId = database.CreateWorkset(); // input set of uncertainty parameters bool isBluePartyWin = false; // election results: win of "blue" or "red" party double strikeDays = 7.5; // number of strike days per year int randomSeed = 12345; // random number generator seed for (int k = 0; k < 32; k++) { isBluePartyWin = Bernoulli(0.6); strikeDays = SumOf_Poisson(4.0); seed++; // write "uncertainty" parameters into database input set: tuple number = k database.WriteParameters(setId, k, isBluePartyWin, strikeDays, randomSeed); } // run the model // for (double tax = 1; tax < 82; tax += 0.1) { model.exe -Parameter.Taxation tax -UncertaintyParameters setId } // // plot output household income depending on taxation level //

Pseudo code above can be implemented in Perl, R or using shell script. Also openM++ already support Modeling Task which allow to submit multiple inputs to the model and vary parameter(s) values similar to example above.

### Solution overview

OpenM++ already have most of components required for our solution, please take a look at:

Following can be done to solve a problem from example above:

1. **Use existing:** R API to create Modeling Task with 800 values of taxation level parameter.

2. **Add new:** Create tools to generate uncertainty parameters. It can be command-line utilities, GUI tool(s) or part of model runtime. Last option would allow us to reuse existing c++ code.

3. **Add new:** Change database schema in order to store tuples of uncertainty parameters as part of model run input. Currently model is using only single input set of parameters (workset) with single value of each parameter. We need to change database schema and model run initialization (input parameters search in database) in order to supply all 32 tuples of uncertainty parameters for every model run.

4. **Add new:** Change parameters memory management in order to provide unique value of each uncertainty parameter to each modeling thread. Now all parameters have only one copy of values and it is shared between all subsamples (threads and processes); only subsample number is unique and not shared between threads (see model run on single computer). And with new runtime we need to make sure only "constant" and "variable" parameters (like population and taxation level above) are shared and "uncertainty" parameters (election outcome, strike, random seed) are unique for each thread.

5. **Add new:** In case if model run on MPI cluster, when there are multiple modeling processes, we need to correctly supply unique values of all uncertainty parameters to each process. Now only subsample number is unique.

6. **Add new:** Change database schema similar to (3) above for model run parameters. Model run contains full copy of input parameters. Today it is only one value for each parameter and we need to change it in order to store all 32 tuples of uncertainty parameters in model run results.

7. **Use existing:** Model Output Expressions for output results aggregation. No changes required. We not yet have capabilities to compare model run results similar to what ModgenWeb does, but this is out of problem scope.

We can split implementation into two steps:

- First do all necessary run time changes (items 3, 4, 5 and 6 above). That would allow us to run the model with uncertainty parameters created by external tools, for example by R.
- Second is to implement "parameters generators" (item 2 above) to make it convenient to model user.

During that two steps process it is also necessary to implement some compatibility logic to supply parameter "Subsample" in order to keep existing models working.

**Note:**
We should also solve ambiguity of "subsample" term, inherited from Modgen. It can be a model integer parameter with name "Subsample" and in that case it same as any other model parameter, no any kind of special meaning or treatment required. It is also can be used as "uncertainty tuple number" and may not be necessary exposed to modeling code, it can be internal to model runtime and visible in database schema as `sub_id`

to order accumulator values and make it comparable between model runs.

<metadesc>OpenM++: open source microsimulation platform</metadesc>