Model Output Expressions

From OpenM++
Jump to: navigation, search

Contents

Sub-values: sub-samples, members, replicas

Following terms: "simulation member", "replica", "sub-sample" are often used in micro-simulation conversations interchangeably, depending on context. To avoid terminology discussion openM++ uses "sub-value" as equivalent of all above and some older pages of that wiki may contain "sub-sample" in that case.

Model output tables: sub-values, accumulators and expressions

There are two kind of model output tables:

  • accumulators table: output sub-values (similar to Modgen sub-samples)
  • expressions table: model output value calculated as accumulators aggregated across sub-values (e.g. mean or CV or SE)

All output accumulator tables always contain same number of sub-values, for example model run:

model.exe -OpenM.Subvalues 16

will create 16 sub-values for each accumulator in each output accumulator table.

Sub-values (accumulators) output tables

During the simulation OpenM++ model collect the results in "accumulators" and, at the end, write it into output accumulators table(s). Each output accumulator table contains results of model executions for all sub-values.

For example:

Model output table "Salary by Sex" has two accumulators and two dimensions:

  • salary: 0 = "Low", 1 = "Medium", 2 = "High"
  • sex: 0 = "Female", 1 = "Male"

If we run that model twice, first time with one sub-value and second with eight sub-values then output results may look like:

SELECT 
  run_id, dim0, dim1, acc_id, sub_id, acc_value
FROM modelone_201208171604590148_a0_salarySex
ORDER BY 1, 2, 3, 4, 5;


run_id  dim0    dim1    acc_id  sub_id  acc_value
------  ------  ------  ------  ------  ----------
11      0       0       0       0       50.0
11      0       0       1       0       1.0
11      0       1       0       0       60.0
11      0       1       1       0       2.0
11      1       0       0       0       51.6
11      1       0       1       0       2.0
11      1       1       0       0       62.0
11      1       1       1       0       3.0
11      2       0       0       0       53.2
11      2       0       1       0       3.0
11      2       1       0       0       64.0
11      2       1       1       0       4.0
12      0       0       0       0       50.0
12      0       0       0       1       100.0
12      0       0       0       2       150.0
12      0       0       0       3       200.0
12      0       0       0       4       250.0
12      0       0       0       5       300.0
12      0       0       0       6       350.0
12      0       0       0       7       400.0
12      0       0       1       0       1.0
....more results....
12      2       1       1       7       11.0

Columns are:

  • run_id: is unique run id for that model execution; all model input parameters and output results can be found by run_id;
  • dim0: salary dimension items;
  • dim1: sex dimension items;
  • acc_id: zero-based accumulator number;
  • sub_id: zero-based sub-value number;
  • acc_value: accumulator value;

Accumulators are low level simulation results and useful mostly to analyze simulation model itself.

Aggregated values output tables

On top of accumulator values for each sub-value model can produce more meaningful output results by using OpenM++ output expressions, i.e.: median value across all sub-values. To do that model developer (or model user) can specify output aggregation expression, for example, median value is: OM_AVG(acc0).

Each "value" output table can contain unlimited (reasonably unlimited) amount of aggregation expressions. Each expression must include aggregation function(s) with accumulators as argument(s) and, optionally, other arithmetic operators and basic SQL functions, such as ABS or SQRT.

Following OpenM++ sub-values aggregation functions are supported:

  • OM_COUNT(...expr...) - count of values across all sub-values:
    COUNT(...expr...)
  • OM_SUM(...expr...) - sum of values across all sub-values:
    SUM(...expr...)
  • OM_AVG(...expr...) - average value over sub-values:
    AVG(...expr...)
  • OM_VAR(...expr...) - variance over sub-values:
    SUM( (...expr... - AVG(...expr...) * (...expr... - AVG(...expr...) ) 
    / 
    ( COUNT(...expr...) - 1 )
  • OM_SD(...) - standard deviation:
    SQRT(OM_VAR(...expr...))
  • OM_SE(...expr...) - standard error:
    SQRT(OM_VAR(...expr...) / COUNT(...expr...))
  • OM_CV(...expr...) - coefficient of variation:
    100 * ( OM_SD(...expr...) / AVG(...expr...) )

Aggregation expression can be more complex than a single function, for example: OM_SUM(acc0) / OM_COUNT(acc0) is equivalent of OM_AVG(acc0). And OM_SD(acc1) can be written as:

SQRT(OM_SUM( (acc1 - OM_AVG(acc1) * (acc1 - OM_AVG(acc1) ) / ( OM_COUNT(acc1) - 1 ))

It is possible, as you can see, combine and nest aggregation functions in the expression.

It is important to understand:

  1. openM++ does aggregation across the sub-values, or other word, COUNT() is (almost) always number of sub-values.
  2. aggregation done by underlying SQL database, so, only non-NULL accumulator values are aggregated, so, COUNT() is number of non-NULL accumultor values across sub-values.
  3. accumulators always must be inside some aggregation function, i.e. this is an error: acc0 + OM_SUM(acc1) because acc0 is not aggregated.

If you want to aggregate simulation results in your own way then it is always possible to combine openM++ and standard SQL functions in some custom expression. For example, if sub-values of your model is parts of large population then your may want to collect count and sum in separate accumulators and instead of OM_AVG(...) use custom median expression, like:

OM_SUM(acc0) / OM_SUM(acc1)

Examples of aggregation expressions

OpenM++ output table expressions translated into SQL aggregation queries. For example, if we have accumulator table:

CREATE TABLE out4_sub
(
  run_id    INT        NOT NULL,
  dim0      INT        NOT NULL,
  dim1      VARCHAR(8) NOT NULL,
  sub_id    INT        NOT NULL,
  acc0      FLOAT NULL,
  PRIMARY KEY (run_id, dim0, dim1, sub_id)
);

SELECT run_id, dim0, dim1, sub_id, acc0 FROM out4_sub ORDER BY run_id, dim0, dim1 DESC;

run_id  dim0      dim1     sub_id   acc0
------  ----      ----     ------   ----
2        10        M        0        1
2        10        M        1        2
2        10        M        2        3
2        10        M        3        4
2        10        F        0        1.5
2        10        F        1        2.5
2        10        F        2        3.5
2        10        F        3        4.5
2        20        M        0        10
2        20        M        1        20
2        20        M        2        30
2        20        M        3        40
2        20        F        0        10.5
2        20        F        1        20.5
2        20        F        2        30.5
2        20        F        3        40.5
3        10        M        0        5
3        10        M        1        6
3        10        F        0        7
3        10        F        1        8
3        20        M        0        50
3        20        M        1        60
3        20        F        0        70
3        20        F        1        80

Please, keep in mind: this is simplified example and in real openM++ database sub-value tables look like as described at the top of the article.

Then following results would be produced by openM++ aggregation functions:

Count, Average, Sum, Min and Max:

SELECT 
  S.run_id, S.dim0, S.dim1,
  COUNT(S.acc0) AS "cnt",
  AVG(S.acc0)   AS "avg",
  SUM(S.acc0)   AS "sum",
  MIN(S.acc0)   AS "min",
  MAX(S.acc0)   AS "max"
FROM out4_sub S
GROUP BY S.run_id, S.dim0, S.dim1
ORDER BY S.run_id, S.dim0, S.dim1 DESC;

run_id dim0   dim1   cnt     avg      sum      min    max
------ ----   ----   ---     ---      ---      ---    ---
2       10     M      4       2.5      10       1       4
2       10     F      4       3        12       1.5     4.5
2       20     M      4      25       100      10      40
2       20     F      4      25.5     102      10.5    40.5
3       10     M      2       5.5      11       5       6
3       10     F      2       7.5      15       7       8
3       20     M      2      55       110      50      60
3       20     F      2      75       150      70      80

Count, Average and Variance:

SELECT 
  S.run_id, S.dim0, S.dim1,
  COUNT(S.acc0) AS "cnt",
  AVG(S.acc0)   AS "avg",
  SUM(
    (S.acc0 - (SELECT AVG(VM1.acc0) FROM out4_sub VM1 WHERE VM1.run_id = S.run_id AND VM1.dim0 = S.dim0 AND VM1.dim1 = S.dim1) ) * 
    (S.acc0 - (SELECT AVG(VM2.acc0) FROM out4_sub VM2 WHERE VM2.run_id = S.run_id AND VM2.dim0 = S.dim0 AND VM2.dim1 = S.dim1) )
  ) / 
  ( (SELECT COUNT(VC1.acc0) FROM out4_sub VC1 WHERE VC1.run_id = S.run_id AND VC1.dim0 = S.dim0 AND VC1.dim1 = S.dim1) - 1) AS "var"
FROM out4_sub S
GROUP BY S.run_id, S.dim0, S.dim1
ORDER BY S.run_id, S.dim0, S.dim1 DESC;

run_id dim0     dim1    cnt     avg      var
------ ---      ----    ---     ---      ---
2       10       M       4       2.5      1.66666666666667
2       10       F       4       3        1.66666666666667
2       20       M       4      25      166.666666666667
2       20       F       4      25.5    166.666666666667
3       10       M       2       5.5      0.5
3       10       F       2       7.5      0.5
3       20       M       2      55       50
3       20       F       2      75       50

Count, Average and Standard Deviation:

SELECT 
  S.run_id, S.dim0, S.dim1,
  COUNT(S.acc0) AS "cnt",
  AVG(S.acc0)   AS "avg",
  SQRT(
    SUM(
      (S.acc0 - (SELECT AVG(SDM1.acc0) FROM out4_sub SDM1 WHERE SDM1.run_id = S.run_id AND SDM1.dim0 = S.dim0 AND SDM1.dim1 = S.dim1) ) * 
      (S.acc0 - (SELECT AVG(SDM2.acc0) FROM out4_sub SDM2 WHERE SDM2.run_id = S.run_id AND SDM2.dim0 = S.dim0 AND SDM2.dim1 = S.dim1) )
    ) / 
    ( (SELECT COUNT(SDC1.acc0) FROM out4_sub SDC1 WHERE SDC1.run_id = S.run_id AND SDC1.dim0 = S.dim0 AND SDC1.dim1 = S.dim1) - 1) 
  ) AS "sd"
FROM out4_sub S
GROUP BY S.run_id, S.dim0, S.dim1
ORDER BY S.run_id, S.dim0, S.dim1 DESC;

run_id dim0     dim1    cnt     avg     sd
------ ----     ---     ---     ---     --
2       10       M       4       2.5     1.29099444873581
2       10       F       4       3       1.29099444873581
2       20       M       4      25      12.9099444873581
2       20       F       4      25.5    12.9099444873581
3       10       M       2       5.5     0.707106781186548
3       10       F       2       7.5     0.707106781186548
3       20       M       2      55       7.07106781186548
3       20       F       2      75       7.07106781186548

Count, Average, and Standard Error:

SELECT 
  S.run_id, S.dim0, S.dim1,
  COUNT(S.acc0) AS "cnt",
  AVG(S.acc0)   AS "avg",
  SQRT(
    SUM(
      (S.acc0 - (SELECT AVG(SEM1.acc0) FROM out4_sub SEM1 WHERE SEM1.run_id = S.run_id AND SEM1.dim0 = S.dim0 AND SEM1.dim1 = S.dim1) ) * 
      (S.acc0 - (SELECT AVG(SEM2.acc0) FROM out4_sub SEM2 WHERE SEM2.run_id = S.run_id AND SEM2.dim0 = S.dim0 AND SEM2.dim1 = S.dim1) )
    ) / 
    ( (SELECT COUNT(SEC1.acc0) FROM out4_sub SEC1 WHERE SEC1.run_id = S.run_id AND SEC1.dim0 = S.dim0 AND SEC1.dim1 = S.dim1) - 1) /
    (SELECT COUNT(SEC2.acc0) FROM out4_sub SEC2 WHERE SEC2.run_id = S.run_id AND SEC2.dim0 = S.dim0 AND SEC2.dim1 = S.dim1)
  ) AS "se"
FROM out4_sub S
GROUP BY S.run_id, S.dim0, S.dim1
ORDER BY S.run_id, S.dim0, S.dim1 DESC;

run_id dim0     dim1    cnt     avg     se
------ ----     ----    ---     ---     --
2       10       M       4       2.5     0.645497224367903
2       10       F       4       3       0.645497224367903
2       20       M       4      25       6.45497224367903
2       20       F       4      25.5     6.45497224367903
3       10       M       2       5.5     0.5
3       10       F       2       7.5     0.5
3       20       M       2      55       5
3       20       F       2      75       5

Count, Average, an Coefficient of Variation:

SELECT 
  S.run_id, S.dim0, S.dim1,
  COUNT(S.acc0) AS "cnt",
  AVG(S.acc0)   AS "avg",
  100.0 * (
    SQRT(
      SUM(
        (S.acc0 - (SELECT AVG(CVM1.acc0) FROM out4_sub CVM1 WHERE CVM1.run_id = S.run_id AND CVM1.dim0 = S.dim0 AND CVM1.dim1 = S.dim1) ) * 
        (S.acc0 - (SELECT AVG(CVM2.acc0) FROM out4_sub CVM2 WHERE CVM2.run_id = S.run_id AND CVM2.dim0 = S.dim0 AND CVM2.dim1 = S.dim1) )
      ) / 
      ( (SELECT COUNT(CVC1.acc0) FROM out4_sub CVC1 WHERE CVC1.run_id = S.run_id AND CVC1.dim0 = S.dim0 AND CVC1.dim1 = S.dim1) - 1) 
    ) /
    (SELECT AVG(CVM3.acc0) FROM out4_sub CVM3 WHERE CVM3.run_id = S.run_id AND CVM3.dim0 = S.dim0 AND CVM3.dim1 = S.dim1)
  ) AS "cv"
FROM out4_sub S
GROUP BY S.run_id, S.dim0, S.dim1
ORDER BY S.run_id, S.dim0, S.dim1 DESC;

run_id dim0     dim1    cnt     avg     cv
------ ----     ----    ---     ---     --
2       10       M       4       2.5    51.6397779494322
2       10       F       4       3      43.0331482911935
2       20       M       4      25      51.6397779494322
2       20       F       4      25.5    50.6272332837571
3       10       M       2       5.5    12.8564869306645
3       10       F       2       7.5     9.42809041582064
3       20       M       2      55      12.8564869306645
3       20       F       2      75       9.42809041582063


SQL implementation details

In the previous section we are using simplified representation of accumulator table and SQL dialect, which is not compatible across all vendors. Real SQL aggregation queries can be found in expr_sql column of table_expr metadata table. For example if source model expression is:

(OM_SUM(acc0) / OM_SUM(acc2))

then result look like:

SELECT 
  M1.run_id, M1.dim0, (SUM(M1.acc_value) / SUM(L1A2.acc2)) AS expr1 
FROM RiskPaths_201410071856440009_a2_T03_FertilityByAge M1 
INNER JOIN 
(
  SELECT run_id, dim0, sub_id, acc_value AS acc2 
  FROM RiskPaths_201410071856440009_a2_T03_FertilityByAge 
  WHERE acc_id = 2
) L1A2 
ON (L1A2.run_id = M1.run_id AND L1A2.dim0 = M1.dim0 AND L1A2.sub_id = M1.sub_id) 
WHERE M1.acc_id = 0 
GROUP BY M1.run_id, M1.dim0

<metadesc>OpenM++: open source microsimulation platform</metadesc>