Fixed models / data files for cmdstan #171

PhilClemson · 2020-04-30T19:08:57Z

This basically fixes the majority of models that were either broken or had missing / incorrect syntax for the R dump data files used by cmdstan. They can be checked using the script found in the performance tests repository (https://github.com/stan-dev/performance-tests-cmdstan), e.g.:

python ./performance-tests-cmdstan/runPerformanceTests.py directories "performance-tests-cmdstan/example-models" --overwrite-gold --method=sample --num-samples=1000

Here is a list of the updated models:

misc/ecology/occupancy/occupancy
misc/ecology/mark-recapture/mark-recapture
misc/ecology/mark-recapture/cjs-K
misc/ecology/mark-recapture/mark-recapture-2
misc/ecology/mark-recapture/mark-recapture-3
misc/ecology/mark-recapture/cjs
misc/linear-regression_regression
misc/linear-regression_regression_std
misc/garch_arch1
misc/garch_koyck
misc/garch_garch1_1
misc/hier_multivariate_hier_multivariate
misc/moving-avg/arma11
misc/moving-avg/arma11_alt
misc/moving-avg/maQ
misc/moving-avg/stochastic-volatility
misc/moving-avg/ma2
misc/moving-avg/stochastic-volatility-optimized
misc/gaussian-process/gp-predict-analytic
misc/gaussian-process/gp-sim-multi
misc/gaussian-process/gp-fit-ARD
misc/gaussian-process/gp-fit-latent
misc/dlm/fx/equicorr
misc/dlm/fx/factor
misc/sur/sur/imrpoper
misc/sur/sur
misc/hmm/hmm
misc/hmm/hmm-analytic
misc/hmm/hmm-semisup
misc/hmm/hmm-sufficient
misc/nnmf/nnmf
misc/nnmf/nnmf/vec
knitr/simplest-regression/fake-data
knitr/pool-binary-trials/pool
knitr/pool-binary-trials/no-pool
knitr/pool-binary-trials/hier
knitr/pool-binary-trials/hier-logit-centered
knitr/pool-binary-trials/hier-logit
knitr/chapter2/worldcup/fixed
knitr/chapter2/worldcup/first/try
knitr/chapter2/worldcup/first/try/noprior
knitr/chapter2/worldcup/with/replication
knitr/chapter2/golf1
knitr/chapter2/worldcup/no/sqrt
knitr/car-iar-poisson/pois/re
knitr/car-iar-poisson/pois
knitr/car-iar-poisson/bym2/islands
knitr/car-iar-poisson/bym2/offset/only
knitr/car-iar-poisson/pois/icar
knitr/car-iar-poisson/bym2
knitr/car-iar-poisson/simple/iar
knitr/chapter1/fake-data
knitr/golf/golf1
knitr/world-cup/worldcup/fixed
knitr/world-cup/worldcup/first/try
knitr/world-cup/worldcup/first/try/noprior
knitr/world-cup/worldcup/with/replication
knitr/world-cup/worldcup/no/sqrt
ARM/Ch.25/earnings2
ARM/Ch.25/earnings
ARM/Ch.25/earnings/pt2
ARM/Ch.25/earnings/pt1
education/tutorial/twopl/reg/dif
education/tutorial/twopl/reg/noncentered
education/tutorial/twopl/reg/centered
education/tutorial/twopl/twopl
education/hierarchical/2pl/hierarchical/2pl
education/dina/independent/dina/independent
education/dina/independent/dina/nostructure
BPA/Ch.04/GLMM1
BPA/Ch.04/GLMM2
BPA/Ch.13/bluebug
BPA/Ch.13/owls/ms1
BPA/Ch.13/owls/ms2
BPA/Ch.13/Dynocc2
BPA/Ch.10/js/ms
BPA/Ch.12/binmix/cov
BPA/Ch.12/binmix
BPA/Ch.12/Nmix0
BPA/Ch.12/Nmix2
BPA/Ch.12/Nmix1
BPA/Ch.07/cjs/trap
BPA/Ch.07/cjs/temp/corr

…d by cmdstan

…tting 'Initialization between (-2, 2) failed after 100 attempts' errors.

…twopl by MC

Added data files for pool-binary-trials

Added missing data for knitr/car-iar-poisson models

Added missing data for golf1 models

…t-regression\ and knitr\chapter1"

…s (by MC)

mitzimorris · 2020-04-30T20:04:59Z

where did you get the data from?
I realize that in the past there have been attempts to add data - cf #151 - these were done in a brute-force way. After the PR associated with that issue went in, I cleaned up part of it by removing generated data from case studies where there wasn't any data.

it's important that the data used to test the model should be generated by the model or the process which the model is trying to approximate. using just the specified constraints on data and parameters to generate test is not enough - there is often domain knowledge which is not specified, e.g., for the icar models, the input data describes an areal map, and there are constraints on the relations between the map edges which aren't directly expressed in the data declarations.

for testing purposes, I would imagine that we're interested in testing well-specified models which are fitting data generated from the model and/or which is somehow pathological. this requires that all test data be annotated explaining what kind of data it is and what kind of behavoir we expect when fitting the model to the data.

mitzimorris · 2020-04-30T20:28:46Z

hi @PhilClemson - please allow me to clarify:

I think it would be better to curate the models used for testing, not make sure that we can use every single one of them. so I don't think this PR is a good idea.

bob-carpenter · 2020-04-30T20:39:41Z

I also think a curated set of test models makes much more sense for testing than all the stuff that's accumulated in our example-models repo.

Having said that, I would like to see something done to clean up example-models, which means:

fixing models that we want to keep
removing junk

In the first category are models there for pedagogical purposes, not for testing purposes, like the ones in directory BPA (from the Bayesian Population Analysis book). It'd be nice to at least clean those up. Pretty much everything in misc is just junk I threw up over the years for the manual or just as an example for someone. That can all be eliminated.

How did you generate new data for things like my hierarchical binary pooling example in knitr/pool-binary-trials?

mitzimorris · 2020-04-30T20:41:58Z

@PhilClemson - w/r/t to the icar-car-poisson models, I don't think your correction of the stan code is correct. did you test it on an areal map data that contained islands/singletons?

PhilClemson · 2020-05-01T14:13:51Z

where did you get the data from?
I realize that in the past there have been attempts to add data - cf #151 - these were done in a brute-force way. After the PR associated with that issue went in, I cleaned up part of it by removing generated data from case studies where there wasn't any data.

it's important that the data used to test the model should be generated by the model or the process which the model is trying to approximate. using just the specified constraints on data and parameters to generate test is not enough - there is often domain knowledge which is not specified, e.g., for the icar models, the input data describes an areal map, and there are constraints on the relations between the map edges which aren't directly expressed in the data declarations.

for testing purposes, I would imagine that we're interested in testing well-specified models which are fitting data generated from the model and/or which is somehow pathological. this requires that all test data be annotated explaining what kind of data it is and what kind of behavoir we expect when fitting the model to the data.

Hi @mitzimorris, thanks for taking a look through this.

Most of the data came from the .R and documentation found in the same folder as the stan files. For the BPA models I downloaded the data from the links provided. My main task was to standardise the models so that they could all be run using cmdstan. Fortunately in most cases it was just a case of generating the data.R file from existing code / data.

There were a few cases where we had to generate "compatible" data but not "curated" data. The knitr models are a good example of this.

hi @PhilClemson - please allow me to clarify:

I think it would be better to curate the models used for testing, not make sure that we can use every single one of them. so I don't think this PR is a good idea.

I understand, we just thought we would share these changes in the hope that they could be applied in general. We've found that having a large selection of models / data is useful, even if just for identifying bugs. It would definitely not be appropriate to judge performance statistics based on individual models, although looking at performance over a large population of models might still be useful.

I also think a curated set of test models makes much more sense for testing than all the stuff that's accumulated in our example-models repo.

Having said that, I would like to see something done to clean up example-models, which means:
* fixing models that we want to keep

* removing junk
In the first category are models there for pedagogical purposes, not for testing purposes, like the ones in directory BPA (from the Bayesian Population Analysis book). It'd be nice to at least clean those up. Pretty much everything in misc is just junk I threw up over the years for the manual or just as an example for someone. That can all be eliminated.

How did you generate new data for things like my hierarchical binary pooling example in knitr/pool-binary-trials?

That sounds like a good idea, although in that case maybe the junk models like those in misc could become a separate repository? I'd have to check with @alecksphillips on those specific examples, but as I've said above I imagine it will be compatible rather than curated data.

@PhilClemson - w/r/t to the icar-car-poisson models, I don't think your correction of the stan code is correct. did you test it on an areal map data that contained islands/singletons?

For these I generated the data from the corresponding .R files. I remember for bym2_islands.stan I was getting errors which seemed to be an indexing issue (which I fixed) but I might be wrong. I believe I used the nycTracts10 data.

alecksphillips · 2020-05-01T16:22:29Z

That sounds like a good idea, although in that case maybe the junk models like those in misc could become a separate repository? I'd have to check with @alecksphillips on those specific examples, but as I've said above I imagine it will be compatible rather than curated data.

Hi all,
Just to add, the bits of data I contributed I generated using the R markdown workbooks that were in those directories.

mitzimorris · 2020-05-01T17:35:27Z

For these I generated the data from the corresponding .R files. I remember for bym2_islands.stan I was getting errors which seemed to be an indexing issue (which I fixed) but I might be wrong. I believe I used the nycTracts10 data.

please revert all changes to the car-icar-poisson models and delete the data files you added. I will try to add appropriate datasets myself.

the bym2_islands.stan is correctly coded - it requires input where the areal map contains at least 2 disconnected subcomponents, whereas the data you're using is an areal map which is fully connected. the bym2_islands.stan model is checked into the case study because it's the next level of in complexity for that model, but I haven't yet extended the case study.

I agree that we should expand the suite of test models, but in the icar model case study, the models that consist of a regression plus just an icar or just a random effects are degenerate versions of the bym2 model when there's no spatial correlation or when there's perfect spatial correlation, so adding these to the test suite isn't going to tell you anything new. those models were included mainly to generate visualizations and stan programming pedegogy. they're just not worth running in a test suite - it's a waste of compute cycles / carbon credits.

This reverts commit 8ba28f0.

This reverts commit cd4f008.

PhilClemson · 2020-05-01T19:26:33Z

please revert all changes to the car-icar-poisson models and delete the data files you added. I will try to add appropriate datasets myself.

Thanks for the explanation, that makes a lot of sense and I agree. I've reverted the relevant commits.

jgabry · 2020-06-09T18:05:02Z

Thanks for doing all this work!

@mitzimorris Now that the icar-poisson commits are reverted, is this ready to merge?

mitzimorris · 2020-06-10T21:37:22Z

yes!

jgabry · 2020-06-10T22:32:48Z

Ok, great, will merge now. Thanks again @PhilClemson for doing all this!

PhilClemson · 2020-06-11T08:02:49Z

Thanks both, although I can't take all of the credit as I had help!

alecksphillips and others added 20 commits February 25, 2020 16:03

Added knitr/worldcup and knitr/chapter2/worldcup* data files

f328d7c

Fixed ecology models by reformatting data files into syntax understoo…

bad63a8

…d by cmdstan

Fixed garch, hier_multivariate and moving-avg models

63a9440

Added data files for pool-binary-trials

6ea6e37

Fixed gaussian-process models (all data file syntax related)

65aec79

Fixed some more models with data file issues and also fixed a typo

98cff00

Fixed remaining models in misc with data file issues

c0d1e8e

Added dump data files for car-iar-poisson models

8ba28f0

Added earnings data files

8fcab99

Added missing data for knitr/car-iar-poisson models

cd4f008

Added missing data for golf1 models

e2f9442

Added dump data files for ARM/Ch.17 models. However, these are now ge…

6926feb

…tting 'Initialization between (-2, 2) failed after 100 attempts' errors.

Fixes of models in education\hierarchical_2pl and education\tutorial_…

033f4ef

…twopl by MC

Merge pull request #1 from alecksphillips/master

a0e611e

Added data files for pool-binary-trials

Merge pull request #2 from alecksphillips/data-car-iar-poisson

a2080b3

Added missing data for knitr/car-iar-poisson models

Merge pull request #3 from alecksphillips/data-golf

96b8ec7

Added missing data for golf1 models

Fixed / added data files for BPA models

6b07319

Added priors in fx_factor.stan and added data files for knitr\simples…

438b8c5

…t-regression\ and knitr\chapter1"

Added data file for BPA\Ch.10\js_ms.stan

6cf55c5

Fixed education/dina_independent models and added data and inits file…

659e916

…s (by MC)

PhilClemson added 2 commits May 1, 2020 20:16

Revert "Added dump data files for car-iar-poisson models"

b3f25ca

This reverts commit 8ba28f0.

Revert "Added missing data for knitr/car-iar-poisson models"

d062ff4

This reverts commit cd4f008.

PhilClemson and others added 2 commits May 22, 2020 12:15

Included new folder 'applications' and added the shared peptide model

b01fc3b

Add Imperial's COVID-19 model

f4848d3

Removed shared peptide model for now

9e196b3

jgabry merged commit 506a9ab into stan-dev:master Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed models / data files for cmdstan #171

Fixed models / data files for cmdstan #171

PhilClemson commented Apr 30, 2020

mitzimorris commented Apr 30, 2020 •

edited

Loading

mitzimorris commented Apr 30, 2020

bob-carpenter commented Apr 30, 2020

mitzimorris commented Apr 30, 2020

PhilClemson commented May 1, 2020

alecksphillips commented May 1, 2020

mitzimorris commented May 1, 2020

PhilClemson commented May 1, 2020

jgabry commented Jun 9, 2020

mitzimorris commented Jun 10, 2020

jgabry commented Jun 10, 2020

PhilClemson commented Jun 11, 2020

Fixed models / data files for cmdstan #171

Fixed models / data files for cmdstan #171

Conversation

PhilClemson commented Apr 30, 2020

mitzimorris commented Apr 30, 2020 • edited Loading

mitzimorris commented Apr 30, 2020

bob-carpenter commented Apr 30, 2020

mitzimorris commented Apr 30, 2020

PhilClemson commented May 1, 2020

alecksphillips commented May 1, 2020

mitzimorris commented May 1, 2020

PhilClemson commented May 1, 2020

jgabry commented Jun 9, 2020

mitzimorris commented Jun 10, 2020

jgabry commented Jun 10, 2020

PhilClemson commented Jun 11, 2020

mitzimorris commented Apr 30, 2020 •

edited

Loading