echoice2.Rmd
This package contains functions for applying choice models based on economic theory. Models can be applied to data from choice experiments (‘Conjoint Analysis’), from purchase transaction histories or a combination of both. The package contains functions for estimation and demand prediction.
By and large, it uses ‘tidy’ data and integrated nicely into many ‘tidyverse’ functions. The idea is to facilitate easy and readable choice modeling workflows.
All estimation and prediction functions are implemented in c++ and use openMP for multithreading support (i.e., this package is fast).
Functions that relate to discrete demand start in dd_
, while functions for volumetric demand start in vd_
. Estimation functions continue in est
, demand simulators in dem
. Universal functions (discrete and volumetric choice) start in ec_
.
Here is a quick overview of key functions:
dd_est_hmnl
: Discrete Choice (HMNL)dd_est_hmnl_screen
: Discrete Choice, attribute-based screening (not including price)dd_est_hmnl_screenpr
: Discrete Choice, attribute-based screening (including price)vd_est_vd
: Volumetric demand, EV1 errorsvd_est_vdmn
: Volumetric demand, Normal errorsvd_est_vdm_screen
: Volumetric demand, attribute-based screening, Normal errorsvd_est_vdm_screenpr
: Volumetric demand, attribute-based screening including price, Normal errorsvd_est_vdm_ss
: Volumetric demand, accounting for set-size variation (1st order), EV1 errorsvd_est_vdm_ssq
: Volumetric demand, accounting for set-size variation (2nd order), EV1 errorsdd_dem
: Discrete Choice (HMNL)dd_dem_sr
: Discrete Choice, attribute-based screening (not including price)vd_dem_vdm
: Volumetric demand, EV1 errorsvd_dem_vdmn
: Volumetric demand, Normal errorsvd_dem_vdmnsr
: Volumetric demand, attribute-based screening, Normal errorsvd_dem_vdmsrpr
: Volumetric demand, attribute-based screening including price, Normal errorsvd_dem_vdmss
: Volumetric demand, accounting for set-size variation (1st order), EV1 errorsvd_dem_vdmssq
: Volumetric demand, accounting for set-size variation (2nd order), EV1 errorsThe package is work in progress. Some of the next things to do on the list are:
The MCMC algorithm for the volumetric demand model is implemented in vd_est_vdm
. The model can be applied to data from volumetric conjoint studies, or from unit-level demand data, such as household panel data. This vignette provides exposition of the model, input data structure, and an example, including model estimation and evaluation.
We assume that subjects maximize their utility from consuming inside goods \(x\) and the outside good \(z\)
\[u\left( {{\bf{x}},z} \right) = \sum\limits_k {\frac{{{\psi _k}}}{\gamma }} \ln \left( {\gamma {x_k} + 1} \right) + \ln (z)\] subject to a budget constraint \[\sum\limits_k^{} {{p_k}} {x_k} + z = E\]
where \(\psi_k\) is the baseline utlity of the \(k\)-th good, and \(\gamma\) is a parameter that governs the rate of satiation of inside goods. It is assumed that the rate of satiation \(\gamma\) is the same for all inside goods.
In this implementation, we assume that \[\psi_j = \exp[a_j'\beta + \varepsilon_j]\] where \[\varepsilon \sim \text{Type I EV}(0,\sigma)\]
While the assumption of the EV1 error term is common, there are also implementations assuming Normal-distributed errors.
The marginal utility for the inside and outside goods is:
\[ \begin{array}{*{20}{l}} {{u_j}}&{ = \frac{{\partial u\left( {{\bf{x}},z} \right)}}{{\partial {x_j}}} = \frac{{{\psi _j}}}{{\gamma {x_j} + 1}}{\rm{ }}}\\ {{u_z}}&{ = \frac{{\partial u\left( {{\bf{x}},z} \right)}}{{\partial z}} = \frac{1}{z}} \end{array} \]
Solving for \(\varepsilon_j\) leads to the following expression for the KT conditions:
\[\begin{array}{*{20}{l}} {}&{{\varepsilon _j} = {g_j}\quad {\rm{if}}\quad {x_j} > 0}\\ {}&{{\varepsilon _j} < {g_j}\quad {\rm{if}}\quad {x_j} = 0} \end{array}\]
where \[{g_j} = - {{\bf{a}}_{j'}}\beta + \ln (\gamma {x_j} + 1) + \ln \left( {\frac{{{p_j}}}{{E - {{\bf{p}}^\prime }{\bf{x}}}}} \right)\]
We are able to obtain a closed-form expression for the probability that \(R\) of \(N\) goods are chosen.
\[ x_1,x_2,\dots,x_R > 0, \qquad x_{R+1}, x_{R+2}, \dots, x_N = 0. \]
The error scale \((\sigma\)) is identified in this model because price enters the specification without a separate price coefficient The likelihood \(\ell(\theta)\) of the model parameters is proportional to the probability of observing \(n_1\) chosen goods and \(n_2\) goods with zero demand The contribution to the likelihood of the chosen goods is in the form of a probability density while the goods not chosen contribute as a probability mass.
\[\begin{array}{l} \ell (\theta ) \propto p({x_{{n_1}}} > 0,{x_{{n_2}}} = 0|\theta )\\ = |{J_R}|\int_{ - \infty }^{{g_N}} \cdots \int_{ - \infty }^{{g_{R + 1}}} f ({g_1}, \ldots ,{g_R},{\varepsilon _{R + 1}}, \ldots ,{\varepsilon _N})d{\varepsilon _{R + 1}}, \ldots ,d{\varepsilon _N}\\ = |{J_R}|\left\{ {\prod\limits_{j = 1}^R {\frac{{\exp ( - {g_j}/\sigma )}}{\sigma }} \exp \left( { - {e^{ - {g_j}/\sigma }}} \right)} \right\}\left\{ {\prod\limits_{i = R + 1}^N {\exp } \left( { - {e^{ - {g_i}/\sigma }}} \right)} \right\}\\ = |{J_R}|\left\{ {\prod\limits_{j = 1}^R {\frac{{\exp ( - {g_j}/\sigma )}}{\sigma }} } \right\}\exp \left\{ { - \sum\limits_{i = 1}^N {\exp } ( - {g_i}/\sigma )} \right\} \end{array}\]
where \(|J_{R}|\) is the Jacobian of the transformation from random-utility error (\(\varepsilon\)) to the likelihood of the observed data. In this model, the Jacobian is: \[\left| {{J_R}} \right| = \prod\limits_{k = 1}^R {\left( {\frac{\gamma }{{\gamma {x_k} + 1}}} \right)} \left\{ {\sum\limits_{k = 1}^R {\frac{{\gamma {x_k} + 1}}{\gamma }} \cdot\frac{{{p_k}}}{{E - {{\bf{p}}^\prime }{\bf{x}}}} + 1} \right\}\]
The ‘lower level’ of the hierarchical model applies the direct utility model to a specific respondent’s choice data, and the ‘upper level’ of the model incorporates heterogeneity in respondent coefficients.
Respondent heterogeneity can be incorporated into conjoint analysis using a variety of random-effect models. We use a Normal model for heterogeneity by default.
Denoting all individual-level parameters \(\theta_h\) for respondent \(h\) we have: \[
\theta_h \sim \text{Normal}(\bar\theta, \Sigma)
\] where \[\theta_h =\left\{ {\beta_h', \gamma_h, E_h,\sigma_h} \right\}\]It is possible to add covariates (\(z\)) to the mean of the heterogeneity distribution as in a regression model: \[
\theta_h \sim \text{Normal}(\Gamma z_h, \Sigma)
\]The parameters \(\bar\theta\), \(\Gamma\) and \(\Sigma\) are referred to as hyper-parameters because they describe the distribution of other parameters.
Covariates in the above expression might include demographic variables or other variables collected as part of the conjoint survey, e.g., variables describing reasons to purchase.rVDev
implements a multivariate Regression prior. If no covariates are provided, \(z\) simply contains a vector of 1
s, and \(\Gamma\) is equivalent to \(\bar \theta\) in the MVN heterogeneity model.
Prior is stored in a list. It’s elements are (Gammabar, AGamma, nu, V).
Natural conjugate priors are specified:
\[ \theta \sim MVN({\Gamma}z_h, \Sigma) \] \[\Gamma \sim (\bar \Gamma, A_{\Gamma}^{-1})\]
\[ \Sigma \sim IW(\nu, V) \]
This specification of priors assumes that \(\Gamma\) is independent of \(\Sigma\) and that, conditional on the hyperparameters, the \(\theta_i\)’s are a priori independent.
By default, Priors are set to be weakly informative, with an emphasis on ‘weakly’.
Here we demonstrate the implementation using the icecream
ice cream data available within the package. The dataset contains volumetric choice data for 300
respondents who evaluated packaged ice cream.
The package is built around the idea of using long format data. This makes it easy to select and manipulate data, whether it is coming from a choice experiment or transaction data. The structure of choice data in echoice2
is simple:
Choice data data.frames or tibbles need to contain the following columns:
Continuous attributes are currently not supported, but forthcoming.
The icecream
dataset is stored in a tibble:
data(icecream)
icecream %>% head
## # A tibble: 6 x 8
## id task alt x p Brand Flavor Size
## <int> <int> <int> <dbl> <dbl> <fct> <fct> <ord>
## 1 1 1 1 8 0.998 Store Neapolitan 16
## 2 1 1 2 0 0.748 Store VanillaBean 16
## 3 1 1 3 0 1.25 BenNJerry Oreo 16
## 4 1 1 4 0 0.748 BenNJerry Neapolitan 16
## 5 1 1 5 0 2.49 HaagenDa RockyRoad 4
## 6 1 1 6 0 1.25 HaagenDa Oreo 16
ec_summarize_attrlvls
provides a quick glance at the attributes and levels (for categorical attributes).
icecream %>% ec_summarize_attrlvls
## # A tibble: 3 x 2
## attribute levels
## <chr> <chr>
## 1 Brand BenNJerry, BlueBell, BlueBunny, Breyers, Dryers, HaagenDa, Store
## 2 Flavor Chocolate, ChocChip, ChocDough, CookieCream, Neapolitan, Oreo, Rock~
## 3 Size 4, 8, 16
Storing choice data in a ‘long’ format allows leveraging all the functionality of dplyr
and other tidyverse
packages. We can, for instance, quickly generate a summary of average choice quantities per task and plot a histogram. Most respondents choose 5 or less packs of 4 servings per task.
For hold-out validation, it is common to keep 1 or 2 tasks per respondent. Here, we select 1 task per respondent. Identifiers of the hold-out tasks are stored in ho_tasks
. ice_cal
and ice_ho
contain the calibration and holdout portion of the icecream
choice dataset.
#select holdout tasks
set.seed(8438453)
ho_tasks=
icecream %>%
distinct(id,task) %>%
mutate(id=as.integer(id))%>%
group_by(id) %>%
summarise(task=sample(task,1), .groups = 'drop')
set.seed(NULL)
#split data
ice_cal= icecream %>% mutate(id=as.integer(id)) %>%
anti_join(ho_tasks, by=c('id','task'))
ice_ho= icecream %>% mutate(id=as.integer(id)) %>%
semi_join(ho_tasks, by=c('id','task'))
The standard volumetric demand model is implemented in vd_est_vdm
. Other volumetric demand models are implemented in functions starting in vd_est_
. Discrete choice models are implemented in functions starting in dd_est_
.
In this example, only 20,000 draws are generated to speed up vignette compilation. It takes about a half minute on a modern 16-core PC.
Estimation uses the simple Metropolis-Hastings algorithm, which is very fast but may require thinning. The function automatically determines the number of physical CPU cores and uses openMP for parallel computation. By default, 100,000 draws are generated and every 10th draw is saved. For real-world applications, more draws (R
) and more thinning (keep
) is recommended.
icecream_est = ice_cal %>% vd_est_vdm(R=20000)
## MCMC in progress
## Total Time Elapsed: 0.32 minutes
Summaries of the upper-level coefficients can be obtained using the ec_estimates_MU
function.
icecream_est %>% ec_estimates_MU
## # A tibble: 21 x 12
## attribute lvl par mean sd `CI-5%` `CI-95%` sig model error
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <chr> <chr>
## 1 <NA> <NA> int -3.25 0.384 -3.47 -3.07 TRUE VD-c~ EV1
## 2 Brand Blue~ Bran~ -0.815 0.151 -1.02 -0.603 TRUE VD-c~ EV1
## 3 Brand Blue~ Bran~ -0.706 0.140 -0.911 -0.500 TRUE VD-c~ EV1
## 4 Brand Brey~ Bran~ -0.0996 0.0857 -0.242 0.0473 FALSE VD-c~ EV1
## 5 Brand Drye~ Bran~ -0.645 0.116 -0.816 -0.497 TRUE VD-c~ EV1
## 6 Brand Haag~ Bran~ -0.386 0.0940 -0.534 -0.244 TRUE VD-c~ EV1
## 7 Brand Store Bran~ -0.581 0.115 -0.745 -0.407 TRUE VD-c~ EV1
## 8 Flavor Choc~ Flav~ -0.396 0.119 -0.592 -0.194 TRUE VD-c~ EV1
## 9 Flavor Choc~ Flav~ -0.418 0.124 -0.616 -0.218 TRUE VD-c~ EV1
## 10 Flavor Cook~ Flav~ -0.413 0.111 -0.590 -0.238 TRUE VD-c~ EV1
## # ... with 11 more rows, and 2 more variables: reference_lvl <chr>,
## # parameter <chr>
Diagnostics start with a look at traceplots. Upper-level means can be checked using the ec_trace_MU
function.
icecream_est %>% ec_trace_MU()
We should check in-sample and out-of-sample fit.
First, we compare in-sample fit. ec_lmd_NR
computes a rough approximation of the marginal Likelihood. It does so for 4 equally-sized chunks of draws, which let’s us double-check convergence.
icecream_est %>% ec_lmd_NR
## # A tibble: 4 x 2
## part lmd
## <dbl> <dbl>
## 1 0.25 -Inf
## 2 0.5 -12178.
## 3 0.75 -12143.
## 4 1 -12068.
Now, we compare out of sample fit. These are the steps:
Ensure the hold-out data is compatible with the in-sample data. This means that the factor levels of the categorical attributes should be identical. Using the prep_newprediction
we ensure that factor levels match.
Demand simulators are implemented in functions starting in vd_dem_
. For the standard volumetric demand model, we use the vd_dem_vdm
function.
Removing burn-in from draws and additional thinning might be required. vd_thin_draw
helps do that. By default, it removes the first half of the draws.
vd_dem_eval
helps assess fit. It compared demand predictions relative to true values. For volumetric models, MAE, MSE, RAE and bias are computed. The RAE (relative absolute error) is not widely used, but it is the closest to an absolute fit statistic. It should be smaller than 1.
#generate predictions
ho_demand=
ice_ho %>%
prep_newprediction(ice_cal) %>%
vd_dem_vdm(icecream_est %>% vd_thin_draw)
## Computation in progress
ho_demand
adds the .demdraws
column to the hold-out data (ice_ho
). Each cell in that column contains posterior draws of predicted demand.
ho_demand
## # A tibble: 3,600 x 9
## id task alt x p Brand Flavor Size .demdraws
## * <int> <int> <int> <dbl> <dbl> <fct> <fct> <ord> <list>
## 1 1 7 1 0 1.25 Store Neapolitan 16 <dbl [501]>
## 2 1 7 2 0 1.25 Store CookieCream 16 <dbl [501]>
## 3 1 7 3 0 1.74 BenNJerry Neapolitan 8 <dbl [501]>
## 4 1 7 4 0 1.12 BenNJerry Oreo 16 <dbl [501]>
## 5 1 7 5 0 3.49 BlueBell VanillaFudge 4 <dbl [501]>
## 6 1 7 6 0 1.74 BlueBell RockyRoad 8 <dbl [501]>
## 7 1 7 7 0 1.25 BlueBunny VanillaFudge 16 <dbl [501]>
## 8 1 7 8 0 1.25 BlueBunny RockyRoad 16 <dbl [501]>
## 9 1 7 9 0 1.25 Dryers CookieCream 16 <dbl [501]>
## 10 1 7 10 0 2.00 Dryers Chocolate 8 <dbl [501]>
## # ... with 3,590 more rows
ho_demand %>%
ec_dem_eval
## # A tibble: 1 x 4
## MSE MAE bias RAE
## <dbl> <dbl> <dbl> <dbl>
## 1 2.74 0.459 0.0310 0.856
Simulating choice quantities and shares is a key feature of echoice2
. Demand simulators are implemented in functions starting in vd_dem_
.
Let’s consider the Haagen-Dasz brand. They line-price most of their classic flavors. Let’s say they usually charge around $4. Overall, we have 10 flavors in this study.
my_scenario = tibble(
id=1L,task=1L,alt=1:10,
Brand= 'HaagenDa',
Flavor=c("Chocolate", "ChocChip", "ChocDough", "CookieCream", "Neapolitan", "Oreo", "RockyRoad", "Vanilla", "VanillaBean", "VanillaFudge"),
Size=8,
p=4)
We need to make sure factor levels match. The prep_newprediction
function does that. Notice that attribute levels in my_scenario
are just strings. prep_newprediction
converts them to factors, with levels matching those in the calibration dataset that was used to estimate the model.
This makes it easy to define whatever scenario for ‘market’ simulation.
my_scenario <- my_scenario %>% prep_newprediction(ice_cal)
Before we can simulate choices for all our respondents, we need to apply this scenario to all 300 respondents from our study.
Now we can simulate choices:
my_market_demand <- my_market %>%
vd_dem_vdm(est = icecream_est)
## Computation in progress
The my_market_demand
object contains draws of demand quantities in column .demdraws
.
my_market_demand %>% head
## # A tibble: 6 x 8
## id task alt p Brand Flavor Size .demdraws
## <int> <dbl> <int> <dbl> <fct> <fct> <fct> <list>
## 1 1 1 1 4 HaagenDa Chocolate 8 <dbl [2,000]>
## 2 1 1 2 4 HaagenDa ChocChip 8 <dbl [2,000]>
## 3 1 1 3 4 HaagenDa ChocDough 8 <dbl [2,000]>
## 4 1 1 4 4 HaagenDa CookieCream 8 <dbl [2,000]>
## 5 1 1 5 4 HaagenDa Neapolitan 8 <dbl [2,000]>
## 6 1 1 6 4 HaagenDa Oreo 8 <dbl [2,000]>
ec_dem_aggregate
helps aggregate demand draws - from individual-task-alternative level to Brand or Brand-Flavor-Size-level. We are not computing posterior means just yet.
my_market_demand %>%
ec_dem_aggregate(c('Brand','Flavor','Size'))
## # A tibble: 10 x 4
## # Groups: Brand, Flavor [10]
## Brand Flavor Size .demdraws
## <fct> <fct> <fct> <list>
## 1 HaagenDa Chocolate 8 <dbl [2,000]>
## 2 HaagenDa ChocChip 8 <dbl [2,000]>
## 3 HaagenDa ChocDough 8 <dbl [2,000]>
## 4 HaagenDa CookieCream 8 <dbl [2,000]>
## 5 HaagenDa Neapolitan 8 <dbl [2,000]>
## 6 HaagenDa Oreo 8 <dbl [2,000]>
## 7 HaagenDa RockyRoad 8 <dbl [2,000]>
## 8 HaagenDa Vanilla 8 <dbl [2,000]>
## 9 HaagenDa VanillaBean 8 <dbl [2,000]>
## 10 HaagenDa VanillaFudge 8 <dbl [2,000]>
ec_dem_summarise
computes posterior summaries, e.g. posterior means.
my_market_demand %>%
ec_dem_aggregate(c('Brand','Flavor','Size')) %>%
ec_dem_summarise()
## # A tibble: 10 x 8
## # Groups: Brand, Flavor [10]
## Brand Flavor Size .demdraws `E(demand)` `S(demand)` `CI-5%` `CI-95%`
## <fct> <fct> <fct> <list> <dbl> <dbl> <dbl> <dbl>
## 1 Haagen~ Chocolate 8 <dbl [2,00~ 14.9 8.93 5.51 27.0
## 2 Haagen~ ChocChip 8 <dbl [2,00~ 20.9 9.25 9.95 35.8
## 3 Haagen~ ChocDough 8 <dbl [2,00~ 16.7 9.02 7.49 29.1
## 4 Haagen~ CookieCre~ 8 <dbl [2,00~ 13.2 8.70 4.78 25.0
## 5 Haagen~ Neapolitan 8 <dbl [2,00~ 13.6 8.61 5.05 25.2
## 6 Haagen~ Oreo 8 <dbl [2,00~ 11.8 8.92 3.77 21.3
## 7 Haagen~ RockyRoad 8 <dbl [2,00~ 18.6 8.83 8.83 30.5
## 8 Haagen~ Vanilla 8 <dbl [2,00~ 18.0 9.08 7.62 30.8
## 9 Haagen~ VanillaBe~ 8 <dbl [2,00~ 21.2 9.26 10.5 36.7
## 10 Haagen~ VanillaFu~ 8 <dbl [2,00~ 18.5 8.67 8.68 30.3
We can also look at other levels of aggregation, e.g. brand. Since only one brand was included in the choice scenario, we effectively obtain aggregate demand quantities for all inside goods.
my_market_demand %>%
ec_dem_aggregate(c('Brand')) %>%
ec_dem_summarise()
## # A tibble: 1 x 6
## Brand .demdraws `E(demand)` `S(demand)` `CI-5%` `CI-95%`
## <fct> <list> <dbl> <dbl> <dbl> <dbl>
## 1 HaagenDa <dbl [2,000]> 168. 64.6 132. 198.
Elasticities in volumetric demand models are not constant. To better understand implications of pricing decisions, demand curves can be a helpful tool.
The ec_demcurve
function can be applied to all demand simulators. Based of an initial market scenario, it generates a series of scenarios where the price of a focal product is changed over an interval. It then runs the demand simulator several times, and we can use the output to draw a demand curve.
Pre-simulating error terms using ec_gen_err_ev1
helps to smooth these demand curves. It generates one error term per draw for each of the alternatives and tasks in the corresponding design dataset.
## Computation in progress
## Computation in progress
## Computation in progress
## Computation in progress
## Computation in progress
The output of ec_demcurve
is a list, containing demand summaries for each product under the different price scenarios. The list can be stacked into a data.frame, which can be used to generate plots.
my_demcurve %>% do.call('rbind',.) %>%
ggplot(aes(x=scenario,y=`E(demand)`,color=Flavor)) + geom_line()