How many CBC tasks

> CBC - Choice Based Conjoint > How many CBC tasks

How many CBC tasks

"One man’s constant is another man’s variable."
Alan Perlis

Problem

The answer to the title might be "the more the better". As CBC tasks belong to the most hated by respondents, designers tend to take the opposite direction. Based on some own experience and a lot of others, two rules of thumb for a set of attributes of about the same length (cardinality, to be precise) and up to 5 alternatives per choice set have been suggested and used for a minimal number of tasks since about 2002.

Rule of thumb by "Longest attribute"

The rule is based on showing each level (after a correction for number of constraints, if any) twice and multiplying the obtained number of task by the number of attributes. For example, for 6 attributes of which the longest one has 5 (effective) levels, and choice sets are made of 4 alternatives, the number of tasks should be (5 x 2) / 4 x 6 = 15 or more. In case of a MXD (MaxDiff with best choices only) for 16 statements, number of 4-statement tasks should be at least (16 x 2) / 4 x 1 = 8.

Rule of thumb by "Number of preferences"

In a choice task with J alternatives, the number of preferences of the chosen alternative over the other alternatives is (J - 1). The number of estimated parameters in CBC is the total number of effective levels (after a correction for number of constraints, if any) minus number of attributes. The rule is based on the number of preferences in a way analogous to degrees of freedom in linear regression. Number of tasks should be at least the number of parameters divided by the number of preferences, and doubled. For the same examples as above the number of tasks would be at least 16 or 10, respectively.

As aside

Of the two rules above, this rule is preferred and used by us for simple self-contained CBC exercises, nevertheless with some upward corrections for number of alternatives higher than 5.

Both rules are only indicative for tasks with not more than 5 alternatives and should be used with caution. Reasons are many. The weight of choice in a maximum likelihood estimation does not grow linearly with the number of alternatives in a choice set. Attributes have different lengths and different importance in real cases. For designs with many alternatives featuring FMCG/CPG products, specific alternatives, overlapping classes of products and, namely, problems exploiting several independent discrete choice blocks serving as soft constraints, the simple rules are not satisfactory. A more substantiated way of controlling the number of various types of tasks belonging to a DCM study had to be devised.

Solution

Let consider a choice task with J alternatives and indicator variable y_j for the alternative j. The multinomial model requires a choice always happens. If a j-th alternative is chosen, y_j = 1, and y_¬j = 0. Estimation of parameters the probabilities depend on is done by maximization of LL (log-likelihood) function for all choices obtained in the exercise. For our purpose, the part of the conventional multinomial LL related to a choice task can be written as

LL =

J
Σ
j = 1

y_j × ln(p_j)

J
Σ
j = 1

p_j = 1;

J
Σ
j = 1

y_j = 1

(1)

where p_j are the instantaneous values of choice probabilities being estimated for the items from the choice task. Note that the probabilities always sum to 1. However, only the value for the actual choice is used in estimation while all the other values are neglected.

As aside

Please note a constant alternative is ignored on purpose in this treatment. Including a constant alternative always requires increase of CBC tasks as the computed parameters often have dubious credibility and interpretation. This is true especially for the alternative "None" which has unpredictable effect on choices.

The null hypothesis at the state of designing a CBC exercise is our belief that all the items in the choice task have the same chance to be selected. The initial LL value for a CBC choice task is therefore

LL_ini =

J
Σ
j = 1

1/J × ln(1/J)

J
Σ
j = 1

(1/J) = 1

(2)

When the CBC exercise is answered and the conventional LL maximized, the true LL becomes

LL =

J
Σ
j = 1

p_j × ln(p_j)

J
Σ
j = 1

p_j = 1

(3)

where p_j values are the estimated choice probabilities.

Values of these LL functions differ according to the state they represent.

LL_ini for the null hypothesis (eq. 2) has the lowest possible value.
LL for estimated values (eq. 3) must have higher value, otherwise the estimation would not be successful.
Value of the maximized conventional LL (eq. 1) is even higher since it does not include negative contributions from rejected alternatives.
As aside
- The maximized conventional LL is usually the only LL value available from estimation software. When averaged over tasks and exponentiated it is known as root likelihood.

In discrete phenomena problems, maximum likelihood estimation is actually minimization of information entropy as a measure of information disorder. Eq. 3 is the basic formula for negatively taken Shanon information entropy expressed in nat units. To obtain entropy in more comprehensible bit units, a value in nat units must be divided with ln(2) = 0.693147. For a given settings of a CBC exercise, we need entropy values representing two states, namely those before and after a choice.

Entropy of the initial state of a choice task is described by the null hypothesis and given as -LL_ini from eq. 2. Entropy of the final state of a choice task (eq. 3) is unknown. However, we can compute the highest achievable value. This is obtained for a hypothetical alternative j chosen with probability p_j = 1. As predicted probability for all other alternatives must be p_¬j = 0, the corresponding LL = 0, and so is the information entropy. The maximal gain of information H(J) obtainable from a CBC task with J alternatives is the difference between the two values. Since entropy of the final state is set to zero, to get the obtainable information value we need to consider only the number of alternatives in the task.

H(J) = -LL_ini / ln(2)

(4)

The amount of information H(J) computed this way is as if the decision maker had an absolute preference among the J alternatives in each task. We call this value "preference bits" to distinguish it from number of preferences mentioned above. Values of H(J) for choice set sizes up to 10 alternatives are in the table below.

Number of preference bits
`J` - Alternatives in a CBC task:	2	3	4	5	6	7	8	9	10
`H`(`J`) - Obtainable preference bits:	1.00	1.58	2.00	2.32	2.58	2.81	3.00	3.17	3.32

The increase in information value with number of alternatives in a task has diminishing tendency. The values are easy to remember. Doubling the number of alternatives increases the information preference by 1 bit. With the knowledge of these values, the power of a CBC exercise can be assessed for some number of tasks and alternatives per task. A reasonable approach is to to ensure the necessary number of preference bits per estimated parameter.

Existence of maximum likelihood estimate requires parameter separability, finiteness and uniqueness. Separability is achieved by a choice from uncorrelated alternatives in a choice set, uniqueness by the linear model, and finiteness by Bayesian estimation. The simplest case of a single CBC choice set with two alternatives is equivalent to estimation of 1 individual-based parameter by gaining 1 preference bit from the answer. When there are P parameters to be estimated with the same credibility from several choice tasks, it seems natural that the total information gain should be P preference bits. The minimal number T_min(J) of choice tasks each with J orthogonal alternatives should be

T_min(J) = P / H(J)

(5)

This result is intuitive and needs to be confirmed.

Proof

Quality comparison of models with different number of parameters estimated with ML method is possible using an information criterion. Most often cited are Akaike (AIC), Schwartz Bayesian (BIC) and Hannan-Quinn (HQC) criteria. In the format related to a single observation point, all the criteria can be written as

c(P) = -2 × LL(P) / n + φ × P / n

(6)

where n is number of observations and φ is penalty coefficient for the number of parameters P. The penalty coefficients are φ_AIC = 2 : n → ☞, φ_BIC = ln(n), and φ_HQC = 2×ln(ln(n)). In our view of an individual having answered T(J) independent tasks with J alternatives, the log-likelihood is expressed in terms of information gain.

LL(P, J) = -T(J) × H(J) × ln(2)

(7)

Our aim is to obtain the minimal number of observations, i.e. the number of choice sets with J alternatives described with P parameters that would provide the same quality of parameter estimates as a choice from a choice set with 2 alternatives described with 1 parameter. For these two options we have these criteria:

`c`(1) = 2 × 2 × 1 × ln(2) / 1 + `φ` × 1 / 1	:	`P =` 1, J = 2, `n` = 1	(8)
`c`(`P`\|`J`) = 2 × `T`(`J`) × `H`(`J`) × ln(2) / `T`(`J`) + `φ` × `P` / `T`(`J`)	:	`n` = `T`(`J`)	(9)

Setting penalty coefficient to a constant φ = 2×ln(2) ≅ 1.386 and assuming the same quality of both estimates c(1) = c(P|J), eq. 5 is obtained.

Value of the used φ reflects n_BIC = 4 or n_HQC ≅ 7.39, both the values being in the range used in marketing research practice for an individual.

Properties

The minimal number of CBC tasks estimated from eq. 5 for the former two examples (5^6 and 16^1 designs, 4 alternatives in task) is 12 and 7.5, respectively. The numbers are lower than from the rules of thumb (15 and 8 or 16 and 10). A more comprehensive comparison of the three rules is in the table below, where just saturated regular orthogonal arrays with strength 2 are used as examples.

Minimal number of CBC tasks estimated by three different rules
Orthogonal array OA(`N`, `L`^`A`)	Number of estimated parameters`P`	Number of CBC alternatives`J`	Number of CBC tasks by rule of
Orthogonal array OA(`N`, `L`^`A`)	Number of estimated parameters`P`	Number of CBC alternatives`J`	Longest attribute	Number of preferences	Preference bits
OA(1, 2^1)	1	2	2	2	1
OA(4, 2^3)	3	2	6	6	3
OA(8, 2^7)	7	2	14	14	7
OA(8, 2^7)	7	4	7	4.7	3.5
OA(9, 3^4)	8	3	8	8	5.0
OA(12, 2^11)	11	4	11	7.3	5.5
OA(16, 4^5)	15	4	10	10	7.5
OA(18, 3^7)	14	3	14	14	8.8
OA(25, 5^6)	24	5	12	12	10.3
OA(64, 4^21)	63	8	21	18	21
OA(81, 9^10)	80	9	20	20	25.2
OA(121, 11^12)	120	12	22	21.8	33.5

OA arguments

N = Minimal number of required observation points in estimation by linear regression
L = Length (cardinality) of all attributes
A = Number of attributes

For small CBC studies with 15 or less parameters, the rules of thumb (by the longest attribute and by number of preferences) seem to suggest unnecessarily large numbers of tasks. For larger studies with not more than about 60 parameters and up to 5 alternatives in choice set, the rules of thumb give only slightly higher numbers than the information preference rule. If attributes are of substantially different lengths, the rule using the longest attribute length might be preferred.

Insufficient number of tasks generated with the rules of thumb can be seen for CBC studies with number of parameters higher than 60 or more than 5 alternatives in choice set. Demand for such studies, nearly always with substantially differing attribute lengths, has become quite common. Huber and Zwerina (1996) show that choice design efficiency requires orthogonality, level balance, minimal overlap and utility balance. Even if these requirements are met in the standard CBC design, level part-worths of long attributes are always determined with lower precision than those of short attributes, simply because of their less frequent appearance in tasks. A remedy, known as soft constraints, is to create auxiliary blocks of choice questions for levels of long attributes and add them to the CBC tasks to get an extended set of tasks for estimation.

Using the soft constraints is not without caveats. Design of choice tasks from each block of questions should be set so that no attribute (or some of its levels) would prevail over other attributes. Because product utility in the standard DCM model is a sum of part-worths, precision and accuracy of all determined parameters in the study should be approximately equal. This can be achieved by setting the information gain from each contributing data block so that the required total information gain is balanced over all attributes. This approach is possible on the assumption of mutual independence of attributes (with no or negligible interactions), independence of choices and orthogonality of alternatives, which are the conditions essential for additivity of information gain from each choice. This idea was the main motivator for using the theory of information .

The standard method of DCM parameter estimation is the hierarchical Bayesian maximum likelihood method. The number of tasks obtained from eq. 5 should be viewed as a threshold under which the sample means will get excessive influence and sample covariances might get ill conditioned. It is important to realize that some substantive conditions such as known or expected narrow consideration set of items, typically classes of products, products with special features, etc., make the effective length of attributes shorter.

It is impossible to develop generalized instructions for estimation of attribute effective length. A proper knowledge of the studied problem is essential. Let consider a simple brand-price study. If it is supposed an individual selects at most 5 brands from the tested brands, the effective length of the brand attribute can be taken as 8 to have some reserve. If the brands belong to different price categories and/or several package sizes are tested, 5 CBC design levels (e.g. -10%, -5%, +/-0%, +5% and +10%) may span over a broad range of prices which might require about 20 (or more) distinct price values. To have a separate part-worth for each of the values, an unconstrained estimate for each value is unnecessary. Price part-worth in a conjoint study can be always constrained as nothing like "too cheap to be good" exists. The effective length of the price attribute can be guessed as a presumed "virtual" number of linear segments with which the curvature of part-worths on logarithm of price could be approximated. We usually set the effective length between 4 and 8.

In our experience, estimates with satisfactory ability to discriminate between individuals are obtained with the number of tasks from eq. 5 if the effective length of attributes is respected. The discrimination can be improved with increasing the number of tasks by 50%. Higher increase is believed to have negligible effect, but no systematic research has been done in this respect.

Usage

Information entropy has the advantage of being additive not only over a single exercise, but also over several choice exercises and, in principle, over the whole interviewed sample provided the choices are independent stochastic and alternatives in choice sets are uncorrelated.

Application of the theory of information allowed for some extensions and refinements of DCM approach:

CSDCA - Common Scale Discrete Choice Analysis

PRIORS

MXD - Maximum Difference Scaling with the best choice only
Used as soft constraints for long attributes (more than about 12 levels)
SCE - Sequential Choice Exercise
Used as soft constraints for levels of short attributes (up to about 12 levels)

MOTIVATORS

Best-Worst Case 2 (Louviere et al., 1995)
A modified approach: H = H(J) + H(J-1), but requires some increase in the number of observations due to obvious answers to dominating attribute levels.

Acceptability thresholds (used in non-compensatory modeling)

Weight of prior thresholds obtained from PRIORS section is corrected according to the number of tasks in CBC section.
Refinement is done by choices made in CBC section.

CBC - Choice Based Conjoint

Attributes with ordinal levels
Number of effective levels is set to the required number of virtual linear segments.
Attributes with both nominal and ordinal levels
Number of effective levels is reduced by the number of virtual linear segments.

Class based design

A class can be represented by several design levels (i.e. the class is repeated) according to prior importance of the class, but without counting the additional levels as effective.
Overlapped property levels from different classes are not counted.

Product portfolio optimization

The persuasive value of a portfolio has been defined as expected value of the portfolio minus the equivalent of information entropy considered to be responsible for the confusion of a potential customer .