Discrete Choice Modeling for CPG Line Extensions
AI Content Disclaimer: The first draft of this case study was created using my pair-coding agentâs review of my codebase + two paragraph project outline I wrote (pure human content). Further context was added with a prompt (human) that detailed expecations, content hierarchy, and preferred markdown structure. Final draft is mostly human.
What drives Pasta Kit Choices?
A Simulated Test of Consumer Preferences Using Discrete Choice Modeling
Overview
Product innovation, coming up ideas for brand/line extensions were part of my responsibility as a brand manager. Iâve recreated one of my past campaigns with a CPG company using synthetic data.
To understand how my brandâs (not a pasta brand IRL) consumers evaluate pasta meal kits, Iâd setup a discrete choice modeling (DCM) study using both internal sales data and IRI/Circana data. Since I couldnât use proprietary data here, I built a synthetic behavioral study. The idea was simple: simulate realistic personas and present them with structured product choice sets to uncover which attributes drove preferences and which ones turned people off.
While the original goal was to showcase DCM as a great way to identify new line extensions, building the simulation ended up being a fun project as well.
This case study shows how DCM can be used to distill findings into key takeaways and actionable insights for marketers and product teams. The statistical modeling isnât covered in depth but I might do a separate post for how I built the simulation for synthetic data.
The consumer âinsightsâ in this post are not applicable anywhere. Youâd need to replicate this study with actual data. Caveats, limitations etc. are detailed toward the end.
Goal
Understand what meal kit combinations consumers are most likely to choose when presented with a variety of pasta-based options. The goal was to model trade-offs in preferences and estimate utility for individual product attributes such as:
- Protein type
- Pasta variety
- Sauce base
- Vegetables
- Price
Methodology: The Not-So-Secret Sauce
Step 1: Synthetic Personas
I defined personas based on varying food preferences (e.g., plant-based, indulgent, protein-heavy). Each persona had a set of weighted preferences across attributes. You can use your existing data to build this out.
Sample Persona
{
"id": "trendy_foodie",
"name": "Trendy Foodie",
"weights": {
"Protein": {
"Shrimp": 0.3,
"Plant-Based Crumbles": 0.2,
"Tofu": 0.2,
"Grilled Chicken": 0.15,
"Italian Sausage": 0.1,
"Ground Beef": 0.05
},
"Sauce Base": {
"Basil Pesto": 0.3,
"Vodka Sauce": 0.25,
"Garlic Butter": 0.2,
"Creamy Alfredo": 0.15,
"Tomato Marinara": 0.1
},
}
}
Redacted for brevity, and because you get the idea.
Persona assignments are made uniformly at random using a weighted choice function, not stratified.
Step 2: Experimental Design
Attributes Used for the Combinations
{
"Pasta Type": ["Penne", "Fettuccine", "Rigatoni", "Cavatappi", "Shells", "Rotini"],
"Protein": ["Grilled Chicken", "Italian Sausage", "Shrimp", "Ground Beef", "Tofu", "Plant-Based Crumbles"],
"Sauce Base": ["Tomato Marinara", "Creamy Alfredo", "Basil Pesto", "Vodka Sauce", "Garlic Butter"],
"Vegetables": ["Spinach", "Roasted Red Peppers", "Mushrooms", "Broccoli", "Zucchini", "None"],
"Seasoning Profile": ["Mild Italian Herbs", "Spicy Arrabbiata", "Garlic-Heavy", "Black Pepper & Lemon", "Smoky Paprika"],
"Cheese Option": ["Mozzarella", "Parmesan", "Ricotta", "No Cheese"],
"Cooking Time": ["3 minutes", "5 minutes", "7 minutes"],
"Price Point": ["$4.99", "$6.99", "$8.99"]
}
An orthogonal full-factorial design is generated using the pyDOE2
package, producing a full factorial of all possible combinations. From this, a D-optimal subset of 4,500 profiles is selected using a Fedorov exchange algorithm to optimize design efficiency. These profiles are then globally shuffled and partitioned into 6-choice sets for each simulated respondent.
Choice-set generation pseudocode:
from pyDOE2 import fullfact
import random
# Build full factorial design based on attribute levels
levels = [len(attributes[attr]) for attr in attributes]
design = fullfact(levels) # each row is a combination index
# Map design indices to attribute values
profiles = [
{attr: attribute_values[attr][level] for attr, level in zip(attributes, row)}
for row in design
]
# Select D-optimal subset of profiles
# (Fedorov exchange algorithm applied here)
# Shuffle and partition into choice sets of size 3
random.seed(42)
random.shuffle(profiles)
choice_sets = [profiles[i:i+3] for i in range(0, len(profiles), 3)]
Study Design
- Persona assignments are made uniformly at random using a weighted choice function.
- Persona weight scales were normalized to standardize utility variances.
Grab your resident data scientist or Python expert if this isnât your domain. Took me a hot second to figure this out since I hadnât used this library before.
Step 3: Behavioral Simulation
Each persona is assigned to 6 randomized choice sets drawn from the master design. Each set contains 3 options.
Selections are made using a softmax-based probability model that weighs the attribute alignment between each option and the personaâs underlying preferences.
- This mimics real-world decision-making under bounded rationality.
- The simulation avoids deterministic choice and introduces variation across similar profiles.
Step 4: Conditional Logit Approximation with Multinomial Logit
I used a Conditional Logit Model to analyze the simulated dataset, accounting for choice set clustering. The input was a one-hot encoded matrix of all product attributes with drop_first=True
, ensuring no multicollinearity.
In discrete choice modeling, a conditional logit explicitly models choices conditional on the choice set, accounting for correlation among alternatives within the same set. In contrast, a multinomial logit (MNL) assumes independence of irrelevant alternatives (IIA) and independent errors across observations, which can be violated in panel or choice set data.
To approximate the conditional logit structure, I fit an MNL model using statsmodels.MNLogit
with clustered standard errors by respondent ID. This clustering accounts for within-respondent correlation in choices, providing robust inference on attribute utilities despite the MNLâs IID error assumption.
This approach was chosen because statsmodels
â native ConditionalLogit
requires a specialized panelâstyle data format and does not directly support clusterârobust error estimation, making it cumbersome for choiceâset data. Using MNLogit
with clustering offers a straightforward interface while still correcting for withinâindividual dependence and yielding valid standard errors in a panel context.
Demonstrative Python code snippet with detailed comments:
import pandas as pd
import statsmodels.api as sm
# Load preprocessed data containing one-hot encoded attributes
df = pd.read_csv("dcm_input.csv")
# Exclude ID and meta columns to isolate feature matrix X and target y
exclude_cols = ["respondent_id", "persona_id", "persona", "choice_set_id", "option_id", "chosen"]
X = df.drop(columns=exclude_cols) # Features: one-hot encoded attributes with drop_first=True to avoid multicollinearity
y = df["chosen"] # Binary indicator of chosen alternative
# Add intercept term to model utility baseline
X = sm.add_constant(X, has_constant='add')
# Define clustering groups by respondent to account for repeated choices within individuals
groups = df["respondent_id"]
# Fit multinomial logit model with clustered standard errors to approximate conditional logit
model = sm.MNLogit(y, X)
result = model.fit(maxiter=100, cov_type='cluster', cov_kwds={'groups': groups})
# Output model summary with robust standard errors
print(result.summary())
Modeling objectives:
- Estimate part-worth utilities for each attribute level.
- Identify statistically significant drivers of choice behavior.
Additional Notes on Model Choice
The native ConditionalLogit
implementation in statsmodels
requires data to be in a specialized panel format and does not support cluster-robust standard errors directly, which complicates its use for choice-set data with respondent-level clustering. By using MNLogit
with clustered standard errors, we leverage a more flexible and user-friendly interface that still accounts for within-respondent correlation, providing valid inference on attribute utilities in a panel data context.
Key Insights
Attribute | Effect | Interpretation |
---|---|---|
Pasta Type | Rotini (+0.256, p=0.015) | Rotini significantly increases choice odds |
Sauce Base | Tomato Marinara (+0.241, p=0.020) | Tomato Marinara drives higher utility |
Vegetables | Mushrooms (+0.209, p=0.043) | Mushrooms increase selection probability |
Price Point | $8.99 (â0.192, p=0.018) | Highest price deters purchase |
Note: AI pair-coding agent was given full freedom to develop the markdown structure for key insights table because I hate making tables in markdown.
Insights
- Rotini stands out as the only pasta type with a significant positive utility.
- Tomato Marinara is the only sauce with a statistically significant positive effect.
- Mushrooms significantly boost choice probability relative to âNone.â
- The $8.99 price point is significantly less preferred than the baseline ($4.99); $6.99 is not significant.
- All other attributes (protein type, seasoning, cheese option, cooking time) showed no significant effects.
Managerial Implications
-
Product Design:
Emphasize Rotini as a core pasta shape in line extensions. Feature Tomato Marinara and mushrooms prominently; other sauces and vegetables can be secondary. -
Avoid premium pricing at $8.99; $6.99 performs comparably to the baseline and may serve as a sweet spot.
-
Deprioritize differentiation by protein type, seasoning profile, cheese option, and cooking time, as none significantly impacted choices.
-
Synthetic Personas for Testing:
Use simulation models to evaluate design decisions before launching market research.
Limitations
- The results are based on simulated responses, not real-world surveys.
- Personas were assigned uniformly at random via a weighted choice function, without stratification across demographic segments.
- The MNL model with clustered standard errors is an approximation of a true conditional logit and may not fully capture all choice-set correlations.
- Zero-variance columns were automatically removed during one-hot encoding, which could drop rare attribute levels.
- Persona weight scales were normalized to standardize utility variances. (Note: This normalization was applied to reduce heterogeneity in utility scales across personas, improving comparability.)
- Consumersâ actual behaviors may shift due to brand loyalty, emotion, or context.
- The model assumes consistent preference expression, which may vary in practice.
Whatâs Next
-
Willingness-to-Pay (WTP) This was built into the study when I ran it with actual data but I wasnât sure if I was building it correctly with the synthetic data. Without WTP this is technically incomplete as you would need those numbers to build projections for sales, market share and other predictive metrics. Iâll update this post if and when I do run it.
-
Latent Class Modeling (LCM):
LCM can be used to uncover hidden preference segments across simulated respondents. -
Validation with Real Data:
Plan a follow-up pilot with real participants to compare predictive alignment.
Final Note
This experiment demonstrates how simulation + choice modeling can fast-track product and pricing insights. It offers a powerful framework for early-stage testing and hypothesis developmentâespecially when budgets are tight or real user data is limited. Itâs not a replacement for it though.