Discrete Choice Modeling for CPG Line Extensions

discrete-choice-modelingsynthetic-dataconsumer-behaviorCPGsimulation

AI Content Disclaimer: The first draft of this case study was created using my pair-coding agent’s review of my codebase + two paragraph project outline I wrote (pure human content). Further context was added with a prompt (human) that detailed expecations, content hierarchy, and preferred markdown structure. Final draft is mostly human.

What drives Pasta Kit Choices?

A Simulated Test of Consumer Preferences Using Discrete Choice Modeling

Overview

Product innovation, coming up ideas for brand/line extensions were part of my responsibility as a brand manager. I’ve recreated one of my past campaigns with a CPG company using synthetic data.

To understand how my brand’s (not a pasta brand IRL) consumers evaluate pasta meal kits, I’d setup a discrete choice modeling (DCM) study using both internal sales data and IRI/Circana data. Since I couldn’t use proprietary data here, I built a synthetic behavioral study. The idea was simple: simulate realistic personas and present them with structured product choice sets to uncover which attributes drove preferences and which ones turned people off.

While the original goal was to showcase DCM as a great way to identify new line extensions, building the simulation ended up being a fun project as well.

This case study shows how DCM can be used to distill findings into key takeaways and actionable insights for marketers and product teams. The statistical modeling isn’t covered in depth but I might do a separate post for how I built the simulation for synthetic data.

The consumer ‘insights’ in this post are not applicable anywhere. You’d need to replicate this study with actual data. Caveats, limitations etc. are detailed toward the end.


Goal

Understand what meal kit combinations consumers are most likely to choose when presented with a variety of pasta-based options. The goal was to model trade-offs in preferences and estimate utility for individual product attributes such as:

  • Protein type
  • Pasta variety
  • Sauce base
  • Vegetables
  • Price

Methodology: The Not-So-Secret Sauce

Step 1: Synthetic Personas

I defined personas based on varying food preferences (e.g., plant-based, indulgent, protein-heavy). Each persona had a set of weighted preferences across attributes. You can use your existing data to build this out.

Sample Persona

{
    "id": "trendy_foodie",
    "name": "Trendy Foodie",
    "weights": {
      "Protein": {
        "Shrimp": 0.3,
        "Plant-Based Crumbles": 0.2,
        "Tofu": 0.2,
        "Grilled Chicken": 0.15,
        "Italian Sausage": 0.1,
        "Ground Beef": 0.05
      },
      "Sauce Base": {
        "Basil Pesto": 0.3,
        "Vodka Sauce": 0.25,
        "Garlic Butter": 0.2,
        "Creamy Alfredo": 0.15,
        "Tomato Marinara": 0.1
      },
    }
  }

Redacted for brevity, and because you get the idea.

Persona assignments are made uniformly at random using a weighted choice function, not stratified.

Step 2: Experimental Design

Attributes Used for the Combinations

{
  "Pasta Type": ["Penne", "Fettuccine", "Rigatoni", "Cavatappi", "Shells", "Rotini"],
  "Protein": ["Grilled Chicken", "Italian Sausage", "Shrimp", "Ground Beef", "Tofu", "Plant-Based Crumbles"],
  "Sauce Base": ["Tomato Marinara", "Creamy Alfredo", "Basil Pesto", "Vodka Sauce", "Garlic Butter"],
  "Vegetables": ["Spinach", "Roasted Red Peppers", "Mushrooms", "Broccoli", "Zucchini", "None"],
  "Seasoning Profile": ["Mild Italian Herbs", "Spicy Arrabbiata", "Garlic-Heavy", "Black Pepper & Lemon", "Smoky Paprika"],
  "Cheese Option": ["Mozzarella", "Parmesan", "Ricotta", "No Cheese"],
  "Cooking Time": ["3 minutes", "5 minutes", "7 minutes"],
  "Price Point": ["$4.99", "$6.99", "$8.99"]
}

An orthogonal full-factorial design is generated using the pyDOE2 package, producing a full factorial of all possible combinations. From this, a D-optimal subset of 4,500 profiles is selected using a Fedorov exchange algorithm to optimize design efficiency. These profiles are then globally shuffled and partitioned into 6-choice sets for each simulated respondent.

Choice-set generation pseudocode:

from pyDOE2 import fullfact
import random

# Build full factorial design based on attribute levels
levels = [len(attributes[attr]) for attr in attributes]
design = fullfact(levels)  # each row is a combination index

# Map design indices to attribute values
profiles = [
    {attr: attribute_values[attr][level] for attr, level in zip(attributes, row)}
    for row in design
]

# Select D-optimal subset of profiles
# (Fedorov exchange algorithm applied here)

# Shuffle and partition into choice sets of size 3
random.seed(42)
random.shuffle(profiles)
choice_sets = [profiles[i:i+3] for i in range(0, len(profiles), 3)]

Study Design

  • Persona assignments are made uniformly at random using a weighted choice function.
  • Persona weight scales were normalized to standardize utility variances.

Grab your resident data scientist or Python expert if this isn’t your domain. Took me a hot second to figure this out since I hadn’t used this library before.

Step 3: Behavioral Simulation

Each persona is assigned to 6 randomized choice sets drawn from the master design. Each set contains 3 options.

Selections are made using a softmax-based probability model that weighs the attribute alignment between each option and the persona’s underlying preferences.

  • This mimics real-world decision-making under bounded rationality.
  • The simulation avoids deterministic choice and introduces variation across similar profiles.

Step 4: Conditional Logit Approximation with Multinomial Logit

I used a Conditional Logit Model to analyze the simulated dataset, accounting for choice set clustering. The input was a one-hot encoded matrix of all product attributes with drop_first=True, ensuring no multicollinearity.

In discrete choice modeling, a conditional logit explicitly models choices conditional on the choice set, accounting for correlation among alternatives within the same set. In contrast, a multinomial logit (MNL) assumes independence of irrelevant alternatives (IIA) and independent errors across observations, which can be violated in panel or choice set data.

To approximate the conditional logit structure, I fit an MNL model using statsmodels.MNLogit with clustered standard errors by respondent ID. This clustering accounts for within-respondent correlation in choices, providing robust inference on attribute utilities despite the MNL’s IID error assumption.

This approach was chosen because statsmodels’ native ConditionalLogit requires a specialized panel‐style data format and does not directly support cluster‐robust error estimation, making it cumbersome for choice‐set data. Using MNLogit with clustering offers a straightforward interface while still correcting for within‐individual dependence and yielding valid standard errors in a panel context.

Demonstrative Python code snippet with detailed comments:

import pandas as pd
import statsmodels.api as sm

# Load preprocessed data containing one-hot encoded attributes
df = pd.read_csv("dcm_input.csv")

# Exclude ID and meta columns to isolate feature matrix X and target y
exclude_cols = ["respondent_id", "persona_id", "persona", "choice_set_id", "option_id", "chosen"]
X = df.drop(columns=exclude_cols)  # Features: one-hot encoded attributes with drop_first=True to avoid multicollinearity
y = df["chosen"]  # Binary indicator of chosen alternative

# Add intercept term to model utility baseline
X = sm.add_constant(X, has_constant='add')

# Define clustering groups by respondent to account for repeated choices within individuals
groups = df["respondent_id"]

# Fit multinomial logit model with clustered standard errors to approximate conditional logit
model = sm.MNLogit(y, X)
result = model.fit(maxiter=100, cov_type='cluster', cov_kwds={'groups': groups})

# Output model summary with robust standard errors
print(result.summary())

Modeling objectives:

  • Estimate part-worth utilities for each attribute level.
  • Identify statistically significant drivers of choice behavior.

Additional Notes on Model Choice

The native ConditionalLogit implementation in statsmodels requires data to be in a specialized panel format and does not support cluster-robust standard errors directly, which complicates its use for choice-set data with respondent-level clustering. By using MNLogit with clustered standard errors, we leverage a more flexible and user-friendly interface that still accounts for within-respondent correlation, providing valid inference on attribute utilities in a panel data context.


Key Insights

AttributeEffectInterpretation
Pasta TypeRotini (+0.256, p=0.015)Rotini significantly increases choice odds
Sauce BaseTomato Marinara (+0.241, p=0.020)Tomato Marinara drives higher utility
VegetablesMushrooms (+0.209, p=0.043)Mushrooms increase selection probability
Price Point$8.99 (–0.192, p=0.018)Highest price deters purchase

Note: AI pair-coding agent was given full freedom to develop the markdown structure for key insights table because I hate making tables in markdown.


Insights

  • Rotini stands out as the only pasta type with a significant positive utility.
  • Tomato Marinara is the only sauce with a statistically significant positive effect.
  • Mushrooms significantly boost choice probability relative to “None.”
  • The $8.99 price point is significantly less preferred than the baseline ($4.99); $6.99 is not significant.
  • All other attributes (protein type, seasoning, cheese option, cooking time) showed no significant effects.

Managerial Implications

  • Product Design:
    Emphasize Rotini as a core pasta shape in line extensions. Feature Tomato Marinara and mushrooms prominently; other sauces and vegetables can be secondary.

  • Avoid premium pricing at $8.99; $6.99 performs comparably to the baseline and may serve as a sweet spot.

  • Deprioritize differentiation by protein type, seasoning profile, cheese option, and cooking time, as none significantly impacted choices.

  • Synthetic Personas for Testing:
    Use simulation models to evaluate design decisions before launching market research.


Limitations

  • The results are based on simulated responses, not real-world surveys.
  • Personas were assigned uniformly at random via a weighted choice function, without stratification across demographic segments.
  • The MNL model with clustered standard errors is an approximation of a true conditional logit and may not fully capture all choice-set correlations.
  • Zero-variance columns were automatically removed during one-hot encoding, which could drop rare attribute levels.
  • Persona weight scales were normalized to standardize utility variances. (Note: This normalization was applied to reduce heterogeneity in utility scales across personas, improving comparability.)
  • Consumers’ actual behaviors may shift due to brand loyalty, emotion, or context.
  • The model assumes consistent preference expression, which may vary in practice.

What’s Next

  • Willingness-to-Pay (WTP) This was built into the study when I ran it with actual data but I wasn’t sure if I was building it correctly with the synthetic data. Without WTP this is technically incomplete as you would need those numbers to build projections for sales, market share and other predictive metrics. I’ll update this post if and when I do run it.

  • Latent Class Modeling (LCM):
    LCM can be used to uncover hidden preference segments across simulated respondents.

  • Validation with Real Data:
    Plan a follow-up pilot with real participants to compare predictive alignment.


Final Note

This experiment demonstrates how simulation + choice modeling can fast-track product and pricing insights. It offers a powerful framework for early-stage testing and hypothesis development—especially when budgets are tight or real user data is limited. It’s not a replacement for it though.