doordash case study
inspired by jessica lachs and matheus facure alves
do refunds save customer ltv to some degree?
confounding!
refund
ltv
min late
do refunds save customer ltv to some degree?
order lateness determines treatment 👉 cutoff: 30 minutes
look for a jump!
do refunds save customer ltv to some degree?
running variable: min late
cutoff: 30 min
treatment: 1 - yes; 0 - no
"as good as random"
treated:
intercept at cutoff
intercept at cutoff
treatment effect:
untreated:
interaction: min late may affect ltv differently on each side
def generate_dataset(n, std, b0, b1, b2, b3, lower=0, upper=60, cutoff=30):
"""generate customer LTV under given order lateness and refund status"""
# generate running variable, treatment status, and errors
min_late = np.random.uniform(lower, upper, n)
was_refunded = np.where(min_late < cutoff, 0, 1)
errors = np.random.normal(0, std, n)
# use above data to "predict" customer LTV
ltv = (
b0
+ b1 * (min_late - cutoff)
+ b2 * was_refunded
+ b3 * (min_late - cutoff) * was_refunded
+ errors
)
# return results in a dataframe
df_late = pd.DataFrame({"min_late": min_late, "ltv": ltv})
df_late["min_late_centered"] = df_late["min_late"] - cutoff
df_late["refunded"] = df_late["min_late"].apply(lambda x: 1 if x >= cutoff else 0)
return df_late
# generate data on 2,000 late orders
df_late = generate_dataset(n=2000, std=10, b0=50, b1=-0.8, b2=10, b3=-0.1)
# DATA PREPARATION
# generate data for 2,000 late orders
df_late = generate_dataset(2000, 10)
df_late.head()
# DATA PREPARATION
modeling
data viz
import statsmodels.formula.api as smf
# fit model
model_all_data = smf.wls("ltv ~ min_late_centered * refunded", df_late).fit()
# model summary
model_all_data.summary().tables[1]
# USING ALL DATA
treatment effect: 9.49
# USING ALL DATA
treatment effect: 9.49
def kernel(min_late, cutoff, bandwidth):
"""assign weight to each data point (0 if outside bandwdith)"""
# assign weight based on distance to cutoff
weight = np.where(
np.abs(min_late - cutoff) <= bandwidth,
(1 - np.abs(min_late - cutoff) / bandwidth),
0.0000000001, # ≈ 0 outside bandwidth (not 0 to avoid division by 0)
)
# return weight of each data point
return weight
df_late["weight"] = df_late["min_late"].apply(lambda x: kernel(x, 30, 5))
# KERNEL WEIGHTING
indicates whether point is within bandwidth
# KERNEL WEIGHTING
# KERNEL WEIGHTING
# fit model using weighted data
model_kernel_weight = smf.wls(
"ltv ~ min_late_centered * refunded",
df_late,
weights=df_late["weight"],
).fit()
# model summary
model_kernel_weight.summary().tables[1]
treatment effect: 8.41
# KERNEL WEIGHTING
treatment effect: 8.41
# OTHER NATURAL EXPERIMENTS
stanford open policing project (pierson et al., 2020)
# OTHER NATURAL EXPERIMENTS
birth season
schooling
income
confounders
which we can easily find, but don't care about
scale to estimate how treatment affects outcome
1. causally impacts treatment
2. correlated with outcome
3. unrelated to confounders
income
education
intelligence
grades
books + papers + talks
an example syllabus...