Causal Inference in Campaign Targeting

The following is one of two posts published alongside the JustCause framework, which we developed at inovex as a tool to foster good scientific practice in the field of Causal Inference. If you are not familiar with the field yet, consider reading the first article on the topic, which gives a high-level conceptual overview and also dives into the theory behind treatment effect estimation in more depth. Here, I will work through a synthetic example to show the efficacy of causal inference in campaign targeting.

Treatment effect estimation is a field of research spread across a wide range of industries. From the field where the name naturally makes sense—medicine—to the social sciences and econometrics, treatments can be found and studied in many places. In all scenarios we are essentially interested in estimating the effect some form of treatment has on a user, patient or group. Why this is difficult and how it can be done, you’ll learn in this post using a practical example. When you’re done reading, you should have a better understanding of where and why it makes sense to use Causal Inference and how it helps to model a specific sort of problems.

The Campaign Targeting Use-Case

To make the topic more tangible, I want to work along a more or less realistic example: a marketing campaign.

Underlying Data

Imagine we’re running an online shop and have a user database with roughly 40,000 entries. We’ve collected some features from each of them. To be honest, these are not the features you’d expect an online shop to collect. In fact we’re just using these data because they’re part of the public UCI machine learning repository. Still, for the sake of the example, let’s imagine that we had run a marketing campaign last month targeting some of these users according to a hunch of the marketing department. They figured that people with higher balance generally spend more. Thus they were convinced that it is favourable to target these customers.

Head of the pd.DataFrame containing cleaned data from the banking dataset.

Formalising Treatments

Before continuing in our quest to trump the marketing team with some simple causal inference, let’s formalise the problem.

We model a user $i$ with their $d$ features $X_i = (x_{i1}, … x_{id})$. Now the so called treatment $T_i$ for user $i$ in our case is whether or not the user received marketing in the last campaign. That is to say, specifically, that $T_i = 1$ if the user is among the 10 000 users with the highest balance and $T_i = 0$ otherwise. Note that we omit the index $i$ for brevity when describing the distributions below.

What we are interested in is the outcome $Y$, the spending of the user in the online shop in the month after the campaign. Following the Potential Outcomes framework of Neyman & Rubin, the treatment effect $\tau_i$ of user $i$ is defined as the difference between the potential outcome $Y_i(1)$, had the user received marketing, and the the potential outcome $Y_i(0)$, had he not received it:

$\tau(x_i) = \tau_i = Y_i(1) – Y_i(0)$.

The two outcomes $Y(1)$ and $Y(0)$ are potential in the sense that only one of them is ever realised—factual—while the other remains unobserved, or as Pearl would say, counterfactual.

We denote further by $Y_{cf}$ the outcomes that are unobserved and by $Y_f$ or simply $Y$ the observed outcomes, which are determined by

$Y_i = Y_i(1) \cdot T_i + Y_i(0) \cdot (1-T_i)$

The fact that we only ever observe one of the two outcomes for each unit $i$ is called the Fundamental Problem of Causal Inference by Paul Holland, and it is this FPCI that forces us to use synthetic data in this example. Because in order to show to you, dear reader, the efficacy of the proposed methods for estimating $\tau_i$, we ought to have ground truth, which is never available for real data. Thus we go ahead and generate outcomes based on a model we make up.

Modelling the Data

Let’s say we model our user behaviour and outcomes as follows:

$Y(0) = \frac{(85 – X_{age})^2}{5} + I_{manager} \cdot 150 + MinMax_{(-1000, 10.000)} (X_{balance}) $

where $I_{manager}$ is an indicator function for the job feature, yielding one if and only if the feature is equal to ‚management‘ and $MinMax$ is a scaling function squashing the value in the range from -1,000 to +10,000. Conceptually, our outcome is the purchase volume in our online-shop in the month after the campaign. Thus the treatment effect is the difference in the spending of a customer depending whether they received marketing or not.

The intuition we want to model behind this simple combination of features is that young people generally spend more, managers spend more than other jobs and that people with a higher account balance spend more. Don’t ask me how we know the account balance of our customers, we just do.

Now we define the true treatment effect as

$\tau = (85 – X_{age}) \cdot 10 + I_{edu} \cdot 200 + I_{highedu} \cdot 100 – I_{married} \cdot 100 + \mathcal{N}(0, 102) $

where the intuition is that young people are more likely to respond to marketing. The higher your education, the less likely you are to respond to marketing because you know it’s just a hoax anyways. And if you’re married you have to argue with your significant other about the purchase and thus won’t respond as much. It’s obvious, isn’t it? To round it off, we add some gaussian noise, because people are different.

The treated outcome is then simply $Y(1) = Y(0) + \tau$.

Now, let’s return to the hunch of our marketing department. They figured—somehow correctly—that people with higher balance spend more, and thus assigned marketing to the 10,000 people with the highest balance, ignoring everything else.

The results of this previous campaign is what we have for our study of treatment effects. The data is called observational because we only observe the data post-hoc. If we had instead assigned treatment randomly across all customers, we would have a sort of experimental randomized control trial (RCT), which would enable us to estimate the treatment effects much more precisely (read why this is so, in the other article). But for now, we want to work with this biased data, because that is closer to what we see in the wild. After all, running an RCT is expensive because of the opportunity cost.

In python modelling of treatment and outcome looks like this. The whole notebook, including plots and data preparation, can be found here.

# put features into single variables for convenient naming

age = data["age"].values

balance = data["balance"].values

duration = data["duration"].values

manager = data["job_management"].values

edu_sec = data["education_secondary"].values

edu_ter = data["education_tertiary"].values

married = data["marital_married"].values

X = data.values

# Scale balance to make it usable as a features without getting massive outliers

scaled_balance = minmax_scale(balance.reshape(-1, 1), feature_range=(-100, 10000))[:, 0]

standard_age = StandardScaler().fit_transform(age.reshape(-1, 1))[:, 0]

def sigmoid(x):

    return 1 / (1 + np.exp(-x))

# Define control outcome

y_0 = ((85 - age) ** 2)/5 + manager * 150 + scaled_balance

# Define \tau, the treatment effect

ite = (85 - age) * 10 + edu_sec * 200 + edu_ter * 100 - married * 150 + np.random.normal(100, 10)

y_1 = y_0 + ite

# Generate observational data based on the hunch of the marketing department

idx = np.argsort(scaled_balance)[-10000:]

t = np.zeros(len(y_1))

t[idx] = 1

y = (y_1 * t) + y_0 * (1 - t)

y_cf = y_1 * (1 - t) + y_0 * t

# put features into single variables for convenient naming

age = data["age"].values

balance = data["balance"].values

duration = data["duration"].values

manager = data["job_management"].values

edu_sec = data["education_secondary"].values

edu_ter = data["education_tertiary"].values

married = data["marital_married"].values

X = data.values

# Scale balance to make it usable as a features without getting massive outliers

scaled_balance = minmax_scale(balance.reshape(-1, 1), feature_range=(-100, 10000))[:, 0]

standard_age = StandardScaler().fit_transform(age.reshape(-1, 1))[:, 0]

def sigmoid(x):

return 1 / (1 + np.exp(-x))

# Define control outcome

y_0 = ((85 - age) ** 2)/5 + manager * 150 + scaled_balance

# Define \tau, the treatment effect

ite = (85 - age) * 10 + edu_sec * 200 + edu_ter * 100 - married * 150 + np.random.normal(100, 10)

y_1 = y_0 + ite

# Generate observational data based on the hunch of the marketing department

idx = np.argsort(scaled_balance)[-10000:]

t = np.zeros(len(y_1))

t[idx] = 1

y = (y_1 * t) + y_0 * (1 - t)

y_cf = y_1 * (1 - t) + y_0 * t

Targeting the Most Effective Group

For the next campaign our boss has imposed some tighter austerity measures on us and we are only allowed to send marketing to 2000 individuals. Thus, we better choose them wisely. Let’s compare different approaches.

Note: We assume that the response behaviour of the individual hasn’t changed since the last campaign. That is to say, our model of potential outcomes remains the same.

Target Users with Highest Balance

If we stick to the assumption of the first campaign and target the 2000 people with the highest balance we only gain a total of 886,450 €. That is to say, the difference between the scenario without marketing and the one with marketing amounts to about 800k € given the data generating process above. This makes sense if we look at how we modelled treatment effects. We didn’t include account balance in the calculation of $\tau$. So while it is true that people with high account balance tend to spend more in general (Y(0)), they are not responsive to marketing. Essentially, all the benefit we get from targeting the 2000 people with the highest balance is by luck.

Comparing two plots for causal inference — Influence of Scaled Balance on the control outcome as well as the treatment effect. We see that the guess of the marketing department is correct, but also that it is useless for our goal of targeting the maximum treatment effect.

Note that we can only calculate this ground truth because we have synthetic data and know the $\tau_i$ of all instances. We calculate the money earned like so:

idx = np.argsort(age)[:2000]

money_earned = np.sum(ite[idx])

round(money_earned, 2)

#! 1463361.6140393545

idx = np.argsort(age)[:2000]

money_earned = np.sum(ite[idx])

round(money_earned, 2)

#! 1463361.6140393545

Target Users Based on Causal Learning

Now comes the interesting part. We run a very simple T-Learner on the observational data we’ve collected from our previous marketing campaign and use that learner to predict/estimate the treatment effects of all customers. We then assign treatment to the 2000 customers who have the highest estimate of treatment effect. And voila: 1.597.590 € of total gain. That’s almost double the total effect of our previous target.

In order to estimate the effect, the T-Learner, where the T stands for two learners, employs two linear regressions. One tries to learn the outcome of the treated instance and one the outcome for so called control instances given the features. We can write this as estimating an expected value:

$\mu_0(x) \cong \mathbb{E}[Y \mid X=x, T=0]$

$\mu_1(x) \cong \mathbb{E}[Y \mid X=x, T=1]$

If these estimates are correct, we can approximate the treatment effect on instances with

$\tau(x) = \mu_1(x) – mu_0(x)$

It’s really that simple, and yet very powerful in our example. This goes back, not least, to the fact that both the untreated outcome $Y(0)$ and the treatment effect $\tau$ essentially are linear combinations of features, which the T-Learner has no struggle learning from the data.

Using our JustCause Framework, this is as simple as:

from justcause.learners import TLearner

from sklearn.linear_model import LinearRegression

learner = TLearner(learner=LinearRegression())

learner.fit(X, t, y)

ite_pred = learner.predict_ite(X, t, y, replace_factuals=True)

idx = np.argsort(ite_pred)[-2000:]

money_earned = np.sum(ite[idx])

round(money_earned, 2)

#! 1597590.15

from justcause.learners import TLearner

from sklearn.linear_model import LinearRegression

learner = TLearner(learner=LinearRegression())

learner.fit(X, t, y)

ite_pred = learner.predict_ite(X, t, y, replace_factuals=True)

idx = np.argsort(ite_pred)[-2000:]

money_earned = np.sum(ite[idx])

round(money_earned, 2)

#! 1597590.15

Target Youngest Users First

Finally, we can compare that to a very informed guess. Namely if we target the 2000 youngest people, we gain a total of 1.463.361 € more than without the marketing campaign. This is pretty good, and it becomes clear why, if we look at the model of $\tau$, where age plays the major and most distinct role. But still, the T-Learner outperformed our best guess without knowing anything about the data generation process that is not in the data.

What Happened Here?

The T-Learner and the informed guess both fare well, compared to targeting by balance, as our imaginary marketing department recommended. This is because they both rely on the importance of age to target users. And according to the synthetic DGP we defined above, age is the most important driver of treatment effect. The difference is that the T-Learner finds this importance only by looking at the data, while the guess must be informed by some background information. In our case, the difficulty to target the right group lies in the fact that the effect of treatment (that is marketing) is not related to the general behaviour of customers.

Check out the notebook if you’re interested in an uplift plot.

Take-Aways

I hope, after reading the article, you are now aware of the notation and idea behind the Potential Outcomes framework and how it relates to the specific use-case of a marketing campaign study. Furthermore, you should be at least somewhat convinced that employing a simple treatment effect estimation technique can be useful.

If you want to dive deeper into the theory and learn about our framework JustCause, check out the other article.

Mastering Python: Fortgeschrittene Techniken für Entwickler:innen

Dieses Training führt praxisorientiert in fortgeschrittene Konzepte von Python ein. Im Verlauf des Trainings lernen die Teilnehmer:innen anhand von interaktiven Beispielen und umfangreichen Praxisaufgaben alle wichtigen Konzepte der Sprache kennen.

Zum Training

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Causal Inference in Campaign Targeting

The Campaign Targeting Use-Case

Underlying Data

Formalising Treatments

Modelling the Data

Targeting the Most Effective Group

Target Users with Highest Balance

Target Users Based on Causal Learning

Target Youngest Users First

What Happened Here?

Take-Aways

Mastering Python: Fortgeschrittene Techniken für Entwickler:innen

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

Survival Analysis for State of Charge Prediction in IoT Devices

Who Let the Dogs Out? Pet Tracker Analytics for Early Identification of Users’ Behavioral Patterns

Peak Shaving erklärt: Lohnt sich die Lastspitzenkappung?

Causal Inference in Campaign Targeting

The Campaign Targeting Use-Case

Underlying Data

Formalising Treatments

Modelling the Data

Targeting the Most Effective Group

Target Users with Highest Balance

Target Users Based on Causal Learning

Target Youngest Users First

What Happened Here?

Take-Aways

Mastering Python: Fortgeschrittene Techniken für Entwickler:innen

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

Survival Analysis for State of Charge Prediction in IoT Devices

Who Let the Dogs Out? Pet Tracker Analytics for Early Identification of Users’ Behavioral Patterns

Peak Shaving erklärt: Lohnt sich die Lastspitzenkappung?

Newsletter