Joint & conditional probabilities with pd.crosstab

pd.crosstab is one of those built-in functions in the Pandas API that I forget about routinely. I instinctively reached for df.groupby('x')['y'].count().unstack(), but when I wanted to normalize the values, it takes more and more steps to get where I wanted.

This was a nice straightforward overview of the pd.crosstab function. To document for myself, below, create a sample correlated DataFrame with integer columns ActiveUsers and CompletedProfile.

import pandas as pd
import numpy as np

# following code from Github Copilot
np.random.seed(0)
n = 1000  # Number of samples
p = 0.7  # Probability of True in the first column
rho = 0.8  # Correlation

col1 = np.random.choice([True, False], size=n, p=[p, 1-p])
col2 = np.where(col1, np.random.choice([True, False], size=n, p=[rho, 1-rho]), 
         np.random.choice([True, False], size=n, p=[1-rho, rho]))

df = pd.DataFrame({'ActiveUsers': col1, 'CompletedProfile': col2})

	ActiveUsers	CompletedProfile
0	True	True
1	False	False
2	True	False
3	...	...

Sample of the constructed DataFrame

Running pd.crosstab(df['ActiveUsers'], df['CompletedProfile']) gets a frequency distribution. Passing normalize=True adjusts the numbers of each group relative to the sum of the groups.

With normalize='index', we can find conditional probabilities given the row values. Given the value in the index, we can see the relative frequencies in the columns; numbers are normalized according to total values of each row. This gives us P(CompletedProfile=Y|ActiveUser=X).

CompletedProfile	False	True
ActiveUsers
False	.797	.202
True	.220	.779

Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=‘index’). Given a ActiveUser value of True, ~78% of instances are CompletedProfile=True.

With normalize='columns', the conditional probabilities are based on the columns, so we find P(ActiveUser=Y|CompletedProfile=X). Given a value in a column, what is the relative frequency of each row value.

CompletedProfile	False	True
ActiveUsers
False	.597	.096
True	.402	.903

Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=‘columns’). Given a CompletedProfile value of True, ~90% of instances are ActiveUsers=True.

Just so I don’t forget this:

pd.crosstab with normalize arg	Resulting conditional probability
pd.crosstab(df.ColumnA, df.ColumnB, normalize='index')	P(ColumnB\|ColumnA)
pd.crosstab(df.ColumnA, df.ColumnB, normalize='columns')	P(ColumnA\|ColumnB)

Resulting conditional probabilities by normalize arg.

Curiously, passing margins=True when normalizing by index or columns gives totals (normalized) for the columns or rows, respectively. The Pandas documentation doesn’t cover this, but it is strange that the margins=True doesn’t result in a totals relative to the values being normalized against.

CompletedProfile	False	True	All
ActiveUsers
False	.597	.096	.291
True	.402	.903	.709

Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], margins=True, normalize=‘columns’). Values in the All column just reflect the normalized totals relative to the entire table, as if you ran pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=True, margins=True).