pd.crosstab is one of those built-in functions in the Pandas API that I forget about routinely. I instinctively reached for df.groupby('x')['y'].count().unstack(), but when I wanted to normalize the values, it takes more and more steps to get where I wanted.
This was a nice straightforward overview of the pd.crosstab function. To document for myself, below, create a sample correlated DataFrame with integer columns ActiveUsers and CompletedProfile.
import pandas as pd
import numpy as np
# following code from Github Copilot
np.random.seed(0)
n = 1000  # Number of samples
p = 0.7  # Probability of True in the first column
rho = 0.8  # Correlation
col1 = np.random.choice([True, False], size=n, p=[p, 1-p])
col2 = np.where(col1, np.random.choice([True, False], size=n, p=[rho, 1-rho]), 
         np.random.choice([True, False], size=n, p=[1-rho, rho]))
df = pd.DataFrame({'ActiveUsers': col1, 'CompletedProfile': col2})
| ActiveUsers | CompletedProfile | |
| 0 | True | True | 
| 1 | False | False | 
| 2 | True | False | 
| 3 | ... | ... | 
Sample of the constructed DataFrame
Running pd.crosstab(df['ActiveUsers'], df['CompletedProfile']) gets a frequency distribution. Passing normalize=True adjusts the numbers of each group relative to the sum of the groups.
With normalize='index', we can find conditional probabilities given the row values. Given the value in the index, we can see the relative frequencies in the columns; numbers are normalized according to total values of each row. This gives us P(CompletedProfile=Y|ActiveUser=X).
| CompletedProfile | False | True | 
| ActiveUsers | ||
| False | .797 | .202 | 
| True | .220 | .779 | 
Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=‘index’). Given a ActiveUser value of True, ~78% of instances are CompletedProfile=True.
With normalize='columns', the conditional probabilities are based on the columns, so we find P(ActiveUser=Y|CompletedProfile=X). Given a value in a column, what is the relative frequency of each row value.
| CompletedProfile | False | True | 
| ActiveUsers | ||
| False | .597 | .096 | 
| True | .402 | .903 | 
Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=‘columns’). Given a CompletedProfile value of True, ~90% of instances are ActiveUsers=True.
Just so I don’t forget this:
| pd.crosstab with normalize arg | Resulting conditional probability | 
| pd.crosstab(df.ColumnA, df.ColumnB, normalize='index') | P(ColumnB|ColumnA) | 
| pd.crosstab(df.ColumnA, df.ColumnB, normalize='columns') | P(ColumnA|ColumnB) | 
Resulting conditional probabilities by normalize arg.
Curiously, passing margins=True when normalizing by index or columns gives totals (normalized) for the columns or rows, respectively. The Pandas documentation doesn’t cover this, but it is strange that the margins=True doesn’t result in a totals relative to the values being normalized against.
| CompletedProfile | False | True | All | 
| ActiveUsers | |||
| False | .597 | .096 | .291 | 
| True | .402 | .903 | .709 | 
Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], margins=True, normalize=‘columns’). Values in the All column just reflect the normalized totals relative to the entire table, as if you ran pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=True, margins=True).