pd.crosstab
is one of those built-in functions in the Pandas API that I forget about routinely. I instinctively reached for df.groupby('x')['y'].count().unstack()
, but when I wanted to normalize the values, it takes more and more steps to get where I wanted.
This was a nice straightforward overview of the pd.crosstab
function. To document for myself, below, create a sample correlated DataFrame
with integer columns ActiveUsers
and CompletedProfile
.
import pandas as pd
import numpy as np
# following code from Github Copilot
np.random.seed(0)
n = 1000 # Number of samples
p = 0.7 # Probability of True in the first column
rho = 0.8 # Correlation
col1 = np.random.choice([True, False], size=n, p=[p, 1-p])
col2 = np.where(col1, np.random.choice([True, False], size=n, p=[rho, 1-rho]),
np.random.choice([True, False], size=n, p=[1-rho, rho]))
df = pd.DataFrame({'ActiveUsers': col1, 'CompletedProfile': col2})
ActiveUsers | CompletedProfile | |
0 | True | True |
1 | False | False |
2 | True | False |
3 | ... | ... |
Sample of the constructed DataFrame
Running pd.crosstab(df['ActiveUsers'], df['CompletedProfile'])
gets a frequency distribution. Passing normalize=True
adjusts the numbers of each group relative to the sum of the groups.
With normalize='index'
, we can find conditional probabilities given the row values. Given the value in the index, we can see the relative frequencies in the columns; numbers are normalized according to total values of each row. This gives us P(CompletedProfile=Y|ActiveUser=X).
CompletedProfile | False | True |
ActiveUsers | ||
False | .797 | .202 |
True | .220 | .779 |
Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=‘index’). Given a ActiveUser value of True, ~78% of instances are CompletedProfile=True.
With normalize='columns'
, the conditional probabilities are based on the columns, so we find P(ActiveUser=Y|CompletedProfile=X). Given a value in a column, what is the relative frequency of each row value.
CompletedProfile | False | True |
ActiveUsers | ||
False | .597 | .096 |
True | .402 | .903 |
Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=‘columns’). Given a CompletedProfile value of True, ~90% of instances are ActiveUsers=True.
Just so I don’t forget this:
pd.crosstab with normalize arg | Resulting conditional probability |
pd.crosstab(df.ColumnA, df.ColumnB, normalize='index') | P(ColumnB|ColumnA) |
pd.crosstab(df.ColumnA, df.ColumnB, normalize='columns') | P(ColumnA|ColumnB) |
Resulting conditional probabilities by normalize arg.
Curiously, passing margins=True
when normalizing by index
or columns
gives totals (normalized) for the columns or rows, respectively. The Pandas documentation doesn’t cover this, but it is strange that the margins=True
doesn’t result in a totals relative to the values being normalized against.
CompletedProfile | False | True | All |
ActiveUsers | |||
False | .597 | .096 | .291 |
True | .402 | .903 | .709 |
Result of pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], margins=True, normalize=‘columns’). Values in the All column just reflect the normalized totals relative to the entire table, as if you ran pd.crosstab(df[‘ActiveUsers’], df[‘CompletedProfile’], normalize=True, margins=True).