


Where R1 refers to the sum of the ranks for the first group, and R2 refers to the sum of the ranks for the second group. Because of this, the Mann-Whitney U Test can be applied to any distribution, whether it is Gaussian or not.

The test is specifically for non-parametric distributions, which do not assume a specific distribution for a set of data. The Mann-Whitney U Test is a null hypothesis test, used to detect differences between two independent data sets. So, what does the Mann-Whitney U Test do exactly? Now that we’ve graphed the different age distributions based on salary, is there a way to statistically prove that that that the two differ? Yes–using the Mann Whitney U Test. This intuitively makes sense, as people earlier on in their career make less money than those later on who are more established. For the population making greater than $50K a year, the peak occurs around 45 years of age. For the population making less than $50K a year, the distribution peaks around 25 years of age. Title = 'Age Distribution: US Population',Īge Distribution, subset by salary level ($50K)Īs you can see in the visual above, the distributions change when we subset the data by salary level. Generate_distribution_histogram(df, 'age',

Plt.hist(dataframe, bins = number_bins, label = label_name) Histogram containing distribution for specific column column_name. First, we look at the age distribution across the US population, using the matplotlib hist() function: import matplotlib.pyplot as pltĭef generate_distribution_histogram(dataframe,Ĭolumn_name: String. Now that we have some data, let’s visualize it. 'native-country', 'salary'] Snapshot of the UCI data set, uploaded directly into a pandas dataframe. 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'marital-status', 'occupation', 'relationship', #Declare the column names of the data setĭf.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', This repository contains a data sampling from the 1994 United States census, including information on individuals’ salary (>$50K, <=$50K), age, education, marital status, race, and sex, among other factors. We’ll pull data from the ‘ Adult’ dataset, available via the UCI Machine Learning Repository. To better illustrate this problem, let’s do an example. Comparing distributions to determine if they’re distinct can lead to many valuable insights in particular, if different attributes associated with a data set lead to different (statistically significant) outcomes. If you use statistics in your day-to-day job, it’s likely that at some point you’ll run across a distribution comparison problem.
