Category means

plot_utils.category_means(categorical_array, continuous_array, fig=None, ax=None, figsize=None, dpi=100, title=None, xlabel=None, ylabel=None, rot=0, dropna=False, show_stats=True, sort_by='name', vert=True, plot_violins=True, **extra_kwargs)[source]

Summarize the mean values of entries of continuous_array corresponding to each distinct category in categorical_array, and show a violin plot to visualize it. The violin plot will show the distribution of values in continuous_array corresponding to each category in categorical_array.

Also, a one-way ANOVA test (H0: different categories in categorical_array yield the same average values in continuous_array) is performed, and F statistics and p-value are returned.

Parameters:
  • categorical_array (list, numpy.ndarray, or pandas.Series) – An vector of categorical values.

  • continuous_array (list, numpy.ndarray, or pandas.Series) – The target variable whose values correspond to the values in x. Must have the same length as x. It is natural that y contains continuous values, but if y contains categorical values (expressed as integers, not strings), this function should also work.

  • fig (matplotlib.figure.Figure or None) – Figure object. If None, a new figure will be created.

  • ax (matplotlib.axes._subplots.AxesSubplot or None) – Axes object. If None, a new axes will be created.

  • figsize ((float, float)) – Figure size in inches, as a tuple of two numbers. The figure size of fig (if not None) will override this parameter.

  • dpi (float) – Figure resolution. The dpi of fig (if not None) will override this parameter.

  • title (str) – The title of the violin plot, usually the name of ``categorical_array`

  • xlabel (str) – The label for the x axis (i.e., categories) of the violin plot. If None and categorical_array is a pandas Series, use the ‘name’ attribute of categorical_array as xlabel.

  • ylabel (str) – The label for the y axis (i.e., average continuous_array values) of the violin plot. If None and continuous_array is a pandas Series, use the ‘name’ attribute of continuous_array as ylabel.

  • rot (float) – The rotation (in degrees) of the x axis labels.

  • dropna (bool) – Whether or not to exclude N/A records in the data.

  • show_stats (bool) – Whether or not to show the statistical test results (F statistics and p-value) on the figure.

  • sort_by ({'name', 'mean', 'median', None}) – Option to arrange the different categories in categorical_array in the violin plot. None means no sorting, i.e., using the hashed order of the category names; ‘mean’ and ‘median’ mean sorting the violins according to the mean/median values of each category; ‘name’ means sorting the violins according to the category names.

  • vert (bool) – Whether to show the violins as vertical.

  • plot_violins (bool) – If True, use violin plots to illustrate the distribution of groups. Otherwise, use multi-histogram (hist_multi()).

  • **extra_kwargs – Keyword arguments to be passed to plt.violinplot() or hist_multi(). (https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.violinplot.html) Note that this subroutine overrides the default behavior of violinplot: showmeans is overriden to True and showextrema to False.

Returns:

  • fig (matplotlib.figure.Figure) – The figure object being created or being passed into this function.

  • ax (matplotlib.axes._subplots.AxesSubplot) – The axes object being created or being passed into this function.

  • mean_values (dict) – A dictionary whose keys are the categories in x, and their corresponding values are the mean values in y.

  • F_test_result (tuple<float>) – A tuple in the order of (F_stat, p_value), where F_stat is the computed F-value of the one-way ANOVA test, and p_value is the associated p-value from the F-distribution.