Bin and mean

plot_utils.bin_and_mean(xdata, ydata, bins=10, distribution='normal', show_fig=True, fig=None, ax=None, figsize=None, dpi=100, show_bins=True, raw_data_label='raw data', mean_data_label='average', xlabel=None, ylabel=None, logx=False, logy=False, grid_on=True, error_bounds=True, err_bound_type='shade', legend_on=True, subsamp_thres=None, show_stats=True, show_SE=False, err_bound_shade_opacity=0.5)[source]

Calculate the “bin-and-mean” results and optionally show the “bin-and-mean” plot.

A “bin-and-mean” plot is a more salient way to show the dependency of ydata on xdata. The data points (xdata, ydata) are divided into different bins according to the values in xdata (via bins), and within each bin, the mean values of x and y are calculated, and treated as the representative x and y values.

“Bin-and-mean” is preferred when data points are highly skewed (e.g., a lot of data points for when x is small, but very few for large x). The data points when x is large are usually not noises, and could be even more valuable (think of the case where x is earthquake magnitude and y is the related economic loss). If we want to study the relationship between economic loss and earthquake magnitude, we need to bin-and-mean raw data and draw conclusions from the mean data points.

The theory that enables this method is the assumption that the data points with similar x values follow the same distribution. Naively, we assume the data points are normally distributed, then y_mean is the arithmetic mean of the data points within a bin. We also often assume the data points follow log-normal distribution (if we want to assert that y values are all positive), then y_mean is the expected value of the log-normal distribution, while x_mean for any bins are still just the arithmetic mean.

Notes

For log-normal distribution, the expective value of y is:
E(Y) = exp(mu + (1/2)*sigma^2)

and the variance is:
Var(Y) = [exp(sigma^2) - 1] * exp(2*mu + sigma^2)

where mu and sigma are the two parameters of the distribution.

Knowing E(Y) and Var(Y), mu and sigma can be back-calculated:

                 ___________________
mu = ln[ E(Y) / V 1 + Var(Y)/E^2(Y)  ]

         _________________________
sigma = V ln[ 1 + Var(Y)/E^2(Y) ]

(Reference: https://en.wikipedia.org/wiki/Log-normal_distribution)

Parameters:

xdata (list, numpy.ndarray, or pandas.Series) – X data.
ydata (list, numpy.ndarray, or pandas.Series) – Y data.
bins (int, list, numpy.ndarray, or pandas.Series) – Number of bins (an integer), or an array representing the actual bin edges. If bins means bin edges, the edges are inclusive on the lower bound, e.g., a value 2 shall fall into the bin [2, 3), but not the bin [1, 2). Note that the binning is done according to the X values.
distribution ({'normal', 'lognormal'}) – Specifies which distribution the Y values within a bin follow. Use ‘lognormal’ if you want to assert all positive Y values. Only supports normal and log-normal distributions at this time.
show_fig (bool) – Whether or not to show a bin-and-mean plot.
fig (matplotlib.figure.Figure or None) – Figure object. If None, a new figure will be created.
ax (matplotlib.axes._subplots.AxesSubplot or None) – Axes object. If None, a new axes will be created.
figsize ((float, float)) – Figure size in inches, as a tuple of two numbers. The figure size of fig (if not None) will override this parameter.
dpi (float) – Figure resolution. The dpi of fig (if not None) will override this parameter.
show_bins (bool) – Whether or not to show the bin edges as vertical lines on the plots.
raw_data_label (str) – The label name of the raw data to be shown in the legend (such as “raw data”). It has no effects if show_legend is False.
mean_data_label (str) – The label name of the mean data to be shown in the legend (such as “averaged data”). It has no effects if show_legend is False.
xlabel (str or None) – X axis label. If None and xdata is a pandas Series, use xdata’s “name” attribute as xlabel.
ylabel (str of None) – Y axis label. If None and ydata is a pandas Series, use ydata’s “name” attribute as ylabel.
logx (bool) – Whether or not to show the X axis in log scale.
logy (bool) – Whether or not to show the Y axis in log scale.
grid_on (bool) – Whether or not to show grids on the plot.
error_bounds (bool) – Whether or not to show error bounds of each bin.
err_bound_type ({'shade', 'bar'}) – Type of error bound: shaded area or error bars. It has no effects if error_bounds is set to False.
legend_on (bool) – Whether or not to show a legend.
subsamp_thres (int) – A positive integer that defines the number of data points in each bin to show in the scatter plot. The smaller this number, the faster the plotting process. If larger than the number of data points in a bin, then all data points from that bin are plotted. If None, then all data points from all bins are plotted.
show_stats (bool) – Whether or not to show R^2 scores, correlation coefficients of the raw data and the binned averages on the plot.
show_SE (bool) – If True, show the standard error of y_mean (orange dots) of each bin as the shaded area beneath the mean value lines. If False, show the standard deviation of raw Y values (gray dots) within each bin.
err_bound_shade_opacity (float) – The opacity of the shaded area representing the error bound. 0 means completely transparent, and 1 means completely opaque. It has no effect if error_bound_type is 'bar'.

Returns:

fig (matplotlib.figure.Figure) – The figure object being created or being passed into this function. None, if show_fig is set to False.
ax (matplotlib.axes._subplots.AxesSubplot) – The axes object being created or being passed into this function. None, if show_fig is set to False.
x_mean (numpy.ndarray) – Mean X values of each data bin (in terms of X values).
y_mean (numpy.ndarray) – Mean Y values of each data bin (in terms of X values).
y_std (numpy.ndarray) – Standard deviation of Y values or each data bin (in terms of X values).
y_SE (numpy.ndarray) – Standard error of y_mean. It describes how far y_mean is from the population mean (or the “true mean value”) within each bin, which is a different concept from y_std. See https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_mean_versus_standard_deviation for further information.
stats_ (tuple<float>) – A tuple in the order of (r2_score_raw, corr_coeff_raw, r2_score_binned, corr_coeff_binned), which are the R^2 score and correlation coefficient of the raw data (xdata and ydata) and the binned averages (x_mean and y_mean).