benford package

benford.benford module

class benford.benford.Base(data, decimals, sign='all', sec_order=False)[source]

Bases: pandas.core.frame.DataFrame

Internalizes and prepares the data for Analysis.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.`
Raises:

TypeError – if not receiving int or float as input.

class benford.benford.Test(base, digs, confidence, limit_N=None, sec_order=False)[source]

Bases: pandas.core.frame.DataFrame

Transforms the original number sequence into a DataFrame reduced by the ocurrences of the chosen digits, creating other computed columns

Parameters:
  • base – The Base object with the data prepared for Analysis
  • digs – Tells which test to perform: 1: first digit; 2: first two digits; 3: furst three digits; 22: second digit; -2: last two digits.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
N

Number of records in the sample to consider in computations

ddf

Degrees of Freedom to look up for the critical chi-square value

chi_square

Chi-square statistic for the given test

KS

Kolmogorov-Smirnov statistic for the given test

MAD

Mean Absolute Deviation for the given test

confidence

Confidence level to consider when setting some critical values

digs

numerical representation of the test at hand. 1: F1D; 2: F2D; 3: F3D; 22: SD; -2: L2D.

Type:int
sec_order

True if the test is a Second Order one

Type:bool
update_confidence(new_conf, check=True)[source]

Sets a new confidence level for the Benford object, so as to be used to produce critical values for the tests

Parameters:
  • new_conf – new confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics.
  • check – checks the value provided for the confidence. Defaults to True
critical_values

a dictionary with the critical values for the test at hand, according to the current confidence level.

Type:dict
show_plot(save_plot=None, save_plot_kwargs=None)[source]

Draws the test plot.

Parameters:
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
report(high_Z='pos', show_plot=True, save_plot=None, save_plot_kwargs=None)[source]

Handles the report especific to the test, considering its statistics and according to the current confidence level.

Parameters:
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the critical value or not.
  • show_plot – calls the show_plot method, to draw the test plot
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
class benford.benford.Summ(base, test)[source]

Bases: pandas.core.frame.DataFrame

Gets the base object and outputs a Summation test object

Parameters:
  • base – The Base object with the data prepared for Analysis
  • test – The test for which to compute the summation
MAD = None

Mean Absolute Deviation for the test

confidence = None

Confidence level to consider when setting some critical values

show_plot(save_plot=None, save_plot_kwargs=None)[source]

Draws the Summation test plot

Parameters:
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
report(high_diff=None, show_plot=True, save_plot=None, save_plot_kwargs=None)[source]

Gives the report on the Summation test.

Parameters:
  • high_diff – Number of records to show after ordering by the absolute differences between the found and the expected proportions
  • show_plot – calls the show_plot method, to draw the Summation test plot
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
class benford.benford.Mantissas(data, confidence=95, limit_N=None)[source]

Bases: object

Computes and holds the mantissas of the logarithms of the records

Parameters:
  • data – sequence to compute mantissas from. numpy 1D array, pandas Series of pandas DataFrame column.
  • confidence – confidence level for computing the critical values to compare with some statistics
data = None

pandas DataFrame with the mantissas

Type:(DataFrame)
stats
update_confidence(new_conf, check=True)[source]

Sets a new confidence level for the Benford object, so as to be used to produce critical values for the tests

Parameters:
  • new_conf – new confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics.
  • check – checks the value provided for the confidence. Defaults to True
report(show_plot=True, save_plot=None, save_plot_kwargs=None)[source]

Displays the Mantissas test stats.

Parameters:
  • show_plot – shows the Ordered Mantissas plot and the Arc Test plot. Defaults to True.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
show_plot(figsize=(12, 6), save_plot=None, save_plot_kwargs=None)[source]

Plots the ordered mantissas and a line with the expected inclination.

Parameters:
  • figsize (tuple) – figure size dimensions
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
arc_test(grid=True, figsize=12, save_plot=None, save_plot_kwargs=None)[source]

Adds two columns to Mantissas’s DataFrame equal to their “X” and “Y” coordinates, plots its to a scatter plot and calculates the gravity center of the circle.

Parameters:
  • grid – show grid of the plot. Defaluts to True.
  • figsize (int) – size of the figure to be displayed. Since it is a square, there is no need to provide a tuple, like is usually the case with matplotlib.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
class benford.benford.Benford(data, decimals=2, sign='all', confidence=95, mantissas=True, sec_order=False, summation=False, limit_N=None, verbose=True)[source]

Bases: object

Initializes a Benford Analysis object and computes the proportions for the digits. The tets dataFrames are atributes, i.e., obj.F1D is the First Digit DataFrame, the obj.F2D,the First Two Digits one, and so one, F3D for First Three Digits, SD for Second Digit and L2D for Last Two Digits.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a tuple with a pandas DataFrame and the name (str) of the chosen column. Values must be integers or floats.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to 95.
  • mantissas (bool) – opts for also running the mantissas Test. Defaulst to True
  • sec_order – runs the Second Order tests, which are the Benford’s tests performed on the differences between the ordered sample (a value minus the one before it, and so on). If the original series is Benford- compliant, this new sequence should aldo follow Beford. The Second Order can also be called separately, through the method sec_order().
  • summation – creates the Summation DataFrames for the First, First Two, and First Three Digits. The summation tests can also be called separately, through the method summation().
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • verbose – gives some information about the data and the registries used and discarded for each test.
data

the raw data provided for the analysis

chosen

the column of the DataFrame to be analysed or the data itself

sign

which number sign(s) to include in the analysis

Type:str
confidence

current confidence level

limit_N

sample size to use in computations

Type:int
verbose

verbose or not

Type:bool
base

the Base, pre-processed object

tests

keeps track of the tests the instance has

Type:list of str
update_confidence(new_conf, tests=None)[source]

Sets (a) new confidence level(s) for the Benford object, so as to be used to produce critical values for the tests.

Parameters:
  • new_conf – new confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics.
  • tests (list of str) – list of tests names (strings) to have their confidence updated. If only one, provide a one-element list, like [‘F1D’]. Defauts to None, in which case it will use the instance .test list attribute.
Raises:

ValueError – if the test argument is not a list or None.

all_confidences

a dictionary with a confidence level for each computed tests, when applicable.

Type:dict
mantissas()[source]

Adds a Mantissas object to the tests, with all its statistics and plotting capabilities.

sec_order()[source]

Runs the Second Order tests, which are the Benford’s tests performed on the differences between the ordered sample (a value minus the one before it, and so on). If the original series is Benford- compliant, this new sequence should aldo follow Beford. The Second Order can also be called separately, through the method sec_order().

summation()[source]

Creates Summation test DataFrames from Base object

class benford.benford.Source(data, decimals=2, sign='all', sec_order=False, verbose=True, inform=None)[source]

Bases: pandas.core.frame.DataFrame

Prepares the data for Analysis. pandas DataFrame subclass.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
  • sec_order – choice for the Second Order Test, which cumputes the differences between the ordered entries before running the Tests.
  • verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True.
Raises:
  • ValueError – if the sign arg is not in [‘all’, ‘pos’, ‘neg’]
  • TypeError – if not receiving int or float as input.
verbose = None

verbose or not

Type:(bool)
mantissas(report=True, show_plot=True, figsize=(15, 8), save_plot=None, save_plot_kwargs=None)[source]

Calculates the mantissas, their mean and variance, and compares them with the mean and variance of a Benford’s sequence.

Parameters:
  • report – prints the mamtissas mean, variance, skewness and kurtosis for the sequence studied, along with reference values.
  • show_plot – plots the ordered mantissas and a line with the expected inclination. Defaults to True.
  • figsize – tuple that sets the figure dimensions.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
first_digits(digs, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, simple=False, bhat_coeff=False, bhat_dist=False, kl_diverg=False, ret_df=False)[source]

Performs the Benford First Digits test with the series of numbers provided, and populates the mapping dict for future selection of the original series.

Parameters:
  • digs (int) – number of first digits to consider. Must be 1 (first digit), 2 (first two digits) or 3 (first three digits).
  • verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to None.
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
  • MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
  • bhat_coeff (bool) – computes the Bhattacharyya Coefficient between the found and the expected (Benford) digits distribution; defaults to Fasle
  • bhat_dist (bool) – calculates the Bhattacharyya Distance between the found and the expected (Benford) digits distribution; defaults to Fasle
  • kl_diverg (bool) – calculates the Kulback-Laibler Divergence between the found and the expected (Benford) digits distribution; defaults to False
  • show_plot (bool) – draws the test plot. Defaults to True.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
  • ret_df – returns the test DataFrame. Defaults to False. True if run by the test function.
Returns:

DataFrame with the Expected and Found proportions, and the Z scores of

the differences

second_digit(confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, bhat_coeff=False, bhat_dist=False, kl_diverg=False, show_plot=True, save_plot=None, save_plot_kwargs=None, simple=False, ret_df=False)[source]

Performs the Benford Second Digit test with the series of numbers provided.

Parameters:
  • verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True
  • MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to None.
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
  • bhat_coeff (bool) – computes the Bhattacharyya Coefficient between the found and the expected (Benford) digits distribution; defaults to Fasle
  • bhat_dist (bool) – calculates the Bhattacharyya Distance between the found and the expected (Benford) digits distribution; defaults to Fasle
  • kl_diverg (bool) – calculates the Kulback-Laibler Divergence between the found and the expected (Benford) digits distribution; defaults to False
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
  • ret_df – returns the test DataFrame. Defaults to False. True if run by the test function.
Returns:

DataFrame with the Expected and Found proportions, and the Z scores of

the differences

last_two_digits(confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, bhat_coeff=False, bhat_dist=False, kl_diverg=False, show_plot=True, save_plot=None, save_plot_kwargs=None, simple=False, ret_df=False)[source]

Performs the Benford Last Two Digits test with the series of numbers provided.

Parameters:
  • verbose (bool) – tells the number of registries that are being subjected to the analysis; defaults to True
  • MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show, as well as to calculate critical values for the tests’ statistics. Defaults to None.
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
  • bhat_coeff (bool) – computes the Bhattacharyya Coefficient between the found and the expected (Benford) digits distribution; defaults to Fasle
  • bhat_dist (bool) – calculates the Bhattacharyya Distance between the found and the expected (Benford) digits distribution; defaults to Fasle
  • kl_diverg (bool) – calculates the Kulback-Laibler Divergence between the found and the expected (Benford) digits distribution; defaults to False
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

DataFrame with the Expected and Found proportions, and the Z scores of

the differences

summation(digs=2, top=20, show_plot=True, save_plot=None, save_plot_kwargs=None, ret_df=False)[source]

Performs the Summation test. In a Benford series, the sums of the entries begining with the same digits tends to be the same.

Parameters:
  • digs – tells the first digits to use. 1- first; 2- first two; 3- first three. Defaults to 2.
  • top – choses how many top values to show. Defaults to 20.
  • show_plot – plots the results. Defaults to True.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

DataFrame with the Expected and Found proportions, and their

absolute differences

duplicates(top_Rep=20, inform=None)[source]

Performs a duplicates test and maps the duplicates count in descending order.

Parameters:
  • verbose (bool) – tells how many duplicated entries were found and prints the top numbers according to the top_Rep argument. Defaluts to True.
  • top_Rep – int or None. Chooses how many duplicated entries will be shown withe the top repititions. Defaluts to 20. If None, returns al the ordered repetitions.
Returns:

DataFrame with the duplicated records and their occurrence counts,

in descending order (if verbose is False; if True, prints to terminal).

Raises:

ValueError – if the top_Rep arg is not int or None.

class benford.benford.Roll_mad(data, test, window, decimals=2, sign='all')[source]

Bases: object

Applies the MAD to sequential subsets of the Series, returning another Series.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
  • window – size of the subset to be used.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
test = None

the test (F1D, SD, F2D…) used for the MAD calculation and critical values

show_plot(figsize=(15, 8), save_plot=None, save_plot_kwargs=None)[source]

Shows the rolling MAD plot

Parameters:
  • figsize – the figure dimensions.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
class benford.benford.Roll_mse(data, test, window, decimals=2, sign='all')[source]

Bases: object

Applies the MSE to sequential subsets of the Series, returning another Series.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
  • window – size of the subset to be used. decimals: number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
show_plot(figsize=(15, 8), save_plot=None, save_plot_kwargs=None)[source]

Shows the rolling MSE plot

Parameters:
  • figsize – the figure dimensions.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when save_plot is a string with the figure file path/name.
benford.benford.first_digits(data, digs, decimals=2, sign='all', verbose=True, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]

Performs the Benford First Digits test on the series of numbers provided.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
  • digs (int) – number of first digits to consider. Must be 1 (first digit), 2 (first two digits) or 3 (first three digits).
  • verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
  • MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
  • chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
  • KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

DataFrame with the Expected and Found proportions, and the Z scores of

the differences if the confidence is not None.

benford.benford.second_digit(data, decimals=2, sign='all', verbose=True, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]

Performs the Benford Second Digits test on the series of numbers provided.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
  • verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
  • MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
  • chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
  • KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

DataFrame with the Expected and Found proportions, and the Z scores of

the differences if the confidence is not None.

benford.benford.last_two_digits(data, decimals=2, sign='all', verbose=True, confidence=None, high_Z='pos', limit_N=None, MAD=False, MSE=False, chi_square=False, KS=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]

Performs the Last Two Digits test on the series of numbers provided.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column,with values being integers or floats.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. ‘pos’: only the positive entries; ‘neg’: only negative entries; ‘all’: all entries but zeros. Defaults to ‘all’.
  • verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
  • MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
  • chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
  • KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

DataFrame with the Expected and Found proportions, and the Z scores of

the differences if the confidence is not None.

benford.benford.mantissas(data, report=True, show_plot=True, arc_test=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]

Extraxts the mantissas of the records logarithms

Parameters:
  • data – sequence to compute mantissas from, numpy 1D array, pandas Series of pandas DataFrame column.
  • report – prints the mamtissas mean, variance, skewness and kurtosis for the sequence studied, along with reference values.
  • show_plot – plots the ordered mantissas and a line with the expected inclination. Defaults to True.
  • arc_test – draws the Arc Test plot. Defaluts to True.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

Series with the data mantissas.

benford.benford.summation(data, digs=2, decimals=2, sign='all', top=20, verbose=True, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]

Performs the Summation test. In a Benford series, the sums of the entries begining with the same digits tends to be the same. Works only with the First Digits (1, 2 or 3) test.

Parameters:
  • digs – tells the first digits to use: 1- first; 2- first two; 3- first three. Defaults to 2.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • top – choses how many top values to show. Defaults to 20.
  • show_plot – plots the results. Defaults to True.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

DataFrame with the Summation test, whether sorted in descending order

(if verbose == True) or not.

benford.benford.mad(data, test, decimals=2, sign='all', verbose=False)[source]

Calculates the Mean Absolute Deviation of the Series

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – informs which base test to use for the mad.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
Returns:

the Mean Absolute Deviation of the Series

Return type:

float

benford.benford.mse(data, test, decimals=2, sign='all', verbose=False)[source]

Calculates the Mean Squared Error of the Series

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – informs which base test to use for the mad.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
Returns:

the Mean Squared Error of the Series

Return type:

float

benford.benford.bhattacharyya_distance(data, test, decimals, sign='all', verbose=False)[source]

Computes the Bhattacharyya Distance between the Found and the Expected (Benford) digits distributions, according toe the test chosen (First, Second, First Two…)

Parameters:
  • data (ndarray, Series) – sequence to be evaluated, with values being integers or floats.
  • test (int, str) – informs which base test to be used.
  • decimals (int) – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign (str, optional) – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to “all”.
Returns:

the Bhattacharyya Distance between the distributions

Return type:

float

benford.benford.kullback_leibler_divergence(data, test, decimals, sign='all', verbose=False)[source]

Computes the Kulback-Leibler Divergence between the Found and the Expected (Benford) digits distributions, according toe the test chosen (First, Second, First Two…).

Parameters:
  • data (ndarray, Series) – sequence to be evaluated, with values being integers or floats.
  • test (int, str) – informs which base test to be used.
  • decimals (int) – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign (str, optional) – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to “all”.
Returns:

the Kulback-Leibler Divergence between the distributions

Return type:

float

benford.benford.mad_summ(data, test, decimals=2, sign='all', verbose=False)[source]

Calculate the Mean Absolute Deviation of the Summation Test

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – informs which base test to use for the summation mad.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
Returns:

the Mean Absolute Deviation of the Summation Test

Return type:

float

benford.benford.rolling_mad(data, test, window, decimals=2, sign='all', show_plot=False, save_plot=None, save_plot_kwargs=None)[source]

Applies the MAD to sequential subsets of the records.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
  • window – size of the subset to be used.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

Series with sequentially computed MADs.

benford.benford.rolling_mse(data, test, window, decimals=2, sign='all', show_plot=False, save_plot=None, save_plot_kwargs=None)[source]

Applies the MSE to sequential subsets of the records.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – tells which test to use. 1: Fisrt Digits; 2: First Two Digits; 3: First Three Digits; 22: Second Digit; and -2: Last Two Digits.
  • window – size of the subset to be used.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

Series with sequentially computed MSEs.

benford.benford.duplicates(data, top_Rep=20, verbose=True, inform=None)[source]

Performs a duplicates test and maps the duplicates count in descending order.

Parameters:
  • data – sequence to take the duplicates from. pandas Series or numpy Ndarray.
  • verbose (bool) – tells how many duplicated entries were found and prints the top numbers according to the top_Rep argument. Defaluts to True.
  • top_Rep – chooses how many duplicated entries will be shown withe the top repititions. int or None. Defaluts to 20. If None, returns al the ordered repetitions.
Returns:

DataFrame with the duplicated records and their respective counts

Raises:

ValueError – if the top_Rep arg is not int or None.

benford.benford.second_order(data, test, decimals=2, sign='all', verbose=True, MAD=False, confidence=None, high_Z='pos', limit_N=None, MSE=False, show_plot=True, save_plot=None, save_plot_kwargs=None, inform=None)[source]

Performs the chosen test after subtracting the ordered sequence by itself. Hence Second Order.

Parameters:
  • data – sequence of numbers to be evaluated. Must be a numpy 1D array, a pandas Series or a pandas DataFrame column, with values being integers or floats.
  • test – the test to be performed - 1 or ‘F1D’: First Digit; 2 or ‘F2D’: First Two Digits; 3 or ‘F3D’: First three Digits; 22 or ‘SD’: Second Digits; -2 or ‘L2D’: Last Two Digits.
  • decimals – number of decimal places to consider. Defaluts to 2. If integers, set to 0. If set to -infer-, it will remove the zeros and consider up to the fifth decimal place to the right, but will loose performance.
  • sign – tells which portion of the data to consider. pos: only the positive entries; neg: only negative entries; all: all entries but zeros. Defaults to all.
  • verbose (bool) – tells the number of registries that are being subjected to the analysis and returns tha analysis DataFrame sorted by the highest Z score down. Defaults to True.
  • MAD (bool) – calculates the Mean Absolute Difference between the found and the expected distributions; defaults to False.
  • confidence (int, float) – confidence level to draw lower and upper limits when plotting and to limit the top deviations to show. Defaults to None.
  • high_Z (int) – chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to ‘pos’, which will highlight only values higher than the expexted frequencies; ‘all’ will highlight both extremes (positive and negative); and an integer, which will use the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • limit_N (int) – sets a limit to N as the sample size for the calculation of the Z scores if the sample is too big. Defaults to None.
  • MSE (bool) – calculates the Mean Square Error of the sample; defaults to False.
  • chi_square – calculates the chi_square statistic of the sample and compares it with a critical value, according to the confidence level chosen and the series’s degrees of freedom. Defaults to False. Requires confidence != None.
  • KS – calculates the Kolmogorov-Smirnov test, comparing the cumulative distribution of the sample with the Benford’s, according to the confidence level chosen. Defaults to False. Requires confidence != None.
  • show_plot (bool) – draws the test plot.
  • save_plot (str) – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs (dict) – any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
Returns:

DataFrame of the test chosen, but applied on Second Order pre-

processed data.

benford.expected module

class benford.expected.First(digs, plot=True, save_plot=None, save_plot_kwargs=None)[source]

Bases: pandas.core.frame.DataFrame

Holds the expected probabilities of the First, First Two, or First Three digits according to Benford’s distribution.

Parameters:
  • digs – 1, 2 or 3 - tells which of the first digits to consider: 1 for the First Digit, 2 for the First Two Digits and 3 for the First Three Digits.
  • plot – option to plot a bar chart of the Expected proportions. Defaults to True.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
class benford.expected.Second(plot=True, save_plot=None, save_plot_kwargs=None)[source]

Bases: pandas.core.frame.DataFrame

Holds the expected probabilities of the Second Digits according to Benford’s distribution.

Parameters:
  • plot – option to plot a bar chart of the Expected proportions. Defaults to True.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.
class benford.expected.LastTwo(num=False, plot=True, save_plot=None, save_plot_kwargs=None)[source]

Bases: pandas.core.frame.DataFrame

Holds the expected probabilities of the Last Two Digits according to Benford’s distribution.

Parameters:
  • plot – option to plot a bar chart of the Expected proportions. Defaults to True.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension. Only available when plot=True.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html Only available when plot=True and save_plot is a string with the figure file path/name.

benford.stats module

benford.stats.Z_score(frame, N)[source]

Computes the Z statistics for the proportions studied

Parameters:
  • frame – DataFrame with the expected proportions and the already calculated Absolute Diferences between the found and expeccted proportions
  • N – sample size
Returns:

Series of computed Z scores

benford.stats.chi_sq(frame, ddf, confidence, verbose=True)[source]

Comnputes the chi-square statistic of the found distributions and compares it with the critical chi-square of such a sample, according to the confidence level chosen and the degrees of freedom - len(sample) -1.

Parameters:
  • frame – DataFrame with Found, Expected and their difference columns.
  • ddf – Degrees of freedom to consider.
  • confidence – Confidence level to look up critical value.
  • verbose – prints the chi-squre result and compares to the critical chi-square for the sample. Defaults to True.
Returns:

The computed Chi square statistic and the critical chi square

(according) to the degrees of freedom and confidence level, for comparison. None if confidence is None

benford.stats.chi_sq_2(frame)[source]

Computes the chi-square statistic of the found distributions

Parameters:frame – DataFrame with Found, Expected and their difference columns.
Returns:The computed Chi square statistic
benford.stats.kolmogorov_smirnov(frame, confidence, N, verbose=True)[source]

Computes the Kolmogorov-Smirnov test of the found distributions and compares it with the critical chi-square of such a sample, according to the confidence level chosen.

Parameters:
  • frame – DataFrame with Foud and Expected distributions.
  • confidence – Confidence level to look up critical value.
  • N – Sample size
  • verbose – prints the KS result and the critical value for the sample. Defaults to True.
Returns:

The Suprem, which is the greatest absolute difference between the

Found and the expected proportions, and the Kolmogorov-Smirnov critical value according to the confidence level, for ccomparison

benford.stats.kolmogorov_smirnov_2(frame)[source]

Computes the Kolmogorov-Smirnov test of the found distributions

Parameters:frame – DataFrame with Foud and Expected distributions.
Returns:
The Suprem, which is the greatest absolute difference between the
Found end th expected proportions
benford.stats.mad(frame, test, verbose=True)[source]

Computes the Mean Absolute Deviation (MAD) between the found and the expected proportions.

Parameters:
  • frame – DataFrame with the Absolute Deviations already calculated.
  • test – Test to compute the MAD from (F1D, SD, F2D…)
  • verbose – prints the MAD result and compares to limit values of conformity. Defaults to True.
Returns:

The Mean of the Absolute Deviations between the found and expected

proportions.

benford.stats.mse(frame, verbose=True)[source]

Computes the test’s Mean Square Error

Parameters:
  • frame – DataFrame with the already computed Absolute Deviations between the found and expected proportions
  • verbose – Prints the MSE. Defaults to True.
Returns:

Mean of the squared differences between the found and the expected proportions.

benford.viz module

benford.viz.plot_expected(df, digs, save_plot=None, save_plot_kwargs=None)[source]

Plots the Expected Benford Distributions

Parameters:
  • df – DataFrame with the Expected Proportions
  • digs – Test’s digit
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
benford.viz.plot_digs(df, x, y_Exp, y_Found, N, figsize, conf_Z, text_x=False, save_plot=None, save_plot_kwargs=None)[source]

Plots the digits tests results

Parameters:
  • df – DataFrame with the data to be plotted
  • x – sequence to be used in the x axis
  • y_Exp – sequence of the expected proportions to be used in the y axis (line)
  • y_Found – sequence of the found proportions to be used in the y axis (bars)
  • N – lenght of sequence, to be used when plotting the confidence levels
  • figsize – tuple to state the size of the plot figure
  • conf_Z – Confidence level
  • save_pic – file path to save figure
  • text_x – Forces to show all x ticks labels. Defaluts to True.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
benford.viz.plot_sum(df, figsize, li, text_x=False, save_plot=None, save_plot_kwargs=None)[source]

Plots the summation test results

Parameters:
  • df – DataFrame with the data to be plotted
  • figsize – sets the dimensions of the plot figure
  • li – value with which to draw the horizontal line
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
benford.viz.plot_ordered_mantissas(col, figsize=(12, 12), save_plot=None, save_plot_kwargs=None)[source]
Plots the ordered mantissas and compares them to the expected, straight
line that should be formed in a Benford-cmpliant set.
Parameters:
  • col (Series) – column of mantissas to plot.
  • figsize (tuple) – sets the dimensions of the plot figure.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
benford.viz.plot_mantissa_arc_test(df, gravity_center, grid=True, figsize=12, save_plot=None, save_plot_kwargs=None)[source]

Draws thee Mantissa Arc Test after computing X and Y circular coordinates for every mantissa and the center of gravity for the set

Parameters:
  • df (DataFrame) – pandas DataFrame with the mantissas and the X and Y coordinates.
  • gravity_center (tuple) – coordinates for plottling the gravity center
  • grid (bool) – show grid. Defaults to True.
  • figsize (int) – figure dimensions. No need to be a tuple, since the figure is a square.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
benford.viz.plot_roll_mse(roll_series, figsize, save_plot=None, save_plot_kwargs=None)[source]

Shows the rolling MSE plot

Parameters:
  • roll_series – pd.Series resultant form rolling mse.
  • figsize – the figure dimensions.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
benford.viz.plot_roll_mad(roll_mad, figsize, save_plot=None, save_plot_kwargs=None)[source]

Shows the rolling MAD plot

Parameters:
  • roll_mad – pd.Series resultant form rolling mad.
  • figsize – the figure dimensions.
  • save_plot – string with the path/name of the file in which the generated plot will be saved. Uses matplotlib.pyplot.savefig(). File format is infered by the file name extension.
  • save_plot_kwargs – dict with any of the kwargs accepted by matplotlib.pyplot.savefig() https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html