# How to calculate Pareto distribution and Zipf’s law in Python

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, (Italian: [paˈreːto] US: /pəˈreɪtoʊ/ pə-RAY-toh), is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena. Originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population. The Pareto principle or “80–20 rule” stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value (α) of log45 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80–20 distribution fits a wide range of cases, including natural phenomena and human activities. (https://en.wikipedia.org/wiki/Pareto_distribution)

## Zipf’s law

Zipf’s law (/zɪf/, not /tsɪpf/ as in German) is an empirical law formulated using mathematical statistics that refers to the fact that for many types of data studied in the physical and social sciences, the rank-frequency distribution is an inverse relation. The Zipfian distribution is one of a family of related discrete power law probability distributions. It is related to the zeta distribution, but is not identical.(https://en.wikipedia.org/wiki/Zipf%27s_law)

# numpy.random.pareto

You can get an array that follows the Pareto distribution with np.random.pareto.

`import numpy as npimport matplotlib.pyplot as plta, m = 6., 2.  # distribution width、specify mode   s = (np.random.pareto(a, 1000) + 1) * m# draw histgramcount, bins, _ = plt.hist(s, 100, density=True)# draw line graph  fit = a * m ** a / bins ** (a + 1)plt.plot(bins, max(count) * fit / max(fit), linewidth=2, color='r')plt.show()`

Adjusting the width of the distribution (the value called shape, scale) affects the maximum value of the graph, and it raises the maximum value up.

`import numpy as npimport matplotlib.pyplot as plta, m = 12., 2.  # distribution width、specify modes = (np.random.pareto(a, 1000) + 1) * m# draw histgramcount, bins, _ = plt.hist(s, 100, density=True)# draw line graphfit = a * m ** a / bins ** (a + 1)plt.plot(bins, max(count) * fit / max(fit), linewidth=2, color='r')plt.show()`

Adjusting the mode moves the x-axis start position.

`import numpy as npimport matplotlib.pyplot as plta, m = 6., 4.  # distribution width、specify modes = (np.random.pareto(a, 1000) + 1) * m# draw histgramcount, bins, _ = plt.hist(s, 100, density=True)# draw line graphfit = a * m ** a / bins ** (a + 1)plt.plot(bins, max(count) * fit / max(fit), linewidth=2, color='r')plt.show()`

# scipy.stats.pareto

Let’s try scipy.stats.pareto.

`from scipy.stats import paretob = 2.62# Moment calculation, and output    mean, var, skew, kurt = pareto.stats(b, moments='mvsk')print(mean, var, skew, kurt)=> 1.6172839506172838 1.6101990746886534 nan nan`

Since b <= 3, the skewness and kurtosis are defined at the end.

`from scipy.stats import paretob = 4.5# Moment calculation, outputmean, var, skew, kurt = pareto.stats (b, moments =’mvsk’)print (mean, var, skew, kurt)=> 1.2857142857142858 0.1469387755102041 5.465943944999486 146.44444444444446`

If b> 4 is set, the skewness and kurtosis calculation results will be returned.

`import numpy as npimport matplotlib.pyplot as pltfig, ax = plt.subplots(1, 1)b = 2.62# PDF）、PPF)# draw parete distribution   x = np.linspace(pareto.ppf(0.01, b),pareto.ppf(0.99, b), 100)    ax.plot(x, pareto.pdf(x, b),'r-', lw=5, alpha=0.6, label='pareto pdf')     rv = pareto(b)    ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')`

# scipy.stats.genpareto

scipy.stats.genpareto allows you to calculate the generalized Pareto distribution. Usage is similar to scipy.stats.pareto.

`from scipy.stats import genparetoc = 0.1# moment calculation and output    mean, var, skew, kurt = genpareto.stats(c, moments='mvsk')print(mean, var, skew, kurt)=> 1.1111111111111116 1.543209876543199 2.8110568859996445 14.828571428571088import matplotlib.pyplot as pltfrom scipy.stats import genparetoc = 0.1fig, ax = plt.subplots(1, 1)x = np.linspace(genpareto.ppf(0.01, c),                genpareto.ppf(0.99, c), 100)ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')rv = genpareto(c)ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')`

c represents the Pareto index. An exponential distribution if c = 0, a uniform distribution if c = -1, and a Type 2 Pareto distribution if c <0.

`import matplotlib.pyplot as pltfrom scipy.stats import genpareto# if c = 0, exponential distributionとなるc = 0fig, ax = plt.subplots(1, 1)x = np.linspace(genpareto.ppf(0.01, c),                genpareto.ppf(0.99, c), 100)ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')rv = genpareto(c)ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')`
`import matplotlib.pyplot as pltfrom scipy.stats import genpareto# if c = -1, uniform distributionc = -1fig, ax = plt.subplots(1, 1)x = np.linspace(genpareto.ppf(0.01, c),                genpareto.ppf(0.99, c), 100)ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')rv = genpareto(c)ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')`
`import matplotlib.pyplot as pltfrom scipy.stats import genpareto# if c = -0.80, the Pareto distribution is type 2. The shape is the opposite of the case of c > 0    fig, ax = plt.subplots(1, 1)x = np.linspace(genpareto.ppf(0.01, c),                genpareto.ppf(0.99, c), 100)ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')rv = genpareto(c)ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')`

# numpy.random.zipf

Calculate Zipf’s law (distribution) with numpy.random.zipf.

`import numpy as npimport matplotlib.pyplot as pltfrom scipy import speciala = 2. # parameter# generate array follows Zipf’s laws = np.random.zipf(a, 1000)# draw histgraphcount, bins, ignored = plt.hist(s[s<50], 50, density=True)x = np.arange(1., 50.)# special.zetac is the function to use zeta distributiony = x**(-a) / special.zetac(a)plt.plot(x, y/max(y), linewidth=2, color='r')plt.show()`

# scipy.stats.zipf

The usage of scipy.stats.zipf is similar to scipy.stats.pareto etc.

`from scipy.stats import zipfa = 6.5# calculation moment and outputmean, var, skew, kurt = zipf.stats(a, moments='mvsk')print(mean, var, skew, kurt)=> 1.0130420979439105 0.015940729460401704 12.561822558088188 279.4461024601007from scipy.stats import zipfimport matplotlib.pyplot as plta = 3fig, ax = plt.subplots(1, 1)# obtain 1% point and 99% point of zipfx = np.arange(zipf.ppf(0.01, a),               zipf.ppf(0.99, a))# plot Probability Mass Functionax.plot(x, zipf.pmf(x, a), 'bo', ms=8, label='zipf pmf')# draw a vertical barax.vlines(x, 0, zipf.pmf(x, a), colors='b', lw=5, alpha=0.5)rv = zipf(a)# draw horizontal barax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='frozen pmf')ax.legend(loc='best', frameon=False)plt.show()`

--

--

--

## More from TiShow

80% is the creation, the rest is depression. Frontend developer and data scientist, designer. Looking for Physics Ph.D Twitter: @_t_i_show

Love podcasts or audiobooks? Learn on the go with our new app.

## Introduction to geo-AR + measuring position and orientation in the browser ## Smart Send Time: How our data science team developed our automated send time optimization feature ## How to Build a Web App in 50 lines of code using Plotly and Dash ## How to Overcome Severe Shadows in a Lane Finding Project for Autonomous Vehicles ## Putting China’s Second-Hand Economy on the Map with GeoHash Matching  ## Baltimore Develops a Golden Method for Transit Signal Priority  ## TiShow

80% is the creation, the rest is depression. Frontend developer and data scientist, designer. Looking for Physics Ph.D Twitter: @_t_i_show

## How to compare the ensemble size in gradient Boosting machines ## How to Create a Dataset for Machine Learning ## Hierarchical Clustering 