How to calculate Pareto distribution and Zipf’s law in Python

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto,[1] (Italian: [paˈreːto] US: /pəˈreɪtoʊ/ pə-RAY-toh),[2] is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena. Originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population.[3][4] The Pareto principle or “80–20 rule” stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value (α) of log45 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80–20 distribution fits a wide range of cases, including natural phenomena[5] and human activities. (https://en.wikipedia.org/wiki/Pareto_distribution)

Zipf’s law

Zipf’s law (/zɪf/, not /tsɪpf/ as in German) is an empirical law formulated using mathematical statistics that refers to the fact that for many types of data studied in the physical and social sciences, the rank-frequency distribution is an inverse relation. The Zipfian distribution is one of a family of related discrete power law probability distributions. It is related to the zeta distribution, but is not identical.(https://en.wikipedia.org/wiki/Zipf%27s_law)

https://unsplash.com/photos/hpjSkU2UYSU

numpy.random.pareto

You can get an array that follows the Pareto distribution with np.random.pareto.

import numpy as np
import matplotlib.pyplot as plt
a, m = 6., 2. # distribution width、specify mode
s = (np.random.pareto(a, 1000) + 1) * m
# draw histgram
count, bins, _ = plt.hist(s, 100, density=True)

# draw line graph
fit = a * m ** a / bins ** (a + 1)
plt.plot(bins, max(count) * fit / max(fit), linewidth=2, color='r')
plt.show()

Adjusting the width of the distribution (the value called shape, scale) affects the maximum value of the graph, and it raises the maximum value up.

import numpy as np
import matplotlib.pyplot as plt
a, m = 12., 2. # distribution width、specify mode
s = (np.random.pareto(a, 1000) + 1) * m
# draw histgram
count, bins, _ = plt.hist(s, 100, density=True)
# draw line graph
fit = a * m ** a / bins ** (a + 1)
plt.plot(bins, max(count) * fit / max(fit), linewidth=2, color='r')
plt.show()

Adjusting the mode moves the x-axis start position.

import numpy as np
import matplotlib.pyplot as plt
a, m = 6., 4. # distribution width、specify mode
s = (np.random.pareto(a, 1000) + 1) * m
# draw histgram
count, bins, _ = plt.hist(s, 100, density=True)

# draw line graph
fit = a * m ** a / bins ** (a + 1)
plt.plot(bins, max(count) * fit / max(fit), linewidth=2, color='r')
plt.show()

scipy.stats.pareto

Let’s try scipy.stats.pareto.

from scipy.stats import paretob = 2.62
# Moment calculation, and output
mean, var, skew, kurt = pareto.stats(b, moments='mvsk')
print(mean, var, skew, kurt)
=> 1.6172839506172838 1.6101990746886534 nan nan

Since b <= 3, the skewness and kurtosis are defined at the end.

from scipy.stats import paretob = 4.5
# Moment calculation, output
mean, var, skew, kurt = pareto.stats (b, moments =’mvsk’)
print (mean, var, skew, kurt)
=> 1.2857142857142858 0.1469387755102041 5.465943944999486 146.44444444444446

If b> 4 is set, the skewness and kurtosis calculation results will be returned.

import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
b = 2.62
# PDF)、PPF)
# draw parete distribution
x = np.linspace(pareto.ppf(0.01, b),pareto.ppf(0.99, b), 100)
ax.plot(x, pareto.pdf(x, b),'r-', lw=5, alpha=0.6, label='pareto pdf')
rv = pareto(b)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

scipy.stats.genpareto

scipy.stats.genpareto allows you to calculate the generalized Pareto distribution. Usage is similar to scipy.stats.pareto.

from scipy.stats import genparetoc = 0.1
# moment calculation and output
mean, var, skew, kurt = genpareto.stats(c, moments='mvsk')
print(mean, var, skew, kurt)
=> 1.1111111111111116 1.543209876543199 2.8110568859996445 14.828571428571088import matplotlib.pyplot as plt
from scipy.stats import genpareto
c = 0.1
fig, ax = plt.subplots(1, 1)
x = np.linspace(genpareto.ppf(0.01, c),
genpareto.ppf(0.99, c), 100)
ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')
rv = genpareto(c)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

c represents the Pareto index. An exponential distribution if c = 0, a uniform distribution if c = -1, and a Type 2 Pareto distribution if c <0.

import matplotlib.pyplot as plt
from scipy.stats import genpareto
# if c = 0, exponential distributionとなる
c = 0
fig, ax = plt.subplots(1, 1)
x = np.linspace(genpareto.ppf(0.01, c),
genpareto.ppf(0.99, c), 100)
ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')
rv = genpareto(c)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')
import matplotlib.pyplot as plt
from scipy.stats import genpareto
# if c = -1, uniform distribution
c = -1
fig, ax = plt.subplots(1, 1)
x = np.linspace(genpareto.ppf(0.01, c),
genpareto.ppf(0.99, c), 100)
ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')
rv = genpareto(c)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')
import matplotlib.pyplot as plt
from scipy.stats import genpareto
# if c = -0.80, the Pareto distribution is type 2. The shape is the opposite of the case of c > 0
fig, ax = plt.subplots(1, 1)
x = np.linspace(genpareto.ppf(0.01, c),
genpareto.ppf(0.99, c), 100)
ax.plot(x, genpareto.pdf(x, c),'r-', lw=5, alpha=0.6, label='genpareto pdf')
rv = genpareto(c)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

numpy.random.zipf

Calculate Zipf’s law (distribution) with numpy.random.zipf.

import numpy as np
import matplotlib.pyplot as plt
from scipy import special
a = 2. # parameter
# generate array follows Zipf’s law
s = np.random.zipf(a, 1000)
# draw histgraph
count, bins, ignored = plt.hist(s[s<50], 50, density=True)
x = np.arange(1., 50.)
# special.zetac is the function to use zeta distribution
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()

scipy.stats.zipf

The usage of scipy.stats.zipf is similar to scipy.stats.pareto etc.

from scipy.stats import zipfa = 6.5
# calculation moment and output
mean, var, skew, kurt = zipf.stats(a, moments='mvsk')
print(mean, var, skew, kurt)
=> 1.0130420979439105 0.015940729460401704 12.561822558088188 279.4461024601007from scipy.stats import zipf
import matplotlib.pyplot as plt
a = 3
fig, ax = plt.subplots(1, 1)
# obtain 1% point and 99% point of zipf
x = np.arange(zipf.ppf(0.01, a),
zipf.ppf(0.99, a))
# plot Probability Mass Function
ax.plot(x, zipf.pmf(x, a), 'bo', ms=8, label='zipf pmf')
# draw a vertical bar
ax.vlines(x, 0, zipf.pmf(x, a), colors='b', lw=5, alpha=0.5)
rv = zipf(a)
# draw horizontal bar
ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='frozen pmf')
ax.legend(loc='best', frameon=False)
plt.show()

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
TiShow

TiShow

80% is the creation, the rest is depression. Frontend developer and data scientist, designer. Looking for Physics Ph.D Twitter: @_t_i_show