Zipf’s law
Zipf’s law is an emprical formula that appears to hold in many natural events. For a set of occuring events, the law relates the occurances of these events to their rank in an exponential attitude. The law is stated as follows:
\[f(n) \propto \frac{1}{r^\alpha}\]
where n is the item, r is its rank $\alpha$ is a constant that is usually close to 1. Under this setup, any structure adhering to Zipf’s law, $f(n)$ predicts the event’s frequency given his rank with the above formula.
The constant, $C$ that transforms tha above relation to a calculatable equation is usually the frequency of the highest ranked item.
It is rather intriguing that this law also holds for what appears to be chaotic and creative strcuture like a spoken language! It has been found out that the frequency of words in a relatively large corpus follows this law. For example, the image below shows the words frequency distribution for “A Cruising Voyage Round the World”, a classical English literature written in 18th century (read full article here). A clear exponential relation is noticeable!
Although many languages happens to follow this Zipf’s trend, the formula not completely follows. This is why it is an active area of research to study the best formual that could fit a large corpus. To serve that purpose, a lot of varities are introduced for Zipf’s law. The widely known variant is Zipf–Mandelbrot law which adds more parametes to the formual to tune on different corpus.
This notebook will not going to deeply study the linguistic and physical aspects of Zipf’s formula and the theory behind it. Rather, it is going to study computational aspects of it when applied on a large corpus.
In this notebook, we are going to reiterate on this topic for a very large English corpus. We will also transform these frequency counts to a log-log scale as this scale is better in depicting the trends among these counts. On that scale, we will study various methods to fit a regression model on this counts dataset. Basically, this notebook will:
- Download datasets from huggingface, filter, and preprocess them.
- Presents a sci-py regression model to fit a Zipf’s model on the curpus.
- Present a neural networks-based methods to fit the Zipf’s model on the corpus.
- Provide some analysis and discussion on the fitted models.
Let us frist do some preparation for this experiment.
Prepare
First, let us install some packages.
Installs
pip packages
1
2
3
4
5
| !pip install -q datasets
# !pip install -q pySmartDL # download stuff in parallel from the internet
!pip install -q tqdm
!pip install -q --upgrade --no-cache-dir gdown
#!pip install -q apache_beam mwparserfromhell in case english wiki would be downloaded from hugginface da
|
1
2
3
4
5
6
| [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h
|
Imports
Builtins packages:
1
2
3
4
5
6
7
8
| import os
import re
import gc
import random
import string
from glob import glob
from pprint import pprint
from collections import defaultdict,Counter
|
drive related packages. We are going to use gdown to download part of our corpuses from google drive.
jupyter related. In some cases, the output becomes cluttered. It is a good idea to clear that chaos programmatically.
1
| from IPython.display import clear_output
|
huggingface datasets are one of the state of the art modules in the field.
data science packages. Our swiss-army tools :)
1
2
3
4
5
6
7
| import torch
import numpy as np
from torch import nn
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,mean_squared_error
|
plotting via matplotlib
1
2
| import matplotlib.pyplot as plt
%matplotlib inline
|
utils functions.
1
| from tqdm.auto import tqdm
|
Constants and other Setups
First, for the sake of reproducibility, all random packages are set with a common seed.
1
2
3
| random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
|
1
| <torch._C.Generator at 0x796f5d1e3710>
|
Load and process the datasets
This section will load the datasets required to conduct this experiment. To have a large enough corpus, we are going to merge a large news dataset along with wikipedia and the bible. The previous datasets are also chosen to ensure domain variability.
Get the news datasets
Downloading gigaword
news dataset
1
2
| hf_english_news_dataset = datasets.load_dataset("gigaword")
hf_english_news_dataset
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| Downloading builder script: 0%| | 0.00/4.40k [00:00<?, ?B/s]
Downloading metadata: 0%| | 0.00/2.20k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/8.06k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/578M [00:00<?, ?B/s]
Generating train split: 0%| | 0/3803957 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/189651 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/1951 [00:00<?, ? examples/s]
DatasetDict({
train: Dataset({
features: ['document', 'summary'],
num_rows: 3803957
})
validation: Dataset({
features: ['document', 'summary'],
num_rows: 189651
})
test: Dataset({
features: ['document', 'summary'],
num_rows: 1951
})
})
|
removing empty documents
1
2
3
4
| en_news_dataset = []
for split in ('train','test','validation'):
en_news_dataset.extend(example['document'] for example in tqdm(hf_english_news_dataset[split]) if example['document'])
len(en_news_dataset)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| 0%| | 0/3803957 [00:00<?, ?it/s]
0%| | 0/1951 [00:00<?, ?it/s]
0%| | 0/189651 [00:00<?, ?it/s]
3995559
|
Get wikipedia dataset
wikipedia dataset can be downloaded from the source using the below cell’s code. This cell will also extract the content using the wikiextractor
tool.
1
2
3
| # the old way :)
# !wget -O enwiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles-multistream.xml.bz2
# !wikiextractor -o enwiki enwiki.xml.bz2 --json --processes 4
|
since this is going to take a lot of time, I already did that. I already saved it in drive to easy experiment with the dataset. The download may take some time to complete and unzip.
1
2
3
4
5
6
7
8
9
10
11
| if not os.path.isfile('/content/enwiki.zip'):
corpus_url = 'https://drive.google.com/file/d/1VR7g315mx8KIDXANKjAtkJzrGby0yVfN/view?usp=sharing'
corpus_id = corpus_url.split('/')[5]
corpus_download_url = f'https://drive.google.com/uc?id={corpus_id}'
corpus_path = "/content/enwiki.zip"
tqdm._instances.clear() # to fix issues with tqdm progress bar
gdown.download(corpus_download_url, corpus_path, quiet=False,)
# !unzip -q -d enwiki enwiki.zip # this does not show progress bar :(
!7z x enwiki.zip -o/content/enwiki # there should be no space between -o and the dest dir :)
# clear_output() # clear_output after finishing, in case the progress bar comes in many lines, # comment it if not needed
print('corpus file has been downloaded and extracted.')
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| Downloading...
From (uriginal): https://drive.google.com/uc?id=1VR7g315mx8KIDXANKjAtkJzrGby0yVfN
From (redirected): https://drive.google.com/uc?id=1VR7g315mx8KIDXANKjAtkJzrGby0yVfN&confirm=t&uuid=68d0c7b1-6175-4905-8f53-d104c4cab192
To: /content/enwiki.zip
100%|██████████| 6.25G/6.25G [00:50<00:00, 124MB/s]
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)
Scanning the drive for archives:
0M Scan 1 file, 6245938389 bytes (5957 MiB)
Extracting archive: enwiki.zip
23% 4096 Open --
Path = enwiki.zip
Type = zip
Physical Size = 6245938389
64-bit = +
0% 0% 15 - enwiki/GH/wiki_08 0% 27 - enwiki/GH/wiki_50 0% 38 - enwiki/GH/wiki_22 0% 52 - enwiki/GH/wiki_39 0% 65 - enwiki/GH/wiki_09 0% 80 0% 91 - enwiki/GH/wiki_97 0% 104 - enwiki/FC/wiki_72 0% 118 - enwiki/FC/wiki_18 0% 132 - enwiki/FC/wiki_00 0% 145 - enwiki/FC/wiki_78 0% 156 - enwiki/FC/wiki_42 0% 167 - enwiki/FC/wiki_19 1% 180 - enwiki/FC/wiki_12 1% 194 - enwiki/FC/wiki_24 1% 209 - enwiki/EI/wiki_16 1% 220 - enwiki/EI/wiki_35 1% 232 - enwiki/EI/wiki_20 1% 245 - enwiki/EI/wiki_93 1% 259 - enwiki/EI/wiki_58 1% 272 - enwiki/EI/wiki_56 1% 278 - enwiki/EI/wiki_59 1% 288 - enwiki/EI/wiki_02 1% 298 - enwiki/EI/wiki_87 1% 310 - enwiki/CI/wiki_16 1% 321 - enwiki/CI/wiki_35 1% 331 - enwiki/CI/wiki_70 1% 342 - enwiki/CI/wiki_62 2% 355 - enwiki/CI/wiki_39 2% 369 - enwiki/CI/wiki_19 2% 382 - enwiki/CI/wiki_12 2% 392 - enwiki/CI/wiki_94 2% 404 - enwiki/CI/wiki_52 2% 416 - enwiki/BO/wiki_55 2% 427 - enwiki/BO/wiki_14 2% 439 - enwiki/BO/wiki_28 2% 449 - enwiki/BO/wiki_61 2% 459 - enwiki/BO/wiki_42 2% 471 2% 482 - enwiki/BO/wiki_11 2% 493 - enwiki/BO/wiki_94 2% 503 - enwiki/BO/wiki_90 2% 512 - enwiki/AT/wiki_16 3% 525 - enwiki/AT/wiki_36 3% 538 - enwiki/AT/wiki_82 3% 550 - enwiki/AT/wiki_61 3% 561 - enwiki/AT/wiki_26 3% 572 - enwiki/AT/wiki_99 3% 585 - enwiki/AT/wiki_31 3% 599 - enwiki/AT/wiki_06 3% 611 - enwiki/FF/wiki_51 3% 623 - enwiki/FF/wiki_18 3% 635 - enwiki/FF/wiki_41 3% 649 - enwiki/FF/wiki_93 3% 663 - enwiki/FF/wiki_58 3% 675 - enwiki/FF/wiki_30 3% 686 - enwiki/FF/wiki_31 4% 698 - enwiki/FF/wiki_92 4% 712 - enwiki/DF/wiki_51 4% 725 - enwiki/DF/wiki_35 4% 737 - enwiki/DF/wiki_20 4% 750 - enwiki/DF/wiki_93 4% 762 - enwiki/DF/wiki_42 4% 777 - enwiki/DF/wiki_56 4% 791 - enwiki/DF/wiki_53 4% 804 - enwiki/DF/wiki_83 4% 816 - enwiki/AE/wiki_98 4% 829 - enwiki/AE/wiki_48 4% 843 - enwiki/AE/wiki_28 4% 857 - enwiki/AE/wiki_95 5% 870 - enwiki/AE/wiki_57 5% 881 - enwiki/AE/wiki_34 5% 895 - enwiki/AE/wiki_77 5% 907 - enwiki/AE/wiki_90 5% 920 - enwiki/CA/wiki_63 5% 931 - enwiki/CA/wiki_13 5% 942 - enwiki/CA/wiki_82 5% 952 - enwiki/CA/wiki_93 5% 964 - enwiki/CA/wiki_42 5% 976 - enwiki/CA/wiki_99 5% 987 - enwiki/CA/wiki_11 5% 997 - enwiki/CA/wiki_74 5% 1009 - enwiki/CA/wiki_10 5% 1022 - enwiki/CZ/wiki_55 5% 1035 - enwiki/CZ/wiki_84 6% 1046 - enwiki/CZ/wiki_81 6% 1059 6% 1070 - enwiki/CZ/wiki_04 6% 1082 - enwiki/CZ/wiki_44 6% 1096 - enwiki/CZ/wiki_02 6% 1108 - enwiki/CZ/wiki_71 6% 1121 - enwiki/ED/wiki_85 6% 1133 - enwiki/ED/wiki_13 6% 1146 - enwiki/ED/wiki_28 6% 1159 - enwiki/ED/wiki_67 6% 1173 - enwiki/ED/wiki_57 6% 1185 - enwiki/ED/wiki_07 6% 1197 - enwiki/ED/wiki_02 6% 1210 - enwiki/ED/wiki_90 7% 1225 - enwiki/FE/wiki_69 7% 1237 - enwiki/FE/wiki_84 7% 1249 - enwiki/FE/wiki_64 7% 1261 - enwiki/FE/wiki_95 7% 1275 - enwiki/FE/wiki_05 7% 1289 - enwiki/FE/wiki_68 7% 1301 - enwiki/FE/wiki_94 7% 1311 - enwiki/FE/wiki_90 7% 1324 - enwiki/FZ/wiki_63 7% 1337 - enwiki/FZ/wiki_17 7% 1348 - enwiki/FZ/wiki_28 7% 1361 - enwiki/FZ/wiki_67 7% 1374 - enwiki/FZ/wiki_01 8% 1387 - enwiki/FZ/wiki_07 8% 1400 - enwiki/FZ/wiki_77 8% 1412 - enwiki/FZ/wiki_90 8% 1425 - enwiki/AI/wiki_63 8% 1439 - enwiki/AI/wiki_84 8% 1452 - enwiki/AI/wiki_22 8% 1466 - enwiki/AI/wiki_39 8% 1480 - enwiki/AI/wiki_19 8% 1492 - enwiki/AI/wiki_11 8% 1505 - enwiki/AI/wiki_97 8% 1520 - enwiki/CW/wiki_51 8% 1535 - enwiki/CW/wiki_36 8% 1549 - enwiki/CW/wiki_47 9% 1562 - enwiki/CW/wiki_80 9% 1575 - enwiki/CW/wiki_04 9% 1588 - enwiki/CW/wiki_34 9% 1601 - enwiki/CW/wiki_02 9% 1614 - enwiki/CW/wiki_90 9% 1627 - enwiki/EF/wiki_63 9% 1640 - enwiki/EF/wiki_17 9% 1655 - enwiki/EF/wiki_62 9% 1669 - enwiki/EF/wiki_88 9% 1684 - enwiki/EF/wiki_27 9% 1698 - enwiki/EF/wiki_33 9% 1711 - enwiki/EF/wiki_03 9% 1724 - enwiki/CX/wiki_16 10% 1737 - enwiki/CX/wiki_36 10% 1749 - enwiki/CX/wiki_40 10% 1763 - enwiki/CX/wiki_32 10% 1776 - enwiki/CX/wiki_49 10% 1789 - enwiki/CX/wiki_44 10% 1803 - enwiki/CX/wiki_02 10% 1817 - enwiki/CX/wiki_10 10% 1832 - enwiki/BH/wiki_89 10% 1845 - enwiki/BH/wiki_50 10% 1859 - enwiki/BH/wiki_29 10% 1873 - enwiki/BH/wiki_42 10% 1887 - enwiki/BH/wiki_30 10% 1900 - enwiki/BH/wiki_33 11% 1913 - enwiki/BH/wiki_03 11% 1928 - enwiki/DK/wiki_23 11% 1942 - enwiki/DK/wiki_14 11% 1956 - enwiki/DK/wiki_64 11% 1969 - enwiki/DK/wiki_54 11% 1977 - enwiki/DK/wiki_91 11% 1979 - enwiki/DK/wiki_04 11% 1992 - enwiki/DK/wiki_34 11% 2005 - enwiki/DK/wiki_02 11% 2017 - enwiki/DK/wiki_71 11% 2028 - enwiki/EY/wiki_98 11% 2040 - enwiki/EY/wiki_36 11% 2052 - enwiki/EY/wiki_40 11% 2064 - enwiki/EY/wiki_78 11% 2076 - enwiki/EY/wiki_26 12% 2086 - enwiki/EY/wiki_19 12% 2098 - enwiki/EY/wiki_11 12% 2109 - enwiki/EY/wiki_94 12% 2120 - enwiki/EY/wiki_10 12% 2128 - enwiki/EN/wiki_16 12% 2136 - enwiki/EN/wiki_08 12% 2146 - enwiki/EN/wiki_84 12% 2157 - enwiki/EN/wiki_81 12% 2175 - enwiki/EN/wiki_96 12% 2190 - enwiki/EN/wiki_30 12% 2206 - enwiki/EN/wiki_45 12% 2224 - enwiki/FH/wiki_66 12% 2239 - enwiki/FH/wiki_18 13% 2255 - enwiki/FH/wiki_82 13% 2272 13% 2288 - enwiki/FH/wiki_19 13% 2305 - enwiki/FH/wiki_25 13% 2321 - enwiki/FH/wiki_90 13% 2336 - enwiki/BV/wiki_69 13% 2350 - enwiki/BV/wiki_50 13% 2365 - enwiki/BV/wiki_46 13% 2380 - enwiki/BV/wiki_58 13% 2391 - enwiki/BV/wiki_27 13% 2397 - enwiki/BV/wiki_07 13% 2405 - enwiki/BV/wiki_33 13% 2416 - enwiki/BV/wiki_24 14% 2428 - enwiki/EJ/wiki_21 14% 2442 - enwiki/EJ/wiki_35 14% 2453 - enwiki/EJ/wiki_41 14% 2468 - enwiki/EJ/wiki_78 14% 2483 - enwiki/EJ/wiki_49 14% 2499 - enwiki/EJ/wiki_79 14% 2516 - enwiki/EJ/wiki_92 14% 2533 - enwiki/FV/wiki_98 14% 2549 - enwiki/FV/wiki_17 14% 2566 - enwiki/FV/wiki_29 14% 2583 - enwiki/FV/wiki_91 15% 2601 - enwiki/FV/wiki_59 15% 2618 - enwiki/FV/wiki_24 15% 2636 - enwiki/AB/wiki_85 15% 2653 - enwiki/AB/wiki_50 15% 2670 - enwiki/AB/wiki_78 15% 2687 - enwiki/AB/wiki_01 15% 2704 - enwiki/AB/wiki_11 15% 2721 - enwiki/AB/wiki_03 15% 2740 - enwiki/FA/wiki_69 15% 2758 - enwiki/FA/wiki_00 16% 2775 - enwiki/FA/wiki_67 16% 2792 - enwiki/FA/wiki_09 16% 2810 - enwiki/FA/wiki_25 16% 2828 - enwiki/FA/wiki_52 16% 2847 - enwiki/FO/wiki_65 16% 2864 - enwiki/FO/wiki_81 16% 2881 - enwiki/FO/wiki_88 16% 2898 - enwiki/FO/wiki_56 16% 2914 - enwiki/FO/wiki_02 16% 2920 - enwiki/FO/wiki_92 16% 2928 - enwiki/FO/wiki_10 16% 2943 - enwiki/BA/wiki_89 16% 2946 - enwiki/BA/wiki_18 17% 2949 - enwiki/BA/wiki_36 17% 2965 - enwiki/BA/wiki_81 17% 2982 - enwiki/BA/wiki_88 17% 2998 - enwiki/BA/wiki_30 17% 3015 - enwiki/BA/wiki_02 17% 3033 - enwiki/CT/wiki_72 17% 3050 - enwiki/CT/wiki_36 17% 3067 - enwiki/CT/wiki_64 17% 3084 - enwiki/CT/wiki_96 17% 3101 17% 3117 - enwiki/CT/wiki_77 18% 3134 - enwiki/AC/wiki_72 18% 3151 - enwiki/AC/wiki_36 18% 3167 - enwiki/AC/wiki_81 18% 3184 - enwiki/AC/wiki_88 18% 3201 - enwiki/AC/wiki_56 18% 3213 - enwiki/AC/wiki_33 18% 3218 - enwiki/AC/wiki_77 18% 3223 - enwiki/AC/wiki_92 18% 3241 - enwiki/GA/wiki_23 18% 3256 - enwiki/GA/wiki_17 18% 3270 - enwiki/GA/wiki_22 18% 3285 - enwiki/GA/wiki_88 19% 3296 - enwiki/GA/wiki_15 19% 3312 - enwiki/GA/wiki_31 19% 3330 - enwiki/GA/wiki_71 19% 3349 - enwiki/AG/wiki_60 19% 3366 - enwiki/AG/wiki_82 19% 3384 - enwiki/AG/wiki_73 19% 3401 - enwiki/AG/wiki_27 19% 3419 - enwiki/AG/wiki_02 19% 3438 - enwiki/GI/wiki_21 19% 3455 - enwiki/GI/wiki_48 20% 3472 - enwiki/GI/wiki_22 20% 3489 - enwiki/GI/wiki_42 20% 3507 - enwiki/GI/wiki_34 20% 3524 - enwiki/GI/wiki_37 20% 3543 20% 3559 - enwiki/GN/wiki_17 20% 3577 - enwiki/GN/wiki_46 20% 3595 - enwiki/GN/wiki_04 20% 3613 - enwiki/GN/wiki_11 20% 3631 21% 3648 - enwiki/DR/wiki_55 21% 3665 - enwiki/DR/wiki_41 21% 3682 - enwiki/DR/wiki_32 21% 3700 - enwiki/DR/wiki_15 21% 3711 21% 3714 - enwiki/DR/wiki_11 21% 3717 - enwiki/DR/wiki_75 21% 3720 - enwiki/DR/wiki_53 21% 3723 - enwiki/DR/wiki_77 21% 3737 - enwiki/DR/wiki_52 21% 3753 - enwiki/CE/wiki_60 21% 3770 - enwiki/CE/wiki_82 21% 3786 - enwiki/CE/wiki_95 21% 3803 - enwiki/CE/wiki_19 22% 3819 - enwiki/CE/wiki_33 22% 3835 - enwiki/CE/wiki_71 22% 3853 - enwiki/CY/wiki_08 22% 3869 - enwiki/CY/wiki_00 22% 3887 - enwiki/CY/wiki_95 22% 3904 - enwiki/CY/wiki_19 22% 3920 - enwiki/CY/wiki_33 22% 3937 - enwiki/CY/wiki_90 22% 3955 - enwiki/AS/wiki_60 22% 3966 - enwiki/AS/wiki_50 22% 3979 - enwiki/AS/wiki_86 22% 3984 - enwiki/AS/wiki_61 23% 3999 - enwiki/AS/wiki_04 23% 4015 - enwiki/AS/wiki_59 23% 4032 - enwiki/AS/wiki_24 23% 4050 - enwiki/EK/wiki_85 23% 4068 - enwiki/EK/wiki_70 23% 4085 - enwiki/EK/wiki_61 23% 4103 - enwiki/EK/wiki_05 23% 4120 - enwiki/EK/wiki_31 23% 4138 - enwiki/EK/wiki_71 23% 4156 - enwiki/CH/wiki_08 24% 4172 - enwiki/CH/wiki_00 24% 4189 - enwiki/CH/wiki_67 24% 4205 - enwiki/CH/wiki_15 24% 4222 - enwiki/CH/wiki_75 24% 4239 - enwiki/CH/wiki_71 24% 4257 - enwiki/CS/wiki_08 24% 4274 - enwiki/CS/wiki_40 24% 4290 - enwiki/CS/wiki_67 24% 4307 - enwiki/CS/wiki_09 24% 4325 - enwiki/CS/wiki_25 25% 4338 - enwiki/CS/wiki_87 25% 4348 - enwiki/EC/wiki_51 25% 4366 - enwiki/EC/wiki_14 25% 4384 - enwiki/EC/wiki_29 25% 4401 - enwiki/EC/wiki_91 25% 4417 - enwiki/EC/wiki_07 25% 4435 - enwiki/EC/wiki_92 25% 4454 - enwiki/CU/wiki_85 25% 4471 - enwiki/CU/wiki_50 25% 4489 - enwiki/CU/wiki_61 26% 4507 - enwiki/CU/wiki_05 26% 4525 - enwiki/CU/wiki_75 26% 4543 - enwiki/CU/wiki_90 26% 4562 - enwiki/GJ/wiki_18 26% 4579 - enwiki/GJ/wiki_47 26% 4593 - enwiki/GJ/wiki_67 26% 4597 - enwiki/GJ/wiki_39 26% 4601 - enwiki/GJ/wiki_26 26% 4605 26% 4618 - enwiki/GJ/wiki_34 26% 4635 - enwiki/GJ/wiki_37 26% 4653 - enwiki/FD/wiki_16 26% 4670 - enwiki/FD/wiki_17 27% 4687 - enwiki/FD/wiki_29 27% 4703 - enwiki/FD/wiki_58 27% 4720 - enwiki/FD/wiki_07 27% 4737 - enwiki/FD/wiki_97 27% 4755 - enwiki/EX/wiki_98 27% 4773 - enwiki/EX/wiki_43 27% 4789 - enwiki/EX/wiki_46 27% 4807 - enwiki/EX/wiki_04 27% 4824 - enwiki/EX/wiki_68 27% 4842 - enwiki/EX/wiki_03 28% 4860 - enwiki/EM/wiki_55 28% 4877 - enwiki/EM/wiki_41 28% 4895 - enwiki/EM/wiki_80 28% 4913 28% 4929 - enwiki/EM/wiki_75 28% 4946 - enwiki/EM/wiki_71 28% 4965 - enwiki/FI/wiki_60 28% 4982 - enwiki/FI/wiki_82 28% 4999 - enwiki/FI/wiki_54 28% 5017 - enwiki/FI/wiki_27 29% 5034 - enwiki/FI/wiki_45 29% 5053 - enwiki/CL/wiki_72 29% 5070 - enwiki/CL/wiki_36 29% 5086 - enwiki/CL/wiki_81 29% 5103 - enwiki/CL/wiki_88 29% 5120 - enwiki/CL/wiki_56 29% 5138 - enwiki/CL/wiki_74 29% 5156 - enwiki/BP/wiki_51 29% 5173 - enwiki/BP/wiki_13 29% 5189 - enwiki/BP/wiki_22 30% 5206 - enwiki/BP/wiki_42 30% 5221 - enwiki/BP/wiki_56 30% 5237 - enwiki/BP/wiki_02 30% 5255 - enwiki/CQ/wiki_72 30% 5272 - enwiki/CQ/wiki_36 30% 5289 - enwiki/CQ/wiki_64 30% 5306 - enwiki/CQ/wiki_96 30% 5324 - enwiki/CQ/wiki_44 30% 5342 - enwiki/CQ/wiki_37 30% 5360 - enwiki/EL/wiki_16 31% 5376 - enwiki/EL/wiki_14 31% 5393 - enwiki/EL/wiki_86 31% 5409 - enwiki/EL/wiki_26 31% 5426 - enwiki/EL/wiki_34 31% 5442 - enwiki/EL/wiki_94 31% 5459 - enwiki/CK/wiki_51 31% 5475 - enwiki/CK/wiki_48 31% 5491 - enwiki/CK/wiki_64 31% 5508 - enwiki/CK/wiki_96 31% 5525 - enwiki/CK/wiki_38 31% 5542 - enwiki/CK/wiki_74 32% 5560 - enwiki/FR/wiki_51 32% 5579 - enwiki/FR/wiki_17 32% 5598 - enwiki/FR/wiki_93 32% 5615 - enwiki/FR/wiki_04 32% 5632 - enwiki/FR/wiki_68 32% 5649 - enwiki/FR/wiki_06 32% 5668 - enwiki/AX/wiki_55 32% 5686 - enwiki/AX/wiki_20 32% 5704 - enwiki/AX/wiki_67 33% 5721 - enwiki/AX/wiki_09 33% 5738 - enwiki/AX/wiki_33 33% 5756 - enwiki/AX/wiki_10 33% 5775 - enwiki/DN/wiki_35 33% 5793 - enwiki/DN/wiki_81 33% 5811 - enwiki/DN/wiki_96 33% 5828 - enwiki/DN/wiki_38 33% 5847 - enwiki/DN/wiki_37 33% 5866 - enwiki/EU/wiki_98 33% 5883 - enwiki/EU/wiki_84 34% 5900 - enwiki/EU/wiki_46 34% 5918 - enwiki/EU/wiki_04 34% 5935 - enwiki/EU/wiki_68 34% 5952 - enwiki/EU/wiki_06 34% 5970 - enwiki/DW/wiki_63 34% 5985 - enwiki/DW/wiki_43 34% 6002 - enwiki/DW/wiki_93 34% 6020 - enwiki/DW/wiki_01 34% 6037 - enwiki/DW/wiki_11 34% 6054 - enwiki/DW/wiki_03 35% 6073 - enwiki/EW/wiki_69 35% 6089 - enwiki/EW/wiki_41 35% 6105 - enwiki/EW/wiki_61 35% 6122 - enwiki/EW/wiki_57 35% 6139 - enwiki/EW/wiki_12 35% 6156 - enwiki/EW/wiki_87 35% 6175 - enwiki/GP/wiki_20 35% 6193 - enwiki/GP/wiki_38 35% 6212 - enwiki/BY/wiki_76 35% 6228 - enwiki/BY/wiki_13 36% 6244 - enwiki/BY/wiki_22 36% 6260 - enwiki/BY/wiki_96 36% 6277 - enwiki/BY/wiki_38 36% 6294 - enwiki/BY/wiki_74 36% 6311 - enwiki/BW/wiki_21 36% 6328 - enwiki/BW/wiki_48 36% 6345 - enwiki/BW/wiki_22 36% 6362 - enwiki/BW/wiki_42 36% 6378 - enwiki/BW/wiki_38 36% 6395 - enwiki/BW/wiki_74 36% 6400 - enwiki/BW/wiki_24 37% 6417 - enwiki/AF/wiki_23 37% 6434 - enwiki/AF/wiki_43 37% 6450 - enwiki/AF/wiki_46 37% 6469 - enwiki/AF/wiki_01 37% 6486 - enwiki/AF/wiki_11 37% 6503 - enwiki/AF/wiki_03 37% 6521 - enwiki/DE/wiki_55 37% 6539 - enwiki/DE/wiki_20 37% 6556 - enwiki/DE/wiki_80 37% 6574 - enwiki/DE/wiki_09 38% 6591 - enwiki/DE/wiki_33 38% 6608 - enwiki/DE/wiki_90 38% 6628 - enwiki/GM/wiki_35 38% 6646 - enwiki/GM/wiki_81 38% 6665 - enwiki/GM/wiki_42 38% 6683 - enwiki/GM/wiki_34 38% 6702 - enwiki/GM/wiki_92 38% 6720 - enwiki/EB/wiki_23 38% 6738 - enwiki/EB/wiki_50 39% 6756 - enwiki/EB/wiki_61 39% 6773 - enwiki/EB/wiki_57 39% 6791 - enwiki/EB/wiki_31 39% 6808 - enwiki/EB/wiki_83 39% 6827 - enwiki/FY/wiki_08 39% 6845 - enwiki/FY/wiki_82 39% 6863 - enwiki/FY/wiki_73 39% 6881 - enwiki/FY/wiki_30 39% 6897 - enwiki/FY/wiki_45 39% 6916 - enwiki/EH/wiki_72 40% 6933 - enwiki/EH/wiki_36 40% 6950 - enwiki/EH/wiki_64 40% 6968 - enwiki/EH/wiki_42 40% 6986 - enwiki/EH/wiki_34 40% 7004 40% 7022 - enwiki/DV/wiki_98 40% 7039 - enwiki/DV/wiki_84 40% 7056 - enwiki/DV/wiki_46 40% 7073 - enwiki/DV/wiki_49 40% 7091 - enwiki/DV/wiki_68 41% 7108 - enwiki/DV/wiki_06 41% 7126 - enwiki/BD/wiki_63 41% 7143 - enwiki/BD/wiki_70 41% 7159 - enwiki/BD/wiki_78 41% 7176 - enwiki/BD/wiki_01 41% 7193 - enwiki/BD/wiki_11 41% 7209 - enwiki/BD/wiki_06 41% 7228 - enwiki/ES/wiki_55 41% 7245 - enwiki/ES/wiki_41 41% 7263 - enwiki/ES/wiki_80 42% 7281 - enwiki/ES/wiki_09 42% 7299 - enwiki/ES/wiki_25 42% 7316 - enwiki/ES/wiki_10 42% 7333 - enwiki/BR/wiki_60 42% 7351 - enwiki/BR/wiki_47 42% 7368 - enwiki/BR/wiki_73 42% 7385 - enwiki/BR/wiki_27 42% 7402 - enwiki/BR/wiki_45 42% 7420 - enwiki/DP/wiki_66 42% 7437 - enwiki/DP/wiki_65 43% 7454 - enwiki/DP/wiki_81 43% 7472 - enwiki/DP/wiki_96 43% 7490 - enwiki/DP/wiki_44 43% 7507 - enwiki/DP/wiki_94 43% 7526 - enwiki/FM/wiki_16 43% 7544 - enwiki/FM/wiki_84 43% 7562 - enwiki/FM/wiki_93 43% 7580 - enwiki/FM/wiki_01 43% 7597 - enwiki/FM/wiki_11 43% 7615 - enwiki/FM/wiki_87 44% 7633 - enwiki/DU/wiki_69 44% 7650 - enwiki/DU/wiki_20 44% 7667 - enwiki/DU/wiki_80 44% 7684 - enwiki/DU/wiki_15 44% 7701 - enwiki/DU/wiki_75 44% 7717 - enwiki/DU/wiki_83 44% 7735 - enwiki/GK/wiki_89 44% 7753 - enwiki/GK/wiki_40 44% 7769 - enwiki/GK/wiki_67 44% 7786 - enwiki/GK/wiki_09 45% 7803 - enwiki/GK/wiki_33 45% 7820 - enwiki/GK/wiki_90 45% 7838 - enwiki/BS/wiki_60 45% 7855 - enwiki/BS/wiki_82 45% 7871 - enwiki/BS/wiki_95 45% 7888 - enwiki/BS/wiki_19 45% 7904 - enwiki/BS/wiki_33 45% 7919 - enwiki/BS/wiki_83 45% 7937 - enwiki/FB/wiki_89 45% 7955 - enwiki/FB/wiki_40 46% 7972 - enwiki/FB/wiki_95 46% 7990 - enwiki/FB/wiki_99 46% 8007 - enwiki/FB/wiki_53 46% 8024 - enwiki/FB/wiki_52 46% 8042 - enwiki/DB/wiki_35 46% 8060 - enwiki/DB/wiki_81 46% 8077 - enwiki/DB/wiki_88 46% 8095 - enwiki/DB/wiki_38 46% 8113 - enwiki/DB/wiki_94 46% 8131 - enwiki/DJ/wiki_76 47% 8149 - enwiki/DJ/wiki_17 47% 8166 - enwiki/DJ/wiki_29 47% 8182 - enwiki/DJ/wiki_58 47% 8199 - enwiki/DJ/wiki_07 47% 8217 - enwiki/DJ/wiki_92 47% 8235 - enwiki/CF/wiki_23 47% 8251 - enwiki/CF/wiki_84 47% 8267 - enwiki/CF/wiki_29 47% 8284 - enwiki/CF/wiki_91 47% 8302 - enwiki/CF/wiki_59 48% 8320 - enwiki/CF/wiki_06 48% 8335 - enwiki/AN/wiki_98 48% 8352 - enwiki/AN/wiki_84 48% 8369 - enwiki/AN/wiki_46 48% 8386 - enwiki/AN/wiki_49 48% 8402 - enwiki/AN/wiki_79 48% 8418 - enwiki/AN/wiki_97 48% 8436 - enwiki/BC/wiki_98 48% 8453 - enwiki/BC/wiki_84 48% 8470 - enwiki/BC/wiki_46 49% 8486 - enwiki/BC/wiki_91 49% 8503 - enwiki/BC/wiki_79 49% 8519 - enwiki/BC/wiki_97 49% 8537 - enwiki/CV/wiki_98 49% 8555 - enwiki/CV/wiki_43 49% 8572 - enwiki/CV/wiki_93 49% 8589 - enwiki/CV/wiki_04 49% 8606 - enwiki/CV/wiki_68 49% 8621 - enwiki/CV/wiki_92 49% 8639 - enwiki/BJ/wiki_23 49% 8653 - enwiki/BJ/wiki_14 50% 8668 - enwiki/BJ/wiki_22 50% 8685 - enwiki/BJ/wiki_42 50% 8701 - enwiki/BJ/wiki_38 50% 8718 - enwiki/BJ/wiki_74 50% 8735 - enwiki/FK/wiki_21 50% 8750 - enwiki/FK/wiki_65 50% 8766 - enwiki/FK/wiki_28 50% 8784 - enwiki/FK/wiki_88 50% 8801 - enwiki/FK/wiki_56 50% 8818 - enwiki/FK/wiki_77 51% 8837 - enwiki/DS/wiki_51 51% 8854 - enwiki/DS/wiki_13 51% 8871 - enwiki/DS/wiki_62 51% 8888 - enwiki/DS/wiki_26 51% 8906 - enwiki/DS/wiki_07 51% 8924 - enwiki/DS/wiki_92 51% 8942 - enwiki/EG/wiki_23 51% 8960 - enwiki/EG/wiki_50 51% 8978 - enwiki/EG/wiki_61 51% 8995 - enwiki/EG/wiki_57 52% 9013 - enwiki/EG/wiki_31 52% 9030 - enwiki/EG/wiki_83 52% 9048 - enwiki/EE/wiki_89 52% 9066 - enwiki/EE/wiki_40 52% 9083 - enwiki/EE/wiki_95 52% 9101 - enwiki/EE/wiki_99 52% 9118 - enwiki/EE/wiki_53 52% 9135 - enwiki/EE/wiki_52 52% 9153 - enwiki/BL/wiki_35 52% 9170 - enwiki/BL/wiki_28 53% 9186 - enwiki/BL/wiki_73 53% 9203 - enwiki/BL/wiki_27 53% 9220 - enwiki/BL/wiki_45 53% 9238 - enwiki/DI/wiki_66 53% 9255 - enwiki/DI/wiki_65 53% 9271 - enwiki/DI/wiki_28 53% 9289 - enwiki/DI/wiki_88 53% 9305 - enwiki/DI/wiki_30 53% 9322 - enwiki/DI/wiki_02 53% 9340 - enwiki/DZ/wiki_72 54% 9358 - enwiki/DZ/wiki_48 54% 9374 - enwiki/DZ/wiki_64 54% 9392 - enwiki/DZ/wiki_42 54% 9410 - enwiki/DZ/wiki_34 54% 9427 - enwiki/DZ/wiki_37 54% 9446 - enwiki/BE/wiki_98 54% 9462 - enwiki/BE/wiki_17 54% 9479 - enwiki/BE/wiki_29 54% 9496 - enwiki/BE/wiki_91 54% 9513 - enwiki/BE/wiki_79 55% 9529 - enwiki/BE/wiki_97 55% 9547 - enwiki/AY/wiki_98 55% 9564 - enwiki/AY/wiki_84 55% 9581 - enwiki/AY/wiki_46 55% 9598 - enwiki/AY/wiki_49 55% 9614 - enwiki/AY/wiki_79 55% 9631 - enwiki/AY/wiki_92 55% 9649 - enwiki/EQ/wiki_23 55% 9667 - enwiki/EQ/wiki_50 55% 9685 - enwiki/EQ/wiki_61 56% 9702 - enwiki/EQ/wiki_57 56% 9720 - enwiki/EQ/wiki_31 56% 9737 - enwiki/EQ/wiki_83 56% 9756 - enwiki/CD/wiki_08 56% 9773 - enwiki/CD/wiki_40 56% 9789 - enwiki/CD/wiki_67 56% 9806 - enwiki/CD/wiki_09 56% 9823 - enwiki/CD/wiki_33 56% 9841 - enwiki/CD/wiki_10 56% 9858 - enwiki/AM/wiki_60 57% 9875 - enwiki/AM/wiki_82 57% 9891 - enwiki/AM/wiki_95 57% 9908 - enwiki/AM/wiki_19 57% 9925 - enwiki/AM/wiki_25 57% 9942 - enwiki/AM/wiki_10 57% 9961 - enwiki/FL/wiki_35 57% 9978 - enwiki/FL/wiki_28 57% 9995 - enwiki/FL/wiki_39 57% 10012 - enwiki/FL/wiki_30 57% 10029 - enwiki/FL/wiki_02 58% 10046 - enwiki/AD/wiki_66 58% 10063 - enwiki/AD/wiki_65 58% 10080 - enwiki/AD/wiki_81 58% 10098 - enwiki/AD/wiki_96 58% 10115 - enwiki/AD/wiki_38 58% 10132 - enwiki/AD/wiki_74 58% 10151 58% 10168 - enwiki/FQ/wiki_14 58% 10185 - enwiki/FQ/wiki_86 58% 10203 - enwiki/FQ/wiki_91 59% 10221 - enwiki/FQ/wiki_59 59% 10239 - enwiki/FQ/wiki_06 59% 10257 - enwiki/DA/wiki_63 59% 10275 - enwiki/DA/wiki_41 59% 10293 - enwiki/DA/wiki_80 59% 10310 - enwiki/DA/wiki_15 59% 10327 - enwiki/DA/wiki_75 59% 10344 - enwiki/DA/wiki_71 59% 10363 - enwiki/EO/wiki_60 59% 10381 - enwiki/EO/wiki_47 60% 10399 - enwiki/EO/wiki_39 60% 10417 - enwiki/EO/wiki_56 60% 10435 - enwiki/EO/wiki_74 60% 10454 - enwiki/GC/wiki_76 60% 10471 - enwiki/GC/wiki_14 60% 10488 - enwiki/GC/wiki_86 60% 10506 - enwiki/GC/wiki_91 60% 10523 - enwiki/GC/wiki_79 60% 10540 - enwiki/GC/wiki_92 60% 10558 - enwiki/ET/wiki_23 61% 10575 - enwiki/ET/wiki_43 61% 10592 - enwiki/ET/wiki_93 61% 10609 - enwiki/ET/wiki_04 61% 10626 - enwiki/ET/wiki_68 61% 10644 - enwiki/ET/wiki_03 61% 10662 - enwiki/CC/wiki_55 61% 10679 - enwiki/CC/wiki_41 61% 10696 - enwiki/CC/wiki_32 61% 10713 - enwiki/CC/wiki_05 61% 10729 - enwiki/CC/wiki_12 62% 10745 - enwiki/CC/wiki_03 62% 10762 - enwiki/GL/wiki_63 62% 10780 - enwiki/GL/wiki_41 62% 10799 - enwiki/GL/wiki_67 62% 10816 - enwiki/GL/wiki_09 62% 10834 - enwiki/GL/wiki_25 62% 10851 - enwiki/GL/wiki_10 62% 10870 - enwiki/EP/wiki_35 62% 10888 - enwiki/EP/wiki_81 62% 10905 - enwiki/EP/wiki_88 63% 10922 - enwiki/EP/wiki_56 63% 10940 - enwiki/EP/wiki_74 63% 10959 - enwiki/FW/wiki_76 63% 10977 - enwiki/FW/wiki_17 63% 10994 - enwiki/FW/wiki_29 63% 11013 - enwiki/FW/wiki_04 63% 11031 - enwiki/FW/wiki_11 63% 11050 - enwiki/FW/wiki_83 63% 11068 - enwiki/GG/wiki_89 64% 11084 - enwiki/GG/wiki_20 64% 11101 - enwiki/GG/wiki_80 64% 11118 - enwiki/GG/wiki_15 64% 11135 - enwiki/GG/wiki_75 64% 11153 - enwiki/GG/wiki_90 64% 11172 - enwiki/GB/wiki_18 64% 11190 - enwiki/GB/wiki_28 64% 11207 - enwiki/GB/wiki_39 64% 11226 - enwiki/GB/wiki_38 64% 11245 - enwiki/GB/wiki_37 65% 11264 - enwiki/DL/wiki_98 65% 11281 - enwiki/DL/wiki_84 65% 11299 65% 11316 - enwiki/DL/wiki_04 65% 11333 - enwiki/DL/wiki_68 65% 11350 - enwiki/DL/wiki_06 65% 11367 - enwiki/AW/wiki_85 65% 11384 - enwiki/AW/wiki_50 65% 11401 - enwiki/AW/wiki_78 65% 11418 - enwiki/AW/wiki_01 66% 11436 - enwiki/AW/wiki_12 66% 11453 - enwiki/AW/wiki_87 66% 11472 - enwiki/FJ/wiki_89 66% 11491 - enwiki/FJ/wiki_82 66% 11510 - enwiki/FJ/wiki_39 66% 11530 - enwiki/FJ/wiki_44 66% 11549 - enwiki/FJ/wiki_97 66% 11568 - enwiki/DH/wiki_23 66% 11587 - enwiki/DH/wiki_70 67% 11605 - enwiki/DH/wiki_32 67% 11623 - enwiki/DH/wiki_15 67% 11641 - enwiki/DH/wiki_33 67% 11658 - enwiki/DH/wiki_90 67% 11677 - enwiki/BM/wiki_18 67% 11695 - enwiki/BM/wiki_28 67% 11712 - enwiki/BM/wiki_39 67% 11729 - enwiki/BM/wiki_30 67% 11746 - enwiki/BM/wiki_02 67% 11764 - enwiki/AL/wiki_72 68% 11781 - enwiki/AL/wiki_36 68% 11798 - enwiki/AL/wiki_64 68% 11815 - enwiki/AL/wiki_96 68% 11832 - enwiki/AL/wiki_38 68% 11849 - enwiki/AL/wiki_74 68% 11868 - enwiki/DM/wiki_76 68% 11885 - enwiki/DM/wiki_14 68% 11901 - enwiki/DM/wiki_62 68% 11918 - enwiki/DM/wiki_26 68% 11935 - enwiki/DM/wiki_34 69% 11953 - enwiki/DM/wiki_97 69% 11971 - enwiki/AA/wiki_98 69% 11988 - enwiki/AA/wiki_84 69% 12005 - enwiki/AA/wiki_46 69% 12022 - enwiki/AA/wiki_49 69% 12039 - enwiki/AA/wiki_59 69% 12056 - enwiki/AA/wiki_24 69% 12074 - enwiki/CR/wiki_85 69% 12091 - enwiki/CR/wiki_50 69% 12108 - enwiki/CR/wiki_78 70% 12125 - enwiki/CR/wiki_01 70% 12142 - enwiki/CR/wiki_11 70% 12159 - enwiki/CR/wiki_03 70% 12178 - enwiki/FS/wiki_69 70% 12196 - enwiki/FS/wiki_00 70% 12214 - enwiki/FS/wiki_95 70% 12230 - enwiki/FS/wiki_09 70% 12247 - enwiki/FS/wiki_33 70% 12265 - enwiki/FS/wiki_10 70% 12284 - enwiki/GF/wiki_35 71% 12302 - enwiki/GF/wiki_81 71% 12320 - enwiki/GF/wiki_96 71% 12337 - enwiki/GF/wiki_38 71% 12355 - enwiki/GF/wiki_94 71% 12374 - enwiki/GO/wiki_16 71% 12392 - enwiki/GO/wiki_84 71% 12411 - enwiki/GO/wiki_78 71% 12429 - enwiki/GO/wiki_57 71% 12448 - enwiki/GO/wiki_75 71% 12466 - enwiki/GO/wiki_90 72% 12485 72% 12501 - enwiki/CM/wiki_82 72% 12518 - enwiki/CM/wiki_54 72% 12534 - enwiki/CM/wiki_19 72% 12550 - enwiki/CM/wiki_33 72% 12566 - enwiki/CM/wiki_71 72% 12584 - enwiki/EV/wiki_08 72% 12601 - enwiki/EV/wiki_40 72% 12619 - enwiki/EV/wiki_54 72% 12637 - enwiki/EV/wiki_27 73% 12655 - enwiki/EV/wiki_02 73% 12673 - enwiki/AZ/wiki_72 73% 12689 - enwiki/AZ/wiki_65 73% 12706 - enwiki/AZ/wiki_81 73% 12723 - enwiki/AZ/wiki_88 73% 12741 - enwiki/AZ/wiki_38 73% 12758 - enwiki/AZ/wiki_74 73% 12777 73% 12795 - enwiki/FU/wiki_17 73% 12813 - enwiki/FU/wiki_46 74% 12832 - enwiki/FU/wiki_01 74% 12849 - enwiki/FU/wiki_11 74% 12868 - enwiki/FU/wiki_83 74% 12886 - enwiki/DX/wiki_89 74% 12905 - enwiki/DX/wiki_82 74% 12922 - enwiki/DX/wiki_54 74% 12940 - enwiki/DX/wiki_27 74% 12957 - enwiki/DX/wiki_45 74% 12976 - enwiki/AU/wiki_72 75% 12993 - enwiki/AU/wiki_36 75% 13010 - enwiki/AU/wiki_64 75% 13028 - enwiki/AU/wiki_42 75% 13045 - enwiki/AU/wiki_44 75% 13061 - enwiki/AU/wiki_74 75% 13079 - enwiki/BK/wiki_51 75% 13095 - enwiki/BK/wiki_48 75% 13105 - enwiki/BK/wiki_00 75% 13121 - enwiki/BK/wiki_80 75% 13137 - enwiki/BK/wiki_05 75% 13154 - enwiki/BK/wiki_31 76% 13170 - enwiki/BK/wiki_87 76% 13188 - enwiki/DY/wiki_69 76% 13205 - enwiki/DY/wiki_20 76% 13222 - enwiki/DY/wiki_80 76% 13240 - enwiki/DY/wiki_09 76% 13258 - enwiki/DY/wiki_25 76% 13276 - enwiki/DY/wiki_52 76% 13295 - enwiki/FN/wiki_65 76% 13312 - enwiki/FN/wiki_81 76% 13330 - enwiki/FN/wiki_96 77% 13348 - enwiki/FN/wiki_44 77% 13365 - enwiki/FN/wiki_94 77% 13383 - enwiki/DT/wiki_76 77% 13399 - enwiki/DT/wiki_13 77% 13415 - enwiki/DT/wiki_22 77% 13432 - enwiki/DT/wiki_42 77% 13449 - enwiki/DT/wiki_44 77% 13466 - enwiki/DT/wiki_94 77% 13484 - enwiki/AO/wiki_76 77% 13501 - enwiki/AO/wiki_14 78% 13518 - enwiki/AO/wiki_86 78% 13519 - enwiki/AO/wiki_29 78% 13536 - enwiki/AO/wiki_91 78% 13553 - enwiki/AO/wiki_79 78% 13570 - enwiki/AO/wiki_92 78% 13588 - enwiki/BT/wiki_23 78% 13605 - enwiki/BT/wiki_43 78% 13622 - enwiki/BT/wiki_93 78% 13639 - enwiki/BT/wiki_04 78% 13656 - enwiki/BT/wiki_68 78% 13673 - enwiki/BT/wiki_06 79% 13692 - enwiki/DC/wiki_55 79% 13711 - enwiki/DC/wiki_00 79% 13730 - enwiki/DC/wiki_54 79% 13748 - enwiki/DC/wiki_27 79% 13766 - enwiki/DC/wiki_02 79% 13785 - enwiki/DG/wiki_21 79% 13802 - enwiki/DG/wiki_48 79% 13819 - enwiki/DG/wiki_22 79% 13837 - enwiki/DG/wiki_26 80% 13854 - enwiki/DG/wiki_34 80% 13872 - enwiki/DG/wiki_97 80% 13889 - enwiki/AQ/wiki_16 80% 13906 - enwiki/AQ/wiki_17 80% 13922 - enwiki/AQ/wiki_86 80% 13939 - enwiki/AQ/wiki_58 80% 13956 - enwiki/AQ/wiki_07 80% 13973 - enwiki/AQ/wiki_97 80% 13991 - enwiki/DD/wiki_98 80% 14008 - enwiki/DD/wiki_84 80% 14025 - enwiki/DD/wiki_46 81% 14043 - enwiki/DD/wiki_04 81% 14061 81% 14078 - enwiki/DD/wiki_03 81% 14096 - enwiki/BB/wiki_55 81% 14112 - enwiki/BB/wiki_70 81% 14129 - enwiki/BB/wiki_61 81% 14146 - enwiki/BB/wiki_57 81% 14162 - enwiki/BB/wiki_11 81% 14179 - enwiki/BB/wiki_03 81% 14197 - enwiki/CG/wiki_55 82% 14213 - enwiki/CG/wiki_70 82% 14230 - enwiki/CG/wiki_61 82% 14247 - enwiki/CG/wiki_57 82% 14263 - enwiki/CG/wiki_11 82% 14278 - enwiki/CG/wiki_24 82% 14296 - enwiki/GE/wiki_85 82% 14313 - enwiki/GE/wiki_50 82% 14330 - enwiki/GE/wiki_78 82% 14348 - enwiki/GE/wiki_57 82% 14366 - enwiki/GE/wiki_31 83% 14384 - enwiki/GE/wiki_71 83% 14405 - enwiki/AH/wiki_35 83% 14427 - enwiki/AH/wiki_86 83% 14449 - enwiki/AH/wiki_57 83% 14471 - enwiki/AH/wiki_53 83% 14493 - enwiki/FX/wiki_51 83% 14511 - enwiki/FX/wiki_14 83% 14531 - enwiki/FX/wiki_93 84% 14550 - enwiki/FX/wiki_57 84% 14568 - enwiki/FX/wiki_31 84% 14587 - enwiki/FX/wiki_90 84% 14605 - enwiki/BG/wiki_60 84% 14622 - enwiki/BG/wiki_82 84% 14639 - enwiki/BG/wiki_54 84% 14656 - enwiki/BG/wiki_99 84% 14674 - enwiki/BG/wiki_45 84% 14692 - enwiki/FT/wiki_66 84% 14710 - enwiki/FT/wiki_36 85% 14727 - enwiki/FT/wiki_64 85% 14745 - enwiki/FT/wiki_42 85% 14763 - enwiki/FT/wiki_34 85% 14780 - enwiki/FT/wiki_37 85% 14798 - enwiki/BF/wiki_16 85% 14815 - enwiki/BF/wiki_17 85% 14831 - enwiki/BF/wiki_86 85% 14847 - enwiki/BF/wiki_26 85% 14863 - enwiki/BF/wiki_44 85% 14879 - enwiki/BF/wiki_74 86% 14896 - enwiki/AR/wiki_21 86% 14912 - enwiki/AR/wiki_36 86% 14928 - enwiki/AR/wiki_81 86% 14944 - enwiki/AR/wiki_39 86% 14961 - enwiki/AR/wiki_30 86% 14979 - enwiki/AR/wiki_77 86% 14998 - enwiki/AV/wiki_51 86% 15015 - enwiki/AV/wiki_13 86% 15032 - enwiki/AV/wiki_62 86% 15049 - enwiki/AV/wiki_26 86% 15065 - enwiki/AV/wiki_44 87% 15083 - enwiki/AV/wiki_37 87% 15101 - enwiki/FP/wiki_16 87% 15118 - enwiki/FP/wiki_17 87% 15136 - enwiki/FP/wiki_46 87% 15154 - enwiki/FP/wiki_04 87% 15171 - enwiki/FP/wiki_68 87% 15188 - enwiki/FP/wiki_06 87% 15206 - enwiki/BZ/wiki_63 87% 15221 - enwiki/BZ/wiki_43 87% 15237 - enwiki/BZ/wiki_46 88% 15253 - enwiki/BZ/wiki_91 88% 15270 - enwiki/BZ/wiki_79 88% 15286 - enwiki/BZ/wiki_97 88% 15305 - enwiki/ER/wiki_23 88% 15322 - enwiki/ER/wiki_43 88% 15339 - enwiki/ER/wiki_93 88% 15357 - enwiki/ER/wiki_01 88% 15374 - enwiki/ER/wiki_11 88% 15392 - enwiki/ER/wiki_87 88% 15409 - enwiki/CB/wiki_55 89% 15426 - enwiki/CB/wiki_41 89% 15441 - enwiki/CB/wiki_78 89% 15457 - enwiki/CB/wiki_04 89% 15474 - enwiki/CB/wiki_68 89% 15491 - enwiki/CB/wiki_06 89% 15509 - enwiki/DQ/wiki_63 89% 15526 - enwiki/DQ/wiki_70 89% 15544 - enwiki/DQ/wiki_32 89% 15561 - enwiki/DQ/wiki_05 89% 15579 - enwiki/DQ/wiki_75 90% 15596 - enwiki/DQ/wiki_71 90% 15613 - enwiki/BN/wiki_89 90% 15630 - enwiki/BN/wiki_00 90% 15646 - enwiki/BN/wiki_80 90% 15662 - enwiki/BN/wiki_05 90% 15679 - enwiki/BN/wiki_31 90% 15695 - enwiki/BN/wiki_87 90% 15714 - enwiki/DO/wiki_89 90% 15731 - enwiki/DO/wiki_00 90% 15748 - enwiki/DO/wiki_67 91% 15766 - enwiki/DO/wiki_19 91% 15783 - enwiki/DO/wiki_25 91% 15799 - enwiki/DO/wiki_90 91% 15817 - enwiki/FG/wiki_60 91% 15834 - enwiki/FG/wiki_82 91% 15850 - enwiki/FG/wiki_95 91% 15867 - enwiki/FG/wiki_19 91% 15883 - enwiki/FG/wiki_33 91% 15900 - enwiki/FG/wiki_90 91% 15918 - enwiki/CJ/wiki_60 92% 15935 - enwiki/CJ/wiki_82 92% 15952 - enwiki/CJ/wiki_54 92% 15967 - enwiki/CJ/wiki_09 92% 15984 - enwiki/CJ/wiki_33 92% 16001 - enwiki/CJ/wiki_90 92% 16018 - enwiki/AJ/wiki_08 92% 16035 - enwiki/AJ/wiki_40 92% 16044 - enwiki/AJ/wiki_29 92% 16060 - enwiki/AJ/wiki_58 92% 16078 - enwiki/AJ/wiki_79 92% 16095 - enwiki/AJ/wiki_92 93% 16113 - enwiki/BU/wiki_23 93% 16130 - enwiki/BU/wiki_43 93% 16146 - enwiki/BU/wiki_46 93% 16163 - enwiki/BU/wiki_49 93% 16180 - enwiki/BU/wiki_59 93% 16197 - enwiki/BU/wiki_24 93% 16215 - enwiki/BX/wiki_85 93% 16232 - enwiki/BX/wiki_50 93% 16249 - enwiki/BX/wiki_78 93% 16265 - enwiki/BX/wiki_04 94% 16282 - enwiki/BX/wiki_68 94% 16298 - enwiki/BX/wiki_24 94% 16315 - enwiki/CO/wiki_23 94% 16332 - enwiki/CO/wiki_43 94% 16349 - enwiki/CO/wiki_93 94% 16365 - enwiki/CO/wiki_49 94% 16380 - enwiki/CO/wiki_07 94% 16396 - enwiki/CO/wiki_37 94% 16414 - enwiki/CN/wiki_16 94% 16431 - enwiki/CN/wiki_17 94% 16447 - enwiki/CN/wiki_86 95% 16464 - enwiki/CN/wiki_58 95% 16480 - enwiki/CN/wiki_34 95% 16498 - enwiki/CN/wiki_97 95% 16516 - enwiki/EZ/wiki_98 95% 16534 - enwiki/EZ/wiki_43 95% 16552 - enwiki/EZ/wiki_78 95% 16570 - enwiki/EZ/wiki_57 95% 16588 - enwiki/EZ/wiki_31 95% 16605 - enwiki/EZ/wiki_83 95% 16623 - enwiki/CP/wiki_89 96% 16641 - enwiki/CP/wiki_40 96% 16658 - enwiki/CP/wiki_95 96% 16674 - enwiki/CP/wiki_09 96% 16691 - enwiki/CP/wiki_33 96% 16707 - enwiki/CP/wiki_71 96% 16726 - enwiki/GD/wiki_60 96% 16743 - enwiki/GD/wiki_82 96% 16761 - enwiki/GD/wiki_73 96% 16779 - enwiki/GD/wiki_30 97% 16796 - enwiki/GD/wiki_02 97% 16814 - enwiki/AP/wiki_72 97% 16831 - enwiki/AP/wiki_36 97% 16847 - enwiki/AP/wiki_81 97% 16864 - enwiki/AP/wiki_88 97% 16880 - enwiki/AP/wiki_30 97% 16897 - enwiki/AP/wiki_02 97% 16914 - enwiki/AK/wiki_66 97% 16930 - enwiki/AK/wiki_35 97% 16947 - enwiki/AK/wiki_28 97% 16964 - enwiki/AK/wiki_39 98% 16981 - enwiki/AK/wiki_30 98% 16998 - enwiki/AK/wiki_02 98% 17016 - enwiki/BI/wiki_72 98% 17033 - enwiki/BI/wiki_36 98% 17050 - enwiki/BI/wiki_64 98% 17067 - enwiki/BI/wiki_96 98% 17084 - enwiki/BI/wiki_38 98% 17101 - enwiki/BI/wiki_74 98% 17119 - enwiki/EA/wiki_51 98% 17137 - enwiki/EA/wiki_14 99% 17155 - enwiki/EA/wiki_29 99% 17172 - enwiki/EA/wiki_91 99% 17191 - enwiki/EA/wiki_68 99% 17209 - enwiki/EA/wiki_03 99% 17227 - enwiki/BQ/wiki_55 99% 17244 - enwiki/BQ/wiki_41 99% 17261 - enwiki/BQ/wiki_32 99% 17277 - enwiki/BQ/wiki_57 99% 17294 - enwiki/BQ/wiki_12 99% 17310 - enwiki/BQ/wiki_03 Everything is Ok
Folders: 173
Files: 17144
Size: 17878455410
Compressed: 6245938389
corpus file has been downloaded and extracted.
|
mount the uncompressed dataset to a huggingface dataset object.
1
| en_hf_wikipedia_dataset = datasets.Dataset.from_json('/content/enwiki/**/wiki_*')
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| Resolving data files: 0%| | 0/17144 [00:00<?, ?it/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
|
filter empty documents
1
| en_hf_wikipedia_dataset = en_hf_wikipedia_dataset.filter(lambda example:example['text'])
|
1
| Filter: 0%| | 0/16699988 [00:00<?, ? examples/s]
|
as this dataset is massively large! we are going only to take a portion that is closer to the news dataset so that the domain of wikipedia text will not dominate the overall dataset text.
1
2
| mini_en_hf_wikipedia_dataset = en_hf_wikipedia_dataset.select(range(4*10**6)) # get 4.0M example, to be close to the news dataset size
mini_en_hf_wikipedia_dataset
|
1
2
3
4
| Dataset({
features: ['id', 'revid', 'url', 'title', 'text'],
num_rows: 4000000
})
|
bible dataset
Get the testament
portion of the bible dataset.
1
2
| hf_bible_dataset = datasets.load_dataset("versae/bibles",'testament')
hf_bible_dataset
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
| Downloading builder script: 0%| | 0.00/7.23k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/178 [00:00<?, ?B/s]
Downloading data files: 0%| | 0/3 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/37.3M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/37.3M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/271M [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
DatasetDict({
train: Dataset({
features: ['text', 'label', 'language', 'year', 'century', 'codebook'],
num_rows: 1570633
})
validation: Dataset({
features: ['text', 'label', 'language', 'year', 'century', 'codebook'],
num_rows: 216493
})
test: Dataset({
features: ['text', 'label', 'language', 'year', 'century', 'codebook'],
num_rows: 216511
})
})
|
1
2
3
4
5
6
7
| en_bible_dataset = list()
for split in ('train','validation','test'):
en_bible_dataset.extend(
example['text'] for example in tqdm(hf_bible_dataset[split])
if example['language'].lower() == 'eng' and example['text']
)
len(en_bible_dataset)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| 0%| | 0/1570633 [00:00<?, ?it/s]
0%| | 0/216493 [00:00<?, ?it/s]
0%| | 0/216511 [00:00<?, ?it/s]
455268
|
Process the datasets
after collecting the previous datasets, let us start processing them. In this processing procedure, we are going to remove any character that does not belong to English alphabet. This will also cover digits and punctuation marks.
utils functions
1
| ENGLISH_LETTERS = string.ascii_lowercase
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| def process_english_text(text):
# add spaces between punctuations, if there is not
text = text.lower()
text = re.sub(r'''([.,!?()\/\\،"'\{\}\(\)\[\]؟<>«»`؛=+\-\*\&\^\%\$\#\@\!:|…123456789;؟–−])''', r' \1 ', text)
# remove any non arabic character
text = ''.join([c for c in text if c in ENGLISH_LETTERS or c.isspace()]) # keep only english chars and spaces
text = re.sub('\s{2,}',' ',text).strip() # remove multiple spaces
'''
interestingly, there is a difference betwen re.sub('\s+',' ',s) and re.sub('\s{2,}',' ',s)
the first one remove newlines while the second does not.
'''
text = text.replace(u'\xa0',u'')
# these are all spaces: https://jkorpela.fi/chars/spaces.html
text = text.replace(u'\x85',u'')
text = text.replace(u'\u200a',u' ')
text = text.replace(u'\u2009',u' ')
text = text.replace(u'\u3000',u' ')
text = text.replace(u'\u202f',u' ')
text = text.replace(u'\u2002',u' ')
text = text.replace(u'\u2003',u' ')
return text.strip()
|
testing the processing function
1
2
3
4
5
6
7
| texts = [
'this is a text',
'he@llo# m%an',
'kind of a good work',
]
for text in texts:
print(process_english_text(text))
|
1
2
3
| this is a text
he llo m an
kind of a good work
|
datasets processing
processing each dataset with the above procedure
1
2
3
4
5
6
7
8
| processed_news_dataset = list(map(
process_english_text,
tqdm(en_news_dataset),
))
# delete the old var an garpage collect
del en_news_dataset
gc.collect()
len(processed_news_dataset)
|
1
2
3
4
5
6
7
| 0%| | 0/3995559 [00:00<?, ?it/s]
3995559
|
1
2
3
4
5
6
7
| processed_wikipedia_dataset = mini_en_hf_wikipedia_dataset.map(lambda example: dict(
processed_text=process_english_text(example['text']),
**example,
),
num_proc=4,
)['processed_text']
len(processed_wikipedia_dataset)
|
1
2
3
4
5
6
7
| Map (num_proc=4): 0%| | 0/4000000 [00:00<?, ? examples/s]
4000000
|
1
2
3
4
5
6
7
8
| processed_bible_dataset = list(map(
process_english_text,
tqdm(en_bible_dataset),
))
# delete the old var an garpage collect
del en_bible_dataset
gc.collect()
len(processed_bible_dataset)
|
1
2
3
4
5
6
7
| 0%| | 0/455268 [00:00<?, ?it/s]
455268
|
Aggregate the datasets
After the datasets are processed, let us merge them into a larege dataset list. The total size of this list exceeds 8M document!
1
2
| aggregated_dataset = processed_news_dataset+processed_wikipedia_dataset+processed_bible_dataset
len(aggregated_dataset)
|
Regression Modeling
To prepare our dataset for regression, we should first exctract the words frequencies counters.
Build the counters
1
2
| dataset_counter = dict(Counter(word for document in tqdm(aggregated_dataset) for word in document.split()))
len(dataset_counter)
|
1
2
3
4
5
6
7
| 0%| | 0/8450827 [00:00<?, ?it/s]
4729296
|
Sort the counters based on their frequency
1
2
3
4
5
6
7
8
| dataset_counter = dict(
sorted(
tqdm(dataset_counter.items()),
key=lambda item:item[1],
reverse=True,
),
)
pprint(list(dataset_counter.items())[:10])
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| 0%| | 0/4729296 [00:00<?, ?it/s]
[('the', 140588744),
('of', 66324254),
('and', 56793485),
('in', 56742261),
('a', 42652768),
('to', 41913171),
('was', 23486709),
('is', 17780676),
('on', 17076328),
('for', 16814191)]
|
Plot the counters on the rank/frequency graph
let us visualize the frequencies distriburtion of this massive collected dataset.
1
2
3
4
5
6
7
8
9
10
11
12
13
| plt.figure(figsize=(15,10))
plt.plot(
list(dataset_counter.keys())[:100],
list(dataset_counter.values())[:100],
'x',
)
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False,
) # labels along the bottom edge are off
|
Prepare regression data
As the current shape of the dataset may not be the best fit, the frequency counts are transformed to teh log-log scale. In this scale, the counts are transformed to almost a linear shape that could be easily learnt using regression.
Transform data to log-log scale
1
2
3
4
5
6
7
8
| X = np.log(
range(1,len(dataset_counter.keys())+1),
dtype=np.float64,
)
y = np.log(
list(dataset_counter.values()),
dtype=np.float64,
)
|
1
| X = np.nan_to_num(X,neginf=0)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| plt.figure(figsize=(15,10))
plt.plot(
X,
y,
',',
)
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False,
) # labels along the bottom edge are off
|
Using scipy to fit
scipy uses a closed-form implementation to fit the data as seen below. This closed-form implementation may hold here as we have a relatively small counts dataset. However, for large and complex regression problems, it may be difficult to apply this closed form approach of regression. This leads to propose iterative approaches that uses optimization techniques. Such technqiues include Stochastic Gradient Descent and Adam optimizations. Let us first see how this closed-form fits the counts dataset.
1
2
| slope, intercept, *_ = stats.linregress(X, y)
slope,intercept
|
1
| (-1.5839698830293245, 23.87273975694796)
|
1
2
| y_pred = slope * X + intercept
r2_score(y,y_pred),mean_squared_error(y,y_pred)
|
1
| (0.9807104659599317, 0.049347223516296584)
|
1
2
3
4
5
6
7
8
9
10
11
12
| plt.figure(figsize=(15,10))
plt.plot(
X,
y,
',',
)
plt.plot(
X,
slope * X + intercept,
# '.',
)
plt.show()
|
Using pytorch to fit
Due to the limitations discussed above regarding the least-square closed-form used to model the regression, in this section we are going to apply an iterative method to learn this regression model. We are showing a linear model of only one linear layer and another complex model of two layers with RELU activation. The results we achieved here are in very close accordance with the results reported by the closed-form approach! Let us dive into the code and details.
utilities and preparation
1
| device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
1
2
3
| X = X.reshape(-1,1)
y = y.reshape(-1,1)
X.shape,y.shape
|
1
| ((4729296, 1), (4729296, 1))
|
1
2
| input_size = 1
output_size = 1
|
1
2
3
4
5
6
| inputs = torch.tensor(
X.astype(np.float32),
requires_grad=True,
device=device
)
outputs = torch.from_numpy(y.astype(np.float32)).to(device)
|
1
2
3
4
5
6
7
| tensor([[18.7613],
[18.0101],
[17.8549],
...,
[ 0.0000],
[ 0.0000],
[ 0.0000]])
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| def train_model_for_regression(
model,
verbose=True,
n_epochs=1000,
learning_rate=0.25,
verbose_loss_every=100,
criteron_class = nn.MSELoss,
optimizer_class=torch.optim.Adam,
):
optimizer = optimizer_class(model.parameters(),lr=learning_rate)
criteron = criteron_class()
for epoch in range(1,n_epochs+1):
if verbose and epoch % verbose_loss_every == 0:
print('Epoch: ',epoch)
optimizer.zero_grad()
model_outputs = model(inputs)
loss = criteron(model_outputs,outputs)
loss.backward()
optimizer.step()
if verbose and epoch % verbose_loss_every == 0:
print(f'Epoch Loss:',loss.item())
|
one layer fit
1
2
3
4
5
6
| one_layer_model = nn.Linear(
in_features=input_size,
out_features=output_size,
bias=True,
)
one_layer_model
|
1
| Linear(in_features=1, out_features=1, bias=True)
|
train the model
1
| train_model_for_regression(model=one_layer_model)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| Epoch: 100
Epoch Loss: 2.6146554946899414
Epoch: 200
Epoch Loss: 2.0427799224853516
Epoch: 300
Epoch Loss: 1.4714503288269043
Epoch: 400
Epoch Loss: 0.9903616905212402
Epoch: 500
Epoch Loss: 0.6325700283050537
Epoch: 600
Epoch Loss: 0.38691988587379456
Epoch: 700
Epoch Loss: 0.23223881423473358
Epoch: 800
Epoch Loss: 0.14219366014003754
Epoch: 900
Epoch Loss: 0.09339486062526703
Epoch: 1000
Epoch Loss: 0.06888055056333542
|
1
2
| y_pred = one_layer_model(inputs).detach().numpy().flatten()
r2_score(y,y_pred),mean_squared_error(y,y_pred)
|
1
| (0.9731394307735174, 0.06871573520857868)
|
1
2
3
4
5
6
7
8
9
10
11
12
| plt.figure(figsize=(15,10))
plt.plot(
X,
y,
',',
)
plt.plot(
X,
y_pred,
# '.',
)
plt.show()
|
multi-layers fit
1
2
3
4
5
6
7
8
9
10
11
12
13
| class MultiLayerRegressionModel(nn.Module):
def __init__(self, input_size=input_size, output_size=output_size, hidden_size=100):
super().__init__()
self.first_layer = nn.Linear(in_features=input_size,out_features=hidden_size)
self.relu = nn.ReLU()
self.second_layer = nn.Linear(in_features=hidden_size,out_features=output_size)
def forward(self, x):
out = self.first_layer(x)
out = self.relu(out)
out = self.second_layer(out)
return out
|
1
2
| multi_layers_model = MultiLayerRegressionModel()
multi_layers_model
|
1
2
3
4
5
| MultiLayerRegressionModel(
(first_layer): Linear(in_features=1, out_features=100, bias=True)
(relu): ReLU()
(second_layer): Linear(in_features=100, out_features=1, bias=True)
)
|
1
| train_model_for_regression(model=multi_layers_model,n_epochs=500)
|
1
2
3
4
5
6
7
8
9
10
| Epoch: 100
Epoch Loss: 2.3198137283325195
Epoch: 200
Epoch Loss: 1.2398127317428589
Epoch: 300
Epoch Loss: 0.1229717805981636
Epoch: 400
Epoch Loss: 0.13968251645565033
Epoch: 500
Epoch Loss: 0.0495976023375988
|
1
2
| y_pred = multi_layers_model(inputs).detach().numpy().flatten()
r2_score(y,y_pred),mean_squared_error(y,y_pred)
|
1
| (0.980625576872138, 0.04956439002642326)
|
1
2
3
4
5
6
7
8
9
10
11
12
| plt.figure(figsize=(15,10))
plt.plot(
X,
y,
',',
)
plt.plot(
X,
y_pred,
# '.',
)
plt.show()
|
Resources
- https://www.youtube.com/watch?v=fCn8zs912OE&ab_channel=Vsauce
- https://en.wikipedia.org/wiki/Zipf%E2%80%93Mandelbrot_law