Skip to content

Commit 58bd8a9

Browse files
committed
Merge branch 'master' of github.com:hipertracker/words_extractor
2 parents 07cfd45 + e7150f0 commit 58bd8a9

File tree

10 files changed

+125
-103
lines changed

10 files changed

+125
-103
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@
1212
/**/main.dwarf
1313
/**/main
1414
/**/target
15-
/**/.idea
15+
/**/.idea
16+
/**/.venv

README.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,29 @@
11
# words_extractor
22

33
### Info
4+
45
Example of a text file parsing in several programming languages. The goal is to extract unique words from utf-8 files and save results them into separate files.
56

67
### Results
78

8-
The following results are for 936 files (2745 MB) on MacOS 12.2 and MacBook Pro 16" 64GB 2TB M1Max 10 cores. (For more text files go into data/pl/* and duplicate files several times.) All examples are using a similar logic and approach.
9+
The following results are for 123 unique utf-8 Bible text files in 23 languages (used at mybible.pl site) They take 504MB. (The repo contains only a few sample files in the 'data' folder. For testing more data you could multiple files by cloning *.txt (and the associated*.yml) file under different names)
10+
11+
* Platform: MacOS 12.2
12+
* Machine: MacBook Pro 16" 64GB 2TB M1Max 10 cores.
913

1014
<pre>
11-
1. Rust v1.58.1 = 7.54s
12-
2. Python v3.10.2 = 15.34s (with multiprocessing)
13-
3. Julia v1.7.1 = 17.00s
14-
4. Crystal v1.3.2 = 26.32s
15-
5. Ruby v3.1.0 = 40.94s (with Parallel)
16-
6. Golang v1.18beta1 = 73.00s
17-
7. Elixir v1.13.2 = 2m43s
15+
1. Rust 1.58 = 0.38s
16+
2. Python 3.10.2 = 2.80s
17+
3. Julia 1.7.1 = 4.522
18+
4. Crystal 1.3.2 = 5.72s
19+
5. Elixir 1.13.2 = 8.37s
20+
6. Ruby 3.1.0 = 8.31s
21+
22+
Golang 1.17 = UNDER REFACTORING, stay tuned
1823
</pre>
1924

2025
### Conclusion
2126

2227
Rust is the fastest language beyond doubt.
2328

24-
What is surprised is pretty poor Golang's performance on this task. Crystal is faster than Golang but in this task it is still slower than Python which is also surprising. (Neither Golang nor Crystal is my main field of expertise so maybe there is some room for improvement. Although I showed this code to people and nobody so far could improve it in any significant way. But if I find a better implementation I will update this comparison.)
25-
26-
The high Python performance is interesting. Although it is using a multiprocessing standard library for full CPU cores utilization this is still dynamic interpreted language after all, which is rather expected to be slower than statically typed languages.
29+
The high Python performance is interesting. Although it is using a multiprocessing standard library for full CPU cores utilization this is still dynamic interpreted language after all, which is rather expected to be slower than statically typed languages.
Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,46 @@
11
require "json"
22
require "yaml"
33

4-
# TODO: Write documentation for `FastWordsCr`
5-
64
module Example::Crystal
7-
VERSION = "0.2.0"
5+
VERSION = "0.3.0"
86
CHARSET = "aąbcćdeęfghijklłmnńoópqrsśtuvwxyzźż"
97

10-
def self.main(outpath = "words")
8+
def self.main(outdir = "words")
119
with_sorting = false
1210
concurrent = true
1311

14-
prepare_folder(outpath, "*.txt")
12+
prepare_folder(outdir, "*.txt")
1513

1614
file_count = 0
1715
total_size = 0
1816

1917
channel = Channel(Tuple(String, Int64)).new
2018
srcPath = "../data/??/**/*.yml"
21-
# srcPath = "./bibles/??/**/*.yml"
22-
Dir.glob(srcPath, follow_symlinks: true).each do |path|
19+
paths = Dir.glob(srcPath, follow_symlinks: true)
20+
count = paths.size
21+
paths.each do |path|
2322
if concurrent
2423
spawn do
25-
channel.send worker(path, outpath, with_sorting)
24+
channel.send worker(path, outdir, with_sorting)
2625
end
2726
file_count += 1
2827
else
29-
worker(path, outpath, with_sorting)
28+
worker(path, outdir, with_sorting)
3029
end
3130
end
3231
if concurrent
3332
file_count.times do |i|
3433
path, size = channel.receive
3534
total_size += size
36-
# puts("[#{i + 1}/#{file_count}] #{path}")
35+
puts(::sprintf("[%3d/%d] %s", i + 1, file_count, path))
3736
end
3837
end
39-
total_size = total_size / 1024 / 1024
40-
puts("Total size: #{total_size} MB")
38+
puts("Total size: #{(total_size / 1024 / 1024).round} MB")
4139
end
4240

43-
def self.worker(path, outpath, with_sorting)
41+
def self.worker(path, outdir, with_sorting)
4442
filepath = path.gsub(".yml", ".txt")
43+
filesize = File.size(filepath)
4544
text = File.read(filepath).gsub("\n", " ").downcase
4645

4746
words = text.split(/[^\p{L}]+/).to_set
@@ -51,10 +50,9 @@ module Example::Crystal
5150
end
5251

5352
meta = File.open(path) { |file| YAML.parse(file) }
54-
filepath = %Q(#{outpath}/#{meta["lang"]}-#{meta["code"]}.txt)
55-
File.write(filepath, words.join("\n"))
53+
outfilepath = %Q(#{outdir}/#{meta["lang"]}-#{meta["code"]}.txt)
54+
File.write(outfilepath, words.join("\n"))
5655
filesize = File.size(filepath)
57-
puts([filepath, filesize])
5856
{filepath, filesize}
5957
end
6058

@@ -78,7 +76,8 @@ module Example::Crystal
7876
end
7977
end
8078

81-
elapsed_time = Time.measure do
79+
elapsed = Time.measure do
8280
Example::Crystal.main
8381
end
84-
puts elapsed_time
82+
83+
puts("Total time: #{elapsed}")
Lines changed: 39 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,56 @@
1+
use Timex
2+
13
defmodule WordsExtractor do
24
@moduledoc nil
35

46
def run do
57
outdir = "words"
6-
clean_dir(outdir)
7-
8-
walk("../data/pl/", ".yml")
9-
|> Task.async_stream(
10-
WordsExtractor,
11-
:worker,
12-
[outdir],
13-
ordered: false,
14-
timeout: :infinity
15-
)
16-
|> Enum.to_list()
8+
File.rm_rf!(outdir)
9+
File.mkdir!(outdir)
10+
t = Duration.now()
11+
paths = Path.wildcard("../data/??/**/*.yml")
12+
count = length(paths)
13+
14+
total_size =
15+
paths
16+
|> Enum.with_index(1)
17+
|> Task.async_stream(
18+
WordsExtractor,
19+
:worker,
20+
[outdir, count],
21+
ordered: false,
22+
timeout: :infinity
23+
)
24+
|> Enum.to_list()
25+
|> Enum.reduce(0, fn {:ok, num}, acc -> acc + num end)
26+
27+
elapsed = Duration.diff(Duration.now(), t, :milliseconds)
28+
IO.puts("Total time: #{elapsed / 1000}s")
29+
IO.puts("Total size: #{(total_size / 1024 / 1024) |> round} MB")
1730
end
1831

19-
def clean_dir(path) do
20-
File.rm_rf!(path)
21-
File.mkdir!(path)
22-
end
32+
def worker({path, i}, outdir, count) do
33+
%{"code" => code, "lang" => lang} = YamlElixir.read_from_file!(path)
34+
pat = Regex.compile!("[\W\d]+/u")
2335

24-
def worker(path, outdir) do
25-
%{"code" => code} = YamlElixir.read_from_file!(path)
36+
filepath = String.replace(path, ".yml", ".txt")
37+
%File.Stat{:size => filesize} = File.stat!(filepath)
2638

27-
words =
28-
File.read!(String.replace(path, ".yml", ".txt"))
39+
content =
40+
File.read!(filepath)
2941
|> String.downcase()
3042
|> String.trim()
31-
|> then(&Regex.split(~r/[\W\d]+/u, &1))
43+
|> then(&Regex.split(pat, &1))
3244
|> MapSet.new()
33-
# sorting does not respect collation
34-
|> Enum.sort()
45+
|> Enum.join("\n")
3546

36-
File.write!("#{outdir}/extracted-#{code}.txt", Enum.join(words, "\n"))
37-
IO.puts(path)
38-
end
47+
# sorting does not respect collation so it is ignored
48+
# |> Enum.sort()
3949

40-
def walk(path, pattern) do
41-
dir = String.to_charlist(path)
42-
regexp = String.to_charlist(pattern)
50+
it = i |> Integer.to_string() |> String.pad_leading(3)
51+
IO.puts("[#{it}/#{count}] #{path}/#{lang}-#{code}.txt")
4352

44-
:filelib.fold_files(dir, regexp, true, fn file, acc -> [file | acc] end, [])
45-
|> Enum.map(fn filepath -> to_string(filepath) end)
53+
File.write!("./#{outdir}/#{lang}-#{code}.txt", content)
54+
filesize
4655
end
4756
end

example-elixir/mix.exs

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ defmodule WordsExtractor.MixProject do
44
def project do
55
[
66
app: :words_extractor_ex,
7-
version: "0.1.0",
7+
version: "0.2.0",
88
elixir: "~> 1.13",
99
start_permanent: Mix.env() == :prod,
1010
deps: deps()
@@ -21,7 +21,8 @@ defmodule WordsExtractor.MixProject do
2121
# Run "mix help deps" to learn about dependencies.
2222
defp deps do
2323
[
24-
{:yaml_elixir, "~> 2.8"}
24+
{:yaml_elixir, "~> 2.8"},
25+
{:timex, "~> 3.7"}
2526
# {:dep_from_hexpm, "~> 0.3.0"},
2627
# {:dep_from_git, git: "https://github.com/elixir-lang/my_dep.git", tag: "0.1.0"}
2728
]

example-elixir/mix.lock

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,16 @@
11
%{
2+
"certifi": {:hex, :certifi, "2.8.0", "d4fb0a6bb20b7c9c3643e22507e42f356ac090a1dcea9ab99e27e0376d695eba", [:rebar3], [], "hexpm", "6ac7efc1c6f8600b08d625292d4bbf584e14847ce1b6b5c44d983d273e1097ea"},
3+
"combine": {:hex, :combine, "0.10.0", "eff8224eeb56498a2af13011d142c5e7997a80c8f5b97c499f84c841032e429f", [:mix], [], "hexpm", "1b1dbc1790073076580d0d1d64e42eae2366583e7aecd455d1215b0d16f2451b"},
4+
"gettext": {:hex, :gettext, "0.19.1", "564953fd21f29358e68b91634799d9d26989f8d039d7512622efb3c3b1c97892", [:mix], [], "hexpm", "10c656c0912b8299adba9b061c06947511e3f109ab0d18b44a866a4498e77222"},
5+
"hackney": {:hex, :hackney, "1.18.0", "c4443d960bb9fba6d01161d01cd81173089686717d9490e5d3606644c48d121f", [:rebar3], [{:certifi, "~>2.8.0", [hex: :certifi, repo: "hexpm", optional: false]}, {:idna, "~>6.1.0", [hex: :idna, repo: "hexpm", optional: false]}, {:metrics, "~>1.0.0", [hex: :metrics, repo: "hexpm", optional: false]}, {:mimerl, "~>1.1", [hex: :mimerl, repo: "hexpm", optional: false]}, {:parse_trans, "3.3.1", [hex: :parse_trans, repo: "hexpm", optional: false]}, {:ssl_verify_fun, "~>1.1.0", [hex: :ssl_verify_fun, repo: "hexpm", optional: false]}, {:unicode_util_compat, "~>0.7.0", [hex: :unicode_util_compat, repo: "hexpm", optional: false]}], "hexpm", "9afcda620704d720db8c6a3123e9848d09c87586dc1c10479c42627b905b5c5e"},
6+
"idna": {:hex, :idna, "6.1.1", "8a63070e9f7d0c62eb9d9fcb360a7de382448200fbbd1b106cc96d3d8099df8d", [:rebar3], [{:unicode_util_compat, "~>0.7.0", [hex: :unicode_util_compat, repo: "hexpm", optional: false]}], "hexpm", "92376eb7894412ed19ac475e4a86f7b413c1b9fbb5bd16dccd57934157944cea"},
7+
"metrics": {:hex, :metrics, "1.0.1", "25f094dea2cda98213cecc3aeff09e940299d950904393b2a29d191c346a8486", [:rebar3], [], "hexpm", "69b09adddc4f74a40716ae54d140f93beb0fb8978d8636eaded0c31b6f099f16"},
8+
"mimerl": {:hex, :mimerl, "1.2.0", "67e2d3f571088d5cfd3e550c383094b47159f3eee8ffa08e64106cdf5e981be3", [:rebar3], [], "hexpm", "f278585650aa581986264638ebf698f8bb19df297f66ad91b18910dfc6e19323"},
9+
"parse_trans": {:hex, :parse_trans, "3.3.1", "16328ab840cc09919bd10dab29e431da3af9e9e7e7e6f0089dd5a2d2820011d8", [:rebar3], [], "hexpm", "07cd9577885f56362d414e8c4c4e6bdf10d43a8767abb92d24cbe8b24c54888b"},
10+
"ssl_verify_fun": {:hex, :ssl_verify_fun, "1.1.6", "cf344f5692c82d2cd7554f5ec8fd961548d4fd09e7d22f5b62482e5aeaebd4b0", [:make, :mix, :rebar3], [], "hexpm", "bdb0d2471f453c88ff3908e7686f86f9be327d065cc1ec16fa4540197ea04680"},
11+
"timex": {:hex, :timex, "3.7.6", "502d2347ec550e77fdf419bc12d15bdccd31266bb7d925b30bf478268098282f", [:mix], [{:combine, "~> 0.10", [hex: :combine, repo: "hexpm", optional: false]}, {:gettext, "~> 0.10", [hex: :gettext, repo: "hexpm", optional: false]}, {:tzdata, "~> 1.0", [hex: :tzdata, repo: "hexpm", optional: false]}], "hexpm", "a296327f79cb1ec795b896698c56e662ed7210cc9eb31f0ab365eb3a62e2c589"},
12+
"tzdata": {:hex, :tzdata, "1.1.1", "20c8043476dfda8504952d00adac41c6eda23912278add38edc140ae0c5bcc46", [:mix], [{:hackney, "~> 1.17", [hex: :hackney, repo: "hexpm", optional: false]}], "hexpm", "a69cec8352eafcd2e198dea28a34113b60fdc6cb57eb5ad65c10292a6ba89787"},
13+
"unicode_util_compat": {:hex, :unicode_util_compat, "0.7.0", "bc84380c9ab48177092f43ac89e4dfa2c6d62b40b8bd132b1059ecc7232f9a78", [:rebar3], [], "hexpm", "25eee6d67df61960cf6a794239566599b09e17e668d3700247bc498638152521"},
214
"yamerl": {:hex, :yamerl, "0.10.0", "4ff81fee2f1f6a46f1700c0d880b24d193ddb74bd14ef42cb0bcf46e81ef2f8e", [:rebar3], [], "hexpm", "346adb2963f1051dc837a2364e4acf6eb7d80097c0f53cbdc3046ec8ec4b4e6e"},
315
"yaml_elixir": {:hex, :yaml_elixir, "2.8.0", "c7ff0034daf57279c2ce902788ce6fdb2445532eb4317e8df4b044209fae6832", [:mix], [{:yamerl, "~> 0.8", [hex: :yamerl, repo: "hexpm", optional: false]}], "hexpm", "4b674bd881e373d1ac6a790c64b2ecb69d1fd612c2af3b22de1619c15473830b"},
416
}

example-julia/src/words.jl

Lines changed: 25 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,15 @@ module words_extractor_jl
22

33
using Distributed
44
using YAML
5+
using Glob
56

6-
const folder = "words"
7+
const outdir = "words"
78

8-
function worker(yaml_path)
9+
function worker(yaml_path, i, count)
910
path = get_filepath(yaml_path)
1011
words = get_words(yaml_path)
1112
write(path, join(words, "\n"))
12-
# println(string("Saved...", path))
13+
println("[$(lpad(i, 3, ' '))/$count] $path")
1314
end
1415

1516
function get_words(yaml_path)
@@ -20,35 +21,37 @@ end
2021

2122
function get_filepath(path)
2223
meta = YAML.load_file(path)
23-
string(folder, "/extracted-words-for-", meta["label"], ".txt")
24+
"""./$outdir/$(meta["lang"])-$(meta["code"]).txt"""
2425
end
2526

26-
function walk(path, file_ext)
27-
res = []
28-
for (root, _, files) in walkdir(path, topdown = true)
29-
for file in files
30-
if endswith(file, file_ext)
31-
filepath = joinpath(root, file)
32-
push!(res, filepath)
33-
end
34-
end
27+
function rdir(dir::AbstractString, pat::Glob.FilenameMatch)
28+
result = String[]
29+
for (root, dirs, files) in walkdir(dir)
30+
filepaths = joinpath.(root, files)
31+
append!(result, filter!(f -> occursin(pat, f), filepaths))
3532
end
36-
res
33+
return result
3734
end
3835

36+
rdir(dir::AbstractString, pat::AbstractString) = rdir(dir, Glob.FilenameMatch(pat))
37+
3938
function main()
40-
if ispath(folder)
41-
rm(folder, recursive = true)
39+
if ispath(outdir)
40+
rm(outdir, recursive = true)
4241
end
43-
mkdir(folder)
44-
Threads.@threads for path in walk("../data/pl/", ".yml")
45-
# println("Spawn $path")
46-
worker(path)
42+
mkdir(outdir)
43+
paths = rdir("../data", fn"../data/??/*.yml")
44+
count = length(paths)
45+
i = 1
46+
Threads.@threads for path in paths
47+
# println("Spawn $path")
48+
worker(path, i, count)
49+
i += 1
4750
end
4851
end
4952

50-
# addprocs()
51-
# println(string("Workers ", nworkers()))
53+
addprocs()
54+
println(string("Workers ", nworkers()))
5255
println(string("Processing... using ", Threads.nthreads(), " threads"))
5356
@time main()
5457
end # module

example-python/words.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,6 @@ def worker(path: str, outdir: str, sorting: bool = False) -> Tuple[str, int]:
5151

5252
pool = mp.Pool(mp.cpu_count())
5353

54-
print("Processing")
5554
results = []
5655
paths = glob.glob(src_path, recursive=True)
5756
if not paths:

example-ruby/words.rb

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,31 @@
11
require 'yaml'
2+
require 'yaml'
23
require 'parallel'
34
require 'etc'
45
require 'fileutils'
56

67
outdir = 'words'
7-
start = Time.now
88

99
FileUtils.rm_rf(outdir)
1010
Dir.mkdir(outdir)
1111

12+
t = Time.now
13+
1214
sorted = false
1315

14-
paths = Dir['../data/pl/**/*.yml']
16+
paths = Dir['../data/??/**/*.yml']
17+
count = paths.count
1518

16-
Parallel.each(paths, in_processes: Etc.nprocessors) do |yaml_path|
19+
sizes = Parallel.map_with_index(paths, in_processes: Etc.nprocessors) do |yaml_path, i|
1720
meta = YAML.load_file(yaml_path)
18-
words = IO.read(yaml_path.gsub('.yml', '.txt')).downcase.strip.split(/[^\p{word}]+/).uniq
19-
if sorted
20-
words = words.sort
21-
end
21+
filepath = yaml_path.gsub('.yml', '.txt')
22+
words = IO.read(filepath).downcase.strip.split(/[^\p{word}]+/).uniq
23+
words = words.sort if sorted
2224
outpath = "#{outdir}/#{meta['lang']}-#{meta['code']}.txt"
23-
puts outpath
25+
puts(format('[%3d/%d] %s/%s', i, count, yaml_path, outpath))
2426
File.write(outpath, words.join("\n"))
27+
File.size(filepath)
2528
end
2629

27-
secs = Time.now - start
28-
puts "Total time: #{secs} s"
29-
30+
puts "Total size: #{(sizes.sum / 1024.0 / 1024).round} MB"
31+
puts "Total time: #{Time.now - t} s"

example-rust/README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,3 @@ cargo build --release
77
target/release/words_extractor_rs
88
99
```
10-
11-
MacOS 12.2
12-
Rust 1.58.1
13-
MBP 16" M1Max 10 cores
14-
Total files: 123
15-
Total size: 504 MB
16-
Total time: 0.3521 s

0 commit comments

Comments
 (0)