Skip to content

Commit bc8ecd5

Browse files
authored
Merge branch 'master' into golang-example-refactor
2 parents ca5bb38 + 735539d commit bc8ecd5

File tree

21 files changed

+516
-449
lines changed

21 files changed

+516
-449
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@
1212
/**/main.dwarf
1313
/**/main
1414
/**/target
15-
/**/.idea
15+
/**/.idea
16+
/**/.venv

README.md

Lines changed: 33 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,48 @@
11
# words_extractor
22

33
### Info
4+
45
Example of a text file parsing in several programming languages. The goal is to extract unique words from utf-8 files and save results them into separate files.
56

7+
The difficulty in sorting words is due to the need to handle sorting rules according to the the different languages grammary. This is quite a complex problem that does not exist for the English language where the character set does not exceed the basic ASCII standard.
8+
69
### Results
710

8-
The following results are for 936 files (2745 MB) on MacOS 12.2 and MacBook Pro 16" 64GB 2TB M1Max 10 cores. (For more text files go into data/pl/* and duplicate files several times.) All examples are using a similar logic and approach.
11+
The following results are for 123 unique utf-8 Bible text files in 23 languages (used at mybible.pl site) They take 504MB. (The repo contains only a few sample files in the 'data' folder. For testing more data you could multiple files by cloning *.txt (and the associated*.yml) file under different names)
12+
13+
* Platform: MacOS 12.2
14+
* Machine: MacBook Pro 16" 64GB 2TB M1Max 10 cores.
915

1016
<pre>
11-
1. Rust v1.58.1 = 7.54s
12-
2. Python v3.10.2 = 15.34s (with multiprocessing)
13-
3. Julia v1.7.1 = 17.00s
14-
4. Crystal v1.3.2 = 26.32s
15-
5. Ruby v3.1.0 = 40.94s (with Parallel)
16-
6. Golang v1.18beta1 = 73.00s
17-
7. Elixir v1.13.2 = 2m43s
17+
1. Rust 1.58 = 1.14s (with sorting: 1.59s) with tokyo (previous: 1.34s, with sorting: 1.79)
18+
2. Golang 1.17.6 = 1.34s (with sorting: 6.55s)
19+
3. Python 3.10.2 = 2.80s (with multiprocessing)
20+
4. Julia 1.7.1 = 4.522
21+
5. Crystal 1.3.2 = 5.72s
22+
6. Elixir 1.13.2 = 7.82s
23+
7. Ruby 3.1.0 = 8.31s (with Parallel)
1824
</pre>
1925

2026
### Conclusion
2127

22-
Rust is the fastest language beyond doubt.
28+
The new optimized Golang code version is very fast, slower than Rust but faster than other languages. Golang is the only language at the moment with full mature i18n support for arm64/M1 platform.
29+
30+
* Rust = the current example uses [lexical-sort](https://lib.rs/crates/lexical-sort) which is not perfect. [There is no standard mature implementation of i18n in Rust](https://www.arewewebyet.org/topics/i18n/) at the moment.
31+
32+
* Python = has a great implementation of [ICU](https://icu.unicode.org/related) library however it does not support arm64/M1 platform, hence I couldn't use it in this comparison.
33+
34+
* Ruby = same as Python, no ICU for M1.
35+
36+
* Elixir = same as Python, no ICU for M1.
37+
38+
* Julia = I couldn't find a good i18 library supporting many languages.
39+
40+
* Crystal = currently supports only Turkish collations. Probably because the language is young and does not have a large enough community or company behind it.
41+
42+
* Golang = has rules for many languages. You can see the influence of a large company and community which makes Golang a mature solution. Sorting slowed the whole task down significantly, but the result is correct (in this case I only checked the results for the Polish language)
43+
44+
### Kudos
2345

24-
What is surprised is pretty poor Golang's performance on this task. Crystal is faster than Golang but in this task it is still slower than Python which is also surprising. (Neither Golang nor Crystal is my main field of expertise so maybe there is some room for improvement. Although I showed this code to people and nobody so far could improve it in any significant way. But if I find a better implementation I will update this comparison.)
46+
[@romanatnews](https://github.com/romanatnews) (Golang example refactoring)
2547

26-
The high Python performance is interesting. Although it is using a multiprocessing standard library for full CPU cores utilization this is still dynamic interpreted language after all, which is rather expected to be slower than statically typed languages.
48+
[@pan93412](https://github.com/pan93412) (Rust example refactoring using Tokyo runtime)
Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,46 @@
11
require "json"
22
require "yaml"
33

4-
# TODO: Write documentation for `FastWordsCr`
5-
64
module Example::Crystal
7-
VERSION = "0.2.0"
5+
VERSION = "0.3.0"
86
CHARSET = "aąbcćdeęfghijklłmnńoópqrsśtuvwxyzźż"
97

10-
def self.main(outpath = "words")
8+
def self.main(outdir = "words")
119
with_sorting = false
1210
concurrent = true
1311

14-
prepare_folder(outpath, "*.txt")
12+
prepare_folder(outdir, "*.txt")
1513

1614
file_count = 0
1715
total_size = 0
1816

1917
channel = Channel(Tuple(String, Int64)).new
2018
srcPath = "../data/??/**/*.yml"
21-
# srcPath = "./bibles/??/**/*.yml"
22-
Dir.glob(srcPath, follow_symlinks: true).each do |path|
19+
paths = Dir.glob(srcPath, follow_symlinks: true)
20+
count = paths.size
21+
paths.each do |path|
2322
if concurrent
2423
spawn do
25-
channel.send worker(path, outpath, with_sorting)
24+
channel.send worker(path, outdir, with_sorting)
2625
end
2726
file_count += 1
2827
else
29-
worker(path, outpath, with_sorting)
28+
worker(path, outdir, with_sorting)
3029
end
3130
end
3231
if concurrent
3332
file_count.times do |i|
3433
path, size = channel.receive
3534
total_size += size
36-
# puts("[#{i + 1}/#{file_count}] #{path}")
35+
puts(::sprintf("[%3d/%d] %s", i + 1, file_count, path))
3736
end
3837
end
39-
total_size = total_size / 1024 / 1024
40-
puts("Total size: #{total_size} MB")
38+
puts("Total size: #{(total_size / 1024 / 1024).round} MB")
4139
end
4240

43-
def self.worker(path, outpath, with_sorting)
41+
def self.worker(path, outdir, with_sorting)
4442
filepath = path.gsub(".yml", ".txt")
43+
filesize = File.size(filepath)
4544
text = File.read(filepath).gsub("\n", " ").downcase
4645

4746
words = text.split(/[^\p{L}]+/).to_set
@@ -51,10 +50,9 @@ module Example::Crystal
5150
end
5251

5352
meta = File.open(path) { |file| YAML.parse(file) }
54-
filepath = %Q(#{outpath}/#{meta["lang"]}-#{meta["code"]}.txt)
55-
File.write(filepath, words.join("\n"))
53+
outfilepath = %Q(#{outdir}/#{meta["lang"]}-#{meta["code"]}.txt)
54+
File.write(outfilepath, words.join("\n"))
5655
filesize = File.size(filepath)
57-
puts([filepath, filesize])
5856
{filepath, filesize}
5957
end
6058

@@ -78,7 +76,8 @@ module Example::Crystal
7876
end
7977
end
8078

81-
elapsed_time = Time.measure do
79+
elapsed = Time.measure do
8280
Example::Crystal.main
8381
end
84-
puts elapsed_time
82+
83+
puts("Total time: #{elapsed}")
Lines changed: 40 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,57 @@
1+
use Timex
2+
13
defmodule WordsExtractor do
24
@moduledoc nil
35

6+
@pat Regex.compile!("[\W\d]+/u")
7+
48
def run do
59
outdir = "words"
6-
clean_dir(outdir)
7-
8-
walk("../data/pl/", ".yml")
9-
|> Task.async_stream(
10-
WordsExtractor,
11-
:worker,
12-
[outdir],
13-
ordered: false,
14-
timeout: :infinity
15-
)
16-
|> Enum.to_list()
10+
File.rm_rf!(outdir)
11+
File.mkdir!(outdir)
12+
t = Duration.now()
13+
paths = Path.wildcard("../data/??/**/*.yml")
14+
count = length(paths)
15+
16+
total_size =
17+
paths
18+
|> Enum.with_index(1)
19+
|> Task.async_stream(
20+
WordsExtractor,
21+
:worker,
22+
[outdir, count],
23+
ordered: false,
24+
timeout: :infinity
25+
)
26+
|> Enum.to_list()
27+
|> Enum.reduce(0, fn {:ok, num}, acc -> acc + num end)
28+
29+
elapsed = Duration.diff(Duration.now(), t, :milliseconds)
30+
IO.puts("Total time: #{elapsed / 1000}s")
31+
IO.puts("Total size: #{(total_size / 1024 / 1024) |> round} MB")
1732
end
1833

19-
def clean_dir(path) do
20-
File.rm_rf!(path)
21-
File.mkdir!(path)
22-
end
34+
def worker({path, i}, outdir, count) do
35+
%{"code" => code, "lang" => lang} = YamlElixir.read_from_file!(path)
2336

24-
def worker(path, outdir) do
25-
%{"code" => code} = YamlElixir.read_from_file!(path)
37+
filepath = String.replace(path, ".yml", ".txt")
38+
%File.Stat{:size => filesize} = File.stat!(filepath)
2639

27-
words =
28-
File.read!(String.replace(path, ".yml", ".txt"))
40+
content =
41+
File.read!(filepath)
2942
|> String.downcase()
3043
|> String.trim()
31-
|> then(&Regex.split(~r/[\W\d]+/u, &1))
44+
|> then(&Regex.split(@pat, &1))
3245
|> MapSet.new()
33-
# sorting does not respect collation
34-
|> Enum.sort()
46+
|> Enum.join("\n")
3547

36-
File.write!("#{outdir}/extracted-#{code}.txt", Enum.join(words, "\n"))
37-
IO.puts(path)
38-
end
48+
# sorting does not respect collation so it is ignored
49+
# |> Enum.sort()
3950

40-
def walk(path, pattern) do
41-
dir = String.to_charlist(path)
42-
regexp = String.to_charlist(pattern)
51+
it = i |> Integer.to_string() |> String.pad_leading(3)
52+
IO.puts("[#{it}/#{count}] #{path}/#{lang}-#{code}.txt")
4353

44-
:filelib.fold_files(dir, regexp, true, fn file, acc -> [file | acc] end, [])
45-
|> Enum.map(fn filepath -> to_string(filepath) end)
54+
File.write!("./#{outdir}/#{lang}-#{code}.txt", content)
55+
filesize
4656
end
4757
end

example-elixir/mix.exs

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ defmodule WordsExtractor.MixProject do
44
def project do
55
[
66
app: :words_extractor_ex,
7-
version: "0.1.0",
7+
version: "0.2.0",
88
elixir: "~> 1.13",
99
start_permanent: Mix.env() == :prod,
1010
deps: deps()
@@ -21,7 +21,8 @@ defmodule WordsExtractor.MixProject do
2121
# Run "mix help deps" to learn about dependencies.
2222
defp deps do
2323
[
24-
{:yaml_elixir, "~> 2.8"}
24+
{:yaml_elixir, "~> 2.8"},
25+
{:timex, "~> 3.7"}
2526
# {:dep_from_hexpm, "~> 0.3.0"},
2627
# {:dep_from_git, git: "https://github.com/elixir-lang/my_dep.git", tag: "0.1.0"}
2728
]

example-elixir/mix.lock

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,16 @@
11
%{
2+
"certifi": {:hex, :certifi, "2.8.0", "d4fb0a6bb20b7c9c3643e22507e42f356ac090a1dcea9ab99e27e0376d695eba", [:rebar3], [], "hexpm", "6ac7efc1c6f8600b08d625292d4bbf584e14847ce1b6b5c44d983d273e1097ea"},
3+
"combine": {:hex, :combine, "0.10.0", "eff8224eeb56498a2af13011d142c5e7997a80c8f5b97c499f84c841032e429f", [:mix], [], "hexpm", "1b1dbc1790073076580d0d1d64e42eae2366583e7aecd455d1215b0d16f2451b"},
4+
"gettext": {:hex, :gettext, "0.19.1", "564953fd21f29358e68b91634799d9d26989f8d039d7512622efb3c3b1c97892", [:mix], [], "hexpm", "10c656c0912b8299adba9b061c06947511e3f109ab0d18b44a866a4498e77222"},
5+
"hackney": {:hex, :hackney, "1.18.0", "c4443d960bb9fba6d01161d01cd81173089686717d9490e5d3606644c48d121f", [:rebar3], [{:certifi, "~>2.8.0", [hex: :certifi, repo: "hexpm", optional: false]}, {:idna, "~>6.1.0", [hex: :idna, repo: "hexpm", optional: false]}, {:metrics, "~>1.0.0", [hex: :metrics, repo: "hexpm", optional: false]}, {:mimerl, "~>1.1", [hex: :mimerl, repo: "hexpm", optional: false]}, {:parse_trans, "3.3.1", [hex: :parse_trans, repo: "hexpm", optional: false]}, {:ssl_verify_fun, "~>1.1.0", [hex: :ssl_verify_fun, repo: "hexpm", optional: false]}, {:unicode_util_compat, "~>0.7.0", [hex: :unicode_util_compat, repo: "hexpm", optional: false]}], "hexpm", "9afcda620704d720db8c6a3123e9848d09c87586dc1c10479c42627b905b5c5e"},
6+
"idna": {:hex, :idna, "6.1.1", "8a63070e9f7d0c62eb9d9fcb360a7de382448200fbbd1b106cc96d3d8099df8d", [:rebar3], [{:unicode_util_compat, "~>0.7.0", [hex: :unicode_util_compat, repo: "hexpm", optional: false]}], "hexpm", "92376eb7894412ed19ac475e4a86f7b413c1b9fbb5bd16dccd57934157944cea"},
7+
"metrics": {:hex, :metrics, "1.0.1", "25f094dea2cda98213cecc3aeff09e940299d950904393b2a29d191c346a8486", [:rebar3], [], "hexpm", "69b09adddc4f74a40716ae54d140f93beb0fb8978d8636eaded0c31b6f099f16"},
8+
"mimerl": {:hex, :mimerl, "1.2.0", "67e2d3f571088d5cfd3e550c383094b47159f3eee8ffa08e64106cdf5e981be3", [:rebar3], [], "hexpm", "f278585650aa581986264638ebf698f8bb19df297f66ad91b18910dfc6e19323"},
9+
"parse_trans": {:hex, :parse_trans, "3.3.1", "16328ab840cc09919bd10dab29e431da3af9e9e7e7e6f0089dd5a2d2820011d8", [:rebar3], [], "hexpm", "07cd9577885f56362d414e8c4c4e6bdf10d43a8767abb92d24cbe8b24c54888b"},
10+
"ssl_verify_fun": {:hex, :ssl_verify_fun, "1.1.6", "cf344f5692c82d2cd7554f5ec8fd961548d4fd09e7d22f5b62482e5aeaebd4b0", [:make, :mix, :rebar3], [], "hexpm", "bdb0d2471f453c88ff3908e7686f86f9be327d065cc1ec16fa4540197ea04680"},
11+
"timex": {:hex, :timex, "3.7.6", "502d2347ec550e77fdf419bc12d15bdccd31266bb7d925b30bf478268098282f", [:mix], [{:combine, "~> 0.10", [hex: :combine, repo: "hexpm", optional: false]}, {:gettext, "~> 0.10", [hex: :gettext, repo: "hexpm", optional: false]}, {:tzdata, "~> 1.0", [hex: :tzdata, repo: "hexpm", optional: false]}], "hexpm", "a296327f79cb1ec795b896698c56e662ed7210cc9eb31f0ab365eb3a62e2c589"},
12+
"tzdata": {:hex, :tzdata, "1.1.1", "20c8043476dfda8504952d00adac41c6eda23912278add38edc140ae0c5bcc46", [:mix], [{:hackney, "~> 1.17", [hex: :hackney, repo: "hexpm", optional: false]}], "hexpm", "a69cec8352eafcd2e198dea28a34113b60fdc6cb57eb5ad65c10292a6ba89787"},
13+
"unicode_util_compat": {:hex, :unicode_util_compat, "0.7.0", "bc84380c9ab48177092f43ac89e4dfa2c6d62b40b8bd132b1059ecc7232f9a78", [:rebar3], [], "hexpm", "25eee6d67df61960cf6a794239566599b09e17e668d3700247bc498638152521"},
214
"yamerl": {:hex, :yamerl, "0.10.0", "4ff81fee2f1f6a46f1700c0d880b24d193ddb74bd14ef42cb0bcf46e81ef2f8e", [:rebar3], [], "hexpm", "346adb2963f1051dc837a2364e4acf6eb7d80097c0f53cbdc3046ec8ec4b4e6e"},
315
"yaml_elixir": {:hex, :yaml_elixir, "2.8.0", "c7ff0034daf57279c2ce902788ce6fdb2445532eb4317e8df4b044209fae6832", [:mix], [{:yamerl, "~> 0.8", [hex: :yamerl, repo: "hexpm", optional: false]}], "hexpm", "4b674bd881e373d1ac6a790c64b2ecb69d1fd612c2af3b22de1619c15473830b"},
416
}

example-golang/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/coverage.out

example-golang/Makefile

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,22 @@ build:
88
run: build
99
./${BINARY_NAME}
1010

11+
run-sort: build
12+
./${BINARY_NAME} -n 10 -s
13+
1114
test:
15+
@go test ./... -v
16+
17+
coverage:
1218
@go test ./... -v -coverprofile=coverage.out
13-
19+
20+
1421
cover: test
1522
@go tool cover -html=coverage.out
1623

1724
clean:
1825
@go clean
1926
rm -f coverage.out
2027
rm -f ./${BINARY_NAME}
21-
rm -rf ./words
28+
rm -rf ./words
29+

example-golang/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,12 @@
44

55
```
66
make build
7-
./main [-n=NUMBER_OF_WORKERS, integer] [-s]
7+
./main -n 8
88
```
9+
10+
<pre>
11+
Usage of ./main:
12+
-n int
13+
Number of workers to run (zero to match the number of available CPUs) (default 10)
14+
-s Sort results
15+
</pre>

example-golang/app/app.go

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ import (
55
"os"
66
"path/filepath"
77
"sync"
8+
9+
"github.com/bmatcuk/doublestar"
810
)
911

1012
const dirPerms = 0755
@@ -15,7 +17,7 @@ type empty struct{}
1517
// No error handling, no context cancellation is implemented to match implementations
1618
// in other languages.
1719
func Run(srcDir, outDir string, numWorkers int, sortResults bool) error {
18-
files, err := filepath.Glob(srcDir)
20+
files, err := doublestar.Glob(srcDir)
1921
if err != nil {
2022
return fmt.Errorf(`app: getting list of files "%s": %w`, srcDir, err)
2123
}
@@ -41,7 +43,7 @@ func Run(srcDir, outDir string, numWorkers int, sortResults bool) error {
4143
}
4244

4345
src := file[:len(file)-3] + "txt"
44-
dst := filepath.Join(outDir, "extracted-words-for-"+spec.Code+".txt")
46+
dst := filepath.Join(outDir, spec.Lang+"-"+spec.Code+".txt")
4547

4648
wg.Add(1)
4749
go extract(src, dst, sortResults, spec.Tag, sem, &wg)

0 commit comments

Comments
 (0)