Skip to content

Commit 5457fa1

Browse files
committed
Merge branch 'm1'
# Conflicts: # README.md # words_extractor_go/main.go
2 parents 2fb5dee + dba6624 commit 5457fa1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+433
-377
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
/.history/
22
/.vscode/
3-
/.DS_Store
3+
/**/.DS_Store
44

55
/words_extractor_pypy/
66
/.elixir_ls/
7+
/.idea/
8+
/old/
9+
/**/.mypy_cache
10+
/**/*.pyc

README.md

Lines changed: 20 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,78 +1,22 @@
11
# words_extractor
22

3-
Example of text file parsing in Python, Golang, Elixir, Rust, Crystal and Julia
4-
5-
Text source: 79.4MB in 30 files
6-
7-
- Rust 1.51.0 (parallel) with sorting: 0.91s, without sorting: 0.49s
8-
- Python 3.9.5 (parallel) with sorting: 3.42s, without sorting: 2.47
9-
- Crystal 1.0.0 (parallel) with sorting: 5.34s, withouts sorting: 2.48
10-
- Go 1.16.4 (parallel) with sorting: 6.41s, without sorting: 3.75s
11-
- Rust 1.51.0 with sorting: 7s, without sorting: 5s (no parallelism)
12-
- Julia 1.6.1 (8 threads) 8.7s, (1 thread) 9.7s without sorting
13-
- Python 3.9.5 with sorting: 10s, without sorting 8.32s (no parallelism)
14-
- Crystal 1.0.0 with sorting: 13s, without sorting: 7s (no parallelism)
15-
- Go 1.16.4 with sorting: 21s, without sorting: 11s (no parallelism)
16-
- Elixir 1.12 (parallel) with sorting: 33s (without release build)
17-
18-
macOS 11.3.1, MacBook Pro (Retina, 15-inch, Late 2013)
19-
20-
Python
21-
22-
```bash
23-
cd words_extractor_py
24-
python words.py
25-
```
26-
27-
Rust
28-
29-
```
30-
cd words_extractor_rs
31-
cargo build --release
32-
target/release/words_extractor_rs
33-
```
34-
Golang
35-
36-
```
37-
cd words_extractor_go
38-
make build
39-
GOGC=2000 ./main
40-
41-
Crystal
42-
43-
```
44-
cd words_extractor_cr
45-
crystal build --release -Dpreview_mt src/fast_words_cr.cr -o main
46-
CRYSTAL_WORKES=8 ./main
47-
```
48-
49-
Julia
50-
51-
```
52-
julia -t 8 src/words_extractor_jl.jl
53-
```
54-
55-
Elixir
56-
57-
```
58-
cd words_extractor_ex
59-
mix run -e "WordsExtractor.run"
60-
```
61-
62-
## Running Python
63-
64-
1. Install the latest Python 3.9.5
65-
2. Create venv and dependencies
66-
67-
```bash
68-
cd words_extractor_py
69-
python -m venv venv
70-
source venv/bin/activate
71-
pip install -r requirements.txt
72-
```
73-
74-
3. Run the code
75-
76-
```bash
77-
python words_parallel.py
78-
```
3+
| | | | | |
4+
|--- |--- |--- |--- |--- |
5+
| | | | | |
6+
| | | | | |
7+
| | | | | |
8+
9+
Example of a text file parsing in several programming languages
10+
11+
MacOS 12.2
12+
Rust 1.58.1
13+
MBP 16" 64GB 2TB M1Max 10 cores
14+
Tested on 123 files (504MB)
15+
16+
Results:
17+
18+
1. Rust 1.58.1 -> 0.3521 s
19+
2. Ruby 3.1 with Parallel -> 2.0542 s
20+
3. Python 3.10.2 with multiprocessing -> 2.9403 s
21+
4. Crystal 1.3.2 with channels -> 6.0035 s
22+
5. Go 1.18beta1 with waitgroup -> 7.2166 s
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
The MIT License (MIT)
1+
MIT License
22

3-
Copyright (c) 2021 Jaroslaw Zabiello <hipertracker@gmail.com>
3+
Copyright (c) 2022 Jaroslaw Zabiello <jaroslaw.zabiello@easyfairs.com>
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal
@@ -9,13 +9,13 @@ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
99
copies of the Software, and to permit persons to whom the Software is
1010
furnished to do so, subject to the following conditions:
1111

12-
The above copyright notice and this permission notice shall be included in
13-
all copies or substantial portions of the Software.
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
1414

1515
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
1616
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
1717
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
1818
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
1919
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21-
THE SOFTWARE.
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

example-crystal/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Crystal
2+
3+
## setup and run
4+
5+
```
6+
crystal build --release -Dpreview_mt src/example-crystal.cr -o main
7+
CRYSTAL_WORKES=10 ./main
8+
```
9+
10+
MacOS 12.2
11+
Crystal 1.3.2
12+
MBP 16" M1Max 10 cores
13+
Total files: 123
14+
Total size: 504 MB
15+
Total time: 6.0035 s

example-crystal/shard.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
name: example-crystal
2+
version: 0.1.0
3+
4+
authors:
5+
- Jaroslaw Zabiello <[email protected]>
6+
7+
targets:
8+
example-crystal:
9+
main: src/example-crystal.cr
10+
11+
crystal: 1.3.2
12+
13+
license: MIT
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
require "./spec_helper"
22

3-
describe FastWordsCr do
3+
describe Example::Crystal do
44
it "should be understand PL collations" do
55
words = %w(ala ąla ćma ciemno ęle element łódź lody śmierć serial źdźbło żółw zawias)
66
expected = %w(ala ąla ciemno ćma element ęle lody łódź serial śmierć zawias źdźbło żółw)
7-
words.sort { |x, y| FastWordsCr.cmp(x, y) }.should eq(expected)
7+
words.sort { |x, y| Example::Crystal.word_cmp(x, y) }.should eq(expected)
88
end
99

1010
it "should be understand PL collations for q" do
1111
words = %w(ala ąla ćma ciemno ęle element łódź lody śmierć serial źdźbło żółw zawias querty)
1212
expected = %w(ala ąla ciemno ćma element ęle lody łódź querty serial śmierć zawias źdźbło żółw)
13-
words.sort { |x, y| FastWordsCr.cmp(x, y) }.should eq(expected)
13+
words.sort { |x, y| Example::Crystal.word_cmp(x, y) }.should eq(expected)
1414
end
1515

1616
it "should understand PL collations for upper chars" do
1717
words = %w(ala ąla Ćma ciemno Ęle Element łódź lody śmierć serial źdźbło żółw zawias querty)
1818
expected = %w(ala ąla ciemno Ćma Element Ęle lody łódź querty serial śmierć zawias źdźbło żółw)
19-
words.sort { |x, y| FastWordsCr.cmp(x, y) }.should eq(expected)
19+
words.sort { |x, y| Example::Crystal.word_cmp(x.downcase, y.downcase) }.should eq(expected)
2020
end
2121
end
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
require "spec"
2+
require "../src/example-crystal"
Lines changed: 21 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3,39 +3,46 @@ require "yaml"
33

44
# TODO: Write documentation for `FastWordsCr`
55

6-
module FastWordsCr
6+
module Example::Crystal
77
VERSION = "0.2.0"
88
CHARSET = "aąbcćdeęfghijklłmnńoópqrsśtuvwxyzźż"
99

1010
def self.main(outpath = "words")
11-
with_sorting = true
11+
with_sorting = false
1212
concurrent = true
1313

1414
prepare_folder(outpath, "*.txt")
1515

1616
file_count = 0
17+
total_size = 0
1718

18-
processed_files = Channel(Bool).new
19-
Dir.glob("../data/pl/**/*.yml").each do |path|
19+
channel = Channel(Tuple(String, Int64)).new
20+
srcPath = "../data/??/**/*.yml"
21+
# srcPath = "./bibles/??/**/*.yml"
22+
Dir.glob(srcPath, follow_symlinks: true).each do |path|
2023
if concurrent
2124
spawn do
22-
worker(path, outpath, with_sorting)
23-
processed_files.send true
25+
channel.send worker(path, outpath, with_sorting)
2426
end
2527
file_count += 1
2628
else
2729
worker(path, outpath, with_sorting)
2830
end
2931
end
3032
if concurrent
31-
file_count.times do
32-
processed_files.receive
33+
file_count.times do |i|
34+
path, size = channel.receive
35+
total_size += size
36+
# puts("[#{i + 1}/#{file_count}] #{path}")
3337
end
3438
end
39+
total_size = total_size / 1024 / 1024
40+
puts("Total size: #{total_size} MB")
3541
end
3642

3743
def self.worker(path, outpath, with_sorting)
38-
text = File.read(path.gsub(".yml", ".txt")).gsub("\n", " ").downcase
44+
filepath = path.gsub(".yml", ".txt")
45+
text = File.read(filepath).gsub("\n", " ").downcase
3946

4047
words = text.split(/[^\p{L}]+/).to_set
4148

@@ -44,9 +51,11 @@ module FastWordsCr
4451
end
4552

4653
meta = File.open(path) { |file| YAML.parse(file) }
47-
filepath = %Q(#{outpath}/extracted-words-for-#{meta["label"]}.txt)
54+
filepath = %Q(#{outpath}/#{meta["lang"]}-#{meta["code"]}.txt)
4855
File.write(filepath, words.join("\n"))
49-
puts "Saved #{filepath}"
56+
filesize = File.size(filepath)
57+
puts([filepath, filesize])
58+
{filepath, filesize}
5059
end
5160

5261
def self.prepare_folder(folder : String, pattern : String)
@@ -70,6 +79,6 @@ module FastWordsCr
7079
end
7180

7281
elapsed_time = Time.measure do
73-
FastWordsCr.main
82+
Example::Crystal.main
7483
end
7584
puts elapsed_time

0 commit comments

Comments
 (0)