Skip to content

Commit 406978a

Browse files
committed
update README
1 parent efa8a00 commit 406978a

File tree

1 file changed

+23
-16
lines changed

1 file changed

+23
-16
lines changed

README.md

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Example of a text file parsing in several programming languages. The goal is to extract unique words from utf-8 files and save results them into separate files.
66

7+
The difficulty in sorting words is due to the need to handle sorting rules according to the the different languages grammary. This is quite a complex problem that does not exist for the English language where the character set does not exceed the basic ASCII standard.
8+
79
### Results
810

911
The following results are for 123 unique utf-8 Bible text files in 23 languages (used at mybible.pl site) They take 504MB. (The repo contains only a few sample files in the 'data' folder. For testing more data you could multiple files by cloning *.txt (and the associated*.yml) file under different names)
@@ -12,30 +14,35 @@ The following results are for 123 unique utf-8 Bible text files in 23 languages
1214
* Machine: MacBook Pro 16" 64GB 2TB M1Max 10 cores.
1315

1416
<pre>
15-
1. Rust 1.58 = 0.38s
16-
2. Python 3.10.2 = 2.80s
17-
3. Julia 1.7.1 = 4.522
18-
4. Crystal 1.3.2 = 5.72s
19-
5. Elixir 1.13.2 = 7.82s
20-
6. Ruby 3.1.0 = 8.31s
21-
22-
Golang 1.17.6 = UNDER REFACTORING, stay tuned
17+
1. Rust 1.58 = 0.38s (with sorting: )
18+
2. Golang 1.17.6 = 0.xxx (with sorting: )
19+
3. Python 3.10.2 = 2.80s
20+
4. Julia 1.7.1 = 4.522
21+
5. Crystal 1.3.2 = 5.72s
22+
6. Elixir 1.13.2 = 7.82s
23+
7. Ruby 3.1.0 = 8.31s
2324
</pre>
2425

2526
### Conclusion
2627

27-
The difficulty in sorting words is due to the need to handle sorting rules according to the language. This is quite a complex problem that does not exist for the English language where the character set does not exceed the basic ASCII standard.
28+
Rust is the fastest language beyond doubt. The new optimized Golang code version is very fast, slower than Rust but faster than other languages. Golang is the only language at the moment with full mature i18n support for arm64/M1 platform.
29+
30+
* Rust = the current example uses [lexical-sort](https://lib.rs/crates/lexical-sort) which is not perfect. [There is no standard mature implementation of i18n in Rust](https://www.arewewebyet.org/topics/i18n/) at the moment.
31+
32+
33+
* Python = has a great implementation of [ICU](https://icu.unicode.org/related) library however it does not support arm64/M1 platform, hence I couldn't use it in this comparison.
34+
35+
36+
* Ruby = same as Python, no ICU for M1.
37+
2838

29-
* Rust = I couldn't find collations for sort rules in other languages.
39+
* Elixir = same as Python, no ICU for M1.
3040

31-
* Julia = same as Rust
3241

33-
* Elixir = same as Rust
42+
* Julia = I couldn't find a good i18 library supporting many lamguages.
3443

35-
* Crystal = currently has Turkish-only collations. Probably because the language is young and does not have a large enough community or company behind it. The manual sorting was not perfect here, the algorithm needs to be improved.
3644

37-
* Python = has a great implementation of ICU library but unfortunately, it is not still available for the arm64 / M1 platform hence I couldn't use it in this comparison.
45+
* Crystal = currently supports only Turkish collations. Probably because the language is young and does not have a large enough community or company behind it.
3846

39-
* Ruby = same as Python
4047

41-
* Golang = has rules for many languages. You can see the influence of a large company and community which makes Golang a mature solution. Sorting slowed the whole task down significantly, but the result is correct (in this case I only checked the results for the Polish language)
48+
* Golang = has rules for many languages. You can see the influence of a large company and community which makes Golang a mature solution. Sorting slowed the whole task down significantly, but the result is correct (in this case I only checked the results for the Polish language)

0 commit comments

Comments
 (0)