You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-16Lines changed: 23 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@
4
4
5
5
Example of a text file parsing in several programming languages. The goal is to extract unique words from utf-8 files and save results them into separate files.
6
6
7
+
The difficulty in sorting words is due to the need to handle sorting rules according to the the different languages grammary. This is quite a complex problem that does not exist for the English language where the character set does not exceed the basic ASCII standard.
8
+
7
9
### Results
8
10
9
11
The following results are for 123 unique utf-8 Bible text files in 23 languages (used at mybible.pl site) They take 504MB. (The repo contains only a few sample files in the 'data' folder. For testing more data you could multiple files by cloning *.txt (and the associated*.yml) file under different names)
@@ -12,30 +14,35 @@ The following results are for 123 unique utf-8 Bible text files in 23 languages
12
14
* Machine: MacBook Pro 16" 64GB 2TB M1Max 10 cores.
13
15
14
16
<pre>
15
-
1. Rust 1.58 = 0.38s
16
-
2. Python 3.10.2 = 2.80s
17
-
3. Julia 1.7.1 = 4.522
18
-
4. Crystal 1.3.2 = 5.72s
19
-
5. Elixir 1.13.2 = 7.82s
20
-
6. Ruby 3.1.0 = 8.31s
21
-
22
-
Golang 1.17.6 = UNDER REFACTORING, stay tuned
17
+
1. Rust 1.58 = 0.38s (with sorting: )
18
+
2. Golang 1.17.6 = 0.xxx (with sorting: )
19
+
3. Python 3.10.2 = 2.80s
20
+
4. Julia 1.7.1 = 4.522
21
+
5. Crystal 1.3.2 = 5.72s
22
+
6. Elixir 1.13.2 = 7.82s
23
+
7. Ruby 3.1.0 = 8.31s
23
24
</pre>
24
25
25
26
### Conclusion
26
27
27
-
The difficulty in sorting words is due to the need to handle sorting rules according to the language. This is quite a complex problem that does not exist for the English language where the character set does not exceed the basic ASCII standard.
28
+
Rust is the fastest language beyond doubt. The new optimized Golang code version is very fast, slower than Rust but faster than other languages. Golang is the only language at the moment with full mature i18n support for arm64/M1 platform.
29
+
30
+
* Rust = the current example uses [lexical-sort](https://lib.rs/crates/lexical-sort) which is not perfect. [There is no standard mature implementation of i18n in Rust](https://www.arewewebyet.org/topics/i18n/) at the moment.
31
+
32
+
33
+
* Python = has a great implementation of [ICU](https://icu.unicode.org/related) library however it does not support arm64/M1 platform, hence I couldn't use it in this comparison.
34
+
35
+
36
+
* Ruby = same as Python, no ICU for M1.
37
+
28
38
29
-
*Rust = I couldn't find collations for sort rules in other languages.
39
+
*Elixir = same as Python, no ICU for M1.
30
40
31
-
* Julia = same as Rust
32
41
33
-
*Elixir = same as Rust
42
+
*Julia = I couldn't find a good i18 library supporting many lamguages.
34
43
35
-
* Crystal = currently has Turkish-only collations. Probably because the language is young and does not have a large enough community or company behind it. The manual sorting was not perfect here, the algorithm needs to be improved.
36
44
37
-
*Python = has a great implementation of ICU library but unfortunately, it is not still available for the arm64 / M1 platform hence I couldn't use it in this comparison.
45
+
*Crystal = currently supports only Turkish collations. Probably because the language is young and does not have a large enough community or company behind it.
38
46
39
-
* Ruby = same as Python
40
47
41
-
* Golang = has rules for many languages. You can see the influence of a large company and community which makes Golang a mature solution. Sorting slowed the whole task down significantly, but the result is correct (in this case I only checked the results for the Polish language)
48
+
* Golang = has rules for many languages. You can see the influence of a large company and community which makes Golang a mature solution. Sorting slowed the whole task down significantly, but the result is correct (in this case I only checked the results for the Polish language)
0 commit comments