Skip to content

Commit 5c80429

Browse files
committed
add improved Ruby version
1 parent 3de44c8 commit 5c80429

File tree

3 files changed

+81
-25
lines changed

3 files changed

+81
-25
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,5 @@
1313
/**/main
1414
/**/target
1515
/**/.idea
16-
/**/.venv
16+
/**/.venv
17+
/data.full/

README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ The following results are for 123 unique utf-8 Bible text files in 23 languages
2020
4. Julia 1.7.1 = 4.522
2121
5. Crystal 1.3.2 = 5.72s
2222
6. Elixir 1.13.2 = 7.82s
23-
7. Ruby 3.1.0 = 8.31s (with Parallel)
23+
7. Ruby 3.1.0 = 10.44s (with Parallel), with sorting: 10.51s
2424
</pre>
2525

2626
### Conclusion
@@ -31,20 +31,24 @@ The new optimized Golang code version is very fast, slower than Rust but faster
3131

3232
* Python = has a great implementation of [ICU](https://icu.unicode.org/related) library however it does not support arm64/M1 platform, hence I couldn't use it in this comparison.
3333

34-
* Ruby = same as Python, no ICU for M1.
34+
* Ruby = can sort unicode text but withot collations becase it can't use ICU on arm64/M1
3535

3636
* Elixir = same as Python, no ICU for M1.
3737

3838
* Julia = I couldn't find a good i18 library supporting many languages.
3939

4040
* Crystal = currently supports only Turkish collations. Probably because the language is young and does not have a large enough community or company behind it.
4141

42-
* Golang = has rules for many languages. You can see the influence of a large company and community which makes Golang a mature solution. Sorting slowed the whole task down significantly, but the result is correct.
43-
44-
Golang is the only language from that list that has full mature support for natural sorting in lots of languages. This code implements sorting in 23 languages. It is also very fast, close to Rust although it is not obvious how to optimize its code to work so fast. Although Golang has simplified syntax it is not simple to know how to use its full power.
42+
* Golang = has rules for many languages. You can see the influence of a large company and community which makes Golang a mature solution. Sorting slowed the whole task down significantly, but the result is correct (in this case I only checked the results for the Polish language)
4543

4644
### Kudos
4745

4846
[@romanatnews](https://github.com/romanatnews) (Golang example refactoring)
4947

5048
[@pan93412](https://github.com/pan93412) (Rust example refactoring using Tokyo runtime)
49+
50+
## CHANGES
51+
52+
2022-02-08
53+
54+
Added improved Ruby code version with correct reading the pure text to tokenize (it ignores sigla in each verse), and with the correct regular expression for extracting words. The code is a little slower but it works as expected.

example-ruby/words.rb

Lines changed: 70 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,82 @@
11
require 'yaml'
2-
require 'yaml'
32
require 'parallel'
43
require 'etc'
54
require 'fileutils'
5+
require 'optparse'
66

7-
outdir = 'words'
7+
class WordExtractor
8+
def initialize(cores: Etc.nprocessors, sorting: false, outdir: 'words', source: '../data/??/**/*.yml')
9+
@cores = cores
10+
@sorting = sorting
11+
@outdir = outdir
12+
@source = source
13+
end
814

9-
FileUtils.rm_rf(outdir)
10-
Dir.mkdir(outdir)
15+
def clear_output
16+
FileUtils.rm_rf(@outdir)
17+
Dir.mkdir(@outdir)
18+
end
1119

12-
t = Time.now
20+
def get_words(filepath)
21+
IO.readlines(filepath).map do |line|
22+
line.strip.downcase.split(' ')[2...-1].join(' ').split(/[^\p{L}]+/).uniq
23+
end.flatten.uniq
24+
end
1325

14-
sorted = false
26+
def save_words(words:, meta:, yaml_path:, count:, i:)
27+
outpath = "#{@outdir}/#{meta['lang']}-#{meta['code']}.txt"
28+
puts(format('[%3d/%d] %s/%s', i, count, yaml_path, outpath))
29+
File.write(outpath, words.join("\n"))
30+
end
1531

16-
paths = Dir['../data/??/**/*.yml']
17-
count = paths.count
32+
def run
33+
print "Running using #{@cores} processes"
34+
print ' with sorting' if @sorting
35+
puts '...'
36+
clear_output
37+
start = Time.now
38+
paths = Dir[@source]
39+
count = paths.count
40+
sizes = Parallel.map_with_index(paths, in_processes: @cores) do |yaml_path, i|
41+
meta = YAML.load_file(yaml_path)
42+
filepath = yaml_path.gsub('.yml', '.txt')
43+
words = get_words(filepath)
44+
words.sort! if @sorting
45+
save_words(words:, meta:, yaml_path:, count:, i:)
46+
File.size(filepath)
47+
end
48+
puts "Total size: #{(sizes.sum / 1024.0 / 1024).round} MB"
49+
puts "Total time: #{Time.now - start} s"
50+
end
51+
end
1852

19-
sizes = Parallel.map_with_index(paths, in_processes: Etc.nprocessors) do |yaml_path, i|
20-
meta = YAML.load_file(yaml_path)
21-
filepath = yaml_path.gsub('.yml', '.txt')
22-
words = IO.read(filepath).downcase.strip.split(/[^\p{word}]+/).uniq
23-
words = words.sort if sorted
24-
outpath = "#{outdir}/#{meta['lang']}-#{meta['code']}.txt"
25-
puts(format('[%3d/%d] %s/%s', i, count, yaml_path, outpath))
26-
File.write(outpath, words.join("\n"))
27-
File.size(filepath)
53+
if __FILE__ == $PROGRAM_NAME
54+
cores = Etc.nprocessors
55+
options = { s: false, n: cores }
56+
OptionParser.new do |opts|
57+
opts.banner = "Usage: ruby #{__FILE__} [options]"
58+
opts.on('-n [NUM]', OptionParser::DecimalInteger, "Number of cores to run (default #{cores})") do |val|
59+
options[:n] = if val.negative? || val > cores
60+
cores
61+
else
62+
val
63+
end
64+
end
65+
opts.on('-s', 'Sort results') { |v| options[:s] = v }
66+
end.parse!
67+
WordExtractor.new(cores: options[:n], sorting: options[:s]).run
2868
end
2969

30-
puts "Total size: #{(sizes.sum / 1024.0 / 1024).round} MB"
31-
puts "Total time: #{Time.now - t} s"
70+
# class A
71+
# def initialize(x:, y:, z:)
72+
# @x = x
73+
# @y = y
74+
# @z = z
75+
# end
76+
77+
# def run
78+
# puts "x=#{@x}, y=#{@y}, z=#{@z}"
79+
# end
80+
# end
81+
82+
# A.new(y: 'Y', x: 'X', z: 'Z').run

0 commit comments

Comments
 (0)