Skip to content

feat: Append ascii name if any 8bit UTF8 chars#9173

Merged
rjsparks merged 2 commits into
ietf-tools:mainfrom
richsalz:fix-7167
Jul 23, 2025
Merged

feat: Append ascii name if any 8bit UTF8 chars#9173
rjsparks merged 2 commits into
ietf-tools:mainfrom
richsalz:fix-7167

Conversation

@richsalz

@richsalz richsalz commented Jul 19, 2025

Copy link
Copy Markdown
Collaborator

Inspirited by Peter Yee's earlier work.

Fixes: #7167

@codecov

codecov Bot commented Jul 19, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.74%. Comparing base (f380b1a) to head (43e4aba).
Report is 19 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #9173    +/-   ##
========================================
  Coverage   88.74%   88.74%            
========================================
  Files         321      320     -1     
  Lines       41853    41649   -204     
========================================
- Hits        37144    36963   -181     
+ Misses       4709     4686    -23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bkmgit

bkmgit commented Jul 19, 2025

Copy link
Copy Markdown
Contributor

It maybe helpful to later add the option of using Meng Sheng Pinyin fonts or Hanzi Pinyin fonts as romanization to ascii can result in information loss for many tonal languages, as an example:

  • "mother" (mā, 妈)
  • "ant" (má, 蚂)
  • "horse" (mǎ, 马)
  • "scold" (mà, 骂)

There are libraries that will also do this, for example xpinyin. Support for other languages could be added as need arises.

@richsalz

Copy link
Copy Markdown
Collaborator Author

Hiya @bkmgit , could you create a new issue with your comment? That greatly expands the scope of this rather simple approach.

@richsalz richsalz closed this Jul 20, 2025
@richsalz richsalz deleted the fix-7167 branch July 20, 2025 00:50
@richsalz richsalz restored the fix-7167 branch July 20, 2025 00:54
@richsalz

Copy link
Copy Markdown
Collaborator Author

Hit wrong GH button, re-opening

@jennifer-richards jennifer-richards left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll likely want to allow some additional Latin characters (e.g., "É" and other accented characters generally recognizable to readers of US-ASCII) without adding the ascii name, but this is a step forward.

@richsalz

Copy link
Copy Markdown
Collaborator Author

According to https://www.lookuptables.com/text/extended-ascii-table, it looks like all the accented characters are in the decimal range 128-154 in case we want to make an exception for them.

thanks for the review.

@bkmgit

bkmgit commented Jul 20, 2025

Copy link
Copy Markdown
Contributor

If the language is known, pyicu has options for transliteration, there is an example in the cheatsheet.

However, it maybe easier to do an NFC decomposition of each character and check if it contains an ascii letter, if all the NFC decompositions contain ascii characters, keep the name, otherwise use the ascii name. This could be done using unicodedata.

Ideally each person would be able to update this field since the readme of Unidecode indicates there will be many corner cases that will be difficult to cover with existing software.

@richsalz

richsalz commented Jul 20, 2025 via email

Copy link
Copy Markdown
Collaborator Author

@jennifer-richards

Copy link
Copy Markdown
Member

Thanks for the insights, Benson, I had looked briefly and more naively at the unicodedata module. I agree it will be useful. I think Rich's test as implemented will inform us a lot as to where we get bitten by pointless extra text in practice.

(And, perfect being enemy of the good and all, merging this will fix the entirely non-Latin text cases that inspired the issue this addresses; follow-ups that deal with additional subtleties will be welcome)

@rjsparks rjsparks merged commit 5a862b2 into ietf-tools:main Jul 23, 2025
17 checks passed
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jul 27, 2025
@richsalz richsalz deleted the fix-7167 branch September 5, 2025 17:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF8 only in Authors list

4 participants