Skip to content

Commit 6a05a1d

Browse files
committed
feat: support justhtml parsing library to convert email to plain text
justhtml is an pure python, fast, HTML5 compliant parser. It is now an option for converting html only emails to plain text. Its output format differs slightly from dehtml or beautifulsoup. Mostly by removing extra blank lines. dehtml.py: Using the stream parser of justhtml. Unable to get the full document parser to successfully strip script and style blocks. If I can fix this and use the standard parser, I can in theory generate markdown from the DOM tree generated by justhtml. Updated test case to include inline elements that should not cause a line break when they are encountered. Running dehtml as: `python roundup/dehtml.py foo.html` will load foo.html and parse it using all available parsers. configuration.py: justhtml is available as an option. docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup and justhtml to the optional software section of doc/installtion.txt. test_mailgw.py, .github/workflows/ci-test Updated tests and install justhtml as part of CI.
1 parent b96c661 commit 6a05a1d

File tree

7 files changed

+191
-21
lines changed

7 files changed

+191
-21
lines changed

.github/workflows/ci-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,7 @@ jobs:
240240
# pygments for markdown2 to highlight code blocks
241241
pip install markdown2 pygments
242242
# docutils for ReStructuredText
243-
pip install beautifulsoup4 brotli docutils jinja2 \
243+
pip install beautifulsoup4 justhtml brotli docutils jinja2 \
244244
mistune==0.8.4 pyjwt pytz whoosh
245245
# gpg on PyPi is currently broken with newer OS platform
246246
# ubuntu 24.04

CHANGES.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,10 @@ Features:
6464
config.ini. (John Rouillard)
6565
- issue2551152 - added basic PGP setup/use info to admin_guide. (John
6666
Rouillard)
67+
- add support for the 'justhtml' html 5 parser library. It is written
68+
in pure Python. Used to convert html emails into plain text. Faster
69+
then beautifulsoup4 and it passes the html 5 standard browser test
70+
suite. Beautifulsoup is still supported. (John Rouillard)
6771

6872
2025-07-13 2.5.0
6973

doc/installation.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,14 @@ polib
311311
roundup-gettext, you must install polib_. See the `developer's
312312
guide`_ for details on translating your tracker.
313313

314+
beautifulsoup, justhtml
315+
When HTML only email is received, Roundup can convert it into
316+
plain text using the native dehtml parser. To convert HTML
317+
email into plain text, beautifulsoup4_ or justhtml_ can also be
318+
used. You can choose the converter in the tracker's
319+
config. Note that justhtml is pure Python, fast and conforms to
320+
HTML 5 standards.
321+
314322
pywin32 - Windows Service
315323
You can run Roundup as a Windows service if pywin32_ is installed.
316324
Otherwise it must be started manually.
@@ -2423,13 +2431,15 @@ the test.
24232431
.. _`adding MySQL users`:
24242432
https://dev.mysql.com/doc/refman/8.0/en/creating-accounts.html
24252433
.. _apache: https://httpd.apache.org/
2434+
.. _beautifulsoup4: https://pypi.org/project/beautifulsoup4/
24262435
.. _brotli: https://pypi.org/project/Brotli/
24272436
.. _`developer's guide`: developers.html
24282437
.. _defusedxml: https://pypi.org/project/defusedxml/
24292438
.. _docutils: https://pypi.org/project/docutils/
24302439
.. _flup: https://pypi.org/project/flup/
24312440
.. _gpg: https://www.gnupg.org/software/gpgme/index.html
24322441
.. _jinja2: https://palletsprojects.com/projects/jinja/
2442+
.. _justhtml: https://pypi.org/project/justhtml/
24332443
.. _markdown: https://python-markdown.github.io/
24342444
.. _markdown2: https://github.com/trentm/python-markdown2
24352445
.. _mistune: https://pypi.org/project/mistune/

doc/tracker_config.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1112,12 +1112,12 @@
11121112

11131113
# If an email has only text/html parts, use this module
11141114
# to convert the html to text. Choose from beautifulsoup 4,
1115-
# dehtml - (internal code), or none to disable conversion.
1116-
# If 'none' is selected, email without a text/plain part
1117-
# will be returned to the user with a message. If
1115+
# justhtml, dehtml - (internal code), or none to disable
1116+
# conversion. If 'none' is selected, email without a text/plain
1117+
# part will be returned to the user with a message. If
11181118
# beautifulsoup is selected but not installed dehtml will
11191119
# be used instead.
1120-
# Allowed values: beautifulsoup, dehtml, none
1120+
# Allowed values: beautifulsoup, justhtml, dehtml, none
11211121
# Default: none
11221122
convert_htmltotext = none
11231123

roundup/configuration.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -384,17 +384,17 @@ class HtmlToTextOption(Option):
384384

385385
"""What module should be used to convert emails with only text/html
386386
parts into text for display in roundup. Choose from beautifulsoup
387-
4, dehtml - the internal code or none to disable html to text
388-
conversion. If beautifulsoup chosen but not available, dehtml will
389-
be used.
387+
4, justhtml, dehtml - the internal code or none to disable html to
388+
text conversion. If beautifulsoup or justhtml is chosen but not
389+
available, dehtml will be used.
390390
391391
"""
392392

393-
class_description = "Allowed values: beautifulsoup, dehtml, none"
393+
class_description = "Allowed values: beautifulsoup, justhtml, dehtml, none"
394394

395395
def str2value(self, value):
396396
_val = value.lower()
397-
if _val in ("beautifulsoup", "dehtml", "none"):
397+
if _val in ("beautifulsoup", "justhtml", "dehtml", "none"):
398398
return _val
399399
else:
400400
raise OptionValueError(self, value, self.class_description)
@@ -1811,11 +1811,11 @@ def str2value(self, value):
18111811
(HtmlToTextOption, "convert_htmltotext", "none",
18121812
"If an email has only text/html parts, use this module\n"
18131813
"to convert the html to text. Choose from beautifulsoup 4,\n"
1814-
"dehtml - (internal code), or none to disable conversion.\n"
1815-
"If 'none' is selected, email without a text/plain part\n"
1816-
"will be returned to the user with a message. If\n"
1817-
"beautifulsoup is selected but not installed dehtml will\n"
1818-
"be used instead."),
1814+
"justhtml, dehtml - (internal code), or none to disable\n"
1815+
"conversion. If 'none' is selected, email without a text/plain\n"
1816+
"part will be returned to the user with a message. If\n"
1817+
"beautifulsoup or justhtml is selected but not installed\n"
1818+
"dehtml will be used instead."),
18191819
(BooleanOption, "keep_real_from", "no",
18201820
"When handling emails ignore the Resent-From:-header\n"
18211821
"and use the original senders From:-header instead.\n"

roundup/dehtml.py

Lines changed: 146 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@
55

66
from roundup.anypy.strings import u2s, uchr
77

8+
# ruff PLC0415 ignore imports not at top of file
9+
# ruff RET505 ignore else after return
10+
# ruff: noqa: PLC0415 RET505
11+
812
_pyver = sys.version_info[0]
913

1014

@@ -28,6 +32,108 @@ def html2text(html):
2832

2933
return u2s(soup.get_text("\n", strip=True))
3034

35+
self.html2text = html2text
36+
elif converter == "justhtml":
37+
from justhtml import stream
38+
39+
def html2text(html):
40+
# The below does not work.
41+
# Using stream parser since I couldn't seem to strip
42+
# 'script' and 'style' blocks. But stream doesn't
43+
# have error reporting or stripping of text nodes
44+
# and dropping empty nodes. Also I would like to try
45+
# its GFM markdown output too even though it keeps
46+
# tables as html and doesn't completely covert as
47+
# this would work well for those supporting markdown.
48+
#
49+
# ctx used for for testing since I have a truncated
50+
# test doc. It eliminates error from missing DOCTYPE
51+
# and head.
52+
#
53+
#from justhtml import JustHTML
54+
# from justhtml.context import FragmentContext
55+
#
56+
#ctx = FragmentContext('html')
57+
#justhtml = JustHTML(html,collect_errors=True,
58+
# fragment_context=ctx)
59+
# I still have the text output inside style/script tags.
60+
# with :not(style, script). I do get text contents
61+
# with query("style, script").
62+
#
63+
#return u2s("\n".join(
64+
# [elem.to_text(separator="\n", strip=True)
65+
# for elem in justhtml.query(":not(style, script)")])
66+
# )
67+
68+
# define inline elements so I can accumulate all unbroken
69+
# text in a single line with embedded inline elements.
70+
# 'br' is inline but should be treated it as a line break
71+
# and element before/after should not be accumulated
72+
# together.
73+
inline_elements = (
74+
"a",
75+
"address",
76+
"b",
77+
"cite",
78+
"code",
79+
"em",
80+
"i",
81+
"img",
82+
"mark",
83+
"q",
84+
"s",
85+
"small",
86+
"span",
87+
"strong",
88+
"sub",
89+
"sup",
90+
"time")
91+
92+
# each line is appended and joined at the end
93+
text = []
94+
# the accumulator for all text in inline elements
95+
text_accumulator = ""
96+
# if set skip all lines till matching end tag found
97+
# used to skip script/style blocks
98+
skip_till_endtag = None
99+
# used to force text_accumulator into text with added
100+
# newline so we have a blank line between paragraphs.
101+
_need_parabreak = False
102+
103+
for event, data in stream(html):
104+
if event == "end" and skip_till_endtag == data:
105+
skip_till_endtag = None
106+
continue
107+
if skip_till_endtag:
108+
continue
109+
if (event == "start" and
110+
data[0] in ('script', 'style')):
111+
skip_till_endtag = data[0]
112+
continue
113+
if (event == "start" and
114+
text_accumulator and
115+
data[0] not in inline_elements):
116+
# add accumulator to "text"
117+
text.append(text_accumulator)
118+
text_accumulator = ""
119+
_need_parabreak = False
120+
elif event == "text":
121+
if not data.isspace():
122+
text_accumulator = text_accumulator + data
123+
_need_parabreak = True
124+
elif (_need_parabreak and
125+
event == "start" and
126+
data[0] == "p"):
127+
text.append(text_accumulator + "\n")
128+
text_accumulator = ""
129+
_need_parabreak = False
130+
131+
# save anything left in the accumulator at end of document
132+
if text_accumulator:
133+
# add newline to match dehtml and beautifulsoup
134+
text.append(text_accumulator + "\n")
135+
return u2s("\n".join(text))
136+
31137
self.html2text = html2text
32138
else:
33139
raise ImportError
@@ -96,6 +202,16 @@ def html2text(html):
96202

97203

98204
if __name__ == "__main__":
205+
# ruff: noqa: B011 S101
206+
207+
try:
208+
assert False
209+
except AssertionError:
210+
pass
211+
else:
212+
print("Error, assertions turned off. Test fails")
213+
sys.exit(1)
214+
99215
html = """
100216
<body>
101217
<script>
@@ -128,10 +244,10 @@ def html2text(html):
128244
<li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li>
129245
</ul>
130246
<div class="section" id="prerequisites">
131-
<h2><a class="toc-backref" href="#id5">Prerequisites</a></h2>
247+
<H2><a class="toc-backref" href="#id5">Prerequisites</a></H2>
132248
<p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning
133249
anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>.
134-
It is highly recommended that users install the latest patch version
250+
It is highly recommended that users install the <span>latest patch version</span>
135251
of python as these contain many fixes to serious bugs.</p>
136252
<p>Some variants of Linux will need an additional &#8220;python dev&#8221; package
137253
installed for Roundup installation to work. Debian and derivatives, are
@@ -147,18 +263,42 @@ def html2text(html):
147263
</body>
148264
"""
149265

150-
html2text = dehtml("dehtml").html2text
151-
if html2text:
152-
print(html2text(html))
266+
if len(sys.argv) > 1:
267+
with open(sys.argv[1]) as h:
268+
html = h.read()
153269

270+
print("==== beautifulsoup")
154271
try:
155272
# trap error seen if N_TOKENS not defined when run.
156273
html2text = dehtml("beautifulsoup").html2text
157274
if html2text:
158-
print(html2text(html))
275+
text = html2text(html)
276+
assert ('HELP' not in text)
277+
assert ('display:block' not in text)
278+
print(text)
159279
except NameError as e:
160280
print("captured error %s" % e)
161281

282+
print("==== justhtml")
283+
try:
284+
html2text = dehtml("justhtml").html2text
285+
if html2text:
286+
text = html2text(html)
287+
assert ('HELP' not in text)
288+
assert ('display:block' not in text)
289+
print(text)
290+
except NameError as e:
291+
print("captured error %s" % e)
292+
293+
print("==== dehtml")
294+
html2text = dehtml("dehtml").html2text
295+
if html2text:
296+
text = html2text(html)
297+
assert ('HELP' not in text)
298+
assert ('display:block' not in text)
299+
print(text)
300+
301+
print("==== disabled html -> text conversion")
162302
html2text = dehtml("none").html2text
163303
if html2text:
164304
print("FAIL: Error, dehtml(none) is returning a function")

test/test_mailgw.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,13 @@
3535
skip_beautifulsoup = mark_class(pytest.mark.skip(
3636
reason="Skipping beautifulsoup tests: 'bs4' not installed"))
3737

38+
try:
39+
import justhtml
40+
skip_justhtml = lambda func, *args, **kwargs: func
41+
except ImportError:
42+
from .pytest_patcher import mark_class
43+
skip_justhtml = mark_class(pytest.mark.skip(
44+
reason="Skipping justhtml tests: 'justhtml' not installed"))
3845

3946
from roundup.anypy.email_ import message_from_bytes
4047
from roundup.anypy.strings import b2s, u2s, s2b
@@ -315,6 +322,10 @@ class MailgwTestCase(MailgwTestAbstractBase, StringFragmentCmpHelper, unittest.T
315322
def testTextHtmlMessageBeautifulSoup(self):
316323
self.testTextHtmlMessage(converter='beautifulsoup')
317324

325+
@skip_justhtml
326+
def testTextHtmlMessageJusthtml(self):
327+
self.testTextHtmlMessage(converter='justhtml')
328+
318329
def testTextHtmlMessage(self, converter='dehtml'):
319330
html_message='''Content-Type: text/html;
320331
charset="iso-8859-1"
@@ -375,10 +386,15 @@ def testTextHtmlMessage(self, converter='dehtml'):
375386
text_fragments['dehtml'] = ['Roundup\n Home\nDownload\nDocs\nRoundup Features\nInstalling Roundup\nUpgrading to newer versions of Roundup\nRoundup FAQ\nUser Guide\nCustomising Roundup\nAdministration Guide\nPrerequisites\n\nRoundup requires Python 2.6 or newer (but not Python 3) with a functioning\nanydbm module. Download the latest version from http://www.python.org/.\nIt is highly recommended that users install the latest patch version\nof python as these contain many fixes to serious bugs.\n\nSome variants of Linux will need an additional ', ('python dev', u2s(u'\u201cpython dev\u201d')), ' package\ninstalled for Roundup installation to work. Debian and derivatives, are\nknown to require this.\n\nIf you', (u2s(u'\u2019'), ''), 're on windows, you will either need to be using the ActiveState python\ndistribution (at http://www.activestate.com/Products/ActivePython/), or you', (u2s(u'\u2019'), ''), 'll\nhave to install the win32all package separately (get it from\nhttp://starship.python.net/crew/mhammond/win32/).']
376387
text_fragments['beautifulsoup'] = ['Roundup\nHome\nDownload\nDocs\nRoundup Features\nInstalling Roundup\nUpgrading to newer versions of Roundup\nRoundup FAQ\nUser Guide\nCustomising Roundup\nAdministration Guide\nPrerequisites\nRoundup requires Python 2.6 or newer (but not Python 3) with a functioning\nanydbm module. Download the latest version from\nhttp://www.python.org/\n.\nIt is highly recommended that users install the latest patch version\nof python as these contain many fixes to serious bugs.\nSome variants of Linux will need an additional ', ('python dev', u2s(u'\u201cpython dev\u201d')), ' package\ninstalled for Roundup installation to work. Debian and derivatives, are\nknown to require this.\nIf you', (u2s(u'\u2019'), "'"), 're on windows, you will either need to be using the ActiveState python\ndistribution (at\nhttp://www.activestate.com/Products/ActivePython/\n), or you’ll\nhave to install the win32all package separately (get it from\nhttp://starship.python.net/crew/mhammond/win32/\n).']
377388

389+
text_fragments['justhtml'] = ['Roundup\nHome\nDownload\nDocs\nRoundup Features\nInstalling Roundup\nUpgrading to newer versions of Roundup\nRoundup FAQ\nUser Guide\nCustomising Roundup\nAdministration Guide\nPrerequisites\nRoundup requires Python 2.6 or newer (but not Python 3) with a functioning\nanydbm module. Download the latest version from http://www.python.org/.\nIt is highly recommended that users install the latest patch version\nof python as these contain many fixes to serious bugs.\nSome variants of Linux will need an additional ', ('python dev', u2s(u'\u201cpython dev\u201d')), ' package\ninstalled for Roundup installation to work. Debian and derivatives, are\nknown to require this.\nIf you', (u2s(u'\u2019'), "'"), 're on windows, you will either need to be using the ActiveState python\ndistribution (at http://www.activestate.com/Products/ActivePython/), or you’ll\nhave to install the win32all package separately (get it from\nhttp://starship.python.net/crew/mhammond/win32/).']
390+
self.maxDiff = 100000
378391
self.db.config.MAILGW_CONVERT_HTMLTOTEXT = converter
379392
nodeid = self._handle_mail(html_message)
380393
assert not os.path.exists(SENDMAILDEBUG)
381394
msgid = self.db.issue.get(nodeid, 'messages')[0]
395+
print(self.db.msg.get(msgid, 'content'))
396+
print("\n==== fragment\n")
397+
print(text_fragments[converter])
382398
self.compareStringFragments(self.db.msg.get(msgid, 'content'),
383399
text_fragments[converter])
384400

0 commit comments

Comments
 (0)