Skip to content

Commit a7a6d95

Browse files
committed
Adding a fix in soup2text for a common pathological case: <br><br> used instead
of <p /> to indicate paragraph breaks. This changes the failed diff for /iesg/telechat/detail/354/ to show only three differences, where two are whitespace differences and one shows a difference between '@ietf.org. The' and '@ietf.org . The' and is an artifact of the text extraction. Will look at fixing that next. - Legacy-Id: 300
1 parent da2de83 commit a7a6d95

1 file changed

Lines changed: 2 additions & 0 deletions

File tree

ietf/utils/soup2text.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,8 @@ def __str__(self, encoding='latin-1',
6666
return str
6767

6868
def soup2text(html):
69+
# some preprocessing to handle common pathological cases
70+
html = re.sub("<br */?>[ \t\r\n]*(<br */?>)+", "<p/>", html)
6971
soup = TextSoup(html)
7072
return str(soup)
7173

0 commit comments

Comments
 (0)