Skip to content

HTML entities are not kept intact, leading to invalid HTML #116

@aboks

Description

@aboks

The diff algorithm does not always keep HTML entity references intact, which can be problematic when later loading the resulting diff as HTML (e.g. when diffing a list entry). This is best illustrated by the following test:

<options></options>

<oldText>
    <ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non justo &amp; sapien;</li></ol>
</oldText>

<newText>
    <ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non sapien et justo;</li></ol>
</newText>

<expected>
    <ol class="diff-list"><li class="normal">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &amp; sapien</del>;</li></ol>
</expected>

This test currently crashes with an error message DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 from ListDiffLines.php:409. This is because it tries to load the string

<body>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &amp</del>;<del class="diffdel"> sapien;</del></body>

You can see that the diff algorithm broke up the &amp from its terminating ;, which leads to invalid HTML.

The solution might be to consider HTML entities a special case in the regex in AbstractDiff.php line 457 (e.g. adding an extra &[a-zA-Z0-9]+; case there). That fixes the given test without breaking the others, but I cannot oversee what further possible impact that may have.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions