Data Privacy: Anonymization of HTML fieldsIssue Overview The Data Privacy API fails to detect and anonymize sensitive information when it is stored in fields that support HTML formatting. This issue arises because certain special characters are automatically encoded when stored in HTML, which causes regex-based detection to fail. Root Cause When data is saved in HTML fields, special characters are encoded into their HTML entity equivalents. This encoding alters the actual string content, making it incompatible with regex patterns that expect raw characters. Common special characters and their HTML encodings (not an exhaustive list) CharacterHTML Encoding<<>>++==&&' '" "@@ These encodings are used to ensure that the data is safely rendered in HTML without being interpreted as markup or script. Example scenario Consider the detection of an email address: Regular expression: \b[\w!#$%&'*+/=?`{|}~^-]{1,64}(?:\.[\w!#$%&'*+/=?`{|}~^-]{1,64}){0,64}@(?:[a-zA-Z0-9-]+\.){1,255}[a-zA-Z]{2,6}\bPlain Text Input: john.doe@company.com (Detected and anonymized correctly.)HTML-Encoded Input: john.doe@company.com (Not detected because the @ symbol is encoded as @) The regex pattern is configured to match the literal @ character, but not its encoded form. As a result, it fails to anonymize data containing the encoded version. Recommended solution If the regex is intended to match special characters, it must be enhanced to detect both raw and encoded forms to ensure accurate detection and anonymization of sensitive data. Regex Enhancement Strategy Update regex patterns to include alternatives for encoded special characters. For example: Email Pattern Out-of-the-box (OOB) regex: \b[\w!#$%&'*+/=?`{|}~^-]{1,64}(?:\.[\w!#$%&'*+/=?`{|}~^-]{1,64}){0,64}@(?:[a-zA-Z0-9-]+\.){1,255}[a-zA-Z]{2,6}\bModified regex (supports encoded '@'): \b[\w!#$%&'*+/=?`{|}~^-]{1,64}(?:\.[\w!#$%&'*+/=?`{|}~^-]{1,64}){0,64}(?:@|@)(?:[a-zA-Z0-9-]+\.){1,255}[a-zA-Z]{2,6}\b Notice the highlighted sections, the section of data pattern matching @ is replaced to match either the raw character @ or its html encoding @. This will cater to match both plain text and encoded formats of email addresses. Updated regex for OOB data patterns We have reviewed the Data pattern records shipped as part of Data privacy and found the below patterns requiring the regex modification. For these, we have provided an alternative regex that handles both the scenarios. Data pattern NameOOB RegexModified RegexEmail\b[\w!#$%&'*+/=?`{|}~^-]{1,64}(?:\.[\w!#$%&'*+/=?`{|}~^-]{1,64}){0,64}@(?:[a-zA-Z0-9-]+\.){1,255}[a-zA-Z]{2,6}\b\b[\w!#$%&'*+/=?`{|}~^-]{1,64}(?:\.[\w!#$%&'*+/=?`{|}~^-]{1,64}){0,64}(?:@|@)(?:[a-zA-Z0-9-]+\.){1,255}[a-zA-Z]{2,6}\bEmail (Generative AI)[\w.-]+@[\w.-]+\.[\w.-]+[\w.-]+(?:@|@)[\w.-]+\.[\w.-]+Phone Number (GB)((0|44|\+44|\+44\s*\(0\)|\+44\s*0))([\s,-]?[0-9]){9,10}((0|44|+44|\+44|\+44\s*\(0\)|\+44\s*0))([\s,-]?[0-9]){9,10} Identifying Impacted Fields To determine if a field is affected: Use the "Show XML" feature to inspect the record.Compare the displayed value with the stored value.If special characters appear encoded (e.g., @ instead of @), the field is subject to this issue. Key Takeaways HTML encoding can interfere with regex-based detection of sensitive data.Regex patterns that expect to match special characters must be updated to match both raw and encoded versions.Review and update all data patterns accordingly.Use XML inspection to identify fields that store encoded data.