p5y

PRIVACY
LANGUAGE.

p5y is a standardized data privacy framework to safely handle any unstructured text that contains personally identifiable and sensitive information.

It does so by translating into a privacy language of the data that can then be easily customized for different use-cases.

01

Translation

Converts PII into a safe privacy language layer.

02

Compliance

Facilitates GDPR & HIPAA adherence instantly.

1. Why do we need a privacy language framework?

Personal and sensitive information is deeply embedded in language, and handling it is very costly due to different risks and regulations.

To address this issue, various solutions have been developed, but many are proprietary or inaccessible. These anonymization tools are often limited in both scope and accuracy.

Without a common standard or shared language, it is very challenging to compare these solutions and hold data administrators accountable for providing high standards of privacy protection.

To ensure data remains usable, shareable, and compliant with strict privacy regulations, we need a standardized, transparent, and accurate approach to data protection.

2. What is p5y?

p5y is a standardized framework for privacy methods to manage unstructured text containing personally identifiable and sensitive information. These methods include managing, substituting, redacting, and anonymizing personal data.

What makes p5y unique is that it addresses privacy concerns at the language level, reducing risks before they enter more complex and costly systems.

It draws inspiration from i18n (internationalization) and l10n (localization) frameworks. Just as they translate content into different locales, p5y "translates" sensitive data into privacy-safer formats, facilitating compliance with regulations like GDPR, HIPAA, and others.

This new framework streamlines the redaction and anonymization of personal data while preserving the usability and integrity of the original information. By adopting p5y, organizations can automate and standardize the handling of sensitive information, applying "privacy translation" similar to content translation for global markets, maximizing compliance, optimising business processes, and increasing user trust.

Diagram depicting p5y privacy masking as a translation task

Fig 1: Privacy Masking as a p5y Translation Task

3. A 3-step approach to data privacy

This implementation is similar to the globalization method, which includes the following steps: Internationalization - preparing a product to support global markets by separating country- or language-specific content for adaptation; Localization - adapting the product for a specific market, and Quality Assurance.

Diagram showing p5y Awareness Report

Fig 2: Flowchart of the 3 step implementation including orginisational motivation.

3.1 Awareness

The first step in p5y is to gain structured insights from unstructured text. This step scans the data for private and sensitive information and adds markup to these entities. It enables deriving quantitative and quality insights about the private data present and assesses risks and business needs.

At this step, we can produce a p5y Awareness Report about the data, including: types of personal data and distribution, personal data density, associated risk assessment, and regulatory readiness.

Diagram showing p5y Awareness Report

Fig 3: p5y Awareness Report (type of personal data and distribution).

3.2 Protection

The second step in p5y is to control the personal data identified in the texts. This includes deciding what to take out (e.g., directly identifying entities, bias-related attributes) and which anonymization strategy to use (e.g., masking, pseudonymization, k-anonymization).

The strategy depends on factors like how the data will be used, applicable regulations and risks, preferences, permissions, and context. By separating the personal data identification (Awareness) from the data anonymization (Protection), the framework prepares data for different use cases without needing separate anonymization pipelines.

Diagram showcasing different anonymization use-cases

Fig 4: Showcase of different use-cases of anonymization tools.

3.3 Quality Assurance

The final step measures the remaining privacy risk after anonymization, evaluating how well the target entities have been anonymized and whether de-anonymization risks exist. This step involves expert human annotation and models to assess de-anonymization risks.

Diagram showing de-anonymization risks during Quality Assurance

Fig 5: Showcase of the de-anonymization risk associated during the quality assurance step.

4. Permissible vs Non-Permissible Use Cases

The p5y framework is intended to facilitate data handling and processing, while maintaining high standards of privacy protection in line with regulatory standards. All uses that do not protect individuals' privacy and contravene privacy and AI regulations are not permitted. See the lists below for an overview of permissible and non-permissible uses.

Permissible Use Cases:

Data Anonymization for Research and Analysis: Removing or masking PII from datasets to enable their use in scientific research or machine learning model training while preserving individual privacy.

Non-Permissible Use Cases:

Targeted Analysis of Personal Data: The framework should not be used to specifically analyze or profile individuals based on their personal information or to inform AI surveillance systems, as this would contradict its primary purpose of privacy protection.

Permissible Use Cases:

Regulatory Compliance: Facilitating adherence to privacy regulations such as GDPR, HIPAA, and CCPA by systematically identifying and protecting sensitive information in various data formats.

Non-Permissible Use Cases:

Circumvention of Consent Requirements: p5y must not be utilized to process personal data without proper consent, under the guise of anonymization, when such consent is legally required.

Permissible Use Cases:

Secure Data Sharing: Enabling the exchange of information between organizations or departments by redacting sensitive details while maintaining the utility of the underlying data.

Non-Permissible Use Cases:

De-anonymization Attempts: Any efforts to reverse the anonymization process or to cross-reference anonymized data with other sources to re-identify individuals are strictly prohibited.

Permissible Use Cases:

Privacy-Preserving Publication: Preparing documents or datasets for public release by ensuring all personal identifiers are appropriately masked or removed.

Non-Permissible Use Cases:

Discriminatory Practices: The framework must not be used to facilitate any form of discrimination based on protected characteristics, even if such characteristics are inferred from anonymized data.

Permissible Use Cases:

Data Minimization: Supporting the principle of data minimization by helping organizations acquire and retain only the necessary non-sensitive information for their operations.

Non-Permissible Use Cases:

Emotion Recognition and Social Scoring: In accordance with the EU AI Act, the p5y framework must not be used to facilitate or support emotion recognition systems in workplace and educational contexts, or to enable social scoring practices. These applications are explicitly banned due to their potential to infringe on individual privacy and fundamental rights.

6. Alignment with the EU AI Act

The permissible use cases of the p5y framework are designed to be compatible with the EU AI Act's emphasis on protecting fundamental rights and ensuring the ethical use of AI systems. Specifically:

  • The framework supports the Act's requirements for transparency by providing clear mechanisms for data anonymization and pseudonymization.
  • By facilitating privacy-preserving techniques, p5y aligns with the Act's focus on data minimization and purpose limitation in AI systems.
  • The framework's emphasis on standardized privacy protection contributes to the Act's goal of creating trustworthy AI systems that respect user privacy.
  • The framework aligns with the EU AI Act's emphasis on fair and non-discriminatory AI systems, by providing a methodology to remove sensitive attributes from data, which may induce unfair bias.

7. How does it look like from a practical perspective?

In p5y version 1.0, we are publishing key data concepts including glossary terms, privacy mask data structure, placeholder tag mechanics, synthetic identities, labels, label sets, and machine learning tasks. See the documentation!

9. Authorization & PII Minimization Framework (APMF)

Last updated: Dec 28, 2025

A structured policy and decision-making system that determines what personal data may be accessed for a specific task, by a specific requester, under specific regulatory and ethical constraints.

This framework produces two machine-readable outputs:

  • Privacy Allowance Profile (PAP) – defines what PII may be included (direct, indirect, sensitive)
  • Data Access Certificate (DAC) – the formal authorization artifact that travels with the data

Purpose of the Framework

  • Enforce data minimization, least privilege access and purpose limitation
  • Decide which PII classes are allowed in a given context
  • Decide which anonymization process is necessary or appropriate
  • Integrate with enterprise Data Access Governance (DAG) workflows
  • Provide traceability for privacy audits, compliance, ML governance and reporting
  • Support unstructured data and ML workloads (chats, logs, transcripts, prompts)

Scope

  • Applies to any unstructured dataset or request
  • Covers both internal and third-party processing tools (e.g., cloud LLMs)
  • Covers human and automated requesters
  • Considers regulations, ethics, consent, and organizational policy
  • Outputs certifications, obligations, and PII transformations

How It Works

  1. Inputs collected from the dataset metadata, requester profile, context of use, and regulatory environment
  2. Inputs are matched to rules in the authorization matrix
  3. The matrix determines the Privacy Allowance Profile (e.g., “No direct/indirect PII”)
  4. System produces a Data Access Certificate (DAC) with requirements, restrictions, and audit metadata
  5. Automated anonymization/redaction/scrubbing pipelines transform the data accordingly
  6. Access is granted only to the transformed and approved view of the data

This fits within existing industry DAG, DLP, ML governance, and privacy compliance models.

10. APMF: PAPs

  • P0 — Full Access: Direct/indirect/sensitive allowed (audit/certification required).
  • P1 — No Direct IDs: Direct removed; indirect & sensitive may remain.
  • P2 — No Direct or Indirect IDs: Direct & indirect removed; sensitive may remain
  • P3 — No Direct and sensitive: Direct and sensitive removed; indirect may remain
  • P4 — No sensitive data: Direct & indirect may remain; sensitive masked or removed
  • P5 — Fully De-Identified / External Processor: All PII removed/redacted; only aggregated or irreversibly anonymized text allowed.
  • P6 — Synthetic / Derived: Only synthetic or fully derived data allowed.

11. APMF: Framework Specification (Inputs)

Input information

Data source (automated from data metadata)

Information that should accompany the data source.

  • Data type: Customer chats, emails, call transcripts, health records, financial docs, employee records, prompts, logs, other
  • Data domain: medical, financial, legal, government, other (specify)
  • Data include minors: yes / no / unknown
  • Moderation risks: presence of illegal, harmful or potentially disturbing content (requires extra care, especially if intended for human access (e.g. labelers)
  • PII classes in data: direct ids, indirect ids, sensitive attributes, none, unknown
  • Jurisdiction: data and data subject location for jurisdiction
  • Consent: consent obtained (yes/no/partial - specify consented use). Capture exact consent strings, timestamps, scope (training allowed, marketing denied), and source (UI, cookie, contract). GDPR/CCPA require proof of legal basis.
  • Data policy: field for policy to access and use the data from the provider.
  • Data provenance: ties dataset back to original source(s), versions, and transformations for auditing and reproducibility.
  • Re-Use / Re-Training Flags: explicit flag if dataset may be used for subsequent training/fine-tuning (yes/no/conditional).
  • Access log: monitoring who has accessed the data, when, …
  • Storage: where the data is storage, eventual copies
Requester (automated from requester profile)
  • Human or automated: human, agent/tool/program
  • Party: internal/ third-party (name&region)/ regulator/ data subject
  • Role: data scientist, annotator, analyst, …
  • Permissions: Training, access level
Request (requester needs to provide)
  • Purpose of access: Training (model), Testing/evaluation, Debugging, Production user-serving, Analytics, Legal/Compliance, Other (specify).
  • Processing location: On-prem, in-cloud (provider: …), hybrid.
  • Processing tool: internal tool, 3rd-party LLM/tool (yes / no. If yes: provider name, contractual status (DPA/BAA/SCCs))
  • Sensitivity of decisions: automated decisions? (yes/no); impacting users? (yes/no); …
  • PII classes requested: direct identifiers/ indirect identifiers/ sensitive information
Regulations (automated)
  • Jurisdiction: based on data/ user location and processing/ deployment location - EU, US, UK, Other (specify).
  • Regulations: that apply, based on all available info
  • Ethical risks: expected bias risk (low/medium/high) and concerns

12. APMF: Framework Specification (Outputs)

Privacy Allowance Profile (PAP)

Machine readable, this instructs the machine to prepare the data according to the allowed profile.

  • Allowed classes of PII: direct identifiers/ indirect identifiers/ sensitive information
  • Anonymization required: automated workflows to apply redaction or export rules before granting data access for the specific DAC: redaction, obfuscation, ,,,
Data Access Certificate (DAC)

This will accompany the data and be both human and machine readable, to allow audits and automatic detection of policy violations.

  • Request summary: details about the data and what the access is granted for (who can access, for how long (if any time-constraint applies), how the data can be used and not be used, regulations this complies with, any ethical risks …).
  • Re-identification risk score: an automated score to measure how effective de-identification was (useful for regulators and auditors).
  • Derivative policies: for Data/ Model outputs derived from original data. Model inversion and embedding leakage are industry concerns.
  • Retention and deletion policy: how long the granted view/extract or derived model may persist. Tie issuance to expiry.
Permissions checklist

Outline what permissions are required and any missing permission to access the data as specified by the requester. This can facilitate the request, by making it clear what permissions are needed and where to request them.

13. APMF: Practical Implementation Notes

1. Automate metadata extraction

All data sources should include machine-readable metadata (schema, detected PII, provenance). This lets the system pre-populate 60–70% of required inputs automatically.

2. Treat PII minimization as the default path

Unless explicitly justified, the system should choose the lowest PII tier compatible with the requester’s purpose. This aligns with:

  • GDPR Art. 5(1)(c) (data minimization)
  • HIPAA minimum necessary rule
  • Industry standard DAG principles

3. PAP and DAC must be machine-enforceable

The Privacy Allowance Profile should trigger:

  • Automated redaction
  • Automated de-identification
  • Auto-blocking of disallowed data exports
  • Auto-routing to safe compute environments

The DAC must be:

  • Attached to all data snapshots
  • Propagated through pipelines
  • Included in model cards for any trained model

4. Support “Trusted vs. External compute”

If processing in:

  • Trusted compute (internal servers) → more PII may be allowed
  • External compute (3rd-party LLMs, Cloud APIs) → no direct/indirect identifiers

5. Redaction pipelines must be centrally managed

Do not rely on manual redaction by requesters. Use:

  • Token classification PII detection
  • Domain-specific scrubbing
  • Sensitive attribute masking to reduce bias risks

6. DACs must flow downstream

Whenever the data:

  • trains a model
  • produces embeddings
  • is transformed

…the DAC must propagate, generating derivative policies automatically.

This enforces traceability and helps with:

  • Model governance
  • Responsible AI
  • Audits

Consent attributes (scope, expiry, revocation) must be checked before PAP selection. If consent prohibits training → PAP must enforce anonymization or block the request entirely.

8. Time-based access

DACs should include expiry, after which:

  • data access auto-revokes
  • models trained during the window may require deletion / retraining if required by policy

9. Example use-case 1: Treat human-access scenarios

Human annotators or analysts must receive:

  • filtered data
  • moderated content warnings
  • protection from disturbing content

This is now a standard in major AI companies and has higher scrutiny.

14. Error Code Management & Taxonomy

Status: Final

This document defines the Management Methodology and the Error Code Taxonomy for continuous improvement of data anonymisation processes.

Management Framework

Overview

This framework provides a systematic approach to managing and improving data anonymisation quality through continuous error detection, measurement, and reduction. Inspired by Six Sigma[^1] principles, it enables organisations to achieve near-perfect anonymisation by minimising errors over time as part of a Privacy Information Management System (PIMS).

Process Architecture

The anonymisation pipeline consists of three layers:

  1. Source Layer: Original text containing personally identifiable information (PII)
  2. Privacy Layer: Text with PII replaced by privacy tokens (e.g., [NAME], [EMAIL])
  3. Output Layer: Final text after unmasking and any transformations (e.g., translation, summarisation)

Core Detection Function

FUNCTION DetectErrors( 	inputs: { source: String, actual-mask: PrivacyMask=[ List <{Start , End, Label, Index }>], computed-mask:PrivacyMask=[ List <{Start , End, Label, Index }>]} 	) 	-> ErrorMatrix [ List<{ activation: Code, explanation: String }> ]:

Objective: Minimise the error function as volume and complexity of requests increase.

Quality Management Workflow:

  • Inference Stage: Given source text and label taxonomy, compute privacy mask
  • Evaluation Stage: Compare computed mask against gold standard annotations
  • Post-Processing Stage: Apply string replacement and validate output quality
  • Analysis Stage: Classify errors by type, calculate metrics, identify improvement areas
  • Improvement Stage: Refine models, rules, and processes based on error patterns

#

15. Error Taxonomy: Token Classification Errors (T)

Description: Binary classification failures at the token level, where individual text units are incorrectly identified as containing or not containing PII. Token refers here to the individual units of text, as can be obtained after running a tokenizer on the data.

Application: Applies to all anonymisation tasks during the initial token-level detection phase.

Evaluation:

  • Severity: 5/5 — Undertriggering creates direct privacy breaches; overtriggering reduces data utility and may block legitimate use cases
  • Metrics: Precision, Recall, F1-score at token level; False Positive Rate (FPR), False Negative Rate (FNR)
Code
T-001
Error Name
Overtriggered
Description
Token incorrectly marked as PII when it should not be
Example[^2]
S: I like apple pie P: I like [COMPANY] pie G: I like apple pie
Code
T-002
Error Name
Undertriggered
Description
Token incorrectly marked as not-PII when it contains personal information
Example[^2]
S: Email to john.doe@email.com P: Email to john.doe@email.com G: Email to [EMAIL]

##

16. Error Taxonomy: Entity Span Errors (S)

Description: Errors in determining correct boundaries of PII entities. The system recognises that PII exists but fails to capture the complete span or captures incorrect portions of surrounding text. Entity refers to a specific piece of information that can contribute to identifying an individual or reveal sensitive personal details. Entities can be realized by spans in the text comprising one or multiple tokens.

Application: Applies to all entity-based anonymisation systems. Particularly relevant for named entity recognition and boundary detection tasks involving multi-token entities.

Evaluation:

  • Severity: 4/5 — Moderate impact on both privacy (underannotation) and utility (overannotation). Can cascade into label classification errors
  • Metrics: Exact Match Accuracy, Partial Match Score, Character-level F1, Boundary IoU (Intersection over Union)
Code
S-001
Error Name
Overannotated
Description
More tokens included in an entity span than correct
Example
S: Dr. Sarah Johnson's research is outstanding P: [NAME] research is outstanding G: Dr. [NAME]’s research is outstanding
Code
S-002
Error Name
Underannotated
Description
Fewer tokens included in an entity span than correct
Example
S: Dr. Sarah Johnson's research is outstanding P: Dr. Sarah [NAME]’s research is outstanding G: Dr. [NAME]’s research is outstanding
Code
S-003
Error Name
Partially Overlapping
Description
Entity span overlaps with predicted span but boundaries don't align
Example
S: Dr. Sarah Johnson's research is outstanding P: Dr. Sarah [NAME] is outstanding G: Dr. [NAME]’s research is outstanding
Code
S-004
Error Name
Span Fragmented
Description
Single entity incorrectly split into multiple separate entities
Example
S: I live in New York City P: I live in [LOCATION] [LOCATION] [LOCATION] G: I live in [LOCATION]
Code
S-005
Error Name
Spans Merged
Description
Multiple distinct entities incorrectly combined into single span
Example
S: Travel from Paris to London tomorrow P: Travel from [LOCATION] tomorrow G: Travel from [LOCATION] to [LOCATION] tomorrow
Code
S-006
Error Name
Span Misaligned
Description
Entity detected but span boundaries are completely wrong
Example
S: I live in New York P: I liv[LOCATION]ork G: I live in [LOCATION]

##

17. Error Taxonomy: Entity Nesting Errors (N)

Description: Failures in recognising and representing hierarchical relationships where one entity contains or is contained within another entity.

Application: Applies primarily to structured data contexts (file paths, URLs, addresses, organisational hierarchies) and systems that support nested entity representation. Not applicable to flat entity models.

Evaluation:

  • Severity: 3/5 — Can expose nested PII but often the containing entity provides sufficient protection. Impact varies by nesting depth and entity types involved
  • Metrics: Nested Entity Recognition Rate, Hierarchy Completeness Score, Parent-Child Match Accuracy
Code
N-001
Error Name
Missing Nested Entity
Description
Nested entity within larger entity not recognized
Example
S: /home/john_doe/documents/contract.pdf P: [/home/john_doe/documents/contract.pdf]FILEPATH G: [/home[/john_doe]USERNAME/documents/ contract.pdf]FILEPATH
Code
N-002
Error Name
Missing Larger Entity
Description
Larger entity not recognised.
Example
S: /home/john_doe/documents/contract.pdf P: /home[/john_doe]USERNAME/documents/ contract.pdf G: [/home[/john_doe]USERNAME/documents/ contract.pdf]FILEPATH

##

18. Error Taxonomy: Label Classification Errors (Single-Label)(L)

Errors in assigning the correct PII category label to an entity. These occur when the entity span is correctly identified but the wrong type/category is assigned, or the granularity level is inappropriate.

Description: Errors in assigning the correct PII category to an entity when only one label should be applied or is allowed. The span is correctly identified but assigned the wrong type or inappropriate granularity level.

Application: Applies to all classification-based anonymisation systems. Critical when different entity types require different handling (e.g., retention policies, encryption methods).

Evaluation:

  • Severity: 3/5 — Generally lower risk as entity is still protected, but can affect downstream processing, policy compliance, and analytics utility
  • Metrics: Label Accuracy, Confusion Matrix, Macro/Micro F1 per label class, Granularity Appropriateness Score
Code
L-001
Error Name
Misclassified
Description
Completely incorrect label assigned to entity
Example
S: For support, call 555-1234 P: For support, call [PASSWORD] G: For support, call [PHONE]
Code
L-002
Error Name
Imprecise
Description
Coarse or fallback category used instead of correct fine-grained label
Example
S: I live in Paris P: I live in [LOCATION] G: I live in [CITY]
Code
L-003
Error Name
Too Specific
Description
Fine-grained category used when coarse or fallback is appropriate
Example
S: Enter code 3456 P: Enter code [PASSWORD] G: Enter code [NUMERIC_ID]

19. Error Taxonomy: Label Classification Errors (Multi-Label) (M)

Description: Errors in assigning and ranking multiple valid PII categories when entities legitimately belong to several classes. Failures include missing labels, incorrect confidence ranking, or assigning invalid labels.

Application: Applies only to systems supporting multi-label classification, typically for ambiguous entities (e.g., "Jordan" as NAME/LOCATION) or for entities carrying multiple information (e.g. “Janet” a NAME/(likely)GENDER or the Italian SSN number “RSSRRT60R27F205X” which also includes parts of NAME, DoB, CITY and GENDER. Not applicable to strict single-label systems.

Evaluation:

  • Severity: 2/5 — Affects downstream decision-making and sensitive data-aware processing but entity likely remains protected under at least one label
  • Metrics: ranking: nDCG (Normalized Discounted Cumulative Gain) or Label Ranking Average Precision (LRAP); Label assignment: Precision/Recall
Code
M-001
Error Name
Overranked
Description
Label is possible but ranked with higher confidence than warranted
Example
S: I want to visit Jordan P: I want to visit Jordan[NAME:0.9, COUNTRY:0.8] G: I want to visit Jordan[COUNTRY:0.8, NAME:0.4]
Code
M-002
Error Name
Underranked
Description
Label is possible but ranked with lower confidence than warranted
Example
S: I want to visit Jordan P: I want to visit Jordan[NAME:0.4, COUNTRY:0.3] G: I want to visit Jordan[COUNTRY:0.8, NAME:0.4]
Code
M-003
Error Name
Underlabeled
Description
Too few labels assigned; additional valid labels missing
Example
S: I want to visit Jordan P: I want to visit Jordan[COUNTRY:0.8, ] G: I want to visit Jordan[COUNTRY:0.8, NAME:0.4]
Code
M-004
Error Name
Overlabeled
Description
Too many labels assigned; some labels not contextually valid
Example
S: I want to visit my friend Jordan P: I want to visit my friend Jordan[NAME:0.9, COUNTRY:0.4] G: I want to visit my friend Jordan[NAME:0.9]

##

20. Error Taxonomy: Privacy Token Errors (K)

Description: Errors in how privacy tokens are structured, linked to source text, and coreferenced across the document. These affect token integrity, traceability, and identity consistency. These occur when there is misalignment between source text and privacy layer or when identities are not correctly identified.

Application: Applies to all token-based anonymisation systems that maintain mappings between privacy tokens and original entities. Critical for reversible anonymisation and multi-mention scenarios.

Evaluation:

  • Severity: 4/5 — High severity as errors can break unmask operations, expose PII through poorly formed tokens, or fail to protect repeated entity mentions
  • Metrics: Token-Span Alignment Accuracy, Coreference Resolution F1, Token Format Compliance Rate
Code
K-001
Error Name
Incorrect token length
Description
Privacy token length not corresponding to the actual entity span
Example
P: Contact Dr. Smith[NAME] Privacy token: [NAME]byte:12-13 Privacy token: [NAME]byte:12-17
Code
K-002
Error Name
Incorrect token anchors
Description
Privacy token is linked to the incorrect span in the original text or to no span
Example
P: Contact Dr. Smith[NAME] Privacy token: [NAME]byte:4-8 Privacy token: [NAME]byte:12-17
Code
K-003
Error Name
Missing Coreference
Description
Failed to link multiple references to same entity
Example
S: Hannah Smith was born in 1956. Dr. Smith studied in Edinburgh P: [NAME_1] was born in 1956. Dr. [SURNAME_2] studied in Edinburgh G: [NAME_1] was born in 1956. Dr. [SURNAME_1] studied in Edinburgh
Code
K-004
Error Name
Incorrect token label
Description
Privacy token is incorrectly named (label, code)
Example
S: Patient SSN is 123-45-6789 P: Patient SSN is [SocialNum_001] G: Patient SSN is [SSN_1]
Code
K-005
Error Name
Token label includes PII
Description
Privacy token label includes personal information
Example
S: Janet Smith’s passport number is DG456789 P:[NAME_FEMALE_1]’s passport number is [US_PASSPORTNO_1] G:[NAME_1]’s passport number is[PASSPORTNO_1]

##

21. Error Taxonomy: Output Text Errors (O)

Description: Errors occurring during unmasking and output generation, where privacy tokens are incorrectly replaced, positioned, or rendered ungrammatical in the final text.

Application: Applies only to reversible anonymisation systems with unmasking functionality and to pipelines involving text transformation (translation, summarisation, style transfer) after masking.

Evaluation:

  • Severity: 3/5 — Medium severity. Doesn't create privacy breaches but severely impacts usability, comprehension, and trust in the system. O-001 with wrong entity value can create confusion or misinformation
  • Metrics: Unmasking Accuracy, BLEU/ROUGE scores (for fluency), Edit Distance, Grammaticality scores
Code
O-001
Error Name
Privacy mask filled with incorrect entity value
Description
Wrong or no entity value inserted during unmasking
Example
S: We met John Doe at the conference P: We met [NAME] at the conference O: We met Janet at the conference G: We met John Doe at the conference
Code
O-002
Error Name
Privacy mask not replaced
Description
Privacy token remains in output text instead of being unmasked
Example
S: We met John Doe at the conference P: We met [NAME] at the conference O: We met [NAME] at the conference G: We met John Doe at the conference
Code
O-003
Error Name
Span incorrectly replaced
Description
The entity value is replaced in an incorrect position in the text
Example
S: We met John Doe at the conference P: We met [NAME] at the conference O: We met at the John Doe conference G: We met John Doe at the conference
Code
O-004
Error Name
Unmasked entity value ungrammatical
Description
The unmasked entity value is not adapted to the context in the output text and is ungrammatical, e.g. not the right case
Example
S: Helena's book is excellent P: [NAME] book is excellent O: Helena Buch ist ausgezeichnet G: Helenas Buch ist ausgezeichnet (German possessive ‘s’)
Code
O-005
Error Name
Surrounding output text ungrammatical
Description
The unmasking process introduced grammatical, linguistic or coherency errors in the surrounding text, often because masking the entity value removed critical information needed for the correct processing of the data transformation request, such as gender, number, case, coreference.
Example
S: Residency: United States P: Residency: [COUNTRY] O: I live in United States G: I live in the United States
Code
O-006
Error Name
Unmasked entity value not translated
Description
Entity value not adapted to output language
Example
S: Janet has recently married P: [NAME] has recently [MARITAL_STATUS] O: Janet si è married da poco G: Janet si è sposata da poco

##

##

22. Error Taxonomy: Personalization Errors (P)

Description: Errors in applying user-specific or policy-specific anonymisation preferences, resulting in incorrect handling based on individual requirements or organisational policies.

Application: Applies only to systems supporting personalised anonymisation rules (e.g., user-defined sensitivity levels, custom entity types, jurisdictional requirements).

Evaluation:

  • Severity: 4/5 — High severity as these directly violate user expectations and policy compliance. Can lead to regulatory violations or loss of user trust
  • Metrics: Policy Compliance Rate, Preference Application Accuracy, User Satisfaction Score
Code
P-001
Error Name
Preference Not Applied
Description
User or policy preference ignored in anonymisation
Example
S: My IP address is 192.168.1.1 User Preference: Anonymise all IP addresses P: My IP address is 192.168.1.1 G: My IP address is [IP_ADDRESS]
Code
P-002
Error Name
Wrong Preference Applied
Description
Incorrect rule set or policy used
Example
S: Patient : E12345 Applied Policy: Healthcare (anonymise) P: Employee ID: [EMPLOYEE_ID] Correct Policy: Internal HR (retain) G: Employee ID: E12345

8. Contact Us

If you have any questions or want to know more about how the p5y framework can help your organization, feel free to reach out to us!

© 2026 p5y Framework. Version 2.0.0