← Back to tools

Data Cleaning

Upload qualitative data files for cleaning. Removes timestamps, filler words, and personally identifying information (PII) — pattern-matching catches structured identifiers, then AI review catches names, locations, and contextual identifiers. How it works →

Before you upload

Do not upload health records or other sensitive personal information. This tool is built for social services and evaluation data. It should not be used for personal health information, or for data that reveals someone's racial or ethnic origin, religious or philosophical beliefs, sexual orientation, or trade union membership. If your file contains this kind of information about identifiable people, do not upload it until you have checked with your organization.
Confirm your organization allows this. Because step 2 sends data outside Canada, make sure your organization permits cross-border processing and that you have permission to upload the file.
If your file includes information about Quebec residents, your organization may have additional obligations to review before transferring it.
Review before sharing. Automated de-identification is a strong first pass, not a replacement for human review. Always check the cleaned file yourself.

Not sure whether your data is safe to upload? Start with your organization’s privacy lead. Our technical reference sets out exactly how the tool handles data, so you can share it with them to help make the decision.

Drag files here or

.docx, .txt, .csv, .pdf, .vtt — up to 20 MB each, 30 files max

Options Remove timestamps from transcripts Remove filler words (um, uh, hmm)

Organization names Anonymize organization names Preserve large organizations; flag small groups Preserve all organization names

Choose "Preserve all" when organization names are analysis context, such as partner-organization narrative data.

Preserve locations and place names

Use when towns, regions, schools, sites, or service locations are needed for analysis. Personal names and contact details are still cleaned.

Terms to preserve (optional — one per line)

List any names, projects, or terms you want left unchanged. We'll skip them during cleaning. Each term is matched literally and case-insensitively.

Names that are already pseudonyms (optional — one per line)

Use this for participant-chosen names or handles that should stay visible. These names are also ignored by the final name-shaped safety scan.

Before you start

.docx — Word documents (text is extracted automatically)
.txt — Plain text files
.csv — CSV files (you'll choose which columns to clean)
.pdf — PDF documents (text is extracted automatically)
.vtt — WebVTT caption/transcript files

Not supported: Excel (.xlsx) — export as CSV first (File → Save As → CSV). Scanned or image-based PDFs cannot be processed — use a text-based PDF or convert to .docx first.

Time estimate: Around 1–2 minutes per file.