Many research projects eventually face the same challenge:
- a collection of CSV files,
- a few Excel workbooks,
- slightly different naming conventions,
- inconsistent value formats,
- and a growing amount of manual data cleaning.
At first, everything seems manageable. Then new files arrive, contributors use different templates, identifiers change over time, and keeping a consolidated dataset up to date becomes increasingly difficult.
This is the problem files2db aims to solve.
What is files2db?
files2db is a Python package that automates the aggregation, normalization and validation of heterogeneous flat files into a single standardized database.
Instead of manually cleaning and merging datasets every time new files are received, users define a set of normalization and validation rules that are applied automatically during the import process.
The result is:
- a consolidated database,
- reproducible processing steps,
- complete traceability back to the original source files,
- and automated error reporting.
All of this is achieved without modifying the original data files. The only requirement is that columns are consistently named across datasets.
Because the aggregation, normalization and validation rules are explicitly defined, rebuilding the database from raw files becomes a deterministic and reproducible process.
How files2db Works
After installation (available on conda-forge), files2db only requires three configuration tables:
- Files: lists the files to aggregate and their metadata.
- FieldsRules: defines normalization and validation rules for each field.
- ValuesMap: harmonizes values across different coding schemes.
The workflow is then fully automated: Aggregation -> Normalisation -> Validation -> Export
Learn More
The project documentation includes: installation instructions, normalization rules, validation mechanisms, and command-line usage.