Normalisation¶
Normalisation steps¶
After aggregating all input files, files2db applies a series of normalisation rules when the --normalize option is enabled.
The rules are defined in the FieldsRules table and are applied sequentially to each field.
These steps are applied in the following order:
- Drop all only NA values rows and columns
- Remove all ascii characters and encode data in UTF-8
- Process each field defined in FieldsRules:
Note
Fields are processed in the order in which they appear in the FieldsRules table. This can affect the result when multiple rules interact.
Important
The rules are applied sequentially, meaning that the output of one rule will be the input for the next rule.
Normalisation rules¶
Separate modalities¶
This step allows to split multiple values in a field using the specified separator.
With Sep as ;
| Before | After | ||
|---|---|---|---|
| MyField | MyField_1 | MyField_2 | MyField_3 |
| A;B;C | A | B | C |
Warning
The separator cannot be a comma (,) as it is used to split the modalities in the FieldsRules table.
Clean the field¶
This step allows to clean the field using regular expression patterns and is case sensitive.
- DelMatch: remove values that completely match the pattern
- DelIn: remove every occurrence of the pattern within the value
- DelStart: remove the pattern when it occurs at the beginning of the value
- DelEnd: remove the pattern when it occurs at the end of the value
- StripFrom: remove the pattern and everything that follows it
With {pattern} as A:
| Type | Regex equivalent | A | AA | AB | BA | BAC |
|---|---|---|---|---|---|---|
| DelMatch | ^{pattern}$ |
AA | AB | BA | BAC | |
| DelIn | {pattern} |
B | B | BC | ||
| DelStart | ^{pattern} |
A | B | BA | BAC | |
| DelEnd | {pattern}$ |
A | AB | B | BAC | |
| StripFrom | {pattern}.*$ |
B | B |
Note
For details on the regular expression syntax, please refer to the Python documentation and test them using the regex101 website.
Convert data type¶
This step allows to convert the data type of the field to one of the following:
lower: string converted to lowercaseUPPER: string converted to uppercaseTitle: string converted to title case (first letter of each word capitalized)date: date in the format DD.MM.YYYYint: integer number (e.g."1"and"1.2"will become1)float: floating-point number (e.g."1.2"and"1,2"will become1.2)string: remains a stringbool: boolean (e.g. case insensitive"TRUE","1"will becomeTruewhile"false"and0will becomeFalse)
Value mapping¶
This step allows to apply value mappings defined in the ValuesMap table.
For each Fieldprovided, the OriginalValues will be split by a comma (,) and each resulting
modalities, if fully matched, will be replaced by the NewValue.
With ValuesMap as:
| Field | OriginalValues | NewValue |
|---|---|---|
| ColA | OldVal1,OldVal2 | NewVal1 |
| ColA | OldVal3 | NewVal2 |
| ColB | OldVal4 | NewVal3 |
Then :
| Before | After | ||
|---|---|---|---|
| ColA | ColB | ColA | ColB |
| OldVal1 | OldVal4 | NewVal1 | NewVal3 |
| OldVal3 | OldVal5 | NewVal2 | OldVal5 |
Note
Multiple mappings can be applied to the same field by defining multiple rows in the ValuesMap table with the same FieldName.
Multiple values can be mapped to the same value by separating them with a comma in the OriginalValues column.
Important
This step is case sensitive except if the field has been converted to lowercase in the previous step.
Validate values¶
This step allows to validate the field values using regular expression patterns. Values that do not match the pattern will generate an error in the error output file.
There is two validation possible:
Contains: list of values separated by a comma (,). This test is case sensitive.MinandMax: for numerical values only.
Split fields¶
This step allows to split the field into multiple fields using a regular expression pattern as separator.
This use named regex pattern to capture the new field.
SepPattern has to be in the form (?P<NewFieldName>{pattern}), KeepLink is a boolean stating if the
NewFieldName should be appended to the old Field or as a separate new field.
Note
For details on the regular expression syntax, please refer to the Python documentation and test them using the regex101 website.
Warning
Only the first match of the pattern will be stored in the new field
Simple example¶
With SepPattern = (?P<Int>\\d+)|(?P<String>[A-Za-z]+)
| Before | After | |||
|---|---|---|---|---|
| MyField | KeepLink = True | KeepLink = False | ||
| MyField_Int | MyField_String | Int | String | |
| 1234ABCD | 1234 | ABCD | 1234 | ABCD |
| ABCD 1234 | 1234 | ABCD | 1234 | ABCD |
| 456 ABCD 1234 | 456 | ABCD | 456 | ABCD |
More complex example¶
A more complex example when you expect either 1 or 2 value, and if two then split it into left and right.
With SepPattern = ((?P<Left>[A-E])(\/*)(?P<Right>[A-E]))|(?P<Mono>[A-E])
| Before | After | ||
|---|---|---|---|
| MyField | Left | Right | Mono |
| A/B | A | B | |
| BE | B | E | |
| D | D | ||