Normalisation

Normalisation steps

After aggregating all input files, files2db applies a series of normalisation rules when the --normalize option is enabled. The rules are defined in the FieldsRules table and are applied sequentially to each field.

These steps are applied in the following order:

  1. Drop all only NA values rows and columns
  2. Remove all ascii characters and encode data in UTF-8
  3. Process each field defined in FieldsRules:
    1. Split values (Sep)
    2. Clean values (DelMatch, DelIn, DelStart, DelEnd and StripFrom)
    3. Convert data type (DataType)
    4. Apply value mappings with ValuesMap table
    5. Validate values
    6. Split fields (SepPattern)

Note

Fields are processed in the order in which they appear in the FieldsRules table. This can affect the result when multiple rules interact.

Important

The rules are applied sequentially, meaning that the output of one rule will be the input for the next rule.

Normalisation rules

Separate modalities

This step allows to split multiple values in a field using the specified separator.

With Sep as ;

Before After
MyField MyField_1 MyField_2 MyField_3
A;B;C A B C

Warning

The separator cannot be a comma (,) as it is used to split the modalities in the FieldsRules table.

Clean the field

This step allows to clean the field using regular expression patterns and is case sensitive.

  • DelMatch: remove values that completely match the pattern
  • DelIn: remove every occurrence of the pattern within the value
  • DelStart: remove the pattern when it occurs at the beginning of the value
  • DelEnd: remove the pattern when it occurs at the end of the value
  • StripFrom: remove the pattern and everything that follows it

With {pattern} as A:

Type Regex equivalent A AA AB BA BAC
DelMatch ^{pattern}$ AA AB BA BAC
DelIn {pattern} B B BC
DelStart ^{pattern} A B BA BAC
DelEnd {pattern}$ A AB B BAC
StripFrom {pattern}.*$ B B

Note

For details on the regular expression syntax, please refer to the Python documentation and test them using the regex101 website.

Convert data type

This step allows to convert the data type of the field to one of the following:

  • lower: string converted to lowercase
  • UPPER: string converted to uppercase
  • Title: string converted to title case (first letter of each word capitalized)
  • date: date in the format DD.MM.YYYY
  • int: integer number (e.g. "1" and "1.2" will become 1)
  • float: floating-point number (e.g. "1.2" and "1,2" will become 1.2)
  • string: remains a string
  • bool: boolean (e.g. case insensitive "TRUE", "1" will become True while "false" and 0 will become False)

Value mapping

This step allows to apply value mappings defined in the ValuesMap table. For each Fieldprovided, the OriginalValues will be split by a comma (,) and each resulting modalities, if fully matched, will be replaced by the NewValue.

With ValuesMap as:

Field OriginalValues NewValue
ColA OldVal1,OldVal2 NewVal1
ColA OldVal3 NewVal2
ColB OldVal4 NewVal3

Then :

Before After
ColA ColB ColA ColB
OldVal1 OldVal4 NewVal1 NewVal3
OldVal3 OldVal5 NewVal2 OldVal5

Note

Multiple mappings can be applied to the same field by defining multiple rows in the ValuesMap table with the same FieldName.

Multiple values can be mapped to the same value by separating them with a comma in the OriginalValues column.

Important

This step is case sensitive except if the field has been converted to lowercase in the previous step.

Validate values

This step allows to validate the field values using regular expression patterns. Values that do not match the pattern will generate an error in the error output file.

There is two validation possible:

  • Contains: list of values separated by a comma (,). This test is case sensitive.
  • Min and Max: for numerical values only.

Split fields

This step allows to split the field into multiple fields using a regular expression pattern as separator. This use named regex pattern to capture the new field. SepPattern has to be in the form (?P<NewFieldName>{pattern}), KeepLink is a boolean stating if the NewFieldName should be appended to the old Field or as a separate new field.

Note

For details on the regular expression syntax, please refer to the Python documentation and test them using the regex101 website.

Warning

Only the first match of the pattern will be stored in the new field

Simple example

With SepPattern = (?P<Int>\\d+)|(?P<String>[A-Za-z]+)

Before After
MyField KeepLink = True KeepLink = False
MyField_Int MyField_String Int String
1234ABCD 1234 ABCD 1234 ABCD
ABCD 1234 1234 ABCD 1234 ABCD
456 ABCD 1234 456 ABCD 456 ABCD

More complex example

A more complex example when you expect either 1 or 2 value, and if two then split it into left and right.

With SepPattern = ((?P<Left>[A-E])(\/*)(?P<Right>[A-E]))|(?P<Mono>[A-E])

Before After
MyField Left Right Mono
A/B A B
BE B E
D D