Normalisation¶

Normalisation steps¶

After aggregating all input files, files2db applies a series of normalisation rules when the --normalize option is enabled. The rules are defined in the FieldsRules table and are applied sequentially to each field.

These steps are applied in the following order:

Drop all only NA values rows and columns
Remove all ascii characters and encode data in UTF-8
Process each field defined in FieldsRules:
Split values (Sep)
Clean values (DelMatch, DelIn, DelStart, DelEnd and StripFrom)
Convert data type (DataType)
Apply value mappings with ValuesMap table
Validate values
Split fields (SepPattern)

Note

Fields are processed in the order in which they appear in the FieldsRules table. This can affect the result when multiple rules interact.

Important

The rules are applied sequentially, meaning that the output of one rule will be the input for the next rule.

Normalisation rules¶

Separate modalities¶

This step allows to split multiple values in a field using the specified separator.

With Sep as ;

Before	After
MyField	MyField_1	MyField_2	MyField_3
A;B;C	A	B	C

Warning

The separator cannot be a comma (,) as it is used to split the modalities in the FieldsRules table.

Clean the field¶

This step allows to clean the field using regular expression patterns and is case sensitive.

DelMatch: remove values that completely match the pattern
DelIn: remove every occurrence of the pattern within the value
DelStart: remove the pattern when it occurs at the beginning of the value
DelEnd: remove the pattern when it occurs at the end of the value
StripFrom: remove the pattern and everything that follows it

With {pattern} as A:

Type	Regex equivalent	AA	AB	BA	BAC
DelMatch	`^{pattern}$`	AA	AB	BA	BAC
DelIn	`{pattern}`		B	B	BC
DelStart	`^{pattern}`	A	B	BA	BAC
DelEnd	`{pattern}$`	A	AB	B	BAC
StripFrom	`{pattern}.*$`			B	B

Note

For details on the regular expression syntax, please refer to the Python documentation and test them using the regex101 website.

Convert data type¶

This step allows to convert the data type of the field to one of the following:

lower: string converted to lowercase
UPPER: string converted to uppercase
Title: string converted to title case (first letter of each word capitalized)
date: date in the format DD.MM.YYYY
int: integer number (e.g. "1" and "1.2" will become 1)
float: floating-point number (e.g. "1.2" and "1,2" will become 1.2)
string: remains a string
bool: boolean (e.g. case insensitive "TRUE", "1" will become True while "false" and 0 will become False)

Value mapping¶

This step allows to apply value mappings defined in the ValuesMap table. For each Fieldprovided, the OriginalValues will be split by a comma (,) and each resulting modalities, if fully matched, will be replaced by the NewValue.

With ValuesMap as:

Field	OriginalValues	NewValue
ColA	OldVal1,OldVal2	NewVal1
ColA	OldVal3	NewVal2
ColB	OldVal4	NewVal3

Then :

Before		After
ColA	ColB	ColA	ColB
OldVal1	OldVal4	NewVal1	NewVal3
OldVal3	OldVal5	NewVal2	OldVal5

Note

Multiple mappings can be applied to the same field by defining multiple rows in the ValuesMap table with the same FieldName.

Multiple values can be mapped to the same value by separating them with a comma in the OriginalValues column.

Important

This step is case sensitive except if the field has been converted to lowercase in the previous step.

Validate values¶

This step allows to validate the field values using regular expression patterns. Values that do not match the pattern will generate an error in the error output file.

There is two validation possible:

Contains: list of values separated by a comma (,). This test is case sensitive.
Min and Max: for numerical values only.

Split fields¶

This step allows to split the field into multiple fields using a regular expression pattern as separator. This use named regex pattern to capture the new field. SepPattern has to be in the form (?P<NewFieldName>{pattern}), KeepLink is a boolean stating if the NewFieldName should be appended to the old Field or as a separate new field.

Note

For details on the regular expression syntax, please refer to the Python documentation and test them using the regex101 website.

Warning

Only the first match of the pattern will be stored in the new field

Simple example¶

With SepPattern = (?P<Int>\\d+)|(?P<String>[A-Za-z]+)

Before	After
MyField	KeepLink = True		KeepLink = False
MyField	MyField_Int	MyField_String	Int	String
1234ABCD	1234	ABCD	1234	ABCD
ABCD 1234	1234	ABCD	1234	ABCD
456 ABCD 1234	456	ABCD	456	ABCD

More complex example¶

A more complex example when you expect either 1 or 2 value, and if two then split it into left and right.

With SepPattern = ((?P<Left>[A-E])(\/*)(?P<Right>[A-E]))|(?P<Mono>[A-E])

Before	After
MyField	Left	Right	Mono
A/B	A	B
BE	B	E
D			D