Quickstart¶

Installation¶

files2db is available on conda-forge, so you can install it with the following command:

conda install -c conda-forge files2db

Table that you will need¶

One excel file with 3 sheets named: Files, FieldsRules, ValuesMap, or 3 different CSV files.

The columns expected for each of these files are the following, but you can find template for each of them in the assets folder.

Files table¶

This table contains the path to each file to aggregate to the database, with information on how to read the data. You therefore need the following columns:

FilePath: the absolute path to the file
SheetName: the sheet name in case of an Excel file
Separator: character used to separate the columns (e.g. ;, \t, ), needed for plain text files (e.g. .csv, .txt, .tsv) (Note: , cannot be used as it is used to split the modalities in the FieldsRules table)
Header: the line number where the column names are written
LineStart: the starting line of the data
LineEnd: the ending line of the data
ColStart: the starting column of the data
ColEnd: the ending column of the data
ToAdd: a true/false value telling if this file should be aggregated or not
AsCorrection: a true/false value telling if this file is a special file to be used for data correction
meta_*: any column starting with meta_ will be added as an additional value for each row of the file.

FilePath	SheetName	Separator	Header	LineStart	LineEnd	ColStart	ColEnd	ToAdd	AsCorrection	meta_DateFileCreation	meta_ColE
test_dataset/test1/fileA.xlsx	Feuil1		3	5	6	C	E	1		2021-12-01
test_dataset/test1/fileB.csv			1	2	3	A	C	1		2024.02.01	MetaNewValue
test_dataset/test1/fileC.csv		;	1	2	4	A	E	1	true	2022.01.10	MetaOldValue

FieldsRules table¶

This table contains the different rules needed to normalize each variable in the aggregated data. More information on the different rules can be found in the documentation.

Field: name of the field to normalize
Category: name of the field category
Sep: character used to separate the modalities in the field (e.g. ,, ;, \t, )
SepPattern: regular expression pattern to separate the modalities in the field
KeepLink: a boolean value telling if the link to the original data should be kept after separating the modalities in the field
DelMatch: a regular expression pattern to delete in the field (fully match)
DelEnd: a regular expression pattern to delete at the end of the field
DelIn: a regular expression pattern to delete in the field (partially match)
DelStart: a regular expression pattern to delete at the start of the field
StripFrom: a regular expression pattern to strip from the field (fully match)
DataType: the data type to assign to the field (e.g. str, int, float, date)
Contains: a regular expression pattern to check if the field contains a specific value
Min: the minimum value for the field (if applicable)
Max: the maximum value for the field (if applicable)

Field	Category	Sep	DelMatch	DelIn	DelEnd	StripFrom	DataType	Contains	Min	Max	SepPattern	KeepLink
ColA	ID	or	NA				int	int	1	4
ColC	Infos	_	WRONG	prefix	suffix	other	string				(?P\d+)\|(?P[A-Za-z]+)	False
DateFileCreation	Other		???,00				date	date

ValuesMap table¶

This table contains modalities correspondance for each field

Field: name of the field to normalize
OriginalValues: modalities list to modify separated by a ,
NewValue: new value to assign to each modality listed in OriginalValue

Field	OriginalValues	NewValue
ValueToMap	A,otherA	MyA
ValueToMap	B	MyB
ValueToMap2	A,B,1	AB1

Explanation:¶

values A and otherA in field ValueToMap will be modify as MyA
values B in field ValueToMap will be modify as MyB
values A, B and 1 in field ValueToMap2 will be modify as AB1