Skip to contents

Normalise dataframe for a Ped object

Usage

norm_ped(
  ped_df,
  na_strings = c("NA", ""),
  missid = NA_character_,
  try_num = FALSE,
  cols_used_del = FALSE
)

Arguments

ped_df

A data.frame with the individuals informations. The minimum columns required are:

  • indID individual identifiers -> id

  • fatherId biological fathers identifiers -> dadid

  • motherId biological mothers identifiers -> momdid

  • gender sex of the individual -> sex

  • family family identifiers -> famid

The family column, if provided, will be merged to the ids field separated by an underscore using the upd_famid() function.

The following columns are also recognize and will be transformed with the vect_to_binary() function:

  • sterilisation status -> steril

  • available status -> avail

  • vitalStatus, is the individual dead -> status

  • affection status -> affected

The values recognized for those columns are 1 or 0, TRUE or FALSE.

na_strings

Vector of strings to be considered as NA values.

missid

A character vector with the missing values identifiers. All the id, dadid and momid corresponding to those values will be set to NA_character_.

try_num

Boolean defining if the function should try to convert all the columns to numeric.

cols_used_del

Boolean defining if the columns that will be used should be deleted.

Value

A dataframe with different variable correctly standardized and with the errors identified in the error column

Details

Normalise a dataframe and check for columns correspondance to be able to use it as an input to create a Ped object. Multiple test are done and errors are checked. Sex is calculated based on the gender column.

The steril column need to be a boolean either TRUE, FALSE or 'NA'. Will be considered available any individual with no 'NA' values in the available column. Duplicated indId will nullify the relationship of the individual. All individuals with errors will be remove from the dataframe and will be transfered to the error dataframe.

A number of checks are done to ensure the dataframe is correct:

On identifiers:

  • All ids (id, dadid, momid, famid) are not empty (!= "")

  • All id are unique (no duplicated)

  • All dadid and momid are unique in the id column (no duplicated)

  • id is not the same as dadid or momid

  • Either have both parents or none

On sex:

  • All sex code are either male, female, terminated or unknown.

  • No parents are steril

  • All fathers are male

  • All mothers are female

See also

Examples

df <- data.frame(
    indId = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
    fatherId = c("A", 0, 1, 3, 0, 4, 1, 0, 6, 6),
    motherId = c(0, 0, 2, 2, 0, 5, 2, 0, 8, 8),
    gender = c(1, 2, "m", "man", "f", "male", "m", "m", "f", "f"),
    available = c("A", "1", 0, NA, 1, 0, 1, 0, 1, 0),
    famid = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
    sterilisation = c("TRUE", "FALSE", TRUE, FALSE, 1, 0, 1, 0, 1, "TRUE"),
    vitalStatus = c("TRUE", "FALSE", TRUE, FALSE, 1, 0, 1, 0, 1, 0),
    affection = c("TRUE", "FALSE", TRUE, FALSE, 1, 0, 1, 0, 1, 0)
)
tryCatch(
    norm_ped(df),
    error = function(e) print(e)
)
#> <simpleError in check_columns(ped_df, cols_need, cols_used, cols_to_use, others_cols = TRUE,     cols_to_use_init = TRUE, cols_used_init = TRUE, cols_used_del = cols_used_del): Columns : famid are used by the script and would be overwritten.
#> >