Pegar los valores de las variables de los factores en R o Python en función de la fecha: creación de interrupciones escolares

Tengo el siguiente conjunto de datos ( Break_data ) recostackdos del calendario escolar que comienza y termina de los descansos:

  print(Break_data) Start End Break Year 1 2016-02-24 2016-02-29 Spring_Break 2016 2 2016-03-23 2016-03-28 Easter_Recess 2016 3 2016-10-05 2016-10-10 Mid_Term_Break 2016 4 2017-03-01 2017-03-06 Spring_Break 2017 5 2017-04-12 2017-04-17 Easter_Recess 2017 6 2017-10-04 2017-10-09 Mid_Term_Break 2017 7 2018-02-28 2018-03-05 Spring_Break 2018 8 2018-03-28 2018-04-02 Easter_Recess 2018 

Y este es un conjunto de datos muy grande

 head(df$date) [1] "2016-02-05" "2016-02-05" "2016-02-05" "2016-02-05" "2016-02-05" "2016-02-05" tail(df$date) [1] "2018-07-12" "2018-07-12" "2018-07-12" "2018-07-12" "2018-07-12" "2018-07-12" 

Siguiendo los pasos provistos en: https://stackoverflow.com/a/51052626/9341589

Quiero crear una variable de factor similar Break comparando con un rango de conjunto de datos df (es decir, incluye muchas variables además de la fecha desde 2016-02-05 a 2018-07-12 ) – el intervalo de muestreo es de 15 minutos (es decir, un día) son 96 filas).

En mi caso, además de estos valores que se muestran en la tabla, deseo que los valores que no pertenecen al Start y al End de estas fechas se consideren días sin Non_Break .

Siguiendo los pasos en el enlace mencionado anteriormente, esta es la versión modificada del código en R:

 Break_data$Start <- ymd(Break_data$Start) Break_data$End <- ymd(Break_data$End) df$date <- ymd(df$date) LU <- Map(`:`, Break_data$Start, Break_data$End) LU <- data.frame(value = unlist(LU), index = rep(seq_along(LU), lapply(LU, length))) df$Break <- Break_data$Break[LU$index[match(df$date, LU$value)]] 

Supongo que, además de esto, tengo que proporcionar Non_Break en un for loop o función simple if el período de tiempo no está dentro de los rangos de inicio y fin.

Editar: Intenté de dos maneras diferentes:

PRIMERO- sin usar el mapeo

 for (i in c(1:nrow(df))){ if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i]= "2016-03-23" & df$date <= "2016-03-28") df$Break[i]= "2016-10-05" & df$date <= "2016-10-10") df$Break[i]= "2017-03-01" & df$date <= "2017-03-06") df$Break[i]= "2017-04-12" & df$date <= "2017-04-17") df$Break[i]= "2017-10-04" & df$date <= "2017-10-09") df$Break[i]= "2018-02-28" & df$date <= "2018-03-05") df$Break[i]= "2018-03-28" & df$date <= "2018-04-02") df$Break[i]<-"Easter_Recess" else (df$Break[i]<-"Not_Break") } 

El primero se ejecuta para siempre 🙂 y obtengo 2 valores Not_Break y Spring_Break .

Y este es el mensaje de advertencia:

 Warning messages: 1: In if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 2: In if (df$date[i] >= "2016-03-23" & df$date <= "2016-03-28") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 3: In if (df$date[i] >= "2016-10-05" & df$date <= "2016-10-10") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 4: In if (df$date[i] >= "2017-03-01" & df$date <= "2017-03-06") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 5: In if (df$date[i] >= "2017-04-12" & df$date <= "2017-04-17") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 6: In if (df$date[i] >= "2017-10-04" & df$date <= "2017-10-09") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 7: In if (df$date[i] >= "2018-02-28" & df$date <= "2018-03-05") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 8: In if (df$date[i] >= "2018-03-28" & df$date <= "2018-04-02") df$Break[i] <- "Easter_Recess" else (df$Break[i]  1 and only the first element will be used 9: In if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 10: In if (df$date[i] >= "2016-03-23" & df$date <= "2016-03-28") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 11: In if (df$date[i] >= "2016-10-05" & df$date <= "2016-10-10") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 12: In if (df$date[i] >= "2017-03-01" & df$date <= "2017-03-06") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 13: In if (df$date[i] >= "2017-04-12" & df$date <= "2017-04-17") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 14: In if (df$date[i] >= "2017-10-04" & df$date <= "2017-10-09") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 15: In if (df$date[i] >= "2018-02-28" & df$date <= "2018-03-05") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 16: In if (df$date[i] >= "2018-03-28" & df$date <= "2018-04-02") df$Break[i] <- "Easter_Recess" else (df$Break[i]  1 and only the first element will be used 17: In if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 18: In if (df$date[i] >= "2016-03-23" & df$date <= "2016-03-28") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 19: In if (df$date[i] >= "2016-10-05" & df$date <= "2016-10-10") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 20: In if (df$date[i] >= "2017-03-01" & df$date <= "2017-03-06") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 21: In if (df$date[i] >= "2017-04-12" & df$date <= "2017-04-17") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 22: In if (df$date[i] >= "2017-10-04" & df$date <= "2017-10-09") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 23: In if (df$date[i] >= "2018-02-28" & df$date <= "2018-03-05") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 24: In if (df$date[i] >= "2018-03-28" & df$date <= "2018-04-02") df$Break[i] <- "Easter_Recess" else (df$Break[i]  1 and only the first element will be used 25: In if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 26: In if (df$date[i] >= "2016-03-23" & df$date <= "2016-03-28") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 27: In if (df$date[i] >= "2016-10-05" & df$date <= "2016-10-10") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 28: In if (df$date[i] >= "2017-03-01" & df$date <= "2017-03-06") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 29: In if (df$date[i] >= "2017-04-12" & df$date <= "2017-04-17") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 30: In if (df$date[i] >= "2017-10-04" & df$date <= "2017-10-09") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 31: In if (df$date[i] >= "2018-02-28" & df$date <= "2018-03-05") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 32: In if (df$date[i] >= "2018-03-28" & df$date <= "2018-04-02") df$Break[i] <- "Easter_Recess" else (df$Break[i]  1 and only the first element will be used 33: In if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 34: In if (df$date[i] >= "2016-03-23" & df$date <= "2016-03-28") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 35: In if (df$date[i] >= "2016-10-05" & df$date <= "2016-10-10") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 36: In if (df$date[i] >= "2017-03-01" & df$date <= "2017-03-06") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 37: In if (df$date[i] >= "2017-04-12" & df$date <= "2017-04-17") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 38: In if (df$date[i] >= "2017-10-04" & df$date <= "2017-10-09") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 39: In if (df$date[i] >= "2018-02-28" & df$date <= "2018-03-05") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 40: In if (df$date[i] >= "2018-03-28" & df$date <= "2018-04-02") df$Break[i] <- "Easter_Recess" else (df$Break[i]  1 and only the first element will be used 41: In if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 42: In if (df$date[i] >= "2016-03-23" & df$date <= "2016-03-28") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 43: In if (df$date[i] >= "2016-10-05" & df$date <= "2016-10-10") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 44: In if (df$date[i] >= "2017-03-01" & df$date <= "2017-03-06") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 45: In if (df$date[i] >= "2017-04-12" & df$date <= "2017-04-17") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 46: In if (df$date[i] >= "2017-10-04" & df$date <= "2017-10-09") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 47: In if (df$date[i] >= "2018-02-28" & df$date <= "2018-03-05") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 48: In if (df$date[i] >= "2018-03-28" & df$date <= "2018-04-02") df$Break[i] <- "Easter_Recess" else (df$Break[i]  1 and only the first element will be used 49: In if (df$date[i] >= "2016-02-24" & df$date <= "2016-02-29") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 50: In if (df$date[i] >= "2016-03-23" & df$date <= "2016-03-28") df$Break[i] = ... : the condition has length > 1 and only the first element will be used 

SEGUNDO: agregando el código en el enlace:

 LU <- Map(`:`, Break_data$Start, Break_data$End) LU <- data.frame(value = unlist(LU), index = rep(seq_along(LU), lapply(LU, length))) for (i in c(1:nrow(df))){ if (df$Break = "2016-02-05" & df$date <= "2018-07-12") df$Break[i]<-"Not_Break" } 

en el segundo también recibo un error. Cualquier modificación al código o implementación (en R o Python) será apreciada

¿Hay alguna forma más eficiente de hacer esto?

Nota: los conjuntos de datos están disponibles públicamente en: https://github.com/tomiscat/data

 library(lubridate) # data Break_data <- data.table::fread( " Start End Break Year 2016-02-24 2016-02-29 Spring_Break 2016 2016-03-23 2016-03-28 Easter_Recess 2016 2016-10-05 2016-10-10 Mid_Term_Break 2016 2017-03-01 2017-03-06 Spring_Break 2017 2017-04-12 2017-04-17 Easter_Recess 2017 2017-10-04 2017-10-09 Mid_Term_Break 2017 2018-02-28 2018-03-05 Spring_Break 2018 2018-03-28 2018-04-02 Easter_Recess 2018" ) df <- data.frame( date = c("2016-02-05","2016-02-05", "2016-02-05" ,"2016-02-05", "2016-02-05", "2016-02-05", "2016-02-26", "2016-10-07", "2018-03-30", "2018-07-12","2018-07-12", "2018-07-12", "2018-07-12", "2018-07-12" ,"2018-07-12") ) # mapping Break_data$Start <- ymd(Break_data$Start) Break_data$End <- ymd(Break_data$End) df$date <- ymd(df$date) LU <- Map(`:`, Break_data$Start, Break_data$End) LU <- data.frame(value = unlist(LU), index = rep(seq_along(LU), lapply(LU, length))) df$Break <- Break_data$Break[LU$index[match(df$date, LU$value)]] # if not mapped(df$Break ==NA), then set it to "Non_break" df$Break <- ifelse(is.na(df$Break), "Non_Break", df$Break) df$Break <- factor(df$Break) df #> date Break #> 1 2016-02-05 Non_Break #> 2 2016-02-05 Non_Break #> 3 2016-02-05 Non_Break #> 4 2016-02-05 Non_Break #> 5 2016-02-05 Non_Break #> 6 2016-02-05 Non_Break #> 7 2016-02-26 Spring_Break #> 8 2016-10-07 Mid_Term_Break #> 9 2018-03-30 Easter_Recess #> 10 2018-07-12 Non_Break #> 11 2018-07-12 Non_Break #> 12 2018-07-12 Non_Break #> 13 2018-07-12 Non_Break #> 14 2018-07-12 Non_Break #> 15 2018-07-12 Non_Break 

Creado en 2018-08-19 por el paquete reprex (v0.2.0).

Editar: solución completa