Incluyendo valores nulos en Apache Spark Join

Me gustaría incluir valores nulos en una unión Apache Spark. Spark no incluye filas con nulo por defecto.

Aquí está el comportamiento predeterminado de Spark.

val numbersDf = Seq( ("123"), ("456"), (null), ("") ).toDF("numbers") val lettersDf = Seq( ("123", "abc"), ("456", "def"), (null, "zzz"), ("", "hhh") ).toDF("numbers", "letters") val joinedDf = numbersDf.join(lettersDf, Seq("numbers")) 

Aquí está la salida de joinedDf.show() :

 +-------+-------+ |numbers|letters| +-------+-------+ | 123| abc| | 456| def| | | hhh| +-------+-------+ 

Este es el resultado que me gustaría:

 +-------+-------+ |numbers|letters| +-------+-------+ | 123| abc| | 456| def| | | hhh| | null| zzz| +-------+-------+ 

Scala proporciona un operador NULL seguro de igualdad especial:

 numbersDf .join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers")) .drop(lettersDf("numbers")) 
 +-------+-------+ |numbers|letters| +-------+-------+ | 123| abc| | 456| def| | null| zzz| | | hhh| +-------+-------+ 

Tenga cuidado de no usarlo con Spark 1.5 o anterior. Antes de Spark 1.6 se requería un producto cartesiano ( SPARK-11111 – Unión rápida segura contra nulos ).

En Spark 2.3.0 o posterior puede usar Column.eqNullSafe en PySpark :

 numbers_df = sc.parallelize([ ("123", ), ("456", ), (None, ), ("", ) ]).toDF(["numbers"]) letters_df = sc.parallelize([ ("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh") ]).toDF(["numbers", "letters"]) numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers)) 
 +-------+-------+-------+ |numbers|numbers|letters| +-------+-------+-------+ | 456| 456| def| | null| null| zzz| | | | hhh| | 123| 123| abc| +-------+-------+-------+ 

y %<=>% en SparkR :

 numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, ""))) letters_df <- createDataFrame(data.frame( numbers = c("123", "456", NA, ""), letters = c("abc", "def", "zzz", "hhh") )) head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers)) 
  numbers numbers letters 1 456 456 def 2   zzz 3 hhh 4 123 123 abc 

Con SQL ( Spark 2.2.0+ ) puede usar IS NOT DISTINCT FROM :

 SELECT * FROM numbers JOIN letters ON numbers.numbers IS NOT DISTINCT FROM letters.numbers 

Esto también se puede usar con la API de DataFrame :

 numbersDf.alias("numbers") .join(lettersDf.alias("letters")) .where("numbers.numbers IS NOT DISTINCT FROM letters.numbers") 
 val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join val letters2 = lettersDf.withColumnRenamed("numbers","num2") val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer") joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names