Monthly Archives: January 2019
File System Operation in Pyspark
Sometimes (unfortunately) we need to do the file operation directly in pyspark. Here is the way to do that:
Pyspark vector to list
In Pyspark, when using ml functions, the inputs/outputs are normally vectors, but some times we want to convert them to/from lists. list to vector dense/sparse vector to list (Array)
Pyspark UDF
UDF is particularly useful when writing Pyspark codes. We can define the function we want then apply back to dataframes. Import everything Create Function Make it a UDF Call this UDF Key notes: 1) we need to carefully define the … Continue reading
Pyspark Coding Quick Start
In most of the cloud platforms, writing Pyspark code is a must to process the data faster compared with HiveQL. Here is the cheat sheet I used for myself when writing those codes. Import most of the sql functions and … Continue reading