Monthly Archives: January 2019

File System Operation in Pyspark

Sometimes (unfortunately) we need to do the file operation directly in pyspark. Here is the way to do that:  

Posted in Big Data | Leave a comment

Pyspark vector to list

In Pyspark, when using ml functions, the inputs/outputs are normally vectors, but some times we want to convert them to/from lists. list to vector dense/sparse vector to list (Array)

Posted in Big Data | Leave a comment

Pyspark UDF

UDF is particularly useful when writing Pyspark codes. We can define the function we want then apply back to dataframes. Import everything Create Function Make it a UDF Call this UDF Key notes: 1) we need to carefully define the … Continue reading

Posted in Big Data | Leave a comment

Pyspark Coding Quick Start

In most of the cloud platforms, writing Pyspark code is a must to process the data faster compared with HiveQL. Here is the cheat sheet I used for myself when writing those codes. Import most of the sql functions and … Continue reading

Posted in Big Data | Leave a comment