PlaygRound

Parsing XML files made simple by PySpark

Imagine you are given a task to parse thousands of xml files to extract the information, write the records into table format with proper data types, the task must be done in a timely manner and is repeated every hour. What are you going to do? With Apache Spark, the embarrassingly parallel processing framework, it can be done with much less effort.


Data visualization made easy with Flexdashboard

Flexdashboard is like a hidden gem which is mainly known for those people using R. In my opinion, it is a great visualization tool. You can substitute flexdashboard for those expensive commercial tools like Tableau and Power BI. Most importantly, it is totally free! You just need to know writing code in R.


Terraform At a Glance

This is an excerption from qwiklabs. It is a quick introduction of Terraform which is an open source Infrastructure as Code tool to create, change and version the infrastructure safely and efficiently.


Build a serverless text to speech endpoint

Implement a serverless and event-driven HTTP endpoint to convert text to speech using Cloud Function, Cloud Text-to-Speech API and Cloud Storage.

Try it out with this link. You can replace the text you want. Have fun!


Streaming process NASA web access logs on GCP

This is the drafted solution to ingest nasa web access logs, process and clean the data, store into a data warehouse for further analysis.

It is implemented with GCP products, including Pub/Sub, Dataflow, BigQuery.


Spark source code snippets

I put the source code snippets from the book Spark: The Definitive Guide into one piece. They cover most of the operations and common functions for DataFrames and Spark SQL in our daily life when writing Spark code.


Install Docker on Debian/Ubuntu

A quick reference for myself regarding the steps to install Docker on Debian/Ubuntu.