PlaygRound

Data Lake, how to dig?

There is an introductory article of AWS services which can help to build data lake in WeChat. I translated to English as I found it quite intuitive.


AWS services for Data Engineers

Here is a list of AWS services with which a data engineer such as me is commonly interacting during the daily tasks.


Run a Machine Learning Pipeline with PySpark

Here are the notes for building a machine learning pipeline with PySpark when I learn a course on Datacamp.


Tackle ETL challenges with Spark

Let us have a deep dive and check out how Spark can tackle some of the challenges of ETL pipeline that a data engineer is facing in his/her daily life.


Use AWS Lambda to process XML files

I have written a post about using Apache Spark to process multiple XML files and convert into tabular format. In this post, we are going to use AWS Lambda, S3 and Athena to achieve the same results.


Manipulate Datetime and timestamp in PySpark

Apache Spark is a general-purpose computational framework which is well known for big data processing. This post will show how to manipulate time-series related data with PySpark.


A Test


Calculate Moving Average with Python, SQL and R

Nowadays time-series data are ubiquitous, from mobile networks, IoT devices to finance markets. Moving average is a simple yet fundamental method when it comes to time-series data analysis. For example, MA crossover is one of the strategies applied to quantitative trading. Here we can find how to compute moving average using Python, SQL and R.


Upgrade R to 3.5+ on Ubuntu

I need to upgrade my current R 3.4 to R 3.5+ on Ubuntu 16.04. Here is how.


Customer Segmentation with BigQuery ML

Customer segmentation is the marketing strategy that divides customers into different groups based on some specific ways of similarity. Marketing teams can tailor their content and media to unique audiences according to the segmentations. BigQuery, the analytics data warehouse on Google Cloud, now enables users to create and execute machine learning models with standard SQL to address such solution easily and quickly.