Imagine you are given a task to parse thousands of xml files to extract the information, write the records into table format with proper data types, the task must be done in a timely manner and is repeated every hour. What are you going to do? With Apache Spark, the embarrassingly parallel processing framework, it can be done with much less effort.