Posts

Showing posts from August, 2013

Aggregation over Data Partition with Apache PIG with nested Foreach

While working on a Big Data project for a major US retailer, we had to slice and dice the sales transaction table very intensely. Some of the build-in Apache PIG features -- such as aggregation over data partition with nested foreach --made our job easier by helping us to avoid writing custom code. Here is a very simplified version of transaction table of a retailer. Transaction Id Product Id Transaction Type Quantity 1000 1 S 10 1001 1 R 3 1002 2 S 6 Typically, sales transaction are written in journal entry fashion. Two common types of transactions are Sales and Return. For example, Transaction ID 1000 and 1002 are sales transaction where Transaction Type is “S”. Transaction Id 1001 is return transaction where transaction type is “R”. Suppose a business report needs to show the Total Sale, Total Return and Net Sale (Total Sale – Total Return) by each Product Id. Example report should look like following table. Product ID Total Sale Total Return Net Sale 1 10 3 7 2 6 ...