Aggregation over Data Partition with Apache PIG with nested Foreach

August 25, 2013

While working on a Big Data project for a major US retailer, we had to slice and dice the sales transaction table very intensely. Some of the build-in Apache PIG features -- such as aggregation over data partition with nested foreach --made our job easier by helping us to avoid writing custom code.
Here is a very simplified version of transaction table of a retailer.

Transaction Id	Product Id	Transaction Type	Quantity
1000	1	S	10
1001	1	R	3
1002	2	S	6

Typically, sales transaction are written in journal entry fashion. Two common types of transactions are Sales and Return. For example, Transaction ID 1000 and 1002 are sales transaction where Transaction Type is “S”. Transaction Id 1001 is return transaction where transaction type is “R”.
Suppose a business report needs to show the Total Sale, Total Return and Net Sale (Total Sale – Total Return) by each Product Id. Example report should look like following table.

Product ID	Total Sale	Total Return	Net Sale
1	10	3	7
2	6	0	6

This report requires pivoting and aggregation over Product ID partition of the entire dataset. In SQL, PARTITION BY and OVER syntax would achieve the goal. In Apache PIG, the same can be done with curly bracket syntax.

Here is the PIG script with nested foreach



/* Load the input csv file */
transaction = LOAD 'tran.csv' USING PigStorage(',') AS (transaction_id:chararray, productId:int, transctionType:chararray,quantity:int);

/* group by the Product Id */
groupByProductId = GROUP  transaction  BY ( productId );

/* Aggregation over Partition with curly bracket. Notice that filter and aggregation 
   functions don’t operate on the entire dataset. They only work on the partitioned 
   dataset by the current Product Id */
   report = FOREACH groupByProductId  {
 /* get all the Sale transactions for the current Product ID */ 
   sale = FILTER transaction BY transctionType == 'S';
 /* get all the Return transactions for the current Product ID */ 
 return = FILTER transaction BY transctionType == 'R';
 GENERATE  group AS productId
  /* get the total sale by current Product Id */
   ,SUM(sale.quantity) AS totalSale
                /* get the total sale by current Product Id. If there is no return record found, put zero */
                 ,(COUNT(return) > 0 ? SUM(return.quantity) : 0 )AS totalReturn 
                /* get the net sale for the current Product Id */ 
                 ,SUM(sale.quantity) - (COUNT(return) > 0 ? SUM(return.quantity) : 0) AS netSale;
         }
/* Store the report in csv file */
store report into ‘report.csv’  USING PigStorage(',');




















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




Json to Avro 






June 26, 2013








  import org.apache.avro.Schema; import org.apache.avro.generic.GenericDatumReader; import org.apache.avro.generic.GenericDatumWriter; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; import org.apache.avro.io.Decoder; import org.apache.avro.io.DecoderFactory; import org.apache.avro.io.Encoder; import org.apache.avro.io.EncoderFactory;  import java.io.ByteArrayOutputStream; import java.io.DataInputStream; import java.io.InputStream; import java.io.ByteArrayInputStream;   public class JasonToAvro {  /** * @param args * @throws Exception  */ public static void main(String[] args) throws Exception {  String json = "{\"username\":\"miguno\",\"tweet\":\"Rock: Nerf paper, scissors is fine.\",\"timestamp\": 1366150681 }"; String schemastr ="{ \"type\" : \"record\", \"name\" : \"twitter_schema\", \"namespace\" : \"com.miguno.avro\", \...





Read more





Parallel class hierarchies with Java Generic






May 11, 2010













   import java.util.ArrayList; import java.util.Collection;  /*  * Super class for Habitat hierarchy   */ public abstract class Habitat <A extends Animal> {    /*   * A generic collection that can hold Animal    * or any subclass of animal    */  Collection<A> collection = new ArrayList<A>();    /*   * add an Inhabitant to the collection.   * should be overridden by subclass   */   public abstract  void addInhabitant( A animal); } /*  * Aquarium class inherit the collection from   * Habitat superclass. But limit the collection   * to Fish type.   */ public class Aquarium extends Habitat <Fish>  {    /*   * (non-Javadoc)   * @see Habitat#addInhabitant(Animal)   */  @Override  public void addInhabitant( Fish fish) {    collection.add(fish);      System.out.println(Aquarium.class);     }  }  /*  * Super class for Animal hierarchy  */ public abstract class Animal {  } public class Fish extends Animal {  } public class Test {   /**   * @param args   */  public static...





Read more





Hibernate CacheMode.IGNORE option






April 19, 2010








Recently, I have ran into a hibernate related issue in our production system. We have a nightly  batch program that reads a lot of rows from one table, then does some conversion and writes to a file. It brings 5000 rows in chunk, still we saw the nightly program took the server down with outOfMemory error when processing large number of rows.   After some investigation, I have found that the program brings out all the data in one session. Hibernate is holding  all the objects in its session until session is closed. As a result, the GC can't clear any previous chunk of object from memory.   Since the we are reading the data for readonly purpose, we set the CacheMode.IGNORE on the Query object. It prevents Hibernate to hold the reference of the objects in it's session.   Here is snippet of the code.   Query qry = session.createQuery(query).setCacheMode(CacheMode.IGNORE); 





Read more

Search This Blog

Reza's Playpen

Aggregation over Data Partition with Apache PIG with nested Foreach

Comments

Post a Comment

Popular posts from this blog

Json to Avro

Parallel class hierarchies with Java Generic

Hibernate CacheMode.IGNORE option