Feb 27, 2013

SQL to ECL - Metamorphosis


As the old saying goes, it can be figured out whether you are a data guy or not just by the way you solve the problem:
An imperative programmer goes: for each order in the orders table, find the corresponding order details. . .
And the SQL programmer goes: take the orders table and join with the order details on the order ID key. . .  

As a SQL programmer, you look at tables as sets and not as individual records. You would not want to bother about whether the join happens as a nested loop, hash, merge or map-reduce for that matter. You want the join to happen the way you intend to (functionally) and let your query engine find the best way to do it based on the data distribution, size etc. That is the SQL programming style.

Now, ECL is not SQL but has the SQL programming style. I find it to be a pseudo functional-declarative programming style with some object oriented concepts tossed in. Don’t pull your hair out. Not just yet!!!

In ECL, there are only 2 types of statements that I have come across.
  1. Action: Something that produces an output.
  2. Declaration: This is a single assignment (or, it cannot be re-assigned). These are called attributes in the ECL world.   
In my experiments with ECL, I have created  attributes to:
    - hold a value (string, int, set, record set, table, file etc)
    - a function or action
    - a definition (More like a table definition)

Most of the ECL code are definitions. And your action can call a definition or pipe your definitions to cascade to a result. More on this later.

Of course, you did notice that there is no concept of a variable. And that is the only thing to “get” in ECL. If you really think about it, ECL gives you amazing abstraction over all the threading and grid-ding that happens behind the scenes. How is it able to do that? By signing a contract with you that says, a definition (or an attribute) does not change its state, ever, ever. So, the attribute points to the same thing whoever wants, or more specifically whichever machine in the grid wants. There, it eliminated any custom race condition you may introduce in your code (which is common in parallel computing world). As long as we adhere to that contract, you can tell HPCC what you need and it will get you the result. With me, so far??

It’s not as hard as you think to code without variables. Purely functional languages like Haskell do not allow you to re-assign.

So, how do you work around this constraint? 
Blunt answer: Create a new definition to hold the mutated value.

Think about it, you don’t “really” use variables in the SQL world either. 

Now, why definitions? If it's SQL programmer friendly, why not use SQL? When you try to answer this question, you will uncover the brilliance of this powerful language. Here is my attempt.

Rewind a few years, SQL Server 2005 introduced the common table expressions (CTE) or the “WITH” clause. What problem was it really trying to solve? Help generate number table using recursion, No!!!
In production applications, the biggest use of CTE was to remove code clutter. I was able to remove inline views and move it to the top of the query. Improved code readability.

Take this cooked-up query in SQL Server 2000:
SELECT *
FROM   OrderDetails A
LEFT OUTER JOIN (SELECT productid,
                        Max(transactiondate) AS LatestDate
                 FROM   transactions
                 WHERE  orderid <> ''
                    AND type = 'Sell'
                 GROUP  BY productid
                 HAVING Year(latestdate) > 2011) B
ON 

     B.productid = A.productid
WHERE  

     A.status = 'Active' 

With CTE, you can change it to:
WITH B
     AS (SELECT productid,
                Max(transactiondate) AS LatestDate
         FROM   transactions
         WHERE  orderid <> ''
                AND type = 'Sell'
         GROUP  BY productid
         HAVING Year(latestdate) > 2011)
SELECT *
FROM   orderdetails A
LEFT OUTER JOIN b
ON 

       b.productid = A.productid
WHERE  

       A.status = 'Active' 

This is cleaner, I am now looking at two queries. One that generates the table B and the other one that uses it. Coming to think of it, Wouldn't it be great if I can move each of the complexities to its own shell (or definition). So that each of definition can be reused as you need. Like this:
































Voila!!! This is exactly how the ECL code looks (Well, its a about 95% close and the compiler will help you fix the rest). But that's about it. This will run just fine for a single node with gigabytes of data or 100 nodes petabytes of data. Welcome to the world of big data.

Once you get past this stage, you will quickly move beyond SQL scope. There is a world of constructs to handle any type or size of data. It's got some powerful detergents built-in to clean up the dirtiest of data you  have, that will blow your mind.

And, just so you know you can write the above query as a single line of code by substituting your definitions inline (recursively). And you will be able to build your SQL Server 2000 query in ECL. But, why you would want to do that is a different discussion!!!

Feb 26, 2013

ECL - Big data for the SQLly Inclined

The last project I worked, we pushed the limits of RDBMS (misfit for our requirements) and I decided that the next time, I will consider beyond SQL for my data needs. I started exploring the NOSQL world - MongoDB,Neo4j, REDIS etc and I understood that had we been open to these technologies when we started our last project, we might have had a lot easier life.

Eventually, I started seeing Hadoop everywhere, our company was talking, customers were talking, my friends were talking. And I started learning the jargons around Hadoop (map reduce, hive, sqoop, HDFS,HBase) and I used to throw these words in my conversation along with a few zoo animals and figured out most of my friends were doing the same and we had a happy ecosystem going on. But, deep down I knew that I was ignoring the elephant in the room that was staring at me. I read through the famed map reduce research paper and I was able to get the concept. But, I was not able to get to start playing with hadoop. Setting up was easy and you can get the word count sample working in an hour. But, after that I was stuck. I understood the power of what it can do. But, I felt I did not know the right language to communicate with it. It's like someone asked me to write a web server in SQL. Of course you can do it, but I don't want to. To me, SQL is "the" reference implementation of a Domain Specific Language. And the ease at which you can instruct your RDBMS to do a complex task was mind blowing as long as you are operating within the problem domain.

In retrospect, I understood that the reason why I had so much reluctance to get into hadoop was because I am not a technology guy (there, I said it). I like to solve logical problems (puzzles or problems, I don't care). From a problem solver perspective, my problem statement does not change whether I am working with 1 record or 1gazellion records. It does not change if I have 1 line of text or the entire world wide web to process. I wanted that abstraction. SQL was giving me that (almost), till the data spilled over to the next machine. So, I started looking for languages on top of Hadoop that can help me out. Looked at Pig and Hive and a few others but, i felt this was like LINQ for SQL. You can change the programming languange but you cannot change the fundamental building blocks. I don't want to come out wrong. I love LINQ, but not so much when i have to write complex SQL queries in LINQ.

And, so I started exploring options that were outside Hadoop. Come on, big data is such a "big" pie, and it will not be monopolized. Anyways, my search ended with ECL. A programing language for taming the super computing grid called High-Performance Computing Cluster. It was open source, installation was exactly like how it was for SQL Server.
    "Download the VM and download an IDE (looks like your SQL Server management studio). Connect to the server and get going."

Played with it for a few days. They have some tutorial videos in their site (google HPCC systems). My interest was mainly because the programming style was so different. Not similar to any language that I knew. But, I was able to relate to it. I didn't have to skew my thinking to fit to their programing style.

Also, it's been "the" programming language for HPCC for the last 10 years or so and it has undergone a lot of refinement over the years. So, I knew that I can take a deep breath and give some time to understand this language.

Fast forward a few months... I am still in love with ECL. And some day, someone may write ECL for Hadoop. But, till then, I am taming big data, the ECL way.

If you got an hour or so, give ECL a try and let me know what you think.

There a mental switch that you need to turn on to be able to easily starting thinking in ECL and with that, it becomes pretty much like SQL, actually even more elegant in a few cases. I will anyway write about it in my next post...breaking down a complex SQL query and building it in ECL. It should be a lot more easy to understand, I hope.