A Data Lake is a term describe a large enterprise repository of data stored in Hadoop. More and more companies are concluding that a Data Lake is the right solution over a traditional ETL process and a restricted EDW. The Data Lake is inexpensive, scales easily, uses commodity hardware, provides a very flexible schema, and enables an easy way to transform data in parallel.
Schema
So why not use a relational database for this Data Lake? In a relational database, the schema is defined first and then data is forced into it. With Hadoop, you first load the data and then apply a schema as you read it out. This means adding new data to Hadoop is easier and faster because you don’t have to first define the schema.
Parallel
Processing is also in parallel. You can transform the data using Hadoop tools like Pig to then be loaded into a relational data store or just use it in Hadoop.
Greenplum Database External Tables
There are use cases where a relational database like Greenplum database is easier to use and performs better than Hadoop. A great feature of Greenplum database is the ability to create an External Table to Hadoop. These External Tables can be defined to either READ or WRITE data to Hadoop. Because Greenplum is an MPP database, each segment connects to Hadoop to READ/WRITE data. This makes the integration of Greenplum Database with Hadoop much faster than a single threaded approach that you might see with other database products.
Transform Example
So let’s say you get an employee file from a company you just acquired. You want to take this file and then make it available to the Enterprise in Hadoop as well as a data mart in Greenplum database.
We will first load the data into a stage directory, run a program to remove a control character, and then put it in the enterprise directory.
hadoop dfs -mkdir /stage hadoop dfs -mkdir /enterprise
Next, I use a Pig program to remove a control character. In this example, I’m removing the “null character”.
The name of this file is pig_cleaning.pig.
dirty = load '$STAGE' as (data:chararray); clean = foreach dirty generate REPLACE(data, '\\u0000', ''); store clean into '$TARGET';
Here is my employees.txt file I created. Notice the ^@ symbols. These are control characters as they appear when using vi. The pig_cleaning.pig script will remove these control characters.
Jon|Roberts|123 Main Street|New York|NY|10202|Sales Engineer|NorthEast Abby|Cromwell|77 Juniper Drive|Raleigh|NC|27605|CIO|NorthWest Lilly|Vargas|7894 Rayna Rd|Atlanta|GA|30301|VP Sales|SouthEast Emily|Palmer|224 Warren Drive|St. Louis|MO|63101|VP Marketing|MidWest Winston|Rigby|84 West Road|^@|CA|^@|President|West
Now I created a shell script that accepts a filename to load and the name of the directory in Hadoop to store the results in.
The name of this file is load.sh.
#!/bin/bash # $1 is the filename # $2 is the target directory name hadoop dfs -put $1 /stage pig -param STAGE=/stage/$1 -param TARGET=/enterprise/$2 pig_cleaning.pig hadoop dfs -rm /stage/$1
This script loads the file into the /stage directory, runs the pig program to clean the file of the null character (^@), stores the output to the /enterprise directory, and then removes the stage file.
Executing the script is as easy as:
./load.sh employees.txt employees
Now what about Greenplum database? Here is how you can READ that data in Hadoop from the database. Note that in this example, I have Hadoop and Greenplum database on the same single host. Typically, these will be on separate hosts and instead of localhost, you would have the name of the NameNode like hdm1.
create schema enterprise; create external table enterprise.employees (fname text, lname text, address1 text, city text, state text, zip text, job text, region text) LOCATION ('gphdfs://localhost:8020/enterprise/employees/part*') FORMAT 'TEXT' (delimiter '|');
And now let’s execute a SELECT statement.
gpdb=# select * from enterprise.employees; fname | lname | address1 | city | state | zip | job | region ---------+----------+------------------+-----------+-------+-------+----------------+----------- Jon | Roberts | 123 Main Street | New York | NY | 10202 | Sales Engineer | NorthEast Abby | Cromwell | 77 Juniper Drive | Raleigh | NC | 27605 | CIO | NorthWest Lilly | Vargas | 7894 Rayna Rd | Atlanta | GA | 30301 | VP Sales | SouthEast Emily | Palmer | 224 Warren Drive | St. Louis | MO | 63101 | VP Marketing | MidWest Winston | Rigby | 84 West Road | | CA | | President | West (5 rows)
Conclusion
Hadoop is being used by Enterprises to create a Data Lake. Once there, it is fast and easy to transform the data. And with Greenplum database, it is easy to use SQL tools to access the data.