The following instructions are valid for connecting Greenplum Database to Hortonworks HDP 2.0. Note that this technique is not officially supported by Pivotal and Pivotal has added supported for HDP 2.1 in Greenplum Database version 4.3.3 and 4.2.8. See this post for more information.
Hortonworks (HDP) 2.0 is based on Apache Hadoop 2.2 which is also what Pivotal HD 2.0 is based on. If you haven’t read already, I was able to demonstrate how Greenplum can read and write data to a Pivotal HD cluster in parallel. http://www.pivotalguru.com/?p=700. You can use this same basic configuration to integrate Grenplum to Hortonworks 2.0!
Follow the same steps in configuring Greenplum database for Pivotal HD 2.0. We’ll use the same client libraries to connect to HDP 2.0. Next create a user in HDP 2.0 named gpadmin. Lastly, create the same types of External Tables but use the IP address of the HDP cluster.
CREATE WRITABLE EXTERNAL TABLE ext_foo_hdp20 (i int, bar text) LOCATION ('gphdfs://192.168.239.215/user/gpadmin/foo_bar') FORMAT 'text' (delimiter '|' null 'null'); INSERT INTO ext_foo_hdp20 SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i; CREATE EXTERNAL TABLE ext_get_foo_hdp20 (i int, bar text) LOCATION ('gphdfs://192.168.239.215/user/gpadmin/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
Here are the files Greenplum created in the Hortonworks cluster.
[gpadmin@sandbox ~]$ hadoop version Hadoop 184.108.40.206.0.6.0-76 Subversion email@example.com:hortonworks/hadoop.git -r 8656b1cfad13b03b29e98cad042626205e7a1c86 Compiled by jenkins on 2013-10-18T00:19Z Compiled with protoc 2.5.0 From source with checksum d23ee1d271c6ac5bd27de664146be2 This command was run using /usr/lib/hadoop/hadoop-common-220.127.116.11.0.6.0-76.jar [gpadmin@sandbox ~]$ hdfs dfs -ls /user/gpadmin/foo_bar/ Found 2 items -rw-r--r-- 3 gpadmin gpadmin 490 2014-08-05 06:19 /user/gpadmin/foo_bar/0_1407266192-0000000091 -rw-r--r-- 3 gpadmin gpadmin 494 2014-08-05 06:19 /user/gpadmin/foo_bar/1_1407266192-0000000091
Here is the data from HDP in Greenplum.
postgres=# SELECT * FROM ext_get_foo_hdp20 ORDER BY i LIMIT 10; i | bar ----+-------- 1 | bar_1 2 | bar_2 3 | bar_3 4 | bar_4 5 | bar_5 6 | bar_6 7 | bar_7 8 | bar_8 9 | bar_9 10 | bar_10 (10 rows) postgres=#
I suspect the integration will get better and better between Greenplum and the various Hadoop distributions because for the most part, all of the distributions rely on the Apache distribution of Hadoop. Look for more to come in the future too!