gplink is a program I wrote that makes it possible to create an External Table in Greenplum that connects to ANY JDBC connection through a gpfdist process.

When to Use
Use this for either loading data into Greenplum from a JDBC source. This is similar to Outsourcer but that tool automates far more tasks and only works for SQL Server or Oracle.

You can also create federated queries between Greenplum and a JDBC source. Please use caution when doing this as the External Table does not do predicate push down so the entire query must complete in the source database before any filters can be applied.

Documentation and Source Code
gplink Github Repository

Download Version 1.0.3

6 thoughts on “gplink

  1. Paul Johnson

    With regards to Teradata, this is not always true: “- You want to use FASTEXPORT for better performance.”

    For small/medium data volumes FEXP is overkill, and is also subject to the system-wide limit on the number of utilities that can be executed concurrently, which some sites intentionally set to a low number (<10).

    I would test with/without FEXP on a case-by-case basis. As always, YMMV.

    Still a great tool though 🙂

  2. Manoj

    1) Is this GP Link production ready code?
    2) The document does not mention how do I execute queries across GP and Oracle. Do I need to pre-define the queries?
    3) Can it run effectively on source tables having 300 Million Records and that can return 500K results?

    1. Jon Post author

      1. Yes. There are several Greenplum customers in production with this tool.
      2. When you select from the External Table, the program will execute a query in the remote database such as Oracle. It runs the predefined query you configure in a SQL file.
      3. Yes, 300M rows is not a problem. I’ve had customers load 2 billions of rows with this tool.

      If you are sourcing data from Oracle, you also may want to consider using Outsourcer.

    1. Jon Post author

      PXF JDBC was developed after gplink was created.

      gplink uses gpfdist to execute both single JDBC connection to the source machine and allows all of the segments to pull data in parallel via gpfdist. You can install gplink on a single host that has network access to both the segment hosts and also the source JDBC connection.

      PXF requires installation on all segment hosts in the cluster and all segment hosts need network access to the JDBC source. PXF has the ability to “partition” the data which means it can execute multiple JDBC queries to pull the data where gplink doesn’t do that.

      Both use JDBC so a JVM is used but gplink uses just one while there is a JVM running on every segment host for PXF.

      PXF is supported by VMware and part of the commercial distribution of Greenplum while gplink is an open source project that is supported by me, not VMware.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.