Category Archives: Greenplum

LDAP Security

Security is always divided into two main actions of Authentication and Authorization.  Authentication is proving you are who you say you are by usually providing a username and a password.  Authorization is the process of approving or denying access to objects such as SELECT on a table.  Greenplum always handles Authorization internally.  You can’t delegate this to a third party but you can with Authentication.

Authentication is handled in Greenplum just like it is in PostgreSQL.  Edit your pg_hba.conf file on the MASTER and provide information about how users should be authenticated.  Typically, companies are using an LDAP solution for security and this is also usually the hub in a hub and spoke security model.  Furthermore, Microsoft Active Directory is usually the LDAP solution picked because so many desktops are using Windows.

The security hub has a company’s password rules like failed attempts, password complexity, password expiration, etc. and you can leverage this very easily with Greenplum’s authentication model.  Simply edit your pg_hba.conf file and at the very bottom of the file (it is read sequentially), add this:

host all all 0.0.0.0/0 ldap ldapserver=<ldap server> ldapprefix=”<domain>\” ldapsuffix=””

For example:

  • LDAP server = ldap1
  • Windows Domain = myco

So your entry at the bottom of the pg_hba.conf file would look like this:

host all all 0.0.0.0/0 ldap ldapserver=ldap1 ldapprefix=”myco\” ldapsuffix=””

During a login attempt, the pg_hba.conf file is read and the first matching criteria is used.  So if you want to use basic password authentication for an ETL user or trust for local users, simply put these entries before the LDAP entry.  Then the last line is the LDAP entry that is the “catch-all” which authenticates users to your Active Directory.

How does it work?  It takes the username and password passed in and attempts to connect to the LDAP server with the ldapprefix (domain) specified.  If the user can connect to the LDAP server, then the user is authenticated and then connects to Greenplum. It is that simple.  Once in the database, Greenplum handles Authorization as normal.

Merge Statements

Merge is a feature in most databases but I have never liked it much.  My source table always had duplicates in it which prevented me from using it.

Here is an example of such with Microsoft SQL Server (BTW, Oracle behaves the same way):

CREATE TABLE stage
(
id int identity not null,
provider_id int NOT NULL,
provider text NOT NULL
);

insert into stage (provider_id, provider) values (1, 'bubba');
insert into stage (provider_id, provider) values (2, 'gump');
insert into stage (provider_id, provider) values (3, 'foo');
insert into stage (provider_id, provider) values (4, 'bar');
insert into stage (provider_id, provider) values (4, 'jon');

CREATE TABLE target
(
provider_id int NOT NULL primary key,
provider text NOT NULL
);

My stage table is where new records get loaded from a file and then my target table is what I’m trying to keep up to date.  So using a MERGE statement, I should be able to insert the new records and if they exist already, update it.

merge target
using (select * from stage) as stage
on (target.provider_id = stage.provider_id)
when matched then
update set provider = stage.provider
when not matched then
insert (provider_id, provider)
values (stage.provider_id, stage.provider);

That is the syntax for MERGE in SQL Server and when executed, I get this error message:

Msg 2627, Level 14, State 1, Line 1

Violation of PRIMARY KEY constraint 'PK__target__00E213107132C993'. Cannot insert duplicate key in object 'dbo.target'.

The statement has been terminated.

Why?  Well my stage table has two records with the same provider_id.  When this happens, I don’t want to error but instead, I want the last value in the file to be used.  Unfortunately, MERGE doesn’t handle this.  Instead, you have to write code.

In Greenplum, here is how you do a “smart” MERGE that handles real world data where the target has duplicates:

CREATE TABLE stage
(
id serial not null,
provider_id int NOT NULL,
provider text NOT NULL
)
DISTRIBUTED BY (provider_id);

insert into stage (provider_id, provider) values (1, 'bubba');
insert into stage (provider_id, provider) values (2, 'gump');
insert into stage (provider_id, provider) values (3, 'foo');
insert into stage (provider_id, provider) values (4, 'bar');
insert into stage (provider_id, provider) values (4, 'jon');

CREATE TABLE target
(
provider_id int NOT NULL primary key,
provider text NOT NULL
)
DISTRIBUTED BY (provider_id);

Now create the function:

create or replace function fn_merge_target() returns void as
$$
declare
    v_function text := 'fn_merge_target';
    v_location int;

begin
    v_location := 1000;
    --delete the old records
    delete from target t
    using stage s
    where t.provider_id = s.provider_id
    and t.provider <> s.provider;

    v_location := 2000;
    --inserts
    insert into target (provider_id, provider)

    select sub.provider_id, sub.provider
    from (  select s.*,
    rank() over (partition by s.provider_id order by s.id desc) as rank
    from stage s
    left join target t on s.provider_id = t.provider_id
    where t.provider is null  ) as sub
    where sub.rank = 1;

exception
    when others then
        raise exception '(%:%:%)', v_function, v_location, sqlerrm;
end;
$$
language plpgsql;

Execute the function:

select fn_merge_target();

And now query target:

select * from target order by provider_id;

I have 4 rows and for provider_id 4, I have the last version from the stage table rather than failing the command.

In summary, MERGE sounds nice but I’ve never been able to use it.  My source or stage table always has duplicates and instead of failing, I want to apply the last version of each primary key record.  I demonstrated an easy to use way to provide a much smarter MERGE that is optimized for Greenplum and uses one of the built in analytical functions (rank()).

Archiving Data

A customer recently had a pretty common question about archiving old data in Greenplum.  They wanted to purge old data in their OLTP system so that it remains small and fast.  Historical data that isn’t changing, stays in Greenplum.

An easy way to do this is with a view and two tables.  The view would UNION ALL the data between the two tables.  Unfortunately, Greenplum treats UNION ALL similarly to other databases which executes each query separated by the UNION ALL serially.  The optimal way to handle this is to partition the data.

So let’s say the customer has a schema called edw and a table called auths.

create schema edw;

CREATE TABLE edw.auths
(id int not null,
fname varchar,
lname varchar,
episodedate timestamp)
DISTRIBUTED BY (id);

Insert some sample data.

INSERT INTO edw.auths VALUES (1, 'foo', 'bar', '2001-05-01'::timestamp);
INSERT INTO edw.auths VALUES (2, 'foo', 'bar', '2012-12-01'::timestamp);
INSERT INTO edw.auths VALUES (3, 'too', 'cool', '2010-01-01'::timestamp);
INSERT INTO edw.auths VALUES (4, 'too', 'cool', '2010-01-02'::timestamp);
INSERT INTO edw.auths VALUES (5, 'old', 'data', '2004-01-02'::timestamp);

Next, rename this schema.

ALTER SCHEMA edw RENAME TO edw_archive;

Create the new edw schema.

CREATE SCHEMA edw;

In this case, anything older than 2011-01-01 will be archived and anything newer will be in the current OLTP system.

CREATE TABLE edw.auths
(LIKE edw_archive.auths)
PARTITION BY RANGE (episodedate)
(START ('2000-01-01'::timestamp) INCLUSIVE END ('2010-01-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5),
DEFAULT PARTITION auths_current);

I’m using a pretty high level of compression and column orientation for the archive data.  This will help performance and use less disk space for the archived data.  The default partition has the current data in it and it remains a row store that isn’t compressed.   This is a good example of Polymorphic storage where partitions within the same table are stored in different ways.

Insert the old data into the new table.

INSERT INTO edw.auths SELECT * FROM edw_archive.auths;

Just to show that everything works.

SELECT * FROM edw.auths;

With Outsourcer, you will want to change the job to load edw.auths to insert data into the default partition and don’t touch the archive partition.  To do this, get the names of the underlying inherited tables.

SELECT * FROM pg_partitions WHERE schemaname = 'edw' AND tablename = 'auths';

You can see the two tables independently.

--new data

SELECT * FROM edw.auths_1_prt_auths_current;

--archive data

SELECT * FROM edw.auths_1_prt_2;

To change Outsourcer, first remove the old job for edw.auths and insert the new one with the target of edw.auths_1_prt_auths_current.

DELETE FROM os.job WHERE (target).schema_name = 'edw' AND (target).table_name = 'auths';

INSERT INTO os.job(
refresh_type, target, source)
VALUES ('refresh',
('edw', 'auths_1_prt_auths_current'),
('sqlserver', 'jonnywin', null, null, 'edw', 'dbo', 'auths', 'sa', 'sa'));

Now when you execute this job, it will truncate the current partition, insert the data from the OLTP system, and then analyze the partition.  The archive partition will be left untouched.  Querying data across the archived and current data will execute much faster than a UNION ALL because it runs in parallel and it is compressed and column oriented.

PL/pgSQL Functions

Overview

Greenplum leverages PostgreSQL internals which has great benefits from security, Multi-Version Concurrency Control, and languages.  Greenplum supports a wide variety of languages but my favorite by far has to be PL/pgSQL.

PL/pgSQL feels a lot like Oracle PL/SQL.  It has the same “Ada” look and feel which makes it easy for an Oracle developer to start writing PL/pgSQL.

Since Greenplum is supporting the same language that PostgreSQL does, you can find excellent information online about different techniques of writing quality PL/pgSQL functions.  For example, here is good documentation on how to port from Oracle PL/SQL to PL/pgSQL => http://www.postgresql.org/docs/8.2/interactive/plpgsql-porting.html

Writing Functions

I follow the same template for each PL/pgSQL function I write and below are each of the different things I always follow.

Naming Conventions

  • Functions– prefix with fn_ to denote it is a function
  • Table names– all lower case with underscores.  The data dictionary stores objects in all lower case so it is easier to read table names with underscores.
  • Variables– prefix with v_
  • Parameters– prefix with p_

Error Handling

  • Function Name– I add the function name to each function I write so that I can include this in the exception message.  This is very helpful when you have functions executing other functions and you need to know what function really failed.
  • Location I prefer to add a location variable to functions and denote where I am in the code frequently.  I use the location value to know where the error occurred.
  • Exception BlockI raise the exception with:
  • RAISE EXCEPTION ‘(%:%:%)’, v_function_name, v_location, sqlerrm;

The parenthesis surrounds the error message from this function.  If the function executes another function and the second function is the one that failed, the parenthesis help to indicate what function, the location the error, and the error message that occurred.

Template

 

CREATE OR REPLACE FUNCTION fn_name() RETURNS void AS
$$

DECLARE
    v_function_name text := 'fn_name';
    v_location int;

BEGIN
    v_location := 1000;

EXCEPTION
    WHEN OTHERS THEN
        RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm;
END;

$$
LANGUAGE plpgsql;

As you develop your code, increment the v_location variable so if you do get an error, it will be raised the approximate location in your code.

More examples of my PL/pgSQL code can be found in the Outsourcer code.  I have about 8 functions that you can reference for examples of writing PL/pgSQL functions.

Ugly Data

It seems that everyone has it. Ugly data. Data that isn’t so clean in transactional systems like control characters or rarely used characters that are used programmatically. This ugly data typically errors out during a load process. This requires someone to cleanse the data by writing scripts and then loading the data.

This is time consuming and feels like you are re-inventing the wheel over and over again.

Outsourcer does the cleansing for you! As the data is written to Greenplum, control characters are removed and special characters are escaped properly.

Here is a demo of Outsourcer in action loading ugly data.

Oracle Sessions

Just like for SQL Server, you may want to know what is going on in Oracle from inside Greenplum.  Here is an External Web Table that uses Outsourcer to give you that information.

First create a connection for Outsourcer:

INSERT INTO os.ext_connection
(type, server_name, instance_name, port, database_name, user_name, pass)
VALUES
('oracle', 'jonnywin', null, 1521, 'xe', 'os_test', 'os_password');

--get the ID you just inserted
SELECT id 
FROM os.ext_connection 
WHERE type = 'oracle' 
AND server_name = 'jonnywin'
AND port = 1521
AND database_name = 'xe';

--in my example, the value is 3.

Next create the schema in Greenplum for this table:

CREATE SCHEMA oracle;

This will create the External Table in Greenplum for Outsourcer:

--notice the double escaping of the dollar sign.
SELECT os.fn_create_ext_table('oracle.sessions', ARRAY[
'username varchar(30)','osuser varchar(30)','machine varchar(64)',
'terminal varchar(30)','program varchar(48)',
'module varchar(48)','sql_text text','logon_time timestamp',
'service_name varchar(64)'], 
3,
'SELECT s.USERNAME, s.OSUSER, s.MACHINE, s.TERMINAL, s.PROGRAM, s.MODULE,
a.sql_text, s.LOGON_TIME, s.service_name 
FROM v\\\\$session s, v\\\\$sqlarea a 
WHERE a.address = s.sql_address');

Now you can execute the query to get the session information:

select * from oracle.sessions;

More information on External Tables is in the Outsourcer documentation.

Why is SQL Server so slow?

This is a very good question.  When using Outsourcer, you may find a job run longer or shorter from time to time and in my experience, it is because the source database is waiting on something.  It is the bottleneck 99% of the time.  So why is it slow today versus yesterday?

Databases have wait stats and SQL Server has this information too.  You can query SQL Server to find what it is waiting on by logging into SQL Server and executing a SQL command.  With the code provided below, you can leverage Outsourcer to query SQL Server from Greenplum!

First create a connection for Outsourcer:

INSERT INTO os.ext_connection
(type, server_name, instance_name, port, database_name, user_name, pass)
VALUES
('sqlserver', 'jonnywin', null, null, null, 'os_test', 'os_password');

--get the ID you just inserted
SELECT id 
FROM os.ext_connection 
WHERE type = 'sqlserver' 
AND server_name = 'jonnywin'

--in my example, the value is 1.

Next create the schema in Greenplum for this table:

CREATE SCHEMA sqlserver;

This will create the External Table in Greenplum for Outsourcer:

SELECT os.fn_create_ext_table('sqlserver.sessions', 
ARRAY['sql_time timestamp','start_time timestamp','status varchar(30)',
'session_id smallint','sqltext text'], 
1,
'SELECT getdate() as sql_time, req.start_time, req.status, 
req.session_id, sqltext.TEXT as sqltext 
FROM sys.dm_exec_requests req 
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext 
order by req.start_time');

Now you can execute the query to get the session information:

select * from sqlserver.sessions;

I have found that the most critical thing to look for from SQL Server is the “status”.  You will see it “Running” when it is actually doing something but “Suspended” when it is waiting on a resource.

So if you see queries generated by Outsourcer “Suspended” in SQL Server, that is why it is slower today versus yesterday.  Work with your SQL Server DBA to determine what resource(s) SQL Server is waiting on next.

More information on External Tables is in the Outsourcer documentation.