Greenplum now has an official Blog! It has some pretty interesting posts already about how Greenplum can be used in different industries.
I recently dug into Greenplum Chorus to better understand the collaboration tool. I installed it on my Mac with my Greenplum Database, Greenplum Hadoop, and Greenplum Command Center. In this post, I will review how Chorus was designed and how it facilitates collaboration in an organization.
First off, the image above shows Greenplum Chorus, Database, and Hadoop. I created the image to better understand the product so this isn’t a marketing slide given to me.
Starting with Chorus, you have these main features.
- Scheduler: Used to refresh data in Sandbox (more on this below)
- Data Definitions: Connections to Greenplum databases and Hadoop. You can browse the data in both databases and Hadoop and then link this to Workspaces.
- Insights: Comments made in Chorus that are so important to the business, it is shared to everyone! No longer will revelations about data be isolated to a single person or a small group. Now you can share your insights with others!
Next we have Workspaces. In my image, I have a “Sales” Workspace as well as other Workspaces like HR, Marketing, Analytics, and Accounting. This is where people work together around data and as you can see, the Chorus Administrator can create as many Workspaces as needed for your organization. Each Workspace have these features:
- Linked Data: Data from Greenplum databases and Hadoop that is linked to the workspace. This makes it simple to query this data in Chorus without needing to make a copy of the data.
- Sandbox Data: This is either data copied from Greenplum databases using the scheduler or completely new tables derived using SQL you write. This is very powerful and goes a long way to providing business users the ability to find new value in data.
- Chorus View: This is similar to a database view but the definition resides in Chorus.
- Files: This is basically metadata (Text, Images, other) plus code (SQL) that is stored in the workspace. Versioning is done automatically too. You can execute the SQL directly in Chorus. Very powerful stuff.
- Notes: This is more metadata about anything in your workspace. Notes are also what can be promoted to an Insight for the entire organization. You are building a living and breathing data dictionary with Chorus!
Workspaces also have the ability to visualize data. This is done with graphing similar to a BI tool. The visualization is there to quickly understand the data and then take action on it. Maybe the action is to write a Note or Insight or might be to further investigate with additional queries to Greenplum database and Hadoop. Chorus isn’t meant to replace reporting tools. Instead, the aim is to quickly understand the data and then take action with a Note, Insight, and/or more investigation.
Logging into Chorus can be handled by LDAP/Active Directory if you like. Hadoop and Database connections can be made public to the Chorus users or you can require users to log into the data sources so security is handled by the Database and Hadoop.
Chorus is a great collaboration tool for Greenplum. I am very impressed with the tool and expect more great things from the product in the future.
Here is another technique worth mentioning in regards to building a date dimension. One way to do this is to create a loop and iterate over this to insert the records. This is slow in virtually every database and we have a way to do this with a single INSERT statement.
First create the table:
DROP TABLE if exists hr.date_dim; CREATE TABLE hr.date_dim ( date_key timestamp without time zone NOT NULL, date_week timestamp without time zone NOT NULL, date_month timestamp without time zone NOT NULL, date_quarter timestamp without time zone NOT NULL, date_year timestamp without time zone NOT NULL ) DISTRIBUTED BY (date_key);
And now the magic:
INSERT INTO hr.date_dim SELECT mydate AS date_key, date_trunc('week', mydate) AS date_week, date_trunc('month', mydate) AS date_month, date_trunc('quarter', mydate) AS date_quarter, date_trunc('year', mydate) AS date_year FROM ( SELECT '1995-01-01'::timestamp + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM '2007-04-16'::timestamp - '1995-01-01'::timestamp)::int))) AS mydate ) AS sub;
My example builds the date dimension with every day populated between January 1, 1995 through April 16, 2007. I did this as it was the minimum and maximum dates in my HR table with job history information.
This solution is similar to the other post I made about using generate_series to avoid a nested loop because I’m using generate_series again. It is a very powerful and easy way to dynamically create a dataset without using a loop to do so.
Here is a common scenario in a data warehouse. You have designed yourself a date dimension and you have another table with a begin and end time for some activity.
The date dimension might look something like this:
CREATE TABLE hr.date_dim ( date_key timestamp without time zone NOT NULL, date_week timestamp without time zone NOT NULL, date_month timestamp without time zone NOT NULL, date_quarter timestamp without time zone NOT NULL, date_year timestamp without time zone NOT NULL ) DISTRIBUTED BY (date_key);
And you activity table would look something like this:
CREATE TABLE hr.job_history ( employee_id numeric NOT NULL, start_date timestamp without time zone NOT NULL, end_date timestamp without time zone, job_id character varying(10), department_id numeric, CONSTRAINT job_history_pkey PRIMARY KEY (employee_id, start_date) ) DISTRIBUTED BY (employee_id, start_date);
One way to write the query joining the two tables is with BETWEEN. Here is what it looks like:
SELECT * FROM hr.job_history jh JOIN hr.date_dim d ON d.date_key BETWEEN jh.start_date AND jh.end_date;
The explain plan looks like this on my local install on my Mac:
Gather Motion 2:1 (slice2; segments: 2) (cost=2.42..2301.81 rows=2494 width=78) -> Nested Loop (cost=2.42..2301.81 rows=2494 width=78) Join Filter: d.date_key >= jh.start_date AND d.date_key Seq Scan on date_dim d (cost=0.00..54.89 rows=2245 width=40) -> Materialize (cost=2.42..2.62 rows=10 width=38) -> Broadcast Motion 2:2 (slice1; segments: 2) (cost=0.00..2.40 rows=10 width=38) -> Seq Scan on job_history jh (cost=0.00..2.10 rows=5 width=38)
Notice it is doing a Nested Loop operation. Yuck.
And here is the one that utilizes generate_series:
SELECT * FROM ( SELECT *, start_date + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM end_date - start_date)::int))) AS mydate FROM hr.job_history ) AS jh JOIN hr.date_dim d ON jh.mydate = d.date_key;
Look at the explain plan now!
Gather Motion 2:1 (slice2; segments: 2) (cost=2.70..68.96 rows=7 width=166) -> Hash Join (cost=2.70..68.96 rows=7 width=166) Hash Cond: d.date_key = jh.mydate -> Seq Scan on date_dim d (cost=0.00..54.89 rows=2245 width=40) -> Hash (cost=2.58..2.58 rows=5 width=126) -> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..2.58 rows=5 width=126) Hash Key: jh.mydate -> Result (cost=0.00..2.27 rows=5 width=38) -> Seq Scan on job_history (cost=0.00..2.27 rows=5 width=38)
It is doing a Hash Join and the cost has dropped significantly.
The lesson here is to avoid Nested Loops in Greenplum and one way to avoid this is to use an equal join instead of between. To achieve this, you can use the generate_series function in Greenplum as demonstrated here.
A customer encouraged me to write a post about how it is possible to use shell scripts to dynamically alter the results from an External Web Table.
For this example, I will create a table in both Greenplum and SQL Server that will contain a list of files. I want to just get the records from the External Web Table that I need dynamically rather than retrieving all of the rows from SQL Server or by recreating the External Web Table. Note: Outsourcer has “Append” jobs for getting just the new records but is has been designed for Integer data types, not character like in this example.
CREATE TABLE myfiles (filename TEXT NOT NULL PRIMARY KEY, dir TEXT NOT NULL) DISTRIBUTED BY (filename); INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');
SQL Server Table:
CREATE TABLE myfiles (filename VARCHAR(500) NOT NULL PRIMARY KEY, dir VARCHAR(MAX) NOT NULL); INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump'); --Note that these three rows don't exist in Greenplum INSERT INTO myfiles VALUES ('2012-01-06_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-07_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-08_10-30-00_data.txt', '/data/dump');
Shell Script: /Users/gpadmin/get_myfiles.sh
Please note that the use of Outsourcer.jar requires creating a connection in os.ext_connection. Review the Outsourcer documentation on creating External Tables for more information.
max=`psql -A -t -c "SELECT MAX(filename) FROM myfiles"` java -classpath /usr/local/os/jar/Outsourcer.jar:/usr/local/os/jar/sqljdbc4.jar:/usr/local/os/jar/ojdbc6.jar -Xms128m -Xmx256m ExternalData gpdb 5432 1 "SELECT filename, dir FROM poc.dbo.myfiles WHERE filename > '$max'"
Dynamic External Web Table in Greenplum
CREATE EXTERNAL WEB TABLE ext_myfiles (filename TEXT, dir TEXT) EXECUTE E'/Users/gpadmin/get_myfiles.sh' ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\') ENCODING 'UTF8';
Insert statement in Greenplum
INSERT INTO myfiles SELECT * FROM ext_myfiles;
This technique will only retrieve the rows from SQL Server that need to be inserted. When dealing with millions of records, this will greatly improve performance.
You can manipulate your script to retrieve data based on anything you want all without having to recreate your External Web Table.
A lot of users will open a connection to a database and then leave it open for days. They will forget about this and then open more connections. Eventually, they will hit the max_connections limit and then not be able to create any new connections to Greenplum.
An easy fix for this is to kill sessions programmatically that have been idle for more than an hour. Of course, you can adjust this time period as appropriate.
Here is a simple SQL statement to identify the idle sessions:
SELECT procpid FROM pg_stat_activity WHERE current_query = '<IDLE>' AND clock_timestamp() - query_start > interval '1 hour';
One way to kill these idle sessions is with a single command:
eval `psql -A -t -c "SELECT 'kill ' || procpid FROM pg_stat_activity WHERE current_query = '<IDLE>' AND clock_timestamp() - query_start > interval '1 hour'"`
Put that command into a shell script (be sure to source the greenplum_path.sh script first) and create a cron job that runs every 5 minutes.
If you want to track who you killed and what time, you can use this shell script and again, use a cron job to execute it every 5 minutes.
#!/bin/bash source /usr/local/greenplum-db/greenplum_path.sh for i in $( psql -A -t -c "SELECT procpid, usename FROM pg_stat_activity WHERE current_query = '<IDLE>' AND clock_timestamp() - query_start > interval '1 hour'" ); do procpid=$(echo $i | cut -d\| -f1) usename=$(echo $i | cut -d\| -f2) echo `date`\|$procpid\|$usename >> kill.log kill $procpid done
Greenplum’s Polymorhpic storage is a very useful feature in Greenplum database where tables can be created and altered with a mixture of columnar, compressed, and row storage.
Here is an example of creating one of these tables and then managing this over time.
CREATE TABLE edw.sales_fact ( store_id integer, sales_person_id integer, product_id integer, sale_date date, sale_price numeric, product_cost numeric, spiff numeric ) DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date) PARTITION BY RANGE(sale_date) ( START ('2011-01-01'::timestamp) INCLUSIVE END ('2011-02-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5), START ('2011-02-01'::timestamp) INCLUSIVE END ('2011-03-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=quicklz), START ('2011-03-01'::timestamp) INCLUSIVE END ('2011-04-01'::timestamp) EXCLUSIVE ) ;
As you can see, this table has three partitions with each having different storage techniques.
- January 2011 is columnar and compressed with ZLib
- February 2011 is columnar and compressed with QuickLZ
- March 2011 is stored as a row store and is not compressed
As time goes on, the DBA may want to make March 2011 compressed and then add a partition for April 2011. To do this, we first need to move the March 2011 data to a new table and exchange the partition. The last step is to add the new April 2011 partition.
CREATE TABLE edw.sales_fact_march ( store_id integer, sales_person_id integer, product_id integer, sale_date date, sale_price numeric, product_cost numeric, spiff numeric ) WITH (appendonly=true, orientation=column, compresstype=quicklz) DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date); INSERT INTO edw.sales_fact_march SELECT * FROM edw.sales_fact WHERE sale_date >= '2011-03-01' AND sale_date < '2011-04-01'; ANALYZE edw.sales_fact_march;
This next step is optional but you can now TRUNCATE the March 2011 partition because all of the data is in the sales_fact_march table. Plus, we need to exchange the two and it is faster to exchange with a truncated table.
TRUNCATE TABLE edw.sales_fact_1_prt_3;
Note: I got the partition table name from pg_partitions.
Now exchange my row stored partition for the columnar and compressed partition.
ALTER TABLE edw.sales_fact EXCHANGE PARTITION FOR (RANK(3)) WITH TABLE edw.sales_fact_march WITH VALIDATION;
Note: I got the RANK from pg_partitions.
We no longer need the temporary March 2011 data so we can drop this table.
DROP TABLE edw.sales_fact_march;
Lastly, add an empty April 2011 partition.
ALTER TABLE edw.sales_fact ADD PARTITION START ('2011-04-01'::timestamp) INCLUSIVE END ('2011-05-01'::timestamp) EXCLUSIVE;
The new DDL for the table looks like this:
CREATE TABLE edw.sales_fact ( store_id integer, sales_person_id integer, product_id integer, sale_date date, sale_price numeric, product_cost numeric, spiff numeric ) WITH ( OIDS=FALSE ) DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date) PARTITION BY RANGE(sale_date) ( START ('2011-01-01'::date) END ('2011-02-01'::date) WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5) , START ('2011-02-01'::date) END ('2011-03-01'::date) WITH (appendonly=true, orientation=column, compresstype=quicklz), START ('2011-03-01'::date) END ('2011-04-01'::date) WITH (appendonly=true, orientation=column, compresstype=quicklz), START ('2011-04-01'::date) END ('2011-05-01'::date) ) ;
Straight from the Greenplum Admin Guide:
gpfdist is Greenplum’s parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.
In this post, I’m going to demonstrate how to use a readable external table to load data that uses gpfdist. This makes it easy for Greenplum users to load massive amounts of data by just writing SQL!
First, start gpfdist as a background process.
gpfdist -d /Users/gpadmin -p 8080 &
Create my Table in Greenplum.
CREATE TABLE foo (bar_id int not null, bar text, bar_description text) DISTRIBUTED BY (bar_id);
Now create my External Table.
CREATE EXTERNAL TABLE ext_foo (LIKE foo) LOCATION ('gpfdist://localhost:8080/demo/foo.txt') FORMAT 'TEXT' (DELIMITER AS '|' NULL AS 'null');
- I could have spelled out all of the columns but I used a shortcut by using “LIKE foo”
- The location indicates it is using gpfdist and the host is local with port 8080
- My file is located /Users/gpadmin/demo/foo.txt but gpfdist is serving /Users/gpadmin so I need to only specify the demo subdirectory and then the file name
- You can pick TEXT of CSV format but TEXT will execute faster
- I used pipes as my delimiter and spelled out null for my null values
INSERT INTO foo SELECT * FROM ext_foo;
You are done!
- When creating the External Table, you can also specify an error_table (created dynamically if not found) with a reject limit.
LOG ERRORS INTO error_table SEGMENT REJECT LIMIT count ROWS
- If you have a really large file to load, you can split the file into multiple files and then specify multiple locations in the External Table. You can even specify different hosts for each file with multiple gpfdists running at once.
And remember, that inserting data this way pushes the data directly to the segments. It bypasses the MASTER server so it scales much better than other database’s loading utilities. Greenplum can load 10TB/Hour for a full rack DCA when using gpfdist!
Greenplum has partnered with Silver Spring Networks to provide a solution for utility companies to manage and mine their smart grid data. The solution developed includes a significant portion of Outsourcer! Check out the white paper from EMC.
Security is always divided into two main actions of Authentication and Authorization. Authentication is proving you are who you say you are by usually providing a username and a password. Authorization is the process of approving or denying access to objects such as SELECT on a table. Greenplum always handles Authorization internally. You can’t delegate this to a third party but you can with Authentication.
Authentication is handled in Greenplum just like it is in PostgreSQL. Edit your pg_hba.conf file on the MASTER and provide information about how users should be authenticated. Typically, companies are using an LDAP solution for security and this is also usually the hub in a hub and spoke security model. Furthermore, Microsoft Active Directory is usually the LDAP solution picked because so many desktops are using Windows.
The security hub has a company’s password rules like failed attempts, password complexity, password expiration, etc. and you can leverage this very easily with Greenplum’s authentication model. Simply edit your pg_hba.conf file and at the very bottom of the file (it is read sequentially), add this:
host all all 0.0.0.0/0 ldap ldapserver=<ldap server> ldapprefix=”<domain>\” ldapsuffix=””
- LDAP server = ldap1
- Windows Domain = myco
So your entry at the bottom of the pg_hba.conf file would look like this:
host all all 0.0.0.0/0 ldap ldapserver=ldap1 ldapprefix=”myco\” ldapsuffix=””
During a login attempt, the pg_hba.conf file is read and the first matching criteria is used. So if you want to use basic password authentication for an ETL user or trust for local users, simply put these entries before the LDAP entry. Then the last line is the LDAP entry that is the “catch-all” which authenticates users to your Active Directory.
How does it work? It takes the username and password passed in and attempts to connect to the LDAP server with the ldapprefix (domain) specified. If the user can connect to the LDAP server, then the user is authenticated and then connects to Greenplum. It is that simple. Once in the database, Greenplum handles Authorization as normal.
Merge is a feature in most databases but I have never liked it much. My source table always had duplicates in it which prevented me from using it.
Here is an example of such with Microsoft SQL Server (BTW, Oracle behaves the same way):
CREATE TABLE stage ( id int identity not null, provider_id int NOT NULL, provider text NOT NULL ); insert into stage (provider_id, provider) values (1, 'bubba'); insert into stage (provider_id, provider) values (2, 'gump'); insert into stage (provider_id, provider) values (3, 'foo'); insert into stage (provider_id, provider) values (4, 'bar'); insert into stage (provider_id, provider) values (4, 'jon'); CREATE TABLE target ( provider_id int NOT NULL primary key, provider text NOT NULL );
My stage table is where new records get loaded from a file and then my target table is what I’m trying to keep up to date. So using a MERGE statement, I should be able to insert the new records and if they exist already, update it.
merge target using (select * from stage) as stage on (target.provider_id = stage.provider_id) when matched then update set provider = stage.provider when not matched then insert (provider_id, provider) values (stage.provider_id, stage.provider);
That is the syntax for MERGE in SQL Server and when executed, I get this error message:
Msg 2627, Level 14, State 1, Line 1 Violation of PRIMARY KEY constraint 'PK__target__00E213107132C993'. Cannot insert duplicate key in object 'dbo.target'. The statement has been terminated.
Why? Well my stage table has two records with the same provider_id. When this happens, I don’t want to error but instead, I want the last value in the file to be used. Unfortunately, MERGE doesn’t handle this. Instead, you have to write code.
In Greenplum, here is how you do a “smart” MERGE that handles real world data where the target has duplicates:
CREATE TABLE stage ( id serial not null, provider_id int NOT NULL, provider text NOT NULL ) DISTRIBUTED BY (provider_id); insert into stage (provider_id, provider) values (1, 'bubba'); insert into stage (provider_id, provider) values (2, 'gump'); insert into stage (provider_id, provider) values (3, 'foo'); insert into stage (provider_id, provider) values (4, 'bar'); insert into stage (provider_id, provider) values (4, 'jon'); CREATE TABLE target ( provider_id int NOT NULL primary key, provider text NOT NULL ) DISTRIBUTED BY (provider_id);
Now create the function:
create or replace function fn_merge_target() returns void as $$ declare v_function text := 'fn_merge_target'; v_location int; begin v_location := 1000; --delete the old records delete from target t using stage s where t.provider_id = s.provider_id and t.provider <> s.provider; v_location := 2000; --inserts insert into target (provider_id, provider) select sub.provider_id, sub.provider from ( select s.*, rank() over (partition by s.provider_id order by s.id desc) as rank from stage s left join target t on s.provider_id = t.provider_id where t.provider is null ) as sub where sub.rank = 1; exception when others then raise exception '(%:%:%)', v_function, v_location, sqlerrm; end; $$ language plpgsql;
Execute the function:
And now query target:
select * from target order by provider_id;
I have 4 rows and for provider_id 4, I have the last version from the stage table rather than failing the command.
In summary, MERGE sounds nice but I’ve never been able to use it. My source or stage table always has duplicates and instead of failing, I want to apply the last version of each primary key record. I demonstrated an easy to use way to provide a much smarter MERGE that is optimized for Greenplum and uses one of the built in analytical functions (rank()).
A customer recently had a pretty common question about archiving old data in Greenplum. They wanted to purge old data in their OLTP system so that it remains small and fast. Historical data that isn’t changing, stays in Greenplum.
An easy way to do this is with a view and two tables. The view would UNION ALL the data between the two tables. Unfortunately, Greenplum treats UNION ALL similarly to other databases which executes each query separated by the UNION ALL serially. The optimal way to handle this is to partition the data.
So let’s say the customer has a schema called edw and a table called auths.
create schema edw; CREATE TABLE edw.auths (id int not null, fname varchar, lname varchar, episodedate timestamp) DISTRIBUTED BY (id);
Insert some sample data.
INSERT INTO edw.auths VALUES (1, 'foo', 'bar', '2001-05-01'::timestamp); INSERT INTO edw.auths VALUES (2, 'foo', 'bar', '2012-12-01'::timestamp); INSERT INTO edw.auths VALUES (3, 'too', 'cool', '2010-01-01'::timestamp); INSERT INTO edw.auths VALUES (4, 'too', 'cool', '2010-01-02'::timestamp); INSERT INTO edw.auths VALUES (5, 'old', 'data', '2004-01-02'::timestamp);
Next, rename this schema.
ALTER SCHEMA edw RENAME TO edw_archive;
Create the new edw schema.
CREATE SCHEMA edw;
In this case, anything older than 2011-01-01 will be archived and anything newer will be in the current OLTP system.
CREATE TABLE edw.auths (LIKE edw_archive.auths) PARTITION BY RANGE (episodedate) (START ('2000-01-01'::timestamp) INCLUSIVE END ('2010-01-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5), DEFAULT PARTITION auths_current);
I’m using a pretty high level of compression and column orientation for the archive data. This will help performance and use less disk space for the archived data. The default partition has the current data in it and it remains a row store that isn’t compressed. This is a good example of Polymorphic storage where partitions within the same table are stored in different ways.
Insert the old data into the new table.
INSERT INTO edw.auths SELECT * FROM edw_archive.auths;
Just to show that everything works.
SELECT * FROM edw.auths;
With Outsourcer, you will want to change the job to load edw.auths to insert data into the default partition and don’t touch the archive partition. To do this, get the names of the underlying inherited tables.
SELECT * FROM pg_partitions WHERE schemaname = 'edw' AND tablename = 'auths';
You can see the two tables independently.
--new data SELECT * FROM edw.auths_1_prt_auths_current; --archive data SELECT * FROM edw.auths_1_prt_2;
To change Outsourcer, first remove the old job for edw.auths and insert the new one with the target of edw.auths_1_prt_auths_current.
DELETE FROM os.job WHERE (target).schema_name = 'edw' AND (target).table_name = 'auths'; INSERT INTO os.job( refresh_type, target, source) VALUES ('refresh', ('edw', 'auths_1_prt_auths_current'), ('sqlserver', 'jonnywin', null, null, 'edw', 'dbo', 'auths', 'sa', 'sa'));
Now when you execute this job, it will truncate the current partition, insert the data from the OLTP system, and then analyze the partition. The archive partition will be left untouched. Querying data across the archived and current data will execute much faster than a UNION ALL because it runs in parallel and it is compressed and column oriented.
Greenplum leverages PostgreSQL internals which has great benefits from security, Multi-Version Concurrency Control, and languages. Greenplum supports a wide variety of languages but my favorite by far has to be PL/pgSQL.
PL/pgSQL feels a lot like Oracle PL/SQL. It has the same “Ada” look and feel which makes it easy for an Oracle developer to start writing PL/pgSQL.
Since Greenplum is supporting the same language that PostgreSQL does, you can find excellent information online about different techniques of writing quality PL/pgSQL functions. For example, here is good documentation on how to port from Oracle PL/SQL to PL/pgSQL => http://www.postgresql.org/docs/8.2/interactive/plpgsql-porting.html
I follow the same template for each PL/pgSQL function I write and below are each of the different things I always follow.
- Functions– prefix with fn_ to denote it is a function
- Table names– all lower case with underscores. The data dictionary stores objects in all lower case so it is easier to read table names with underscores.
- Variables– prefix with v_
- Parameters– prefix with p_
- Function Name– I add the function name to each function I write so that I can include this in the exception message. This is very helpful when you have functions executing other functions and you need to know what function really failed.
- Location– I prefer to add a location variable to functions and denote where I am in the code frequently. I use the location value to know where the error occurred.
- Exception Block– I raise the exception with:
- RAISE EXCEPTION ‘(%:%:%)’, v_function_name, v_location, sqlerrm;
The parenthesis surrounds the error message from this function. If the function executes another function and the second function is the one that failed, the parenthesis help to indicate what function, the location the error, and the error message that occurred.
CREATE OR REPLACE FUNCTION fn_name() RETURNS void AS $$ DECLARE v_function_name text := 'fn_name'; v_location int; BEGIN v_location := 1000; EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm; END; $$ LANGUAGE plpgsql;
As you develop your code, increment the v_location variable so if you do get an error, it will be raised the approximate location in your code.
More examples of my PL/pgSQL code can be found in the Outsourcer code. I have about 8 functions that you can reference for examples of writing PL/pgSQL functions.
It seems that everyone has it. Ugly data. Data that isn’t so clean in transactional systems like control characters or rarely used characters that are used programmatically. This ugly data typically errors out during a load process. This requires someone to cleanse the data by writing scripts and then loading the data.
This is time consuming and feels like you are re-inventing the wheel over and over again.
Outsourcer does the cleansing for you! As the data is written to Greenplum, control characters are removed and special characters are escaped properly.
Here is a demo of Outsourcer in action loading ugly data.
Just like for SQL Server, you may want to know what is going on in Oracle from inside Greenplum. Here is an External Web Table that uses Outsourcer to give you that information.
First create a connection for Outsourcer:
INSERT INTO os.ext_connection (type, server_name, instance_name, port, database_name, user_name, pass) VALUES ('oracle', 'jonnywin', null, 1521, 'xe', 'os_test', 'os_password'); --get the ID you just inserted SELECT id FROM os.ext_connection WHERE type = 'oracle' AND server_name = 'jonnywin' AND port = 1521 AND database_name = 'xe'; --in my example, the value is 3.
Next create the schema in Greenplum for this table:
CREATE SCHEMA oracle;
This will create the External Table in Greenplum for Outsourcer:
--notice the double escaping of the dollar sign. SELECT os.fn_create_ext_table('oracle.sessions', ARRAY[ 'username varchar(30)','osuser varchar(30)','machine varchar(64)', 'terminal varchar(30)','program varchar(48)', 'module varchar(48)','sql_text text','logon_time timestamp', 'service_name varchar(64)'], 3, 'SELECT s.USERNAME, s.OSUSER, s.MACHINE, s.TERMINAL, s.PROGRAM, s.MODULE, a.sql_text, s.LOGON_TIME, s.service_name FROM v\\\\$session s, v\\\\$sqlarea a WHERE a.address = s.sql_address');
Now you can execute the query to get the session information:
select * from oracle.sessions;
More information on External Tables is in the Outsourcer documentation.
This is a very good question. When using Outsourcer, you may find a job run longer or shorter from time to time and in my experience, it is because the source database is waiting on something. It is the bottleneck 99% of the time. So why is it slow today versus yesterday?
Databases have wait stats and SQL Server has this information too. You can query SQL Server to find what it is waiting on by logging into SQL Server and executing a SQL command. With the code provided below, you can leverage Outsourcer to query SQL Server from Greenplum!
First create a connection for Outsourcer:
INSERT INTO os.ext_connection (type, server_name, instance_name, port, database_name, user_name, pass) VALUES ('sqlserver', 'jonnywin', null, null, null, 'os_test', 'os_password'); --get the ID you just inserted SELECT id FROM os.ext_connection WHERE type = 'sqlserver' AND server_name = 'jonnywin' --in my example, the value is 1.
Next create the schema in Greenplum for this table:
CREATE SCHEMA sqlserver;
This will create the External Table in Greenplum for Outsourcer:
SELECT os.fn_create_ext_table('sqlserver.sessions', ARRAY['sql_time timestamp','start_time timestamp','status varchar(30)', 'session_id smallint','sqltext text'], 1, 'SELECT getdate() as sql_time, req.start_time, req.status, req.session_id, sqltext.TEXT as sqltext FROM sys.dm_exec_requests req CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext order by req.start_time');
Now you can execute the query to get the session information:
select * from sqlserver.sessions;
I have found that the most critical thing to look for from SQL Server is the “status”. You will see it “Running” when it is actually doing something but “Suspended” when it is waiting on a resource.
So if you see queries generated by Outsourcer “Suspended” in SQL Server, that is why it is slower today versus yesterday. Work with your SQL Server DBA to determine what resource(s) SQL Server is waiting on next.
More information on External Tables is in the Outsourcer documentation.