A customer recently had a pretty common question about archiving old data in Greenplum. They wanted to purge old data in their OLTP system so that it remains small and fast. Historical data that isn’t changing, stays in Greenplum.
An easy way to do this is with a view and two tables. The view would UNION ALL the data between the two tables. Unfortunately, Greenplum treats UNION ALL similarly to other databases which executes each query separated by the UNION ALL serially. The optimal way to handle this is to partition the data.
So let’s say the customer has a schema called edw and a table called auths.
create schema edw;
CREATE TABLE edw.auths
(id int not null,
fname varchar,
lname varchar,
episodedate timestamp)
DISTRIBUTED BY (id);
Insert some sample data.
INSERT INTO edw.auths VALUES (1, 'foo', 'bar', '2001-05-01'::timestamp);
INSERT INTO edw.auths VALUES (2, 'foo', 'bar', '2012-12-01'::timestamp);
INSERT INTO edw.auths VALUES (3, 'too', 'cool', '2010-01-01'::timestamp);
INSERT INTO edw.auths VALUES (4, 'too', 'cool', '2010-01-02'::timestamp);
INSERT INTO edw.auths VALUES (5, 'old', 'data', '2004-01-02'::timestamp);
Next, rename this schema.
ALTER SCHEMA edw RENAME TO edw_archive;
Create the new edw schema.
CREATE SCHEMA edw;
In this case, anything older than 2011-01-01 will be archived and anything newer will be in the current OLTP system.
CREATE TABLE edw.auths
(LIKE edw_archive.auths)
PARTITION BY RANGE (episodedate)
(START ('2000-01-01'::timestamp) INCLUSIVE END ('2010-01-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5),
DEFAULT PARTITION auths_current);
I’m using a pretty high level of compression and column orientation for the archive data. This will help performance and use less disk space for the archived data. The default partition has the current data in it and it remains a row store that isn’t compressed. This is a good example of Polymorphic storage where partitions within the same table are stored in different ways.
Insert the old data into the new table.
INSERT INTO edw.auths SELECT * FROM edw_archive.auths;
Just to show that everything works.
SELECT * FROM edw.auths;
With Outsourcer, you will want to change the job to load edw.auths to insert data into the default partition and don’t touch the archive partition. To do this, get the names of the underlying inherited tables.
SELECT * FROM pg_partitions WHERE schemaname = 'edw' AND tablename = 'auths';
You can see the two tables independently.
--new data
SELECT * FROM edw.auths_1_prt_auths_current;
--archive data
SELECT * FROM edw.auths_1_prt_2;
To change Outsourcer, first remove the old job for edw.auths and insert the new one with the target of edw.auths_1_prt_auths_current.
DELETE FROM os.job WHERE (target).schema_name = 'edw' AND (target).table_name = 'auths';
INSERT INTO os.job(
refresh_type, target, source)
VALUES ('refresh',
('edw', 'auths_1_prt_auths_current'),
('sqlserver', 'jonnywin', null, null, 'edw', 'dbo', 'auths', 'sa', 'sa'));
Now when you execute this job, it will truncate the current partition, insert the data from the OLTP system, and then analyze the partition. The archive partition will be left untouched. Querying data across the archived and current data will execute much faster than a UNION ALL because it runs in parallel and it is compressed and column oriented.