Parquet is a format for column oriented data to be stored in HDFS. It is part of the Apache distribution and is also available in Pivotal HD and HAWQ. HAWQ can store and read data in the Parquet format and it is also available with the open source components of Pivotal HD such as Pig and MapReduce.
Here is a quick example showing how this work.
First, create a table with some data. You can either use the CTAS method or the more traditional CREATE TABLE and then INSERT. Either method works and it is up to your preference on which you do. This example generates only 100 records of some fake customer data.
CTAS
CREATE TABLE CUSTOMER WITH (appendonly=true, orientation=parquet) AS SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip FROM (SELECT generate_series(1, 100) AS i) AS sub DISTRIBUTED BY (id);
CREATE and then INSERT
CREATE TABLE customer ( id integer, fname text, lname text, address text, city text, state text, zip text ) WITH (APPENDONLY=true, ORIENTATION=parquet, OIDS=FALSE ) DISTRIBUTED BY (id); INSERT INTO customer SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip FROM (SELECT generate_series(1, 100) AS i) AS sub;
Now you have data in the Parquet format in HAWQ. Pretty easy, huh?
Next, I’ll use a nifty tool that queries the HAWQ catalog which tells me where the Parquet files are.
gpextract -o customer.yaml -W customer -dgpadmin
And here is the customer.yaml file it created.
DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.0.1 build 8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr 23 2014 16:12:32 DFS_URL: hdfs://phd1.pivotalguru.com:8020 Encoding: UTF8 FileFormat: Parquet Parquet_FileLocations: Checksum: false CompressionLevel: 0 CompressionType: null EnableDictionary: false Files: - path: /hawq_data/gpseg0/16385/16554/16622.0 size: 4493 - path: /hawq_data/gpseg1/16385/16554/16622.0 size: 4499 PageSize: 1048576 RowGroupSize: 8388608 TableName: public.customer Version: 1.0.0
Notice the path to the files which are in Hadoop and are in the Parquet format.
Now you can use a tool like Pig to look at the data.
grunt> A = load '/hawq_data/gpseg{0,1}/16385/16554/16622' USING parquet.pig.ParquetLoader(); grunt> describe A; A: {id: int,fname: bytearray,lname: bytearray,address: bytearray,city: bytearray,state: bytearray,zip: bytearray} grunt> B = foreach A generate id, fname, lname, address, city, state, zip; grunt> dump B; (2,jon2,roberts2,2 main street,new york,ny,00002) (4,jon4,roberts4,4 main street,new york,ny,00004) (6,jon6,roberts6,6 main street,new york,ny,00006) (8,jon8,roberts8,8 main street,new york,ny,00008) .....
Parquet is easy to use in HAWQ and doesn’t lock you into a Pivotal HD and HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster. No vendor lock-in.
Is there performance benefit over HAWQ columnar orientation vs parquet ?
It depends.
HAWQ, like Greenplum database, implements columnar storage with a file per column. This is great in reducing I/O because it reads smaller files. However, Hadoop doesn’t do well with lots and lots of files so you may actually hurt performance in HAWQ with a column oriented table.
For example, let’s say you have a partitioned and column oriented table in HAWQ with 37 partitions, 100 columns, and 96 segments. With this scenario, you have roughly 96 * 100 * 37 files for this one table. With Parquet, you have 96 * 37 files which is more manageable.
So you may have a Parquet table perform better than Column oriented and then a different table may perform better with Column orientation over Parquet.