Occasionally, I see the request to get the filename added to a file read by gpfdist. Here is a way to do it!
First, create a YML file named “transform_config.yml” with the following:
--- VERSION: 1.0.0.1 TRANSFORMATIONS: transformation_input: TYPE: input CONTENT: data COMMAND: /bin/bash transform.sh
Next, create that “transform.sh” file. This is just a simple example that gets all txt files but you can also pass in a parameter to this script. The filename in the external table gets passed to the script.
#!/bin/bash set -e for i in $(ls *.txt); do awk '{print FILENAME"|"$0}' $i; done
Create two test files (test_file_1.txt and test_file_2.txt).
cat test_file_1.txt 1|foo1 2|foo2 3|foo3 4|foo4 5|foo5 6|foo6 7|foo7 8|foo8 9|foo9 10|foo10
cat test_file_2.txt 11|foo11 12|foo12 13|foo13 14|foo14 15|foo15 16|foo16 17|foo17 18|foo18 19|foo19 20|foo20
Start gpfdist in the background.
gpfdist -p 8999 -c transform_config.yml > mylog 2>&1 < mylog &
Create the External Table but be sure to change the hostname. Note that "foo" in the LOCATION is the filename. I'm ignoring it for this example but this is how you can pass parameters to the script. You add %filename% to the YML file as the parameter to the script.
CREATE EXTERNAL TABLE ext_transform_table (filename text, id int, descrption text) LOCATION ('gpfdist://gpdbsne:8999/foo#transform=transformation_input') FORMAT 'text' (DELIMITER '|')
Now select from the External Table.
select * from ext_transform_table; filename | id | descrption -----------------+----+------------ test_file_1.txt | 1 | foo1 test_file_1.txt | 2 | foo2 test_file_1.txt | 3 | foo3 test_file_1.txt | 4 | foo4 test_file_1.txt | 5 | foo5 test_file_1.txt | 6 | foo6 test_file_1.txt | 7 | foo7 test_file_1.txt | 8 | foo8 test_file_1.txt | 9 | foo9 test_file_1.txt | 10 | foo10 test_file_2.txt | 11 | foo11 test_file_2.txt | 12 | foo12 test_file_2.txt | 13 | foo13 test_file_2.txt | 14 | foo14 test_file_2.txt | 15 | foo15 test_file_2.txt | 16 | foo16 test_file_2.txt | 17 | foo17 test_file_2.txt | 18 | foo18 test_file_2.txt | 19 | foo19 test_file_2.txt | 20 | foo20 (20 rows)