TPC-H 데이터베이스 성능 검사

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

멍청해서 기록한다

TPC-H 데이터베이스 성능 검사 본문

Etc

TPC-H 데이터베이스 성능 검사

개발근로자 2020. 1. 8. 09:06

Solution

<Ensure you call out if authorized access / privileges are needed by customers> Following are the steps taken to resolve the issue: 1. <Step 1> – <flags that indicate that Step 1 was a success> 2. <Step 2> – <flags that indicate Step 2 success>

1) Download the freely available dbgen tool from
TPC org: http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp
You will see many different workloads. For our case, download the tool corresponding to TPC-H_Tools
Follow the instructions for registration and download the tool and copy to local mountpoint with free space > 100G.
Note: With Mapr, you can use DFS location mounted using NFS. So you are not constrained by local disk.

2) Build the dbgen tool.
Readme in the dbgen tool has the instruction on how to build the tool.
Make sure that PLATFORM/MACHINE is set to "LINUX" in makefile. You can ignore the DATABASE values even if they are set.
cd tpch-dbgen/ vi makefile make
Once the build is successful, it should generate "dbgen" and "qgen" tools.

3) Generate the data.
By default, dbgen will create the data in ascii text files, one file for each table in TPC-H framework.
The fields are separated by pipe "|" and file with an extension of .tbl

Use the below command to generate approx 50G of data for the "lineitem" table.
for i in `seq 1 50`; do echo $i; ./dbgen -s 50 -S $i -C 50 -T L -v; done;
After successful completion, it will generate nearly 50 files in the current directory with name lineitem.tbl.*
Together, there could be 600M rows.
If you want more data, just increase the scaling factor from 50G to whatever value.

4) Copy all the above generated files to some convenient DFS location on your cluster
For ex: /user/mapr/tpch/lineitem/

5) Create a text/flat table as Hive external table.create database tpch_text; use tpch_text; create external table lineitem (L_ORDERKEY BIGINT, L_PARTKEY BIGINT, L_SUPPKEY BIGINT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/mapr/tpch/lineitem';
6) Now, create a Parquet table in Hive using above external table.
Make sure to set dynamic partitioning in Hive to "nonstrict" before running the Insert statement.
Note, that i am creating the table partitioned "monthly" based on "L_SHIPDATE" column.
By default, if you use date column as is, it will create daily partitions which may create very small parquet files in the
range of few kb's which will not give the best performance in our case where data is not too huge.
So, in order to create optimally sized parquet files, we created monthly partitioned table using substring(L_SHIPDATE,1,7) which returns, for ex: "2017-08" without the date of the month.
create database tpch_parquet; use tpch_parquet; create table lineitem_monthly (L_ORDERKEY BIGINT, L_PARTKEY BIGINT, L_SUPPKEY BIGINT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) partitioned by (monthly string) stored as PARQUET ; use tpch_parquet; set hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE lineitem_monthly partition(monthly) select L_ORDERKEY , L_PARTKEY , L_SUPPKEY , L_LINENUMBER , L_QUANTITY , L_EXTENDEDPRICE , L_DISCOUNT , L_TAX , L_RETURNFLAG , L_LINESTATUS , L_COMMITDATE , L_RECEIPTDATE , L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT , substring(L_SHIPDATE,1,7) from tpch_text.lineitem ;
The above procedure can be repeated for the remaining tables in the TPC-H schema.

저작자표시 (새창열림)

'Etc' 카테고리의 다른 글

인터넷 끊김 발생시 대처법 (0)	2020.10.16
데몬 모니터링 툴 (0)	2020.04.10
캐시 동작 원리 (0)	2020.03.10
정규표현식 모음 (0)	2020.02.07
TPC-H 데이터베이스 성능 측정 (0)	2020.01.20

'Etc' Related Articles

멍청해서 기록한다

TPC-H 데이터베이스 성능 검사 본문

TPC-H 데이터베이스 성능 검사

'Etc' 카테고리의 다른 글

티스토리툴바