一、TDengine的下载、安装
1、从官网上下载即可。
2、上传到指定端
我是从WSL上传到工作端,云服务器上。
songroom@staff-NB-146:/mnt/d$ scp -r -P 19622 TDengine-server-2.4.0.7-Linux-x64.tar.gz click@103.10.3.83:~/taosclick@103.10.3.83's password: TDengine-server-2.4.0.7-Linux-x64.tar.gz 100% 14MB 508.7KB/s 00:28
3、解压
到文件指定目录taos下,解压:
click@iZ9ni05fy7agndgpndc7gsZ:~/taos$ tar xvf TDengine-server-2.4.0.7-Linux-x64.tarTDengine-server-2.4.0.7/TDengine-server-2.4.0.7/driver/TDengine-server-2.4.0.7/driver/libtaos.so.2.4.0.7TDengine-server-2.4.0.7/driver/vercomp.txtTDengine-server-2.4.0.7/taos.tar.gzTDengine-server-2.4.0.7/examples/TDengine-server-2.4.0.7/examples/nodejs/TDengine-server-2.4.0.7/examples/nodejs/nodejsChecker.jsTDengine-server-2.4.0.7/examples/nodejs/README-win.mdTDengine-server-2.4.0.7/examples/nodejs/test1970.js......
4、安装
click@iZ9ni05fy7agndgpndc7gsZ:~/taos$ ls -a、 .、 TDengine-server-2.4.0.7 TDengine-server-2.4.0.7-Linux-x64.tarclick@iZ9ni05fy7agndgpndc7gsZ:~/taos$ mv TDengine-server-2.4.0.7 TDengine-serverclick@iZ9ni05fy7agndgpndc7gsZ:~/taos$ cd TDengine-serverclick@iZ9ni05fy7agndgpndc7gsZ:~/taos/TDengine-server$ click@iZ9ni05fy7agndgpndc7gsZ:~/taos/TDengine-server$ lltotal 11664drwxr-xr-x 4 click click 4096 Feb 10 11:58 ./drwxrwxr-x 3 click click 4096 Feb 16 10:00 ../drwxr-xr-x 2 click click 4096 Feb 10 11:58 driver/drwxr-xr-x 10 click click 4096 Feb 10 11:58 examples/-rwxr-xr-x 1 click click 30890 Feb 10 11:58 install.sh*-rw-r--r-- 1 click click 11894309 Feb 10 11:58 taos.tar.gz
执行安装
click@iZ9ni05fy7agndgpndc7gsZ:~/taos/TDengine-server$ sudo ./install.shStart to install TDengine...Created symlink from /etc/systemd/system/multi-user.target.wants/taosd.service to /etc/systemd/system/taosd.service.System hostname is: iZ9ni05fy7agndgpndc7gsZEnter FQDN:port (like h1.taosdata.com:6030) of an existing TDengine cluster node to joinOR leave it blank to build one:Enter your email address for priority support or enter empty to skip: To configure TDengine : edit /etc/taos/taos.cfgTo configure taosadapter (if has) : edit /etc/taos/taosadapter.tomlTo start TDengine : sudo systemctl start taosdTo access TDengine : taos -h iZ9ni05fy7agndgpndc7gsZ to login into TDengine serverTDengine is installed successfully!
5、启动和检查
注意在root权限下,如果不是管理员账户,加sudo.
click@iZ9ni05fy7agndgpndc7gsZ:~/taos/TDengine-server$ sudo systemctl start taosdclick@iZ9ni05fy7agndgpndc7gsZ:~/taos/TDengine-server$ sudo systemctl status taosd● taosd.service - TDengine server service Loaded: loaded (/etc/systemd/system/taosd.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2022-02-16 10:15:45 CST; 24s ago Process: 1351 ExecStartPre=/usr/local/taos/bin/startPre.sh (code=exited, status=0/SUCCESS) Main PID: 1359 (taosd) CGroup: /system.slice/taosd.service └─1359 /usr/bin/taosdFeb 16 10:15:45 iZ9ni05fy7agndgpndc7gsZ systemd[1]: Starting TDengine server service...Feb 16 10:15:45 iZ9ni05fy7agndgpndc7gsZ systemd[1]: Started TDengine server service.Feb 16 10:15:45 iZ9ni05fy7agndgpndc7gsZ TDengine:[1359]: Starting TDengine service...Feb 16 10:15:45 iZ9ni05fy7agndgpndc7gsZ TDengine:[1359]: Started TDengine service successfully.
二、数据的导入–非超级表设计尝试
通过python连接器,把相关的数据导入TD。
想法,先模拟把A股的所有个股的数据insert到一个大表中,其实也是普通表,没有用超级表的功能下,我们来看一下TD的性能。
# -*- coding: utf-8 -*-import taosimport pandas as pdimport datetimeimport random as rdimport time# 创建数据库def td_create_db(conn,db_name): c1 = conn.cursor() sql = 'create database '+db_name c1.execute(sql)def td_create_table(conn,table_name): c1 = conn.cursor() create_table_sql_0 = f"create table {table_name} (datetime timestamp,code BINARY(20),open float,close float,low float,high float,volume float,money float,factor float," create_table_sql_1 ="high_limit float,low_limit float,avg float,pre_close float,paused float,open_interest float)" create_table_sql = create_table_sql_0 + create_table_sql_1 print(create_table_sql) c1.execute(create_table_sql)def td_insert_data(conn,db_name,table_name): c1 = conn.cursor() c1.execute(f'use {db_name}') # 批量插入数据 t0 = time.time() for i in range(0,100000): sql_insert = [f'insert into {table_name} values '] code = str(100000 +i) + ".XSHG" start_time = datetime.datetime(2014, 1, 1) time_interval = datetime.timedelta(seconds=1) for irow in range(1,1000): start_time += time_interval t1= rd.random()+10 t2 =rd.random()+10 t3= rd.random()+10 t4 =rd.random()+10 t5= rd.random()+1000000.0 t6 =rd.random()+999999999.0 t7= rd.random()+10 t8 =rd.random()+10 t9= rd.random()+10 t10 =rd.random()+10 t11= rd.random()+10 t12 =rd.random()+10 t13= rd.random()+1000000.0 sql_insert.append('('%s', '%s',%f, %f,%f, %f,%f, %f,%f, %f,%f,%f, %f,%f, %f)' %(start_time,code,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13)) sql = ' '.join(sql_insert) c1.execute(sql) print(f"insert row finished! cost time:{time.time()-t0}")def td_fetch_data(conn,db_name,table_name): c1 = conn.cursor() #c1.execute(f"use db {db_name}") c1.execute(f"select * from {db_name}.{table_name} where code ='100009.XSHG'") for data in c1: print(f"{data[0]} {data[1]}")t0 = time.time()conn = taos.connect(host="127.0.0.1")db_name = "td_db_test"tb_name = "stock_test"td_fetch_data(conn,db_name,tb_name)print(f"cost time :{time.time()-t0}")
从1.1亿行中,查找满足条件的近1000条数据:
经过insert到一张大表中,1.1亿行数据后,数据的查询效率并不太高,可见超级表的重要性。
三、超级表尝试
看来,TDengine没有超级表,情况并不乐观。
从架构上看,超级表的设计,比较符合股票市场中情况。每一个个股的数据就是一个子表,每个品种,比如A股,就是一个超级表;当然,期货,指数,基金都可以当一个超级表。下面子表就是每一个具体的品种。
数据:模拟生成一万只个股数据,每只个股的数据为100万bar数据,列数15列。这个量级基本上接近实际的生产水平。数据的模拟时间比较长,需要近10个小时的时间。
对上面的代码修改后,如下:
# -*- coding: utf-8 -*-import taosimport pandas as pdimport datetimeimport random as rdimport time# 创建数据库def create_db(conn,db_name): c1 = conn.cursor() sql = 'create database '+ db_name c1.execute(sql)# 创建超级表def create_super_table(conn,db_name,super_table_name): c1 = conn.cursor() c1.execute(f'use {db_name}') create_table_sql_0 = f"create table if not exists {db_name}.{super_table_name} (datetime timestamp,code BINARY(20),open float,close float,low float,high float,volume float,money float,factor float," create_table_sql_1 =" high_limit float,low_limit float,avg float,pre_close float,paused float,open_interest float) " create_table_sql_2 = " tags (list_code nchar(20),asset_type nchar(20), exchange int)" # sql = create_table_sql_0 + create_table_sql_1 + create_table_sql_2 print(sql) c1.execute(sql)# 创建超级表子表def create_son_table(conn,db_name,son_table_name,super_table_name,tags): c1 = conn.cursor() # create table dianbiao1001 using super_dianbiao tags('张三','东城小区','1-1101'); create_table_sql = f"create table if not exists {db_name}.{son_table_name} using {super_table_name} " sql = create_table_sql + tags print(sql) c1.execute(sql)# 模拟生成子表的数据,子表10000张,每张表100万行,每行15列def insert_data(conn,db_name,super_table_name): c1 = conn.cursor() c1.execute(f'use {db_name}') t0 = time.time() time_interval = datetime.timedelta(seconds=1) code_num = 0 total_code_nums = 10001 # 批量插入数据 for i in range(1106,total_code_nums): code = str(100001+i) + ".XSHG" print(f"start insert code table :{code }......") start_time = datetime.datetime(2014, 1, 1) son_table_name = "s_" + str(100001+i) asset_type = "stock" exchange = 0 # 沪市 if i < 4000: exchange = 1 # 深市 elif i > 8000: exchange = 2 #科创板 tags = f" tags ('{code}', '{asset_type}' ,{exchange})" ## 代码 create_son_table(conn,db_name,son_table_name,super_table_name,tags) # 超级表与子表关联,生成新子表 for j in range(1,1001): sql_insert = [f'insert into {db_name}.{son_table_name} values '] for row in range(1,1001): start_time = start_time + time_interval t1= rd.random()+10 t2 =rd.random()+10 t3= rd.random()+10 t4 =rd.random()+10 t5= rd.random()+1000000.0 t6 =rd.random()+999999999.0 t7= rd.random()+10 t8 =rd.random()+10 t9= rd.random()+10 t10 =rd.random()+10 t11= rd.random()+10 t12 =rd.random()+10 t13= rd.random()+1000000.0 sql_insert.append('('%s', '%s',%f, %f,%f, %f,%f, %f,%f, %f,%f,%f, %f,%f, %f)' %(start_time,code,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13)) sql = ' '.join(sql_insert) c1.execute(sql) # 每次insert的条数有限,分批次insert code_num = code_num +1 print(f" code: {code} code_num : {code_num} total_num : {total_code_nums-1} insert finish! ") print(f" {code_num} 个code finished insert! cost time: {time.time()-t0} ")def fetch_data(conn,db_name,super_table_name): c1 = conn.cursor() sql = f"select * from {db_name}.{super_table_name} where list_code ='100009.XSHG' AND datetime >= '2014-1-4 00:10:00' AND datetime < '2014-1-4 01:20:59'" #样例查询代码 c1.execute(sql) data = c1.fetchall() return datat0 = time.time()conn = taos.connect(host="127.0.0.1")db_name = "td_db_test"super_table_name = "stock"mode = 0 # 0为取数据,1为生成模拟数据和表if mode ==1 : # 生成超级表 create_super_table(conn,db_name,super_table_name) # 模拟生成各子表,并insert相关的数据 insert_data(conn,db_name,super_table_name)else: data = fetch_data(conn,db_name,super_table_name)print(f"cost time :{time.time()-t0} ")# create table if not exists td_db_test.stock (datetime timestamp,code BINARY(20),open float,close float,low float,high float,volume float,money float,factor float,high_limit float,low_limit float,avg float,pre_close float,paused float,open_interest float) tags (code nchar(20),asset_type nchar(20), exchange int)
模拟生成数据的执行代码:
t0 = time.time()conn = taos.connect(host="127.0.0.1")db_name = "td_db_test"super_table_name = "stock"mode = 0 # 0为取数据,1为生成模拟数据和表if mode ==1 : # 生成超级表 create_super_table(conn,db_name,super_table_name) # 模拟生成各子表,并insert相关的数据 insert_data(conn,db_name,super_table_name)else: data = fetch_data(conn,db_name,super_table_name) print(f"data: {len(data)}")print(f"cost time :{time.time()-t0} ")
模拟查询数据的执行代码:
数据情况: 最后因为硬盘空间的原因,大约生产了1000只模拟个股表,每个表约100万条数据。
任务:1、单条件单子表查询:查其中特定代码的表全量数据;2、多条件单子表查询:满足一个代码和日期的多条件查询; 3、跨子表满足多条件查询;
(1)单条件单表查询
python连接器的速度:
sql = f"select * from {db_name}.{super_table_name} where list_code ='100009.XSHG' "
taos下的查询速度(单条件全量):
可以看到,
1、超级表的作用还是非常明显的,提升了6倍左右(python端)。
2、python端与taos端这两者相差很远。taos的速度要比python连接器好太多,大约差了15倍。
3、SSD可能的确很重要。本次云端服务器并不是SSD.
(2)多条件单表查询:
查询条件:
sql = f"select * from {db_name}.{super_table_name} where list_code ='100009.XSHG' AND datetime >= '2014-1-4 00:10:00' AND datetime < '2014-1-4 01:20:59'"
多条件查询的执行情况
python连接器:
click@iZ9ni05fy7agndgpndc7gsZ:~/mytao$ python -u "/home/click/mytao/tao_test.py"data: 4259cost time :0.12971186637878418
taos客户端:
[100 Rows showed, and more rows are fetching but will not be showed、You can ctrl+c to stop or wait.][You can add limit statement to show more or redirect results to specific file to get all.]Query OK, 4259 row(s) in set (0.017772s)taos> select * from td_db_test.stock where list_code ='100009.XSHG' AND datetime >= '2014-1-4 00:10:00' AND datetime < '2014-1-4 01:20:59' ;
上面的数据表明,在没有进行任何优化的情况下,效率还是不错的。从这个角度上看,taos很适合当历史行情的web api的数仓支持。
对小样本(指的是比如10万行以下输出;当然从超级表的角度,已经超过了10亿了)的数据查询,效率很高。因为web api 不可能允许对大样本的输出,在这种情况下,taos非常合适。
问题:百万级大样本(输出)下(比如量化投资平台的高频场景),这种量级的数据查询,还有没有优化的可能性?
(3)跨子表多条件查询
这种场景在于比如要查询中证500或某一个特定行业的股票,这个时间就会涉及在超级表下各子表的查询。
sql = f"select * from {db_name}.{super_table_name} where datetime >= '2014-1-4 00:10:00' AND datetime < '2014-1-4 01:20:59'"
当然,实际情况可以加中证500标签,或行业标签。上面的查询代表全子表都必须跑一次。
taos客户端运行情况:
[100 Rows showed, and more rows are fetching but will not be showed、You can ctrl+c to stop or wait.][You can add limit statement to show more or redirect results to specific file to get all.]Query OK, 1196358 row(s) in set (10.466470s)taos> select * from td_db_test.stock where datetime >= '2014-1-4 00:00:01' AND datetime < '2014-1-4 00:20:59' ;
更大量的查询:
[100 Rows showed, and more rows are fetching but will not be showed、You can ctrl+c to stop or wait.][You can add limit statement to show more or redirect results to specific file to get all.]Query OK, 33712900 row(s) in set (91.685929s)taos> select * from td_db_test.stock where datetime >= '2014-1-4 00:00:01' AND datetime < '2014-1-4 09:50:59' ;
其它:5-10万量级查询
发现,TDengine在用上超级表的基础上,面对5-10万级的查询上,性能较好(相比clickhouse而言);其实,并不完全是单表的查询效率的问题,而是超级性的"索引"优化的结果,因为clickhouse真的是去做全库扫描,而TD根本不需要。
前面也做过了clickhouse的研究,与TD相比,不太准确的说,clickhouse的大数量级(5000万级以上)的跨代码(类似于TD跨子表)输出查询相比更有优势;TD在小量级(1-20万)的输出查询上更有优势。