1、获取数据集
1.1 下载数据集1.2 横向数据集切分 2、通过DSL Conf运行训练和预测任务
2.1 数据输入2.2 模型训练
2.2.1 配置DSL文件2.2.2 运行配置Submit Runtime Conf2.2.3 提交任务,训练模型 2.3 模型评估
2.3.1 修改DSL2.3.2 修改conf2.3.3 提交任务 1、获取数据集 1.1 下载数据集
数据集:乳腺癌肿瘤数据集(内置在sklearn库)
from sklearn.datasets import load_breast_cancerimport pandas as pd breast_dataset = load_breast_cancer()breast = pd.Dataframe(breast_dataset.data, columns=breast_dataset.feature_names)breast.head()
观察数据发现,样本数569,特征数30(10个属性分别以均值mean、标准差std、最差值worst出现),标签(1-良性肿瘤,0-恶性肿瘤)1:0 = 357:212
1.2 横向数据集切分
为了模拟横向联邦建模场景,将数据集切分为特征相同的横向联邦形式切分策略:
前469个数据作为训练数据,后100个数据作为测试数据训练数据中,前200个作为机构A的数据,存为breast_1_train.csv,后269个数据作为机构B的数据,存为breast_2_train.csv测试数据不切分,存为breast_eval.csv 数据集完整代码
from sklearn.datasets import load_breast_cancerimport pandas as pd breast_dataset = load_breast_cancer()breast = pd.Dataframe(breast_dataset.data, columns=breast_dataset.feature_names)breast = (breast-breast.mean())/(breast.std()) # z-score 标准化col_names = breast.columns.values.tolist()columns = {}for idx, n in enumerate(col_names):columns[n] = "x%d"%idx breast = breast.rename(columns=columns)breast['y'] = breast_dataset.targetbreast['idx'] = range(breast.shape[0])idx = breast['idx']breast.drop(labels=['idx'], axis=1, inplace = True)breast.insert(0, 'idx', idx)breast = breast.sample(frac=1) # 打乱数据train = breast.iloc[:469]eval = breast.iloc[469:]breast_1_train = train.iloc[:200]breast_2_train = train.iloc[200:] breast_1_train.to_csv('breast_1_train.csv', index=False, header=True)breast_2_train.to_csv('breast_2_train.csv', index=False, header=True)eval.to_csv('breast_eval.csv', index=False, header=True)
2、通过DSL Conf运行训练和预测任务可以通过 Flow Client HF提交 conf 和 dsl 来开始训练作业这个流程包括
数据输入模型训练模型评估 注:配置文件中的注释在使用时要去除,否则无法进行 Flow Client 任务 2.1 数据输入
上传数据的配置文件示例可以从以下两个目录找到
example/dsl/v1
upload_data.json 或 upload_host.json 或 upload_guest.json(配置项相同)该配置文件结构
{ "file": "examples/data/breast_hetero_guest.csv",// 数据文件路径,相对于当前所在路径 "head": 1,// 指定数据文件是否包含表头,1: 是,0: 否 "partition": 16,// 指定用于存储数据的分区数 "work_mode": 0, // 指定工作模式,0: 单机版,1: 集群版 "table_name": "breast_hetero_guest",// 需要转换为DTable格式的表名(相当于后续需要使用的表) "namespace": "experiment"// DTable格式的表名对应的命名空间}
example/dsl/v2/uploadupload_conf.json 或 upload_tag_conf.json(配置项相同)该配置文件结构
{"file": "/data/projects/fate/examples/data/breast_hetero_guest.csv",// 数据文件路径,相对于当前所在路径"table_name": "breast_hetero_guest",// 需要转换为DTable格式的表名"namespace": "experiment",// DTable格式的表名对应的命名空间"head": 1,// 指定数据文件是否包含表头,1: 是,0: 否"partition": 8,// 指定用于存储数据的分区数"work_mode": 0, // 指定工作模式,0: 单机版,1: 集群版"backend": 0// 指定后端,0:EggRoll, 1: Spark _ RabbitMQ, 2: Spark + Pulsar}
这里使用较新的dsl v2版本,配置文件如下
workspace/HFL_lr/ 是我建立在fate根目录下的目录上传训练数据至机构1 upload_train_host_conf.json
{ "file": "workspace/HFL_lr/breast_1_train.csv", "table_name": "homo_breast_1_train", "namespace": "homo_host_breast_train", "head": 1, "partition": 8, "work_mode": 0, "backend": 0}
上传训练数据至机构2 upload_train_guest_conf.json{ "file": "workspace/HFL_lr/breast_2_train.csv", "table_name": "homo_breast_2_train", "namespace": "homo_guest_breast_train", "head": 1, "partition": 8, "work_mode": 0, "backend": 0}
上传测试数据至机构1 upload_eval_host_conf.json{ "file": "workspace/HFL_lr/breast_eval.csv", "table_name": "homo_breast_1_eval", "namespace": "homo_host_breast_eval", "head": 1, "partition": 8, "work_mode": 0, "backend": 0}
上传测试数据至机构2 upload_eval_guest_conf.json{ "file": "workspace/HFL_lr/ 是我建立在fate根目录下的目录breast_eval.csv", "table_name": "homo_breast_2_eval", "namespace": "homo_guest_breast_eval", "head": 1, "partition": 8, "work_mode": 0, "backend": 0}
上传数据命令
workspace/HFL_lr/ 是我建立在fate根目录下的目录$ flow data upload -c workspace/HFL_lr/upload_train_host_conf.json $ flow data upload -c workspace/HFL_lr/upload_train_guest_conf.json $ flow data upload -c workspace/HFL_lr/upload_eval_host_conf.json $ flow data upload -c workspace/HFL_lr/upload_eval_guest_conf.json
数据上传成功
提示:若出现以下问题,说明
{ "retcode": 100, "retmsg": "Fate flow CLI has not been initialized yet or configured incorrectly、Please initialize it before using CLI at the first time、And make sure the address of fate flow server is configured correctly、The configuration file path is: /ai/xwj/federate_learning/fate_dir/standalone_fate_master_1.6.0/venv/lib/python3.6/site-packages/flow_client/settings.yaml." }
说明Flow未初始化,执行命令flow init --ip 127.0.0.1 --port 9380即可
2.2 模型训练目前,FATE框架提供的 DSL 中,可以将各种任务模块通过一个有向无环图 (DAG) 组织起来。用户可以根据自身的需要,灵活地组合各种算法模块。这里采用逻辑回归模型。 2.2.1 配置DSL文件
官方提供的示例可以以下目录找到
官方示例中的各种模块说明
dataio_0: 数据IO组件,用于将本地数据转为DTablefeature_scale_0:特征工程组件homo_lr_0:横向逻辑回归组件evaluation_0:模型评估组件,如果未提供测试数据集则自动使用训练集 /examples/dsl/v1/homo_logistic_regreesion/test_homolr_train_job_dsl.json(可以直接使用)
配置文件之间的指定关系如图所示:
/examples/dsl/v2/homo_logistic_regression 2.2.2 运行配置Submit Runtime Conf
每个模块都有不同的参数需要配置,不同的 party 对于同一个模块的参数也可能有所区别。为了简化这种情况,对于每一个模块,FATE 会将所有 party 的不同参数保存到同一个运行配置文件(Submit Runtime Conf)中,并且所有的 party 都将共用这个配置文件。除了 DSL 的配置文件之外,用户还需要准备一份运行配置(Submit Runtime Conf)用于设置各个组件的参数。角色说明
arbiter:是用来辅助多方完成联合建模的,它的主要作用是聚合梯度或者模型。比如纵向lr里面,各方将自己一半的梯度发送给arbiter,然后arbiter再联合优化等等。initiator:任务发起人host:数据提供方guest:数据应用方local:本地任务,该角色仅用于 upload 和 download 阶段 官方提供的示例可以以下目录找到
/examples/dsl/v1/homo_logistic_regreesion/test_homolr_train_job_conf.json,对部分内容进行修改
{// 发起人 "initiator": { "role": "guest", "party_id": 10000 }, "job_parameters": { "work_mode": 0 }, // 所有参与此任务的角色// 每一个元素代表一种角色以及承担这个角色的party_id,是一个列表表示同一角色可能有多个实体ID "role": { "guest": [10000], "host": [10000], "arbiter": [10000]// 仲裁者 }, "role_parameters": { "guest": { "args": { "data": { "train_data": [ { "name": "homo_breast_2_train",// 修改DTable的表名,对应上传数据时配置文件的内容 "namespace": "homo_guest_breast_train"// 修改命名空间,对应上传数据时配置文件的内容 } ] } }, "dataio_0":{ "label_name": ["y"]// 添加标签对应的列属性名称 } }, "host": { "args": { "data": { "train_data": [ { "name": "homo_breast_1_train",// 同理修改 "namespace": "homo_host_breast_train"// 同理修改 } ] } }, "dataio_0":{ "label_name": ["y"]// 同理修改 }, "evaluation_0": { "need_run": [false] //host是数据应用方不需要跑评估 } } }, // 设置模型训练的超参数信息 "algorithm_parameters": { "dataio_0": { "with_label": true, "label_name": "y", "label_type": "int", "output_format": "dense" }, "homo_lr_0": { "penalty": "L2", "optimizer": "sgd", "tol": 1e-05, "alpha": 0.01, "max_iter": 10, "early_stop": "diff", "batch_size": 500, "learning_rate": 0.15, "decay": 1, "decay_sqrt": true, "init_param": { "init_method": "zeros" }, "encrypt_param": { "method": null }, "cv_param": { "n_splits": 4, "shuffle": true, "random_seed": 33, "need_cv": false } } }}
2.2.3 提交任务,训练模型
执行pipeline任务
flow job submit -c ${conf_path} -d ${dsl_path}
需要全部通过,才说明运行成功
可以看到训练结果
v2版本(待更新) 2.3 模型评估
官方示例在目录examples/dsl/v1/homo_logistic_regression
test_homolr_train_eval_job_conf.jsontest_homolr_train_eval_job_dsl.json 2.3.1 修改DSL
v1版本
{ "components" : { "dataio_0": { "module": "DataIO", "input": { "data": { "data": ["args.train_data"] } }, "output": { "data": ["train"], "model": ["dataio"] } }, "dataio_1": { "module": "DataIO", "input": { "data": { "data": ["args.eval_data"] }, "model": ["dataio_0.dataio"] }, "output": { "data": ["eval_data"] } }, "feature_scale_0": { "module": "FeatureScale", "input": { "data": { "data": ["dataio_0.train"] } }, "output": { "data": ["train"], "model": ["feature_scale"] } }, "feature_scale_1": { "module": "FeatureScale", "input": { "data": { "data": ["dataio_1.eval_data"] } }, "output": { "data": ["eval_data"], "model": ["feature_scale"] } }, "homo_lr_0": { "module": "HomoLR", "input": { "data": { "train_data": ["feature_scale_0.train"] } }, "output": { "data": ["train"], "model": ["homolr"] } }, "homo_lr_1": { "module": "HomoLR", "input": { "data": { "eval_data": ["feature_scale_1.eval_data"]// 指定训练数据 }, "model": ["homo_lr_0.homolr"] }, "output": { "data": ["eval_data"], "model": ["homolr"] } }, "evaluation_0": { "module": "evaluation", "input": { "data": { "data": ["homo_lr_0.train"] } } }, "evaluation_1": { "module": "evaluation", "input": { "data": { "data": ["homo_lr_1.eval_data"] } } } }}
v2版本(待更新) 2.3.2 修改confv1版本
{ "initiator": { "role": "guest", "party_id": 10000 }, "job_parameters": { "work_mode": 0 }, "role": { "guest": [10000], "host": [10000], "arbiter": [10000] }, "role_parameters": { "guest": { "args": { "data": { "train_data": [ { "name": "homo_breast_2_train", "namespace": "homo_guest_breast_train" } ], "eval_data": [ { "name": "homo_breast_2_eval", "namespace": "homo_guest_breast_eval" } ] } }, "dataio_0": { "with_label": [true], "label_name": ["y"], "label_type": ["int"], "output_format": ["dense"] } }, "host": { "args": { "data": { "train_data": [ { "name": "homo_breast_1_train", "namespace": "homo_host_breast_train" } ], "eval_data": [ { "name": "homo_breast_1_eval", "namespace": "homo_host_breast_eval" } ] } }, "dataio_0": { "with_label": [true], "label_name": ["y"], "label_type": ["int"], "output_format": ["dense"] }, "evaluation_0": { "need_run": [false] }, "evaluation_1": { "need_run": [false] } } }, "algorithm_parameters": { "homo_lr_0": { "penalty": "L2", "optimizer": "sgd", "tol": 1e-05, "alpha": 0.01, "max_iter": 20, "early_stop": "diff", "batch_size": 320, "learning_rate": 0.05, "validation_freqs": 1, "init_param": { "init_method": "zeros" }, "encrypt_param": { "method": null }, "cv_param": { "n_splits": 4, "shuffle": true, "random_seed": 33, "need_cv": false } }, "evaluation_0": { "eval_type": "binary" } }}
v2版本(待更新) 2.3.3 提交任务
flow job submit -c ${conf_path} -d ${dsl_path}