欢迎您访问365答案网，请分享给你的朋友!

生活常识学习资料

365答案网 > IT知识 > 正文

【深度学习】将文本数据转换为张量的方法总结

时间：2023-05-23

目录

问题描述：

方法概括：

1.单词级的one-hot编码

2.字符级的one-hot编码

3.用keras实现单词级的one-hot编码

4.用散列技巧的单词级的one-hot1编码

参考：

问题描述：

深度学习模型不会接收原始文本作为输入，它只能处理数值张量。文本向量化（vectorize）是指将文本转换为数值张量的过程。实现方法：①文本中的每个单词转换为一个向量.②文本中的每个字符转换为一个向量。

方法概括：

1.单词级的one-hot编码
代码展示

import numpy as npsamples = ['The cat sat on the mat.', 'The dog ate my homework.']#构建数据中被标记的索引token_index = {}for sample in samples: #利用split方法进行分词 for word in sample.split(): if word not in token_index: # 为唯一单词指定唯一索引 token_index[word] = len(token_index) + 1max_length = 10#结果保存在result中results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))for i, sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: index = token_index.get(word) results[i, j, index] = 1.print(results)

实现截图

2.字符级的one-hot编码
代码展示

import stringsamples = ['The cat sat on the mat.', 'The dog ate my homework.']#所有可以打印的ASCII字符characters = string.printabletoken_index = dict(zip(range(1,len(characters) +1),characters))max_length = 50results = np.zeros((len(samples),max_length,max(token_index.keys()) + 1))for i ,sample in enumerate(samples): for j,character in enumerate(sample[:max_length]): index = token_index.get(character) results[i,j,index] =1.print(results)

实现截图

3.用keras实现单词级的one-hot编码
代码展示

from keras.preprocessing.text import Tokenizersamples = ['The cat sat on the mat.', 'The dog ate my homework.']#创建分词器，设置只考虑前1000最常见单词tokenizer = Tokenizer(num_words=1000)#构建单词索引tokenizer.fit_on_texts(samples)#将字符串转换为由，整数索引组成的列表sequences = tokenizer.texts_to_sequences(samples)one_hot_results = tokenizer.texts_to_matrix(samples,mode='binary')word_index = tokenizer.word_indexprint('found %s unique tokens'%len(word_index))

实现截图

4.用散列技巧的单词级的one-hot1编码
代码展示

samples = ['The cat sat on the mat.', 'The dog ate my homework.']#将单词保存长度为1000的向量dimensionality = 1000max_length= 10results = np.zeros((len(samples),max_length,dimensionality))for i ,sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: #for j,word in list(enumerate(samples.split()))[:max_length]: index = abs(hash(word)) % dimensionality results [i,j,index] =1.print(results)

实现截图

参考：
《Python深度学习》

上一篇：python的return详解

下一篇：Python|常用库的文档链接总结

相关推荐

相关文章

Copyright © 2016-2020 www.365daan.com All Rights Reserved. 365答案网版权所有备案号：

部分内容来自互联网，版权归原作者所有，如有冒犯请联系我们，我们将在三个工作时内妥善处理。