• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

NLTK2词性标注

武飞扬头像
大地之灯
帮助1


参考链接2
参考链接3
参考链接1

自然语言是人类在沟通中形成的一套规则体系。规则有强有弱,比如非正式场合使用口语,正式场合下的书面语。要处理自然语言,也要遵循这些形成的规则,否则就会得出令人无法理解的结论。下面介绍一些术语的简单区别。
文法:等同于语法(grammar),文章的书写规范,用来描述语言及其结构,它包含句法和词法规范。
句法:Syntax,句子的结构或成分的构成与关系的规范。
词法:Lexical,词的构词,变化等的规范。

词性标注,或POS(Part Of Speech),是一种分析句子成分的方法,通过它来识别每个词的词性。

下面简要列举POS的tagset含意,详细可看nltk.help.brown_tagset()

标记 词性 示例
ADJ 形容词 new, good, high, special, big, local
ADV 动词 really, already, still, early, now
CONJ 连词 and, or, but, if, while, although
DET 限定词 the, a, some, most, every, no
EX 存在量词 there, there’s
MOD 情态动词 will, can, would, may, must, should
NN 名词 year,home,costs,time
NNP 专有名词 April,China,Washington
NUM 数词 fourth,2016, 09:30
PRON 代词 he,they,us
P 介词 on,over,with,of
TO 词to to
UH 叹词 ah,ha,oops
VB   动词
VBD 动词过去式 made,said,went
VBG 现在分词 going,lying,playing
VBN 过去分词 taken,given,gone
WH wh限定词 who,where,when,what

1. 使用NLTK对英文进行词性标注

1.1词性标注示例

import nltk

sent = "I am going to Beijing tomorrow."

"""
nltk.sent_tokenize(text) #按句子分割 ,python3分不开句子
nltk.word_tokenize(sentence) #分词 
nltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词: 
"""
'\nnltk.sent_tokenize(text) #按句子分割 ,python3分不开句子\nnltk.word_tokenize(sentence) #分词 \nnltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词: \n'
# 分割句子
words = nltk.word_tokenize(sent)
print(words)
['I', 'am', 'going', 'to', 'Beijing', 'tomorrow', '.']
# 词性标注
taged_sent = nltk.pos_tag(words)
taged_sent
[('I', 'PRP'),
 ('am', 'VBP'),
 ('going', 'VBG'),
 ('to', 'TO'),
 ('Beijing', 'NNP'),
 ('tomorrow', 'NN'),
 ('.', '.')]

1.2 语料库的已标注数据

语料类提供了下列方法可以返回预标注数据。

方法 说明
tagged_words(fileids,categories) 返回标注数据,以词列表的形式
tagged_sents(fileids,categories) 返回标注数据,以句子列表形式
tagged_paras(fileids,categories) 返回标注数据,以文章列表形式

2 标注器

2.1 默认标注器

最简单的词性标注器是将所有词都标注为名词NN。这种标注器没有太大的价值。正确率很低。下面演示NLTK提供的默认标注器的用法。

import nltk
from nltk.corpus import brown
# 加载数据
brown_tagged_sents = brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), 
brown_sents = brown.sents(categories='news')
# brown_tagged_sents
# 最简单的标注器是为每个标识符分配同样的标记。这似乎是一个相对普通的方法,
# 但为标注器的性能建立了一个重要的标准。为了得到最好的效果,我们用最有可能的标记标注每个词。
# 通过下例找出哪个标记是最有可能的。
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
tags
['AT',
 'NP-TL',
 'NN-TL',
 'JJ-TL',
 'NN-TL',
 'VBD',
 'NR',
 'AT',
 'NN',
 'IN',
 'NP$',
 'JJ',
 'NN',
 'NN',
 'VBD',
 '``',
 'AT',
 'NN',
 "''",
 'CS',
 'DTI',
 'NNS',
 'VBD',
 'NN',
...,
 'IN',
 'NN',
 '.',
 'NP',
 'NPS',
 'BER',
 'VBG',
 'JJ',
 'NN',
 'TO',
 'VB',
 'AT',
 'NN',
 'IN',
 'AT',
 'CD',
 'NN$',
 ...]
学新通
tag = nltk.FreqDist(tags).max()
tag
'NN'
# 我们现在可以创建一个将所有词都标注为NN的标注器。
default_tagger = nltk.DefaultTagger('NN')
sent = "I am going to Beijing tomorrow."
default_tagger.tag(nltk.word_tokenize(sent))
[('I', 'NN'),
 ('am', 'NN'),
 ('going', 'NN'),
 ('to', 'NN'),
 ('Beijing', 'NN'),
 ('tomorrow', 'NN'),
 ('.', 'NN')]
default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028

2.2 正则表达式标注器

正则表达式标注器基于匹配模式分配标记给标识符。例如,一般情况下认为任一以ed结尾的词都是动词过去分词,任一以‘s结尾的词都是名词所有格。下例中可以用正则表达式的列表来表示这些。

patterns = [
    (r'.*ing$', 'VBG'),                                # gerunds
    (r'.*ed$', 'VBD'),                                 # simple past
    (r'.*es$', 'VBZ'),                                 # 3rd singular present
    (r'.*ould$', 'MD'),                                # modals
    (r'.*\'s$', 'NN$'),                                # possessive nouns
    (r'.*s$', 'NNS'),                                  # plural nouns
    (r'^-?[0-9] (.[0-9] )?$', 'CD'),                 # cardinal numbers
    (r'.*', 'NN')                                      # nouns (deafult)
]

这些是按顺序处理的,第一个匹配上的会被使用。现在建立一个标注器,并用它来标注句子。

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
regexp_tagger.evaluate(brown_tagged_sents)
# 0.20326391789486245 # 大约有五分之一是正确的
0.20326391789486245

2.3 查询标注器

很多高频词没有NN标记,我们找出100个最频繁的词,存储它们最有可能的标记,然后我们可以使用这个信息作为“查询标注器(NLTKUnigramTagger)”的模型,如下例:

# 先把词拿出来
fd = nltk.FreqDist(brown.words(categories='news')) # ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 收集了在不同条件下运行的单个实验的频率分布。条件频率分布用于记录每个样本在给定的实验条件下出现的次数。
# 例如,可以使用条件频率分布来记录文档中给定长度的每个单词(类型)的频率。
# 在形式上,条件频率分布可以定义为一个函数,将每个条件映射到实验条件下的FreqDist。
cfd =nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) 
# print(cfd.items()) # dict_items([('The', FreqDist({'AT': 775, 'AT-TL': 28, 'AT-HL': 3})), ('Fulton', FreqDist({'NP-TL': 10, 'NP': 4})), 

# 频繁词top100
most_freq_words = fd.keys()# python 3.6 以上,dict_keys 类型需要list转化
most_freq_words = list(most_freq_words)[:100] # ['The','Fulton','County','Grand','Jury','said','Friday','an',

# 字典生成式 对于top100的单词,取该单词频率分布最高的词性,作为该词的词性
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
# likely_tags # {'The': 'AT','Fulton': 'NP-TL','County': 'NN-TL','Grand': 'JJ-TL','Jury': 'NN-TL','said': 'VBD','Friday': 'NR',

# UnigramTagger为训练语料库中的每个单词找到最有可能的标记,然后使用该信息为新标记分配标记。
baseline_tagger = nltk.UnigramTagger(model = likely_tags)

baseline_tagger.evaluate(brown_tagged_sents) # 0.3329355371243312
# brown.tagged_words(categories='news') #[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
baseline_tagger.evaluate([brown.tagged_words(categories='news')]) # brown.tagged_words()需要加括号转二维数组 0.3329355371243312
baseline_tagger.evaluate([brown.tagged_sents(categories='news')[3]]) # 个别语句会有极高的准确率 0.972972972972973
学新通
0.972972972972973

此处结果与书中不同,书中结果为0.45左右,即仅仅知道100个最频繁的词的标记就能正确标注很大一部分标识符。

来看看它在未标注的输入文本是运行得怎么样:

sent = brown.sents(categories='news')[10] #
baseline_tagger.tag(sent)
[('It', 'PPS'),
 ('urged', None),
 ('that', 'CS'),
 ('the', 'AT'),
 ('city', 'NN'),
 ('``', '``'),
 ('take', None),
 ('steps', None),
 ('to', 'TO'),
 ('remedy', None),
 ("''", "''"),
 ('this', 'DT'),
 ('problem', None),
 ('.', '.')]

可以看到很多词都被分配了’None’标签,因为它们不在100个最频繁的词中。这种情况我们想分配默认标记NN。也就是说,我们应先使用查找表,如果不能指定就使用默认标注器,这个过程叫“回退”。

# 设置默认标注器,在找不到匹配时使用
baseline_tagger = nltk.UnigramTagger(model = likely_tags,backoff = nltk.DefaultTagger('NN'))

最后我们把查找标注器和默认标注器结合起来之后,看它的性能如何,使用大小不同的模型:

def performance(cfd,wordlist):
    lt = dict((word,cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
 
def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(16)
    prefs = [performance(cfd,words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes,prefs,'-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performace')
    pylab.show()
 
display()
学新通

学新通

可以看到随着模型规模的增长,最初性能增加较快,最终达到稳定水平,这时哪怕模型规模再增加,性能提升幅度也很小

3 训练N-gram标注器

3.1 一般N-gram标注

在上一节中,已经使用了1-Gram,即Unigram标注器。考虑更多的上下文,便有了2/3-gram,这里统称为N-gram。注意,更长的上正文并不能带来准确度的提升。
除了向N-gram标注器提供词表模型,另外一种构建标注器的方法是训练。N-gram标注器的构建函数如下:init(train=None, model=None, backoff=None),可以将标注好的语料作为训练数据,用于构建一个标注器。

import nltk
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0:train_num]
x_test = brown_tagged_sents[train_num:]
tagger = nltk.UnigramTagger(train = x_train)
print(tagger.evaluate(x_test)) # 0.8121200039868434
0.8121200039868434

对于UniGram,使用90%的数据进行训练,在余下10%的数据上测试的准确率为81%。如果改为BiGram,则正确率会下降到10%左右。

3.2 组合标注器

可以利用backoff参数,将多个组合标注器组合起来,以提高识别精确率。

import nltk
from nltk.corpus import brown
pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍标注为NN
]
brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train =  brown_tagged_sents[0:train_num]
x_test =   brown_tagged_sents[train_num:]

t0 = nltk.RegexpTagger(pattern)
t1 = nltk.UnigramTagger(x_train,backoff = t0)
t2 = nltk.BigramTagger(x_train,backoff = t1)
print(t2.evaluate(x_test)) # 0.8627529153792485
学新通
0.8627529153792485

从上面可以看出,不需要任何的语言学知识,只需要借助统计数据便可以使得词性标注做的足够好。
对于中文,只要有标注语料,也可以按照上面的过程训练N-gram标注器。

4.更进一步

nltk.tag.BrillTagger实现了基于转换的标注,在基础标注器的结果上,对输出进行基于规则的修正,实现更高的准确度。

import nltk
import nltk.tag.brill
from nltk.corpus import brown

pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍标注为NN
]
# 划分数据集
brown_tagged_sents = brown.tagged_sents(categories = ['news'])
train_num = int(len(brown_tagged_sents)*0.9)
x_train = brown_tagged_sents[:train_num]
x_test = brown_tagged_sents[train_num:]
#
baseline_tagger = nltk.UnigramTagger(x_train,backoff = nltk.RegexpTagger(pattern))
tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24())
brill_tagger = tt.train(x_train,max_rules=20,min_acc=0.99)
# 评估
print(brill_tagger.evaluate(x_test))# 0.8683344961626632                                    
学新通
0.8683344961626632
brown_sents = brown.sents(categories="news")
print(brown_tagged_sents[2007])
print(brill_tagger.tag(brown_sents[2007]))
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

5.中文标注器的训练

下面基于Unigram训练一个中文词性标注器,语料使用网上可以下载得到的人民日报98年1月的标注资料。

import nltk
import json

lines = open('./词性标注人民日报199801.txt',
        encoding = 'utf-8'
            ).readlines()
all_tagged_sents = []

for line in lines:
    sent = line.split()
    tagged_sent = []
    for item in sent:
        pair = nltk.str2tuple(item)
        tagged_sent.append(pair)
    
    if len(tagged_sent)>0:
        all_tagged_sents.append(tagged_sent)

train_size = int(len(all_tagged_sents)*0.8)
x_train = all_tagged_sents[:train_size]
x_test = all_tagged_sents[train_size:]

tagger = nltk.UnigramTagger(train=x_train,backoff=nltk.DefaultTagger('n'))
print(tagger.evaluate(x_test)) # 0.8714095491725319
"""
line:
19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w 

line.split():
'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w  \n'

tagged_sent:
[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('(', 'W'), ('附', 'V'), ('图片', 'N'), ('1', 'M'), ('张', 'Q'), (')', 'W')]
"""
学新通
0.8714095491725319





"\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w \n\nline.split():\n'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w  \n'\n\ntagged_sent:\n[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('(', 'W'), ('附', 'V'), ('图片', 'N'), ('1', 'M'), ('张', 'Q'), (')', 'W')]\n"

6. brown语料库相关方法

# 语料库文件名列表
brown.fileids()
['ca01',
 'ca02',
 'ca03',
 'ca04',
...,
 'cp17',
 'cp18',
 'cp19',
 'cp20',
 'cp21',
 'cp22',
 'cp23',
 'cp24',
 'cp25',
 'cp26',
 'cp27',
 'cp28',
 'cp29',
 'cr01',
 'cr02',
 'cr03',
 'cr04',
 'cr05',
 'cr06',
 'cr07',
 'cr08',
 'cr09']
学新通
# 返回指定类别('news')的文件名列表
brown.fileids('news')
['ca01',
 'ca02',
 'ca03',
 'ca04',
 'ca05',
 'ca06',
 'ca07',
 'ca08',
 ...
 'ca26',
 'ca27',
 'ca28',
 'ca29',
 'ca30',
 'ca31',
 'ca32',
 'ca33',
 'ca34',
 'ca35',
 'ca36',
 'ca37',
 'ca38',
 'ca39',
 'ca40',
 'ca41',
 'ca42',
 'ca43',
 'ca44']
学新通
# 返回指定分类的原始文本
brown.raw(categories=['news'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ‘’/‘’ that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, / deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ‘’/‘’ for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.\n\n\n\tThe/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl

Steve/np Barber/np joined/vbd the/at club/nn one/cd week/nn ago/rb after/cs completing/vbg his/pp$ hitch/nn under/in the/at Army’s/nnKaTeX parse error: Undefined control sequence: \nThe at position 108: … ,/, Ky./np ./.\̲n̲T̲h̲e̲/at 22-year-old… bulky/jj spring-training/nn contingent/nn now/rb gradually/rb will/md be/be reduced/vbn as/cs Manager/nn-tl Paul/np Richards/np and/cc his/pp$ coaches/nns seek/vb to/to trim/vb it/ppo down/rp to/in a/at more/ql streamlined/vbn and/cc workable/jj unit/nn ./.\n\n\n\n\n/ Take/vb a/at ride/nn on/in this/dt one/cd ‘’/‘’ ,/, Brooks/np Robinson/np greeted/vbd Hansen/np as/cs the/at Bird/np third/od sacker/nn grabbed/vbd a/at bat/nn ,/, headed/vbd for/in the/at plate/nn and/cc bounced/vbd a/at third-inning/nn two-run/jj double/nn off/in the/at left-centerfield/nn wall/nn tonight/nr ./.\n\n\n\tIt/pps was/bedz the/at first/od of/in two/cd doubles/nns by/in Robinson/np ,/, who/wps was/bedz in/in a/at mood/nn to/to celebrate/vb ./.\n\n\n\tJust/rb before/in game/nn time/nn ,/, Robinson’s/np$ pretty/jj wife/nn ,/, Connie/np informed/vbd him/ppo that/cs an/at addition/nn to/in the/at family/nn can/md be/be expected/vbn late/jj next/ap summer/nn ./.\n\n\n\tUnfortunately/rb ,/, Brooks’s/np$ teammates/nns were/bed not/* in/in such/ql festive/jj mood/nn as/cs the/at Orioles/nps expired/vbd before/in the/at seven-hit

# 返回指定文件名的文本字符串
brown.raw(fileids=['ca01','ca02'])
"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl ... for/in-hl extension/nn-hl \nOther/ap recommendations/nns made/vbn by/in the/at committee/nn are/ber :/: \n\n\tExtension/nn of/in the/at ADC/nn program/nn to/in all/abn children/nns in/in need/nn living/vbg with/in any/dti relatives/nns ,/, including/in both/abx parents/nns ,/, as/cs a/at means/nns of/in preserving/vbg family/nn unity/nn ./.\n\n\n\tResearch/nn projects/nns as/ql soon/rb as/cs possible/jj on/in the/at causes/nns and/cc prevention/nn of/in dependency/nn and/cc illegitimacy/nn ./.\n\n"
# 返回指定文件名的语句列表
brown.sents(fileids=['ca01','ca02'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
# 按分类返回语句列表
brown.sents(categories=['news'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
# 返回指定文件名的单词列表
brown.words('ca01')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
# 返回指定分类的单词列表
brown.words(categories=['news'])
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
# 返回按句子标注好词性的二维数组
brown.tagged_sents(categories=['news'])
[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /boutique/detail/tanhiaciih
系列文章
更多 icon
同类精品
更多 icon
继续加载