NLTK2词性标注

大地之灯

2024-06-27 帮助1人

自然语言是人类在沟通中形成的一套规则体系。规则有强有弱，比如非正式场合使用口语，正式场合下的书面语。要处理自然语言，也要遵循这些形成的规则，否则就会得出令人无法理解的结论。下面介绍一些术语的简单区别。
文法：等同于语法(grammar)，文章的书写规范，用来描述语言及其结构，它包含句法和词法规范。
句法：Syntax，句子的结构或成分的构成与关系的规范。
词法：Lexical，词的构词，变化等的规范。

词性标注，或POS(Part Of Speech)，是一种分析句子成分的方法，通过它来识别每个词的词性。

下面简要列举POS的tagset含意，详细可看nltk.help.brown_tagset()

标记	词性	示例
ADJ	形容词	new, good, high, special, big, local
ADV	动词	really, already, still, early, now
CONJ	连词	and, or, but, if, while, although
DET	限定词	the, a, some, most, every, no
EX	存在量词	there, there’s
MOD	情态动词	will, can, would, may, must, should
NN	名词	year,home,costs,time
NNP	专有名词	April，China，Washington
NUM	数词	fourth，2016, 09:30
PRON	代词	he,they,us
P	介词	on,over,with,of
TO	词to	to
UH	叹词	ah,ha,oops
VB		动词
VBD	动词过去式	made,said,went
VBG	现在分词	going,lying,playing
VBN	过去分词	taken,given,gone
WH	wh限定词	who,where,when,what

1. 使用NLTK对英文进行词性标注

1.1词性标注示例

import nltk

sent = "I am going to Beijing tomorrow."

"""
nltk.sent_tokenize(text) #按句子分割 ,python3分不开句子
nltk.word_tokenize(sentence) #分词 
nltk的分词是句子级别的，所以对于一篇文档首先要将文章按句子进行分割，然后句子进行分词： 
"""

'\nnltk.sent_tokenize(text) #按句子分割 ,python3分不开句子\nnltk.word_tokenize(sentence) #分词 \nnltk的分词是句子级别的，所以对于一篇文档首先要将文章按句子进行分割，然后句子进行分词： \n'

# 分割句子
words = nltk.word_tokenize(sent)
print(words)

['I', 'am', 'going', 'to', 'Beijing', 'tomorrow', '.']

# 词性标注
taged_sent = nltk.pos_tag(words)
taged_sent

[('I', 'PRP'),
 ('am', 'VBP'),
 ('going', 'VBG'),
 ('to', 'TO'),
 ('Beijing', 'NNP'),
 ('tomorrow', 'NN'),
 ('.', '.')]

1.2 语料库的已标注数据

语料类提供了下列方法可以返回预标注数据。

方法	说明
tagged_words(fileids,categories)	返回标注数据，以词列表的形式
tagged_sents(fileids,categories)	返回标注数据，以句子列表形式
tagged_paras(fileids,categories)	返回标注数据，以文章列表形式

2 标注器

2.1 默认标注器

最简单的词性标注器是将所有词都标注为名词NN。这种标注器没有太大的价值。正确率很低。下面演示NLTK提供的默认标注器的用法。

import nltk
from nltk.corpus import brown

# 加载数据
brown_tagged_sents = brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), 
brown_sents = brown.sents(categories='news')
# brown_tagged_sents

# 最简单的标注器是为每个标识符分配同样的标记。这似乎是一个相对普通的方法，
# 但为标注器的性能建立了一个重要的标准。为了得到最好的效果，我们用最有可能的标记标注每个词。
# 通过下例找出哪个标记是最有可能的。
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
tags

['AT',
 'NP-TL',
 'NN-TL',
 'JJ-TL',
 'NN-TL',
 'VBD',
 'NR',
 'AT',
 'NN',
 'IN',
 'NP$',
 'JJ',
 'NN',
 'NN',
 'VBD',
 '``',
 'AT',
 'NN',
 "''",
 'CS',
 'DTI',
 'NNS',
 'VBD',
 'NN',
...,
 'IN',
 'NN',
 '.',
 'NP',
 'NPS',
 'BER',
 'VBG',
 'JJ',
 'NN',
 'TO',
 'VB',
 'AT',
 'NN',
 'IN',
 'AT',
 'CD',
 'NN$',
 ...]

tag = nltk.FreqDist(tags).max()
tag

'NN'

# 我们现在可以创建一个将所有词都标注为NN的标注器。
default_tagger = nltk.DefaultTagger('NN')
sent = "I am going to Beijing tomorrow."
default_tagger.tag(nltk.word_tokenize(sent))

[('I', 'NN'),
 ('am', 'NN'),
 ('going', 'NN'),
 ('to', 'NN'),
 ('Beijing', 'NN'),
 ('tomorrow', 'NN'),
 ('.', 'NN')]

default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

2.2 正则表达式标注器

正则表达式标注器基于匹配模式分配标记给标识符。例如，一般情况下认为任一以ed结尾的词都是动词过去分词，任一以‘s结尾的词都是名词所有格。下例中可以用正则表达式的列表来表示这些。

patterns = [
    (r'.*ing$', 'VBG'),                                # gerunds
    (r'.*ed$', 'VBD'),                                 # simple past
    (r'.*es$', 'VBZ'),                                 # 3rd singular present
    (r'.*ould$', 'MD'),                                # modals
    (r'.*\'s$', 'NN$'),                                # possessive nouns
    (r'.*s$', 'NNS'),                                  # plural nouns
    (r'^-?[0-9] (.[0-9] )?$', 'CD'),                 # cardinal numbers
    (r'.*', 'NN')                                      # nouns (deafult)
]

这些是按顺序处理的，第一个匹配上的会被使用。现在建立一个标注器，并用它来标注句子。

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
regexp_tagger.evaluate(brown_tagged_sents)
# 0.20326391789486245 # 大约有五分之一是正确的

0.20326391789486245

2.3 查询标注器

很多高频词没有NN标记，我们找出100个最频繁的词，存储它们最有可能的标记，然后我们可以使用这个信息作为“查询标注器（NLTKUnigramTagger）”的模型，如下例：

# 先把词拿出来
fd = nltk.FreqDist(brown.words(categories='news')) # ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 收集了在不同条件下运行的单个实验的频率分布。条件频率分布用于记录每个样本在给定的实验条件下出现的次数。
# 例如，可以使用条件频率分布来记录文档中给定长度的每个单词(类型)的频率。
# 在形式上，条件频率分布可以定义为一个函数，将每个条件映射到实验条件下的FreqDist。
cfd =nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) 
# print(cfd.items()) # dict_items([('The', FreqDist({'AT': 775, 'AT-TL': 28, 'AT-HL': 3})), ('Fulton', FreqDist({'NP-TL': 10, 'NP': 4})), 

# 频繁词top100
most_freq_words = fd.keys()# python 3.6 以上，dict_keys 类型需要list转化
most_freq_words = list(most_freq_words)[:100] # ['The','Fulton','County','Grand','Jury','said','Friday','an',

# 字典生成式 对于top100的单词，取该单词频率分布最高的词性，作为该词的词性
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
# likely_tags # {'The': 'AT','Fulton': 'NP-TL','County': 'NN-TL','Grand': 'JJ-TL','Jury': 'NN-TL','said': 'VBD','Friday': 'NR',

# UnigramTagger为训练语料库中的每个单词找到最有可能的标记，然后使用该信息为新标记分配标记。
baseline_tagger = nltk.UnigramTagger(model = likely_tags)

baseline_tagger.evaluate(brown_tagged_sents) # 0.3329355371243312
# brown.tagged_words(categories='news') #[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
baseline_tagger.evaluate([brown.tagged_words(categories='news')]) # brown.tagged_words()需要加括号转二维数组 0.3329355371243312
baseline_tagger.evaluate([brown.tagged_sents(categories='news')[3]]) # 个别语句会有极高的准确率 0.972972972972973

0.972972972972973

此处结果与书中不同，书中结果为0.45左右，即仅仅知道100个最频繁的词的标记就能正确标注很大一部分标识符。

来看看它在未标注的输入文本是运行得怎么样：

sent = brown.sents(categories='news')[10] #
baseline_tagger.tag(sent)

[('It', 'PPS'),
 ('urged', None),
 ('that', 'CS'),
 ('the', 'AT'),
 ('city', 'NN'),
 ('``', '``'),
 ('take', None),
 ('steps', None),
 ('to', 'TO'),
 ('remedy', None),
 ("''", "''"),
 ('this', 'DT'),
 ('problem', None),
 ('.', '.')]

可以看到很多词都被分配了’None’标签，因为它们不在100个最频繁的词中。这种情况我们想分配默认标记NN。也就是说，我们应先使用查找表，如果不能指定就使用默认标注器，这个过程叫“回退”。

# 设置默认标注器，在找不到匹配时使用
baseline_tagger = nltk.UnigramTagger(model = likely_tags,backoff = nltk.DefaultTagger('NN'))

最后我们把查找标注器和默认标注器结合起来之后，看它的性能如何，使用大小不同的模型：

def performance(cfd,wordlist):
    lt = dict((word,cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
 
def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(16)
    prefs = [performance(cfd,words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes,prefs,'-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performace')
    pylab.show()
 
display()

学新通

可以看到随着模型规模的增长，最初性能增加较快，最终达到稳定水平，这时哪怕模型规模再增加，性能提升幅度也很小

3 训练N-gram标注器

3.1 一般N-gram标注

在上一节中，已经使用了1-Gram，即Unigram标注器。考虑更多的上下文，便有了2/3-gram，这里统称为N-gram。注意，更长的上正文并不能带来准确度的提升。
除了向N-gram标注器提供词表模型，另外一种构建标注器的方法是训练。N-gram标注器的构建函数如下：init(train=None, model=None, backoff=None),可以将标注好的语料作为训练数据，用于构建一个标注器。

import nltk
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0:train_num]
x_test = brown_tagged_sents[train_num:]
tagger = nltk.UnigramTagger(train = x_train)
print(tagger.evaluate(x_test)) # 0.8121200039868434

0.8121200039868434

对于UniGram，使用90%的数据进行训练，在余下10%的数据上测试的准确率为81%。如果改为BiGram，则正确率会下降到10%左右。

3.2 组合标注器

可以利用backoff参数，将多个组合标注器组合起来，以提高识别精确率。

import nltk
from nltk.corpus import brown
pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍标注为NN
]
brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train =  brown_tagged_sents[0:train_num]
x_test =   brown_tagged_sents[train_num:]

t0 = nltk.RegexpTagger(pattern)
t1 = nltk.UnigramTagger(x_train,backoff = t0)
t2 = nltk.BigramTagger(x_train,backoff = t1)
print(t2.evaluate(x_test)) # 0.8627529153792485

0.8627529153792485

从上面可以看出，不需要任何的语言学知识，只需要借助统计数据便可以使得词性标注做的足够好。
对于中文，只要有标注语料，也可以按照上面的过程训练N-gram标注器。

4.更进一步

nltk.tag.BrillTagger实现了基于转换的标注，在基础标注器的结果上，对输出进行基于规则的修正，实现更高的准确度。

import nltk
import nltk.tag.brill
from nltk.corpus import brown

pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍标注为NN
]
# 划分数据集
brown_tagged_sents = brown.tagged_sents(categories = ['news'])
train_num = int(len(brown_tagged_sents)*0.9)
x_train = brown_tagged_sents[:train_num]
x_test = brown_tagged_sents[train_num:]
#
baseline_tagger = nltk.UnigramTagger(x_train,backoff = nltk.RegexpTagger(pattern))
tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24())
brill_tagger = tt.train(x_train,max_rules=20,min_acc=0.99)
# 评估
print(brill_tagger.evaluate(x_test))# 0.8683344961626632

0.8683344961626632

brown_sents = brown.sents(categories="news")
print(brown_tagged_sents[2007])
print(brill_tagger.tag(brown_sents[2007]))

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

5.中文标注器的训练

下面基于Unigram训练一个中文词性标注器，语料使用网上可以下载得到的人民日报98年1月的标注资料。

import nltk
import json

lines = open('./词性标注人民日报199801.txt',
        encoding = 'utf-8'
            ).readlines()
all_tagged_sents = []

for line in lines:
    sent = line.split()
    tagged_sent = []
    for item in sent:
        pair = nltk.str2tuple(item)
        tagged_sent.append(pair)
    
    if len(tagged_sent)>0:
        all_tagged_sents.append(tagged_sent)

train_size = int(len(all_tagged_sents)*0.8)
x_train = all_tagged_sents[:train_size]
x_test = all_tagged_sents[train_size:]

tagger = nltk.UnigramTagger(train=x_train,backoff=nltk.DefaultTagger('n'))
print(tagger.evaluate(x_test)) # 0.8714095491725319
"""
line:
19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w 

line.split():
'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w  \n'

tagged_sent:
[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('（', 'W'), ('附', 'V'), ('图片', 'N'), ('１', 'M'), ('张', 'Q'), ('）', 'W')]
"""

0.8714095491725319





"\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w \n\nline.split():\n'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w  \n'\n\ntagged_sent:\n[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('（', 'W'), ('附', 'V'), ('图片', 'N'), ('１', 'M'), ('张', 'Q'), ('）', 'W')]\n"

6. brown语料库相关方法

# 语料库文件名列表
brown.fileids()

['ca01',
 'ca02',
 'ca03',
 'ca04',
...,
 'cp17',
 'cp18',
 'cp19',
 'cp20',
 'cp21',
 'cp22',
 'cp23',
 'cp24',
 'cp25',
 'cp26',
 'cp27',
 'cp28',
 'cp29',
 'cr01',
 'cr02',
 'cr03',
 'cr04',
 'cr05',
 'cr06',
 'cr07',
 'cr08',
 'cr09']

# 返回指定类别('news')的文件名列表
brown.fileids('news')

['ca01',
 'ca02',
 'ca03',
 'ca04',
 'ca05',
 'ca06',
 'ca07',
 'ca08',
 ...
 'ca26',
 'ca27',
 'ca28',
 'ca29',
 'ca30',
 'ca31',
 'ca32',
 'ca33',
 'ca34',
 'ca35',
 'ca36',
 'ca37',
 'ca38',
 'ca39',
 'ca40',
 'ca41',
 'ca42',
 'ca43',
 'ca44']

# 返回指定分类的原始文本
brown.raw(categories=['news'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ‘’/‘’ that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, / deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ‘’/‘’ for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.\n\n\n\tThe/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl
…
Steve/np Barber/np joined/vbd the/at club/nn one/cd week/nn ago/rb after/cs completing/vbg his/pp$ hitch/nn under/in the/at Army’s/nn $KaTeX parse error: Undefined control sequence: \nThe at position 108: … ,/, Ky./np ./.\̲n̲T̲h̲e̲/at 22-year-old…$ bulky/jj spring-training/nn contingent/nn now/rb gradually/rb will/md be/be reduced/vbn as/cs Manager/nn-tl Paul/np Richards/np and/cc his/pp$ coaches/nns seek/vb to/to trim/vb it/ppo down/rp to/in a/at more/ql streamlined/vbn and/cc workable/jj unit/nn ./.\n\n\n\n\n/ Take/vb a/at ride/nn on/in this/dt one/cd ‘’/‘’ ,/, Brooks/np Robinson/np greeted/vbd Hansen/np as/cs the/at Bird/np third/od sacker/nn grabbed/vbd a/at bat/nn ,/, headed/vbd for/in the/at plate/nn and/cc bounced/vbd a/at third-inning/nn two-run/jj double/nn off/in the/at left-centerfield/nn wall/nn tonight/nr ./.\n\n\n\tIt/pps was/bedz the/at first/od of/in two/cd doubles/nns by/in Robinson/np ,/, who/wps was/bedz in/in a/at mood/nn to/to celebrate/vb ./.\n\n\n\tJust/rb before/in game/nn time/nn ,/, Robinson’s/np$ pretty/jj wife/nn ,/, Connie/np informed/vbd him/ppo that/cs an/at addition/nn to/in the/at family/nn can/md be/be expected/vbn late/jj next/ap summer/nn ./.\n\n\n\tUnfortunately/rb ,/, Brooks’s/np$ teammates/nns were/bed not/* in/in such/ql festive/jj mood/nn as/cs the/at Orioles/nps expired/vbd before/in the/at seven-hit

# 返回指定文件名的文本字符串
brown.raw(fileids=['ca01','ca02'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl ... for/in-hl extension/nn-hl \nOther/ap recommendations/nns made/vbn by/in the/at committee/nn are/ber :/: \n\n\tExtension/nn of/in the/at ADC/nn program/nn to/in all/abn children/nns in/in need/nn living/vbg with/in any/dti relatives/nns ,/, including/in both/abx parents/nns ,/, as/cs a/at means/nns of/in preserving/vbg family/nn unity/nn ./.\n\n\n\tResearch/nn projects/nns as/ql soon/rb as/cs possible/jj on/in the/at causes/nns and/cc prevention/nn of/in dependency/nn and/cc illegitimacy/nn ./.\n\n"

# 返回指定文件名的语句列表
brown.sents(fileids=['ca01','ca02'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

# 按分类返回语句列表
brown.sents(categories=['news'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

# 返回指定文件名的单词列表
brown.words('ca01')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 返回指定分类的单词列表
brown.words(categories=['news'])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 返回按句子标注好词性的二维数组
brown.tagged_sents(categories=['news'])

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]

这篇好文章是转载于：学新通技术网

NLTK2词性标注

1. 使用NLTK对英文进行词性标注

1.1词性标注示例

1.2 语料库的已标注数据

2 标注器

2.1 默认标注器

2.2 正则表达式标注器

2.3 查询标注器

3 训练N-gram标注器

3.1 一般N-gram标注

3.2 组合标注器

4.更进一步

5.中文标注器的训练

6. brown语料库相关方法

photoshop保存的图片太大微信发不了怎么办

word里面弄一个表格后上面的标题会跑到下面怎么办

photoshop扩展功能面板显示灰色怎么办

《学习通》视频自动暂停处理方法

TikTok加速器哪个好免费的TK加速器推荐

Android 11 保存文件到外部存储，并分享文件

微信公众号没有声音提示怎么办

excel下划线不显示怎么办

微信运动停用后别人还能看到步数吗

excel打印预览压线压字怎么办