首页 > python如何解析字符串中出现的英文人名?

python如何解析字符串中出现的英文人名?

这里有四个例子,结果来自google scholar

str1 = "Jakes, William C., and Donald C. Cox. Microwave mobile communications. Wiley-IEEE Press, 1994."
str2 = "Schlegel, David J., Douglas P. Finkbeiner, and Marc Davis. \"Maps of dust infrared emission for use in estimation of reddening and cosmic microwave background radiation foregrounds.\" The Astrophysical Journal 500, no. 2 (1998): 525."
str3 = "Komatsu, Eiichiro, J. Dunkley, M. R. Nolta, C. L. Bennett, B. Gold, G. Hinshaw, N. Jarosik et al. \"Five-year Wilkinson microwave anisotropy probe (WMAP) observations: cosmological interpretation.\" arXiv preprint arXiv:0803.0547 (2008)."
str4 = "Gonzalez, Guillermo. Microwave transistor amplifiers: analysis and design. Vol. 61. Englewood 3Cliffs, NJ: Prentice-Hall, 1984."

如何得到如下的人名组成的list?

['Jakes', 'William C.', 'Donald C. Cox']
['Schlegel', 'David J.', 'Douglas P. Finkbeiner', 'Marc Davis']
['Komatsu', 'Eiichiro', 'J. Dunkley', 'M. R. Nolta', 'C. L. Bennett', 'B. Gold', 'G. Hinshaw', 'N. Jarosik']
['Gonzalez', Guillermo']

可以用BNF范式描述下,解析可以用pyparser,
其实BNF你可以理解为高级的正则,只不过可读性更好点
http://pyparsing.wikispaces.com/?responseToken=04d6d64f88c3ab19fc12af2f2887ff075


先从非人名部分切开。前面的可以用正则匹配


如果只是这四个的话,那么只需加上特定判断配合split就ok,但如果泛指字符串中的人名,那么应该是深度神经网络学习的领域。


如果是这四个字符串的话是可以的。

import re

str1 = "Jakes, William C., and Donald C. Cox. Microwave mobile communications. Wiley-IEEE Press, 1994."
str2 = "Schlegel, David J., Douglas P. Finkbeiner, and Marc Davis. \"Maps of dust infrared emission for use in estimation of reddening and cosmic microwave background radiation foregrounds.\" The Astrophysical Journal 500, no. 2 (1998): 525."
str3 = "Komatsu, Eiichiro, J. Dunkley, M. R. Nolta, C. L. Bennett, B. Gold, G. Hinshaw, N. Jarosik et al. \"Five-year Wilkinson microwave anisotropy probe (WMAP) observations: cosmological interpretation.\" arXiv preprint arXiv:0803.0547 (2008)."
str4 = "Gonzalez, Guillermo. Microwave transistor amplifiers: analysis and design. Vol. 61. Englewood 3Cliffs, NJ: Prentice-Hall, 1984."

str_list = [str1, str2, str3, str4]


def filter_re(string):
    names = re.search(r'.*?[^A-Z]\.', string).group(0)
    return map(lambda x: x.replace('and', '').replace('et al','').strip(), names.rstrip('.').split(','))

if __name__ == '__main__':
    result = []
    for s in str_list:
        result.append(filter_re(s))
    print result

output:

[['Jakes', 'William C.', 'Donald C. Cox'], ['Schlegel', 'David J.', 'Douglas P. Finkbeiner', 'Marc Davis'], ['Komatsu', 'Eiichiro', 'J. Dunkley', 'M. R. Nolta', 'C. L. Bennett', 'B. Gold', 'G. Hinshaw', 'N. Jarosik'], ['Gonzalez', 'Guillermo']]

思路就是找到第一个前一个字符不是大写字母的句号,以这个句号之前的部分为姓名部分。剩下的就是分割了。不过这个思路有个问题,就是如果最后一个人的名字刚好是个缩写,如"John D. Microwave mobile communications. Wiley-IEEE Press, 1994."这样Microwave mobile communications也会被当成一个人名来处理。
我不是很清楚期刊中人名书写的规范,希望这个可以帮到你。

【热门文章】
【热门文章】