挖掘DBLP作者合作关系,FP-Growth算法实践(2):从DBLP数据集中
发布时间:2021-05-25 12:40:07 所属栏目:大数据 来源:网络整理
导读:副标题#e# 上篇文章:http://www.voidcn.com/article/p-nsbrwwsu-zv.html?(挖掘DBLP作者合作关系,FP-Growth算法实践(1):从DBLP数据集中提取目标信息(会议、作者等)) 大家反映代码不能用,主要是太慢了,好吧,我也承认慢,在内存构造树,肯定的!
def XmlLineParser(fileName):
rf=open(fileName,"r")
for line in rf:
#print "line [1]",line
if line.startswith("<inproceedings"):
print "line [1]",line
booktitle=""
year=""
title=""
authorList=""
for line in rf:
print "line [2]",line
if line.startswith("<author"):
authorList+=line
if line.startswith("<title"):
title=line
elif line.startswith("<year"):
year=line[6:10]
if year<fromYear:
break
elif line.startswith("<booktitle"):
booktitle=((line[11:]).split("</")[0]).split(" ")[0]
if not confNameDict.has_key(booktitle):
break
elif line.startswith("</inproceedings"):
#tranList=[] #"confName t year t title t author1|author2|..|authorn"
localTran=booktitle+"t"+year+"t"+(title[7:]).split("</")[0]+"t"
for authorLine in authorList.split("n"):
for author in re.findall(re.compile(r'<author>(.*)</author>',re.S),authorLine):
localTran+=author+"|"
wf=open("tranDB.txt","a")
wf.write(localTran[:-1]+"n") #remove last "|"
wf.close()
break #do not forget
rf.close()
调用直接: XmlLineParser(fileName) 拿走不谢,代码写得有点水,见谅。 (编辑:应用网_阳江站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |


