挖掘DBLP作者合作关系，FP-Growth算法实践（2）：从DBLP数据集中

发布时间：2021-05-25 12:40:07 所属栏目：大数据来源：网络整理

导读：副标题#e# 上篇文章：http://www.voidcn.com/article/p-nsbrwwsu-zv.html?（挖掘DBLP作者合作关系，FP-Growth算法实践（1）：从DBLP数据集中提取目标信息（会议、作者等））大家反映代码不能用，主要是太慢了，好吧，我也承认慢，在内存构造树，肯定的！

def XmlLineParser(fileName):
    rf=open(fileName,"r")
    for line in rf:
        #print "line [1]",line
        if line.startswith("<inproceedings"):
            print "line [1]",line
            booktitle=""
            year=""
            title=""
            authorList=""
            for line in rf:
                print "line [2]",line
                if line.startswith("<author"):
                    authorList+=line
                if line.startswith("<title"):
                    title=line
                elif line.startswith("<year"):
                    year=line[6:10]
                    if year<fromYear:
                        break
                elif line.startswith("<booktitle"):
                    booktitle=((line[11:]).split("</")[0]).split(" ")[0]
                    if not confNameDict.has_key(booktitle):
                        break
                elif line.startswith("</inproceedings"):
                    #tranList=[] #"confName    t    year    t    title    t    author1|author2|..|authorn"
                    localTran=booktitle+"t"+year+"t"+(title[7:]).split("</")[0]+"t"
                    for authorLine in authorList.split("n"):
                        for author in re.findall(re.compile(r'<author>(.*)</author>',re.S),authorLine):
                            localTran+=author+"|"
                    wf=open("tranDB.txt","a")
                    wf.write(localTran[:-1]+"n") #remove last "|"
                    wf.close()
                    break #do not forget
    rf.close()

调用直接：

XmlLineParser(fileName)

拿走不谢，代码写得有点水，见谅。

（编辑：应用网_阳江站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

3/3

首页

绕过使用大数据的保护	用Elastic Block Stor
技术迷途者指南我有问	转向未来的AI自动化测