挖掘DBLP作者合作关系,FP-Growth算法实践(2):从DBLP数据集中
发布时间:2021-05-25 19:23:43 所属栏目:大数据 来源:网络整理
导读:副标题#e# 上篇文章:http://www.voidcn.com/article/p-nsbrwwsu-zv.html?(挖掘DBLP作者合作关系,FP-Growth算法实践(1):从DBLP数据集中提取目标信息(会议、作者等)) 大家反映代码不能用,主要是太慢了,好吧,我也承认慢,在内存构造树,肯定的!
|
副标题[/!--empirenews.page--]
上篇文章:http://www.voidcn.com/article/p-nsbrwwsu-zv.html?(挖掘DBLP作者合作关系,FP-Growth算法实践(1):从DBLP数据集中提取目标信息(会议、作者等)) 大家反映代码不能用,主要是太慢了,好吧,我也承认慢,在内存构造树,肯定的! 这次给出另外两种。 为了完整,先给出dom: #do not use this code!
def DomParser():
domTree=parse(fileName)
dblp=domTree.documentElement
inproceedingsList=dblp.getElementsByTagName("inproceedings")
for inproceedings in inproceedingsList:
year=inproceedings.getElementsByTagName("year")[0]
yearStr=str(year.childNodes[0].data)
if yearStr<fromYear:
continue
print "yearStr",yearStr,"=="*20
booktitle=inproceedings.getElementsByTagName("booktitle")[0]
booktitleStr=str(booktitle.childNodes[0].data)
#for "<booktitle>ICML Unsupervised and Transfer Learning</booktitle>"
booktitleStr=booktitleStr.split(" ")[0]
if not confNameDict.has_key(booktitleStr):
continue
print "booktitleStr",booktitleStr,"^^"*20
#allList=[] #"confName t year t title t author1|author2|..|authorn"
#authorDict={} #author: [frequence,yearStart,yearEnd]
allContent=booktitleStr+"t"+yearStr+"t" #confName t year t
title=inproceedings.getElementsByTagName("title")[0]
titleStr=str(title.childNodes[0].data)
allContent+=titleStr+"t" #title t
authorList=inproceedings.getElementsByTagName("author")
for i,author in enumerate(authorList):
authorStr=str(author.childNodes[0].data)
allContent+=authorStr+"|" #authori|
if authorDict.has_key(authorStr):
authorDict[authorStr][0]+=1
if yearStr<authorDict[authorStr][1]:
authorDict[authorStr][1]=yearStr
elif yearStr>authorDict[authorStr][2]:
authorDict[authorStr][2]=yearStr
else:
authorDict[authorStr]=[1,yearStr]
allList.append(allContent)
allContent="n".join(allList)
wf=open("allDB.txt","w")
wf.write(allContent)
wf.close()
authorList=sorted(authorDict.items(),lambda x,y: cmp(x[1],y[1]),reverse=True)
wf=open("authorDB.txt","w")
allContent="n".join([author+"t"+str(frequence)+"t"+yearStart+"t"+yearEnd for author,(frequence,yearEnd) in authorList])
wf.write(allContent)
wf.close()
再给出sax: class SAX_PARSER(xml.sax.ContentHandler):
'''
startDocument()方法
文档启动的时候调用。
endDocument()方法
解析器到达文档结尾时调用。
startElement(name,attrs)方法
遇到XML开始标签时调用,name是标签的名字,attrs是标签的属性值字典。
endElement(name)方法
遇到XML结束标签时调用。
characters(content)方法,调用时机:
从行开始,遇到标签之前,存在字符,content的值为这些字符串。
从一个标签,遇到下一个标签之前, 存在字符,content的值为这些字符串。
从一个标签,遇到行结束符之前,存在字符,content的值为这些字符串。
标签可以是开始标签,也可以是结束标签。
'''
def __init__(self):
self.authorList=""
self.title=""
self.year=""
self.booktitle=""
self.flag=0
self.tag=""
def startDocument(self):
print "Document start","=="*20
def endDocument(self):
print "Document end","=="*20
def startElement(self,tag,attributes):
print "startElement","ss"*20,tag
if tag=="inproceedings":
self.flag=1
elif self.flag==1: #tag!="inproceedings" and self.flag==1,we are now in a subtag of "inproceedings"
self.tag=tag
def endElement(self,tag):
print "endElement","ee"*20,tag
if self.flag==1 and tag=="inproceedings":
if confNameDict.has_key(self.booktitle) and self.year>=fromYear:
#allList=[] #"confName t year t title t author1|author2|..|authorn"
allContent=self.booktitle+"t"+self.year+"t"+self.title+"t"+self.authorList[:-1]+"n" #for the last "|"
wf=open("allDB.txt","a")
wf.write(allContent)
wf.close()
self.authorList=""
self.title=""
self.year=""
self.booktitle=""
self.flag=0
self.tag=""
def characters(self,content):
print "characters","cc"*20,content
if self.flag==1: #we are now in "inproceedings" tag
print self.tag
if self.tag=="author":
self.authorList+=content+"|"
elif self.tag=="title":
self.title=content
elif self.tag=="year":
self.year=content
elif self.tag=="booktitle":
self.booktitle=content.split(" ")[0] #for "<booktitle>ICML Unsupervised and Transfer Learning</booktitle>"
最后给出string,把每行看成字符串来处理的方式: (编辑:黄山站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
站长推荐


