from bs4 import BeautifulSoup
html='''
<h1> A </h1>
<div >AA </div>
<li>a1</li>
<li>a2</li>
<h2> B </h2>
<div >BB </div>
<li > b1</li>
<li > b2</li>
<li > b3</li>
<h3> C </h3>
<div >CC</div>
<li> c1</li>
'''
soup=BeautifulSoup(html,'lxml',from_encoding='utf-8')
for div in soup.findAll('div'):
print(div.text,end="")
for dt in div.find_all_next("li"):
print("\t",dt.text,end=",")
print()
期望输出 AA a1,a2
BB b1,b2,b3
CC c1
但是结果输出是
AA a1, a2, b1, b2, b3, c1,
BB b1, b2, b3, c1,
CC c1,
其实可以不用 for
嵌套的,一个简单的if
判断就可以:
soup = BeautifulSoup(html)
tags = soup.body
for tag in tags:
if tag.name == 'div':
print('\n'+tag.string.strip(), end = ' ')
elif tag.name == 'li' :
print(tag.string.strip(), end = ', ')
else:
continue
结果:
AA a1, a2,
BB b1, b2, b3,
CC c1,
这样效率上要高很多,毕竟这是个 n 复杂度的,而for
嵌套则是 n^2 复杂度。
find_all_next()通过 .next_elements 属性对当前tag的之后的 tag和字符串进行迭代返回所有符合条件的节点
是直接迭代到底的不能这样用
用.next_siblings再加个判断吧
for div in soup.findAll('div'):
print(div.text,end="")
for dt in div.find_next_siblings():
if dt.name == 'li':
print("\t",dt.text,end=",")
else:
break
print()