scrapy，如何在HTML标记元素中分隔文本

debugcn 发表于 Dev

HeadAboutToExplode

包含我的数据的代码：

        <div id="content"><!-- InstanceBeginEditable name="EditRegion3" -->
      <div id="content_div">
    <div class="title" id="content_title_div"><img src="img/banner_outlets.jpg" width="920" height="157" alt="Outlets" /></div>
    <div id="menu_list">
<table border="0" cellpadding="5" cellspacing="5" width="100%">
    <tbody>
        <tr>
            <td valign="top">
                <p>
                    <span class="foodTitle">Century Square</span><br />
                    2 Tampines Central 5<br />
                    #01-44-47 Century Square<br />
                    Singapore 529509</p>
                <p>
                    <br />
                    <strong>Opening Hours:</strong><br />
                    7am to 12am (Sun-Thu &amp;&nbsp;PH)<br />
                    24 Hours (Fri &amp; Sat&nbsp;&amp;</p>
                <p>
                    Eve of PH)<br />
                    Telephone: 6789 0457</p>
            </td>
            <td valign="top">
                <img alt="Century Square" src="/assets/images/outlets/century_sq.jpg" style="width: 260px; height: 140px" /></td>
            <td valign="top">
                <span class="foodTitle">Liat Towers</span><br />
                541 Liat towers #01-01<br />
                Orchard Road<br />
                Singapore 238888<br />
                <br />
                <strong>Opening Hours: </strong><br />
                24 hours (Daily)<br />
                <br />
                Telephone: 6737 8036</td>
            <td valign="top">
                <img alt="Liat Towers" src="/assets/images/outlets/century_liat.jpg" style="width: 260px; height: 140px" /></td>
        </tr>

**我想得到

地名：世纪广场，利特大厦

地址：2 Tampines Central 5，5，Liat塔＃01-01

邮政编码：新加坡529509，新加坡238888

营业时间：上午7-12点，每天24小时**

例如：

“ <” td valign =“ top”>'中的第一个<> p>具有3个我想要的数据（名称，地址，邮政）。我如何拆分它们？

这是我的蜘蛛代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
from todo.items import wendyItem

class wendySpider(BaseSpider):
    name = "wendyspider"
    allowed_domains = ["wendys.com.sg"]
    start_urls = ["http://www.wendys.com.sg/outlets.php"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        values = hxs.select('//td')
        items = []
        for value in values:
            item = wendyItem()
            item['name'] = value.select('//span[@class="foodTitle"]/text()').extract()
            item['address'] = value.select().extract()
            item['postal'] = value.select().extract()
            item['hours'] = value.select().extract()
            item['contact'] = value.select().extract()
            items.append(item)
        return items

保罗·特姆布雷斯

我会选择所有<td valign="top">包含<span class="foodTitle">

//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]

然后对于这些td单元格中的每一个，获取所有文本节点

.//text()

您得到的是这样的：

['\n                ',
 '\n                    ',
 'Century Square',
 '\n                    2 Tampines Central 5',
 '\n                    #01-44-47 Century Square',
 '\n                    Singapore 529509',
 '\n                ',
 '\n                    ',
 'Opening Hours:',
 u'\n                    7am to 12am (Sun-Thu &\xa0PH)',
 u'\n                    24 Hours (Fri & Sat\xa0&',
 '\n                ',
 '\n                    Eve of PH)',
 '\n                    Telephone: 6789 0457',
 '\n            ']

和

['\n                ',
 'Liat Towers',
 '\n                541 Liat towers #01-01',
 '\n                Orchard Road',
 '\n                Singapore 238888',
 'Opening Hours: ',
 '\n                24 hours (Daily)',
 '\n                Telephone: 6737 8036']

其中一些文本节点的字符串表示形式全部为空格，因此请剥离它们并查找“ Opening hours”和“ Telephone”关键字以循环处理行：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
from todo.items import wendyItem

class wendySpider(BaseSpider):
    name = "wendyspider"
    allowed_domains = ["wendys.com.sg"]
    start_urls = ["http://www.wendys.com.sg/outlets.php"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        cells = hxs.select('//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]')
        items = []
        for cell in cells:
            item = wendyItem()

            # get all text nodes
            # some lines are blank so .strip() them
            lines = cell.select('.//text()').extract()
            lines = [l.strip() for l in lines if l.strip()]

            # first non-blank line is the place name
            item['name'] = lines.pop(0)

            # for the other lines, check for "Opening hours" and "Telephone"
            # to store lines in correct list container

            address_lines = []
            hours_lines = []
            telephone_lines = []

            opening_hours = False
            telephone = False

            for line in lines:
                if 'Opening Hours' in line:
                    opening_hours = True
                elif 'Telephone' in line:
                    telephone = True
                if telephone:
                    telephone_lines.append(line)
                elif opening_hours:
                    hours_lines.append(line)
                else:
                    address_lines.append(line)

            # last address line is the postal code + town name
            item['address'] = "\n".join(address_lines[:-1])
            item['postal'] = address_lines[-1]

            # ommit "Opening hours" (first element in list)
            item['hours'] = "\n".join(hours_lines[1:])

            item['contact'] = "\n".join(telephone_lines)

            items.append(item)

        return items

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。