怎么部署

  • scrapyd + supervisord + crontab + redis

可以用的一些lib

分布式

参考的blog

入门

结合

行业资料

行业需求

  • [lagou]

其他

  • 比如如何防止被ban掉
    1
    2
    3
    4
    5
    6
    7
    Here are some tips to keep in mind when dealing with these kinds of sites:
    - rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
    - disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
    - use download delays (2 or higher). See DOWNLOAD_DELAY setting.
    - if possible, use Google cache to fetch pages, instead of hitting the sites directly
    - use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
    - use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera