python_note
  • Introduction
  • My Python
    • Anaconda
    • argparse
    • datetime
    • json
    • logging
    • numpy
    • open
    • openCC
    • pandas & csv
    • Socket & SocketServer
    • re
    • yaml
    • smtp
    • 物件操作
    • unittest
    • 線程
    • prettytable
    • IO
    • pycurl
    • sys
    • pickle
    • auto-python-to-exe
    • cython
    • nuitka
  • Crawler
    • Urllib & Requests
      • User-agent
      • Percent-Encoding
      • mail code
    • Selenium
    • TCP & UDP
    • 控制字符(control character)
  • Web Development
    • Flask
      • RESTful api
      • Template
      • blueprint
    • Django
      • 環境佈署(windows)
    • 檢查Port
    • Apache
    • 使用者行為
    • jQuery
    • 壓力測試
    • DataTable
    • Bootstrap
    • CSS
    • JavaScript
    • Chart.js
  • Deep Learning
    • Keras 設定
    • RNN
    • LSTM
  • Test
    • T-Test
  • 資料結構
    • Hash
    • 時間複雜度
  • NLP
    • N-gram
    • CKIP
    • 中文轉數字
    • CRF
    • Mutual Information
    • 模糊比對
  • Linebot
    • Heroku
    • 圖文選單
    • channel
  • Linux
    • 常用指令
    • shell script
    • sshfs
    • ssh
    • nodejs & npm
    • debug
  • GCP
    • app engine
    • ssh(gcp)
    • gsutil
    • brabrabra
    • Load Balancer
    • k8s
  • Database
    • mysql
    • elasticsearch
      • Query
      • Backup and Restore
      • elasticdump
      • es2csv
      • ELK
    • mongodb
      • install
      • authentication
      • pymongo
    • sql server
  • go
    • Swarm
  • Docker
    • Kitematic
    • Dockerfile
    • Swarm
  • Git
  • 其他
    • USB軟體保護
    • Windows效能監視器
  • Blockchain
Powered by GitBook
On this page
  • Urllib
  • Python2
  • Requests
  • SSL: CERTIFICATE_VERIFY_FAILED

Was this helpful?

  1. Crawler

Urllib & Requests

Urllib

錯誤類型 1. URLError 2. HTTPError (url的子類別) 會返回狀態碼

範圍

狀態

100~299

成功

300~399

可處理

400~599

錯誤

子類異常要寫在父類異常之前,所以先http再url

python2與python3使用不太一樣

Python2

import urllib2  
from bs4 import BeautifulSoup

#取得url的原始碼
def getHtml(url): 
    try:
        header = {
            "Accept" : "text/html",
            "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22",
            #'Cookie':'over18=1',  #可在header傳入參數 ex.八卦版->我已滿18歲
        }
        request = urllib2.Request(url, headers=header)
        soup=BeautifulSoup(urllib2.urlopen(request).read(),'lxml')  #記得安裝lxml套件
        return soup

    except urllib2.HTTPError, e:
        return 'error'
    except urllib2.URLError, e:
        return 'error'

Requests

# python3
import requests
from bs4 import BeautifulSoup

def getHtml(url):
    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/55.0.2883.87 Safari/537.36 '
    }

    res=requests.get(url,headers=header)
    res.encoding='utf8'
    res.raise_for_status() # 如果連線發生錯誤則終止程式(類似try except)

    print(res.status_code) # 連線的狀態碼(int),也可用於判斷連線是否成功

    soup=BeautifulSoup(res.text,'lxml') #記得安裝lxml套件


    return soup

SSL: CERTIFICATE_VERIFY_FAILED

import requests

# 去除警告比較不煩人
requests.packages.urllib3.disable_warnings()

requests.get(url, timeout=10, verify=False)

加上verify=False

參考:

PreviousCrawlerNextUser-agent

Last updated 5 years ago

Was this helpful?

來源:URLerror異常處理
Python爬虫入门六之Cookie的使用
https://tw.saowen.com/a/5fc6e9419520438129df8e091c27683af5cc933a01db3e259e8b44fde106b91e
http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
https://blog.csdn.net/zahuopuboss/article/details/52964809
https://www.itread01.com/content/1549509138.html