Skip to main content

Extracting an HTML Page Contents with Python's BeautifulSoup4

BeautifulSoup get_text method can be used for stripping html tags and getting page contents.

html_content.py file is like: 
# -*- coding: utf-8 -*-
import sys
import os
from bs4 import BeautifulSoup
import requests
if sys.stdout.encoding is None:
    os.putenv("PYTHONIOENCODING", 'UTF-8')
    os.execv(sys.executable, ['python']+sys.argv)
url = sys.argv[1]
page_content = requests.get(url)
text = BeautifulSoup(page_content.text).get_text()
print text

This python code can be run with command line argument like:
# python html_content.py http://kadirsert.blogspot.com

Comments

Popular posts from this blog