Skip to main content

Extracting an HTML Page Contents with Python's BeautifulSoup4

BeautifulSoup get_text method can be used for stripping html tags and getting page contents.

html_content.py file is like: 
# -*- coding: utf-8 -*-
import sys
import os
from bs4 import BeautifulSoup
import requests
if sys.stdout.encoding is None:
    os.putenv("PYTHONIOENCODING", 'UTF-8')
    os.execv(sys.executable, ['python']+sys.argv)
url = sys.argv[1]
page_content = requests.get(url)
text = BeautifulSoup(page_content.text).get_text()
print text

This python code can be run with command line argument like:
# python html_content.py http://kadirsert.blogspot.com

Comments

Popular posts from this blog

Find and replace with sed command in Linux

Find and replace feature is always handy. It can turn into a torture when it comes to change or delete a simple constant string in a text file. There is a handy tool in linux for doing these kind of tihngs. Actually sed is not a text editor but it is used outside of the text file to make changes.

Sending Jboss Server Logs to Logstash Using Filebeat with Multiline Support

In addition to sending system logs to logstash, it is possible to add a prospector section to the filebeat.yml for jboss server logs. Sometimes jboss server.log has single events made up from several lines of messages. In such cases Filebeat should be configured for a multiline prospector.
Filebeat takes lines do not start with a date pattern (look at pattern in the multiline section "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}" and negate section is set to true) and combines them with the previous line that starts with a date pattern.

server.log file excerpt where DatePattern: yyyy-MM-dd-HH and ConversionPattern: %d %-5p [%c] %m%n
Logstash filter: