Quickly Parse HTML And XML With BeautifulSoup Python Library In Delphi And C++ Windows Apps
We know how to load and display Web content or local files in Delphi using TWebBrowser. It offers support for the basic functions of a browser, such as navigate to URL, go back, go forward, along with specific events. How about the web scrapping in Delphi using the Python BeautifulSoup library? Sounds Interesting? Yes, with the help of Python4Delphi we can scrap the web pages quickly in the Delphi/C++ Builder app. This post helps to understand with sample python script.
Prerequisites.
- If not python and Python4Delphi is not installed on your machine, Check this, how to run a simple python script in Delphi application using Python4Delphi sample app
- Open windows open command prompt, and type pip install -U bs4 to install BeautifulSoup4. For more info for Installing Python Modules check here
- First, run the Demo1 project for executing Python script in Python for Delphi. Then load the script in the Memo1 field and press the Execute Script button to see the result. Go to GitHub to download the Demo1 source.
procedure TForm1.Button1Click(Sender: TObject);
begin
PythonEngine1.ExecStrings( Memo1.Lines );
end;
Beautiful Soup Python Library sample script details: Beautiful Soup works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The sample script demonstrates,
- How to transforms a complex HTML document into a complex tree of Python objects( four kinds of objects:
Tag
,NavigableString
,BeautifulSoup
, andComment
.) - How to Navigate the within the tree of Python Objects like Going down, Up, Sideways, Back and Forth, Navigable using Tagnames.
- Searching the parse tree Objects using two most popular methods:
find()
andfind_all()
. - How to modify the tree and write your changes as a new HTML or XML document.
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b id = "boldest"> The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#Simple Html parsing.
soup = BeautifulSoup(html_doc,'html.parser')
print(soup.title)
print(soup.title.name)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
print(soup.a)
# --Kinds of objects.---
tag = soup.b
print(type(tag))
# tag name
print(tag.name)
#tag id
print(tag['id'])
# Navigable string corresponds to a bit of text within a tag.
souptag = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag1 = souptag.b
print(tag1.string)
print(type(tag.string))
#comments
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup1 = BeautifulSoup(markup, 'html.parser')
comment = soup1.b.string
print(type(comment))
#Navigating using tagnames
print(soup.head)
print(soup.title)
# going Up
title_tag = soup.title
print(title_tag)
print(title_tag.parent)
# Search the tree
#find by id
print(soup.find(id="link3"))
# find all with <a> tags
for tag in soup.find_all('a'):
print(tag)
#Modifying the tree
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
print(tag)
del tag['class']
del tag['id']
print(tag)

- CSS selector against a parsed document and return all the matching elements.
Tag
has a similar method which runs a CSS selector against the contents of a single tag. check here for more details. - You can do much more with this library like Output the Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string, Comparing objects for equality, Copying Beautiful Soup objects etc.
Note: Samples used for demonstration were picked from here with only the difference of printing the outputs. You can check the APIs and some more samples from the same place.
You have read the quick overview of Beautiful Soup library, download this library from here and pull data out of html, xml easily in your applications. Check out Python4Delphi and easily build Python GUIs for Windows using Delphi.
Leave Your Comment