トピック記事を見てみましょう “beautifulsoup4 python – Beautiful Soup 4 Tutorial #1 – Web Scraping With Python“? カテゴリ内: Top 716 tips update new. この記事は、インターネット上の多くのソースからのhttps://ph.taphoamini.comによって編集されています. 著者Tech With Timによる記事には167,032 回視聴があり、高評価 4,910 件で高く評価されています.
このbeautifulsoup4 pythonトピックの詳細については、以下の記事を参照してください。.投稿がある場合は、記事の下にコメントするか、関連記事セクションのトピックbeautifulsoup4 pythonに関連するその他の記事を参照してください。.
主題に関するビデオを見る beautifulsoup4 python
以下は、このトピックに関する詳細なビデオです beautifulsoup4 python – Beautiful Soup 4 Tutorial #1 – Web Scraping With Python. 注意深く見て、あなたが読んでいるものについてのフィードバックを私たちに与えてください!
Beautiful Soup 4 Tutorial #1 – Web Scraping With Python – beautifulsoup4 python このトピックの詳細
テーマの説明 beautifulsoup4 python:
Welcome to a new tutorial series on Beautiful Soup 4! Beautiful Soup 4 is a web scraping module that allows you to get information from HTML documents and modify them as well. It’s very versatile and there is a lot of things to go over and in this video, I’ll be giving an introduction/walkthrough to Beautiful Soup 4.
💻 AlgoExpert is the coding interview prep platform that I used to ace my Microsoft and Shopify interviews. Check it out and get a discount on the platform using the code \”techwithtim\” https://algoexpert.io/techwithtim
📄 Resources 📄
Beautiful Soup Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Code In This Video: https://github.com/techwithtim/Beautiful-Soup-Tutorial
Fix Pip (Mac): https://www.youtube.com/watch?v=E-WhAS6qzsU
Fix Pip (Windows): https://www.youtube.com/watch?v=AdUZArA-kZw\u0026t=7s
NewEgg Link: https://www.newegg.ca/gigabyte-geforce-rtx-3080-ti-gv-n308tgaming-oc-12gd/p/N82E16814932436?Description=3080\u0026cm_re=3080-_-14-932-436-_-Product
📚 Playlist: https://www.youtube.com/watch?v=gRLHr664tXA\u0026list=PLzMcBGfZo4-lSq2IDrA6vpZEV92AmQfJK
⭐️ Timestamps ⭐️
00:00 | Overview
01:26 | Beautiful Soup 4 Setup
02:51 | Reading HTML Files
05:50 | Find By Tag Name
07:45 | Find All By Tag Name
09:44 | Parsing Website HTML
12:50 | Locating Text
13:53 | Beautiful Soup Tree Structure
◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️
💰 Courses \u0026 Merch 💰
💻 The Fundamentals of Programming w/ Python: https://tech-with-tim.teachable.com/p/the-fundamentals-of-programming-with-python
👕 Merchandise: https://teespring.com/stores/tech-with-tim-merch-shop
🔗 Social Medias 🔗
📸 Instagram: https://www.instagram.com/tech_with_tim
📱 Twitter: https://twitter.com/TechWithTimm
⭐ Discord: https://discord.gg/twt
📝 LinkedIn: https://www.linkedin.com/in/tim-ruscica-82631b179/
🌎 Website: https://techwithtim.net
📂 GitHub: https://github.com/techwithtim
🔊 Podcast: https://anchor.fm/tech-with-tim
🎬 My YouTube Gear 🎬
🎥 Main Camera (EOS Canon 90D): https://amzn.to/3cY23y9
🎥 Secondary Camera (Panasonic Lumix G7): https://amzn.to/3fl2iEV
📹 Main Lens (EFS 24mm f/2.8): https://amzn.to/2Yuol5r
🕹 Tripod: https://amzn.to/3hpSprv
🎤 Main Microphone (Rode NT1): https://amzn.to/2HrZxXc
🎤 Secondary Microphone (Synco Wireless Lapel System): https://amzn.to/3e07Swl
🎤 Third Microphone (Rode NTG4+): https://amzn.to/3oi0v8Z
☀️ Lights: https://amzn.to/2ApeiXr
⌨ Keyboard (Daskeyboard 4Q): https://amzn.to/2YpN5vm
🖱 Mouse (Logitech MX Master): https://amzn.to/2HsmRDN
📸 Webcam (Logitech 1080p Pro): https://amzn.to/2B2IXcQ
📢 Speaker (Beats Pill): https://amzn.to/2XYc5ef
🎧 Headphones (Bose Quiet Comfort 35): https://amzn.to/2MWbl3e
🌞 Lamp (BenQ E-reading Lamp): https://amzn.to/3e0UCr8
🌞 Secondary Lamp (BenQ Screenbar Plus): https://amzn.to/30Dtafi
💻 Monitor (BenQ EX2780Q): https://amzn.to/2HsmUPZ
💻 Monitor (LG Ultrawide 34WN750): https://amzn.to/3dSD7tS
🎙 Mic Boom Arm (Rode PSA 1): https://amzn.to/30EZw9m
🎚 Audio Interface (Focusrite Scarlet 4i4): https://amzn.to/2TjXsih
💸 Donations 💸
💵 One-Time Donations: https://www.paypal.com/donate?hosted_button_id=CU9FV329ADNT8
💰 Patreon: https://www.patreon.com/techwithtim
◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️◼️
⭐️ Tags ⭐️
– Tech With Tim
– Beautiful Soup 4
– Web Scraping
– HTML
– HTML Parsing
– Python
⭐️ Hashtags ⭐️
#TechWithTim #BeautifulSoup4
See some more details on the topic beautifulsoup4 python here:
beautifulsoup4 · PyPI
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, proving Pythonic ioms for …
Source: pypi.org
Date Published: 2/15/2022
View: 5549
Web Crawling Với BeautifulSoup4 Trong Python – CodeLearn
Web Crawling Với BeautifulSoup4 Trong Python. “You dn’t write that awful page. You’re just trying to get some data out of it.
Source: codelearn.io
Date Published: 3/19/2022
View: 4723
Beautiful Soup Documentation — Beautiful Soup 4.4.0 …
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to prove iomatic ways of navigating, …
Source: beautiful-soup-4.readthedocs.io
Date Published: 11/16/2022
View: 9021
Beautiful Soup 4.9.0 documentation – Crummy
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to prove iomatic ways of navigating, …
Source: www.crummy.com
Date Published: 6/10/2022
View: 6135
PyMOTM: Beautiful Soup 4 (Part I) – Viblo
Qua APT: sudo apt-get install python-bs4; Qua PIP: sudo pip install beautifulsoup4; Qua EasyInstall: sudo easy_install beautifulsoup4; Qua source:.
Source: viblo.asia
Date Published: 9/24/2021
View: 2473
Kỹ thuật Scraping Trang web trong Python bằng Beautiful Soup
Bạn có thể cài đặt Beautiful Soup 4 bằng pip . Tên gói là beautifulsoup4 . Nó sẽ làm việc trên cả Python 2 và Python 3.
Source: code.tutsplus.com
Date Published: 2/13/2022
View: 2154
Beautiful Soup – Installation – Tutorialspoint
Beautiful Soup – Installation, As BeautifulSoup is not a standard python library, we need to install it first. We are going to install the BeautifulSoup 4 …
Source: www.tutorialspoint.com
Date Published: 12/8/2022
View: 4620
Thư viện Beautiful Soup | How Kteam
BeautifulSoup là một thư viện Python dùng để lấy dữ liệu ra khỏi các file HTML và XML. Nó hoạt động cùng với các parser (trình phân tích cú …
Source: howkteam.vn
Date Published: 8/18/2021
View: 221
Web scraping and parsing with Beautiful Soup 4 Introduction
Welcome to a tutorial on web scraping with Beautiful Soup 4. Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data …
Source: pythonprogramming.net
Date Published: 5/2/2021
View: 3739
コンテンツの写真 beautifulsoup4 python
トピックに関する写真 Beautiful Soup 4 Tutorial #1 – Web Scraping With Python 記事の内容をよりよく理解するために記事を説明するために使用されます。コメントセクションでより多くの関連画像を参照するか、必要に応じてより多くの関連記事を参照してください.
トピックに関する記事を評価する beautifulsoup4 python
- 著者: Tech With Tim
- 意見: 167,032 回視聴
- いいねの数: 高評価 4,910 件
- 動画のアップロード日: 2021/09/03
- ビデオURL: https://www.youtube.com/watch?v=gRLHr664tXA
What is beautifulsoup4 in Python?
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
How do I use beautifulsoup4 in Python?
First, we need to import all the libraries that we are going to use. Next, declare a variable for the url of the page. Then, make use of the Python urllib2 to get the HTML page of the url declared. Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.
Is bs4 same as beautifulsoup4?
The official name of PyPI’s Beautiful Soup Python package is beautifulsoup4 . This package ensures that if you type pip install bs4 by mistake you will end up with Beautiful Soup .
Is Scrapy better than BeautifulSoup?
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
Is BeautifulSoup faster than selenium?
Comparing selenium vs BeautifulSoup allows you to see that BeautifulSoup is more user-friendly and allows you to learn faster and begin web scraping smaller tasks easier. Selenium on the other hand is important when the target website has a lot of java elements in its code.
How do you scrape data from a website in Python?
- Find the URL that you want to scrape.
- Inspecting the Page.
- Find the data you want to extract.
- Write the code.
- Run the code and extract the data.
- Store the data in the required format.
How do you scrape data from a website?
- Inspect the website HTML that you want to crawl.
- Access URL of the website using code and download all the HTML contents on the page.
- Format the downloaded content into a readable format.
- Extract out useful information and save it into a structured format.
How do I import a beautifulsoup4 in Jupyter notebook?
- Open a new anaconda prompt.
- Run conda install -c anaconda beautifulsoup4.
- Close and reopen jupyter notebook.
- In jupyter notebook import libraries as following: from bs4 import BeautifulSoup.
How do I install pip on beautifulsoup4?
- Step 1: Open your command prompt.
- Step 2: Check the version of the python by typing the following command. python –version Checking the version of python on windows.
- Step 3: Install the beautifulsoup using pip.
How do I know if bs4 is installed?
- Open up the Python interpreter in a terminal by using the following command: python.
- Now, we can issue a simple import statement to see whether we have successfully installed Beautiful Soup or not by using the following command: from bs4 import BeautifulSoup.
Why is it called BeautifulSoup?
The poorly-formed stuff you saw on the Web was referred to as “tag soup”, and only a web browser could parse it. Beautiful Soup started out as an HTML parser that would take tag soup and make it beautiful, or at least workable.
Should I use Scrapy or Selenium?
Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.
Is Selenium good for scraping?
Selenium wasn’t originally designed for web scraping. In fact, Selenium is a web driver designed to render web pages for test automation of web applications. This makes Selenium great for web scraping because many websites rely on JavaScript to create dynamic content on the page.
Is Scrapy safe?
Is Scrapy safe to use? Security issues were found while scanning the latest version of Scrapy, and a total of 1 vulnerabilities were detected. It is highly advised to conduct a security review before using this package. View the full security scan results.
beautifulsoup4
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
Quick start
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(“
SomebadHTML”) >>> print(soup.prettify())
Some bad HTML
>>> soup.find(text=”bad”) ‘bad’ >>> soup.i HTML # >>> soup = BeautifulSoup(“
Some bad XML”, “xml”) # >>> print(soup.prettify()) Some bad XML To go beyond the basics, comprehensive documentation is available.
Links
Note on Python 2 sunsetting
Beautiful Soup’s support for Python 2 was discontinued on December 31, 2020: one year after the sunset date for Python 2 itself. From this point onward, new Beautiful Soup development will exclusively target Python 3. The final release of Beautiful Soup 4 to support Python 2 was 4.9.3.
Supporting the project
If you use Beautiful Soup as part of your professional work, please consider a Tidelift subscription. This will support many of the free software projects your organization depends on, not just Beautiful Soup.
If you use Beautiful Soup for personal projects, the best way to say thank you is to read Tool Safety, a zine I wrote about what Beautiful Soup has taught me about software development.
Building the documentation
The bs4/doc/ directory contains full documentation in Sphinx format. Run make html in that directory to create HTML documentation.
Running the unit tests
Beautiful Soup supports unit test discovery using Pytest:
Web Crawling Với BeautifulSoup4 Trong Python
“You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help.” (Bạn không phải là người viết ra trang web khó chịu đó. Bạn chỉ đang cố lấy chút dữ liệu từ nó. Beautiful Soup đã ở đây để giúp bạn). – Intro của beautifulsoup4.
Bạn chán nản việc dùng Ctrl F để ngồi tìm và lưu từng thông tin bạn muốn? Beautiful Soup đã ở đây để giúp đỡ với tư cách là một công cụ hữu ích để screen scraping các file HTML, XML, và các Markup Language khác từ trang web bạn mong muốn cho mục đích riêng của mình. Thư viện này có thể giúp bạn lấy nội dung bạn mong muốn, xóa bớt những thông tin không cần thiết khi lấy HTML từ trang web nào đó.
Cài đặt môi trường
Để cài đặt beautifulsoup4, hãy chắc chắn rằng bạn đã có Python và pip được install từ trước.
Python install: https://www.python.org/
get-pip install: https://bootstrap.pypa.io/get-pip.py
Khi đã có cả hai thứ trên, bạn chỉ cần dùng một command duy nhất để cài đặt beautifulsoup4:
pip install beautifulsoup4
Sau khi đó, bạn cũng cần cài dặt một parser (một interpreter giúp phiên dịch và sắp xếp lại HTML pull về từ trang web). Ngoài parser có sẵn là html.parser ở trong thư viện của Python, bạn cũng có thể cài đặt thêm những parser khác như lxml, html5lib,… bằng pip:
pip install tên_parser
Về cá nhân tôi, tôi sẽ khuyên bạn sử dụng lxml vì parser này xử lí data nhanh, có nhiều features hữu dụng và dễ sử dụng.
Demo
Chúng ta sẽ bắt đầu với một demo đơn giản nhẹ nhàng: lấy top 10 bộ phim đạt giải Oscars hay nhất mọi thời đại
Ở đây, chúng ta sẽ chỉ quan tâm tới title của các bộ phim. (Phần được bôi đen)
Để bắt đầu sử dụng beautiful soup trên parser lxml, trước tiên bạn cần phải import chúng.
from bs4 import BeautifulSoup import lxml
Nếu bạn cảm thấy việc dùng các parser khác phù hợp với bạn hơn, hay thoải mái sử dụng chúng. Nhưng ở đây, tôi sẽ chỉ viết lxml bởi parser này được đa số mọi người sử dụng cũng bởi vì sự xử lí nhanh của nó, mong các bạn thông cảm :’(.
Hãy bắt đầu chế biến món soup!
1. Nguyên liệu:
Để đơn giản hóa mọi việc và tránh sử dụng các library chưa được nhắc tới, tôi sẽ tải luôn cả HTML của trang web cần kéo về.
Để thực hiện điều này, bạn chỉ cần click chuột phải vào trang web và chọn Save as (Lưu thành):
Sau đó, hãy lưu trang web dưới dạng HTML:
Đồng thời, hãy đổi tên file thành một thứ gì đó ngắn gọn nhưng súc tích để về sau có thể mở file dễ hơn; về phần tôi, tôi sẽ để là “top10”.
Như vậy, nguyên liệu của chúng ta đã được xử lí xong.
2. Chế biến:
Trước khi bắt tay vào khâu chế biến, hãy cùng tìm hiểu kĩ về title của những bộ phim – mục tiêu của chúng ta.
Xem xét file HTML vừa tải về (Ctrl F), bạn có thể thấy các title được viết dưới dạng các tag
và may mắn thay, cũng là những element duy nhất được viết dưới dạng
!:
Ngoại trừ cái này >:(
Đăng ký học thử/test thử tại trung tâm này để có lựa chọn tốt nhất
Nhưng không sao, một chút nữa, tôi sẽ hướng dẫn các bạn loại bỏ đi những thứ không cần thiết mà vô tình lọt vào khi lấy thông tin.
Vậy bây giờ, để có thể lấy được title của các bộ phim, chúng ta chỉ việc tìm và lấy tất cả những thông tin trong các tag
– và mùa nghỉ dịch sẽ bớt chán òm với những bộ phim hay mà bạn vừa tìm được (và loại bỏ những
không cần thiết nữa >:( )
Để bắt đầu, hãy bỏ các nguyên liệu cần thiết vào món soup:
soup = BeautifulSoup (open(“top10.html”, encoding=”utf8″), features=”lxml”)
Ở đây, tôi đã mở file “top10.html”, encode nó dưới dạng “utf8” và sử dụng parser lxml. Nhưng coi chừng, bạn có thể gặp phải lỗi này:
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x90 in position 2907500: character maps to `
` Hoặc có thể là lỗi timeout…
Đây là lỗi encoding thông thường, có thể là do bạn chọn loại encode chưa đúng và parser gặp lỗi khi đang cố dịch. Để sửa lỗi này, bạn có thể chỉnh lại loại encode thành kiểu khác. Các kiểu encode hay gặp là latin1 và utf8, ngoài ra có thể có cp437,… Để biết rõ web sử dụng kiểu encode nào, bạn chỉ cần vào console rồi gõ: document.characterSet và bạn sẽ nhận được kiểu encode bạn cần dùng:
Và bây giờ, khi đã có soup, chúng ta cần dũng command “find_all” để tìm tất cả các tag
links = soup.find_all(‘h3’)
Ở đây, tôi đã tìm trong món soup tất cả những
và lưu chúng vào một Python list có tên là links. Và để kết thúc, hãy in ra những phần tử có trong list này:
for link in links:
print(link.text)
Lí do tôi sử dụng link.text chứ không phải là link là bởi .text sẽ giúp cho tôi lấy được nội dung của các thẻ
đó. Nếu bỏ .text đi thì tôi sẽ nhận được kết quả là các tên bộ phim bị kẹp ở trong các tag
: “
1/ Titanic (1997)
” chứ không phải là “Titanic (1997)”.
Vậy thực hiện tất cả những điều trên, ta có code đầy đủ:
from bs4 import BeautifulSoup import lxml soup = BeautifulSoup (open(“top10.html”, encoding=”utf8″), features=”lxml”) links = soup.find_all(‘h3’) for link in links: print(link.text)
Chạy file này, kết quả nhận được sẽ là:
Ôi không, tôi quên mất chúng ta vẫn còn tag
bị thừa ra kia. Để tôi giúp bạn sửa lại chúng ngay.
Nhìn lại file HTML một chút, chúng ta thấy rằng tất cả những thông tin về bộ phim đều nằm trong tag
, với 2 class là “article-content school-info” Vậy vấn đề đã được đơn giản hóa: tìm tất cả .text của tag
ở bên trong một tag
có 2 class là “article-content” và “school-info”. Đến đây, chúng ta mới có thể thấy sự hữu dụng của beautifulsoup4 được thể hiện rõ: thư viện cho phép chúng ta điều hướng đến các phần tử nhỏ hơn của một phần tử nào đó theo mô hình cây DOM – điều này giúp cho chúng ta lấy thông tin cần thiết một cách dễ dàng.
Vậy để tìm tag “article” có 2 class là “article-content school-info” trong soup, ta thực hiện:
article = soup.find(‘article’, class_=’article-content school-info’)
Và để tìm được các tag “h3” ở trong article, ta thực hiện:
links = article.find_all(‘h3’)
In ra kết quả một lần nữa, ta sẽ nhận được list phim mong muốn:
Bạn làm được rồi chứ? Giờ hãy thực hành lấy thông tin về tên, giá và id của các món hàng trên một trang bán hàng của tiki: máy tính linh kiện điện tử
Gợi ý cho các bạn, để lấy được thông tin của một attribute ở trong một tag tìm được, hãy sử dụng cú pháp:
biến_đại_diện_cho_tag[‘tên_attribute’]
Giả sử bạn có một biến link lưu thông tin của một div mà bạn tìm được; thẻ div đó có attribute là data-name. Vậy để lấy được thông tin của attribute, bạn cần viết
link[‘data-name’]
Hãy tự thử viết code cho bài tập thực hành này trước, rồi so sánh với code của tôi nhé.
CODE:
import pandas as pd pd.set_option(‘max_colwidth’,-1) from bs4 import BeautifulSoup as bs import requests from bs4 import BeautifulSoup import lxml soup = BeautifulSoup (open(“tiki.html”, encoding=”utf8″), features=”lxml”) links = soup.find_all(‘div’, class_=’product-item’) for link in links: print(‘Tên hàng: ‘ + link[‘data-title’] + ‘ | Giá: ‘ + link[‘data-price’] + ‘ | ID: ‘ + link[‘product-sku’])
Và đây sẽ là kết quả tôi nhận được:
Tổng kết qua thì qua bài viết này, các bạn đã học được cách lấy attributes từ một tag, tìm kiếm các phần tử với class và id cho trước và điều hướng đến các children của parent trong DOM tree. Đây sẽ là nền móng cho nhiều kiến thức nâng cao hơn về sau. Ngoài ra, beautifulsoup4 còn có thể kết hợp với các thư viện như Requests (giúp lấy thông tin từ remote chứ không cần tải file HTML về), hoặc là các module giúp lưu giữ file dưới dạng csv, json,…
Tạm kết:
Trong bài viết này, bạn đã được thực hành demo và hiểu sơ qua cách hoạt động và cách tìm kiếm thông tin qua beautifulsoup4. Hi vọng bài viết này đã giúp cho các bạn có một cái nhìn rõ hơn về beautifulsoup4 vốn là hoàn toàn mới lạ với nhiều người. Rất cảm ơn các bạn đã dành thời gian đọc post và hẹn các bạn ở các bài viết khác.
bs4 — BeautifulSoup 4 — Python 3.6.1 documentation
bs4 — BeautifulSoup 4¶
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
How to scrape websites with Python and BeautifulSoup
by Justin Yek
How to scrape websites with Python and BeautifulSoup
There is more information on the Internet than any human can absorb in a lifetime. What you need is not access to that information, but a scalable way to collect, organize, and analyze it.
You need web scraping.
Web scraping automatically extracts data and presents it in a format you can easily make sense of. In this tutorial, we’ll focus on its applications in the financial market, but web scraping can be used in a wide variety of situations.
If you’re an avid investor, getting closing prices every day can be a pain, especially when the information you need is found across several webpages. We’ll make data extraction easier by building a web scraper to retrieve stock indices automatically from the Internet.
Getting Started
We are going to use Python as our scraping language, together with a simple and powerful library, BeautifulSoup.
For Mac users, Python is pre-installed in OS X. Open up Terminal and type python –version . You should see your python version is 2.7.x.
. You should see your python version is 2.7.x. For Windows users, please install Python through the official website.
Next we need to get the BeautifulSoup library using pip , a package management tool for Python.
In the terminal, type:
easy_install pip pip install BeautifulSoup4
Note: If you fail to execute the above command line, try adding sudo in front of each line.
The Basics
Before we start jumping into the code, let’s understand the basics of HTML and some rules of scraping.
HTML tags
If you already understand HTML tags, feel free to skip this part.
First Scraping
Hello World
This is the basic syntax of an HTML webpage. Every
serves a block inside the webpage: 1. : HTML documents must start with a type declaration.
2. The HTML document is contained between and .
3. The meta and script declaration of the HTML document is between
and .4. The visible part of the HTML document is between
and tags.5. Title headings are defined with the
through
tags.
6. Paragraphs are defined with the
tag.
Other useful tags include for hyperlinks,
for tables,
for table rows, and for table columns. Also, HTML tags sometimes come with id or class attributes. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.
For more information on HTML tags, id and class, please refer to W3Schools Tutorials.
Scraping Rules
You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes. Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed
Inspecting the Page
Let’s take one page from the Bloomberg Quote website as an example.
As someone following the stock market, we would like to get the index name (S&P 500) and its price from this page. First, right-click and open your browser’s inspector to inspect the webpage.
Try hovering your cursor on the price and you should be able to see a blue box surrounding it. If you click it, the related HTML will be selected in the browser console.
From the result, we can see that the price is inside a few levels of HTML tags, which is
→→.Similarly, if you hover and click the name “S&P 500 Index”, it is inside
and.
Now we know the unique location of our data with the help of class tags.
Jump into the Code
Now that we know where our data is, we can start coding our web scraper. Open your text editor now!
First, we need to import all the libraries that we are going to use.
# import libraries import urllib2 from bs4 import BeautifulSoup
Next, declare a variable for the url of the page.
# specify the url quote_page = ‘http://www.bloomberg.com/quote/SPX:IND’
Then, make use of the Python urllib2 to get the HTML page of the url declared.
# query the website and return the html to the variable ‘page’ page = urllib2.urlopen(quote_page)
Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.
# parse the html using beautiful soup and store in variable `soup` soup = BeautifulSoup(page, ‘html.parser’)
Now we have a variable, soup , containing the HTML of the page. Here’s where we can start coding the part that extracts the data.
Remember the unique layers of our data? BeautifulSoup can help us get into these layers and extract the content with find() . In this case, since the HTML class name is unique on this page, we can simply query
.# Take out the
of name and get its value name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})After we have the tag, we can get the data by getting its text .
name = name_box.text.strip() # strip() is used to remove starting and trailing print name
Similarly, we can get the price too.
# get the index price price_box = soup.find(‘div’, attrs={‘class’:’price’}) price = price_box.text print price
When you run the program, you should be able to see that it prints out the current price of the S&P 500 Index.
Export to Excel CSV
Now that we have the data, it is time to save it. The Excel Comma Separated Format is a nice choice. It can be opened in Excel so you can see the data and process it easily.
But first, we have to import the Python csv module and the datetime module to get the record date. Insert these lines to your code in the import section.
import csv from datetime import datetime
At the bottom of your code, add the code for writing data to a csv file.
# open a csv file with append, so old data will not be erased with open(‘index.csv’, ‘a’) as csv_file: writer = csv.writer(csv_file) writer.writerow([name, price, datetime.now()])
Now if you run your program, you should able to export an index.csv file, which you can then open with Excel, where you should see a line of data.
So if you run this program everyday, you will be able to easily get the S&P 500 Index price without rummaging through the website!
Going Further (Advanced uses)
Multiple Indices
So scraping one index is not enough for you, right? We can try to extract multiple indices at the same time.
First, modify the quote_page into an array of URLs.
quote_page = [‘http://www.bloomberg.com/quote/SPX:IND’, ‘http://www.bloomberg.com/quote/CCMP:IND’]
Then we change the data extraction code into a for loop, which will process the URLs one by one and store all the data into a variable data in tuples.
# for loop data = [] for pg in quote_page: # query the website and return the html to the variable ‘page’ page = urllib2.urlopen(pg) # parse the html using beautiful soap and store in variable `soup` soup = BeautifulSoup(page, ‘html.parser’) # Take out the
of name and get its value name_box = soup.find(‘h1’, attrs={‘class’: ‘name’}) name = name_box.text.strip() # strip() is used to remove starting and trailing # get the index price price_box = soup.find(‘div’, attrs={‘class’:’price’}) price = price_box.text # save the data in tuple data.append((name, price))Also, modify the saving section to save data row by row.
# open a csv file with append, so old data will not be erased with open(‘index.csv’, ‘a’) as csv_file: writer = csv.writer(csv_file) # The for loop for name, price in data: writer.writerow([name, price, datetime.now()])
Rerun the program and you should be able to extract two indices at the same time!
Advanced Scraping Techniques
BeautifulSoup is simple and great for small-scale web scraping. But if you are interested in scraping data at a larger scale, you should consider using these other alternatives:
Scrapy, a powerful python scraping framework Try to integrate your code with some public APIs. The efficiency of data retrieval is much higher than scraping webpages. For example, take a look at Facebook Graph API, which can help you get hidden data which is not shown on Facebook webpages. Consider using a database backend like MySQL to store your data when it gets too large.
Adopt the DRY Method
DRY stands for “Don’t Repeat Yourself”, try to automate your everyday tasks like this person. Some other fun projects to consider might be keeping track of your Facebook friends’ active time (with their consent of course), or grabbing a list of topics in a forum and trying out natural language processing (which is a hot topic for Artificial Intelligence right now)!
If you have any questions, please feel free to leave a comment below.
References
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/
Beginner’s guide to Web Scraping in Python using BeautifulSoup
This article was originally published on Altitude Labs’ blog and was written by our software engineer, Leonard Mok. Altitude Labs is a software agency that specializes in personalized, mobile-first React apps.
What is the difference beautifulsoup and bs4
I’m new to python and I tried to parse some XML files in order to add some new tags and store that new XML file.
python-beautifulsoup seams to be the right package for that. Searching around the web for tutorials, how to add an new tag to XML parsed by BeautifulSoup, i found out, that the package python-bs4 is used.
Looking at the package description, both packages have the same title:
python-bs4 – error-tolerant HTML parser for Python python-beautifulsoup – error-tolerant HTML parser for Python
So my question: what is the difference?
Scrapy VS Beautiful Soup: A Comparison Of Web Crawling Tools
One of the most critical assets for data-driven organisations is the kind of tools used by their data science professionals. Web crawler and other such web scraping tools are few of those tools that are used to gain meaningful insights. Web scraping allows efficient extraction of data from several web services and helps in converting raw and unstructured data into a structured whole.
There are several tools available for web scraping, such as lxml, BeautifulSoup, MechanicalSoup, Scrapy, Python Requests and others. Among these, Scrapy and Beautiful Soup are popular among developers.
THE BELAMY Sign up for your weekly dose of what’s up in emerging technology. Email Sign up
In this article, we will compare these two web scraping tools, and try to understand the differences between them. Before diving deep into the tools, let us first understand what these tools are.
Scrapy
Scrapy is an open-source and collaborative framework for extracting the data you need from websites in a fast and simple manner. This tool can be used for extracting data using APIs. It can also be used as a general-purpose web crawler. Thus, Scrapy is an application framework, which can be used for writing web spiders that crawl websites and extract data from them.
The framework provides a built-in mechanism for extracting data – known as selectors – and can be used for data mining, automated testing, etc. Scrapy is supported under Python 3.5+ under CPython and PyPy starting with PyPy 5.9.
Features of Scrapy:
Scrapy provides built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions
An interactive shell console for trying out the CSS and XPath expressions to scrape data
Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
Scraping With Scrapy
Using pip
If you just want to install scrapy globally in your system, you can install scrapy library using the python package ‘pip’. Open your terminal or command prompt and type the following command.
pip install scrapy
Using Conda
If you want scrapy to be in your conda environment just type in and execute the following command in your terminal
conda install -c conda-forge scrapy
The scrapy shell: It allows to scrape web pages interactively using the command line.
To open scrapy shell type scrapy shell .
Scraping with Scrapy Shell
Follow the steps below to start scraping :
1. Open the html file in a web browser and copy the url.
2. Now in the scrapy shell type and execute the following command:
fetch(“url–”)
Replace url– with the url of the html file or any webpage and the fetch command will download the page locally to your system.
You will get a similar message in your console
[scrapy.core.engine] DEBUG: Crawled (200)3. Viewing the response
The fetch object will store whatever page or information it fetched into a response object. To view the response object simply type in and enter the following command.
view(response)
The console will return a True and the webpage that was downloaded with fetch() will open up in your default browser.
4. Now that all the data you need is available locally. You just need to know what data you need.
5. Scraping the data: Coming back to the console, all the elements need to be printed behind the webpage that was fetched earlier. Enter the following command:
print(response.text)
Click here to get the detailed web scraping.
Beautiful Soup
Beautiful Soup is one of the most popular Python libraries which helps in parsing HTML or XML documents into a tree structure to find and extract data. This tool features a simple, Pythonic interface and automatic encoding conversion to make it easy to work with website data.
This library provides simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree, and automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
Features of Beautiful Soup:
This Python library provides a few simple methods, as well as Pythonic idioms for navigating, searching, and modifying a parse tree
The library automatically converts incoming and outgoing documents to Unicode and UTF-8, respectively
This library sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility
Scraping With Beautifulsoup
Installing Beautiful Soup 4
Beautiful Soup library can be installed using PIP with a very simple command. It is available on almost all platforms. Here is a way to install it using Jupyter Notebook.
This library can be imported with the following code and assign it to an object.
Getting Started
We will be using this basic, and default, HTML doc to parse the data using Beautiful Soup.
The following code will expand HTML into its hierarchy:
Exploring The Parse Tree
To navigate through the tree, we can use the following commands:
Beautiful Soup has many attributes which can be accessed and edited. This extracted parsed data can be saved onto a text file.
Click here to know more about web scraping with BeautifulSoup.
Scrapy VS Beautiful Soup
Structure
Scrapy is an open-source framework, whereas Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. A framework inverts the control of the program and informs the developer what they need. Whereas in the case of a library, the developer calls the library where and when they need it.
Performance
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
Extensibility
Beautiful Soup works best when working on smaller projects. On the other hand, Scrapy may be the better choice for larger projects with more complexities, as this framework can add custom functionalities and can develop pipelines with flexibility and speed.
Beginner-Friendly
For a beginner who is trying hands-on web scraping for the first time, Beautiful Soup is the best choice to start with. Scrapy can be used for scraping, but it is comparatively more complex than the former.
Community
The developer’s community of Scrapy is stronger and vast compared to that of Beautiful Soup. Also, developers can use Beautiful Soup for parsing HTML responses in Scrapy callbacks by feeding the response’s body into a BeautifulSoup object and extracting whatever data they need from it.
Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation
Quick Start¶ Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland : html_doc = “””
The Dormouse’s story The Dormouse’s story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
…
“”” Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure: from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) print ( soup . prettify ()) # #
## The Dormouse’s story # # # ## # The Dormouse’s story # #
#
# Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #
#
# … #
# # Here are some simple ways to navigate that data structure: soup . title #
The Dormouse’s story soup . title . name # u’title’ soup . title . string # u’The Dormouse’s story’ soup . title . parent . name # u’head’ soup . p #The Dormouse’s story
soup . p [ ‘class’ ] # u’title’ soup . a # Elsie soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] soup . find ( id = “link3” ) # Tillie One common task is extracting all the URLs found within a page’s tags: for link in soup . find_all ( ‘a’ ): print ( link . get ( ‘href’ )) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page: print ( soup . get_text ()) # The Dormouse’s story # # The Dormouse’s story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # … Does this look like what you need? If so, read on.
Installing Beautiful Soup¶ If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager: $ apt-get install python-bs4 (for Python 2) $ apt-get install python3-bs4 (for Python 3) Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip . The package name is beautifulsoup4 , and the same package works on Python 2 and Python 3. Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively if you’re using Python 3). $ easy_install beautifulsoup4 $ pip install beautifulsoup4 (The BeautifulSoup package is probably not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4 .) If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py . $ python setup.py install If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all. I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions. Problems after installation¶ Beautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, it’s automatically converted to Python 3 code. If you don’t install the package, the code won’t be converted. There have also been reports on Windows machines of the wrong version being installed. If you get the ImportError “No module named HTMLParser”, your problem is that you’re running the Python 2 version of the code under Python 3. If you get the ImportError “No module named html.parser”, your problem is that you’re running the Python 3 version of the code under Python 2. In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including any directory created when you unzipped the tarball) and try the installation again. If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u'[document]’ , you need to convert the Python 2 code to Python 3. You can do this either by installing the package: $ python3 setup.py install or by manually running Python’s 2to3 conversion script on the bs4 directory: $ 2to3-3.2 -w bs4 Installing a parser¶ Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands: $ apt-get install python-lxml $ easy_install lxml $ pip install lxml Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands: $ apt-get install python-html5lib $ easy_install html5lib $ pip install html5lib This table summarizes the advantages and disadvantages of each parser library: Parser Typical usage Advantages Disadvantages Python’s html.parser BeautifulSoup(markup, “html.parser”) Batteries included
Decent speed
Lenient (As of Python 2.7.3 and 3.2.) Not as fast as lxml, less lenient than html5lib. lxml’s HTML parser BeautifulSoup(markup, “lxml”) Very fast
Lenient External C dependency lxml’s XML parser BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup, “xml”) Very fast
The only currently supported XML parser External C dependency html5lib BeautifulSoup(markup, “html5lib”) Extremely lenient
Parses pages the same way a web browser does
Creates valid HTML5 Very slow
External Python dependency If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.
Making the soup¶ To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle: from bs4 import BeautifulSoup with open ( “index.html” ) as fp : soup = BeautifulSoup ( fp ) soup = BeautifulSoup ( “data” ) First, the document is converted to Unicode, and HTML entities are converted to Unicode characters: BeautifulSoup(“Sacré bleu!”)
Sacré bleu! Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)Kinds of objects¶ Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag , NavigableString , BeautifulSoup , and Comment . Tag ¶ A Tag object corresponds to an XML or HTML tag in the original document: soup = BeautifulSoup ( ‘Extremely bold‘ ) tag = soup . b type ( tag ) #
Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes. Name¶ Every tag has a name, accessible as .name : tag . name # u’b’ If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup: tag . name = “blockquote” tag # Extremely bold
Attributes¶ A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary: tag [ ‘id’ ] # u’boldest’ You can access that dictionary directly as .attrs : tag . attrs # {u’id’: ‘boldest’} You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary: tag [ ‘id’ ] = ‘verybold’ tag [ ‘another-attribute’ ] = 1 tag # del tag [ ‘id’ ] del tag [ ‘another-attribute’ ] tag # tag [ ‘id’ ] # KeyError: ‘id’ print ( tag . get ( ‘id’ )) # None Multi-valued attributes¶ HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel , rev , accept-charset , headers , and accesskey . Beautiful Soup presents the value(s) of a multi-valued attribute as a list: css_soup = BeautifulSoup ( ‘
‘ ) css_soup . p [ ‘class’ ] # [“body”] css_soup = BeautifulSoup ( ‘
‘ ) css_soup . p [ ‘class’ ] # [“body”, “strikeout”] If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone: id_soup = BeautifulSoup ( ‘
‘ ) id_soup . p [ ‘id’ ] # ‘my id’ When you turn a tag back into a string, multiple attribute values are consolidated: rel_soup = BeautifulSoup ( ‘
Back to the homepage
‘ ) rel_soup . a [ ‘rel’ ] # [‘index’] rel_soup . a [ ‘rel’ ] = [ ‘index’ , ‘contents’ ] print ( rel_soup . p ) #
Back to the homepage
You can disable this by passing multi_valued_attributes=None as a keyword argument into the BeautifulSoup constructor: no_list_soup = BeautifulSoup ( ‘
‘ , ‘html’ , multi_valued_attributes = None ) no_list_soup . p [ ‘class’ ] # u’body strikeout’ You can use `get_attribute_list to get a value that’s always a list, whether or not it’s a multi-valued atribute: id_soup . p . get_attribute_list ( ‘id’ ) # [“my id”] If you parse a document as XML, there are no multi-valued attributes: xml_soup = BeautifulSoup ( ‘
‘ , ‘xml’ ) xml_soup . p [ ‘class’ ] # u’body strikeout’ Again, you can configure this using the multi_valued_attributes argument: class_is_multi = { ‘*’ : ‘class’ } xml_soup = BeautifulSoup ( ‘
‘ , ‘xml’ , multi_valued_attributes = class_is_multi ) xml_soup . p [ ‘class’ ] # [u’body’, u’strikeout’] You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification: from bs4.builder import builder_registry builder_registry . lookup ( ‘html’ ) . DEFAULT_CDATA_LIST_ATTRIBUTES NavigableString ¶ A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text: tag . string # u’Extremely bold’ type ( tag . string ) #
A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with unicode() : unicode_string = unicode ( tag . string ) unicode_string # u’Extremely bold’ type ( unicode_string ) # You can’t edit a string in place, but you can replace one string with another, using replace_with(): tag . string . replace_with ( “No longer bold” ) tag # No longer bold
NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method. If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory. BeautifulSoup ¶ The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree. You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents: doc = BeautifulSoup ( “
Here’s the footer INSERT FOOTER HERE ” , “xml” ) doc . find ( text = “INSERT FOOTER HERE” ) . replace_with ( footer ) # u’INSERT FOOTER HERE’ print ( doc ) # #
Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name , so it’s been given the special .name “[document]”: soup . name # u'[document]’
Navigating the tree¶ Here’s the “Three sisters” HTML document again: html_doc = “””
The Dormouse’s story The Dormouse’s story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
…
“”” from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) I’ll use this as an example to show you how to move from one part of a document to another. Going down¶ Tags may contain strings and other tags. These elements are the tag’s children . Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children. Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children. Navigating using tag names¶ The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the
tag, just say soup.head : soup . head #The Dormouse’s story soup . title #The Dormouse’s story You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first tag beneath the tag: soup . body . b # The Dormouse’s story Using a tag name as an attribute will give you only the first tag by that name: soup . a # Elsie If you need to get all the tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all() : soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] .contents and .children ¶ A tag’s children are available in a list called .contents : head_tag = soup . head head_tag #The Dormouse’s story head_tag . contents [ < title > The Dormouse ‘s story] title_tag = head_tag . contents [ 0 ] title_tag #The Dormouse’s story title_tag . contents # [u’The Dormouse’s story’] The BeautifulSoup object itself has children. In this case, the tag is the child of the BeautifulSoup object.: len ( soup . contents ) # 1 soup . contents [ 0 ] . name # u’html’ A string does not have .contents , because it can’t contain anything: text = title_tag . contents [ 0 ] text . contents # AttributeError: ‘NavigableString’ object has no attribute ‘contents’ Instead of getting them as a list, you can iterate over a tag’s children using the .children generator: for child in title_tag . children : print ( child ) # The Dormouse’s story .descendants ¶ The .contents and .children attributes only consider a tag’s direct children. For instance, the tag has a single direct child–thetag: head_tag . contents # [ ” ) print ( sibling_soup . prettify ()) # # # # # text1 # #The Dormouse’s story ] But thetag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on: for child in head_tag . descendants : print ( child ) # The Dormouse’s story # The Dormouse’s story The tag has only one child, but it has two descendants: thetag and the tag’s child. The BeautifulSoup object only has one direct child (the tag), but it has a whole lot of descendants: len ( list ( soup . children )) # 1 len ( list ( soup . descendants )) # 25 .string ¶ If a tag has only one child, and that child is a NavigableString , the child is made available as .string : title_tag . string # u’The Dormouse’s story’ If a tag’s only child is another tag, and that tag has a .string , then the parent tag is considered to have the same .string as its child: head_tag . contents # [ The Dormouse’s story ] head_tag . string # u’The Dormouse’s story’ If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None : print ( soup . html . string ) # None .strings and stripped_strings ¶ If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator: for string in soup . strings : print ( repr ( string )) # u”The Dormouse’s story” # u’‘ # u”The Dormouse’s story” # u’
‘ # u’Once upon a time there were three little sisters; and their names were
‘ # u’Elsie’ # u’,
‘ # u’Lacie’ # u’ and
‘ # u’Tillie’ # u’;
and they lived at the bottom of a well.’ # u’
‘ # u’…’ # u’
‘ These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead: for string in soup . stripped_strings : print ( repr ( string )) # u”The Dormouse’s story” # u”The Dormouse’s story” # u’Once upon a time there were three little sisters; and their names were’ # u’Elsie’ # u’,’ # u’Lacie’ # u’and’ # u’Tillie’ # u’;
and they lived at the bottom of a well.’ # u’…’ Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed. Going up¶ Continuing the “family tree” analogy, every tag and every string has a parent : the tag that contains it. .parent ¶ You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the
tag is the parent of thetag: title_tag = soup . title title_tag # The Dormouse’s story title_tag . parent #The Dormouse’s story The title string itself has a parent: thetag that contains it: title_tag . string . parent # The Dormouse’s story The parent of a top-level tag like is the BeautifulSoup object itself: html_tag = soup . html type ( html_tag . parent ) #And the .parent of a BeautifulSoup object is defined as None: print ( soup . parent ) # None .parents ¶ You can iterate over all of an element’s parents with .parents . This example uses .parents to travel from an tag buried deep within the document, to the very top of the document: link = soup . a link # Elsie for parent in link . parents : if parent is None : print ( parent ) else : print ( parent . name ) # p # body # html # [document] # None Going sideways¶ Consider a simple document like this: sibling_soup = BeautifulSoup ( “text1 text2 # text2 # # # # The tag and thetag are at the same level: they’re both direct children of the same tag. We call them siblings . When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write. .next_sibling and .previous_sibling ¶ You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree: sibling_soup . b . next_sibling # soup . b . contents # [Don’t, u’ you’,text2 sibling_soup . c . previous_sibling # text1 The tag has a .next_sibling , but no .previous_sibling , because there’s nothing before the tag on the same level of the tree . For the same reason, thetag has a .previous_sibling but no .next_sibling : print ( sibling_soup . b . previous_sibling ) # None print ( sibling_soup . c . next_sibling ) # None The strings “text1” and “text2” are not siblings, because they don’t have the same parent: sibling_soup . b . string # u’text1′ print ( sibling_soup . b . string . next_sibling ) # None In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document: < a href = "http://example.com/elsie" class = "sister" id = "link1" > Elsie a > < a href = "http://example.com/lacie" class = "sister" id = "link2" > Lacie a > < a href = "http://example.com/tillie" class = "sister" id = "link3" > Tillie a > You might think that the .next_sibling of the first tag would be the second tag. But actually, it’s a string: the comma and newline that separate the first tag from the second: link = soup . a link # Elsie link . next_sibling # u’, ‘ The second tag is actually the .next_sibling of the comma: link . next_sibling . next_sibling # Lacie .next_siblings and .previous_siblings ¶ You can iterate over a tag’s siblings with .next_siblings or .previous_siblings : for sibling in soup . a . next_siblings : print ( repr ( sibling )) # u’,
‘ # Lacie # u’ and
‘ # Tillie # u’; and they lived at the bottom of a well.’ # None for sibling in soup . find ( id = “link3” ) . previous_siblings : print ( repr ( sibling )) # ‘ and
‘ # Lacie # u’,
‘ # Elsie # u’Once upon a time there were three little sisters; and their names were
‘ # None Going back and forth¶ Take a look at the beginning of the “three sisters” document: < html >< head >< title > The Dormouse ‘s story < p class = "title" >< b > The Dormouse ‘s story
An HTML parser takes this string of characters and turns it into a series of events: “open an tag”, “open a
tag”, “open atag”, “add a string”, “close the FooBar soup . a . contents # [u’Foo’, u’Bar’] extend() ¶ Starting in Beautiful Soup 4.7.0, Tag also supports a method called .extend() , which works just like calling .extend() on a Python list: soup = BeautifulSoup ( “Soup” ) soup . a . extend ([ “‘s” , ” ” , “on” ]) soup # Soup’s on soup . a . contents # [u’Soup’, u”s’, u’ ‘, u’on’] NavigableString() and .new_tag() ¶ If you need to add a string to a document, no problem–you can pass a Python string in to append() , or you can call the NavigableString constructor: soup = BeautifulSoup ( “” ) tag = soup . b tag . append ( “Hello” ) new_string = NavigableString ( ” there” ) tag . append ( new_string ) tag # Hello there. tag . contents # [u’Hello’, u’ there’] If you want to create a comment or some other subclass of NavigableString , just call the constructor: from bs4 import Comment new_comment = Comment ( “Nice to see you.” ) tag . append ( new_comment ) tag # Hello there tag . contents # [u’Hello’, u’ there’, u’Nice to see you.’] (This is a new feature in Beautiful Soup 4.4.0.) What if you need to create a whole new tag? The best solution is to call the factory method BeautifulSoup.new_tag() : soup = BeautifulSoup ( “” ) original_tag = soup . b new_tag = soup . new_tag ( “a” , href = “http://www.example.com” ) original_tag . append ( new_tag ) original_tag # new_tag . string = “Link text.” original_tag # Link text. Only the first argument, the tag name, is required. insert() ¶ Tag.insert() is just like Tag.append() , except the new element doesn’t necessarily go at the end of its parent’s .contents . It’ll be inserted at whatever numeric position you say. It works just like .insert() on a Python list: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) tag = soup . a tag . insert ( 1 , “but did not endorse ” ) tag # I linked to but did not endorse example.com tag . contents # [u’I linked to ‘, u’but did not endorse’, example.com] insert_before() and insert_after() ¶ The insert_before() method inserts tags or strings immediately before something else in the parse tree: soup = BeautifulSoup ( “stop” ) tag = soup . new_tag ( “i” ) tag . string = “Don’t” soup . b . string . insert_before ( tag ) soup . b # Don’tstop The insert_after() method inserts tags or strings immediately following something else in the parse tree: div = soup . new_tag ( ‘div’ ) div . string = ‘ever’ soup . b . i . insert_after ( ” you ” , div ) soup . b # Don’t youtag”, “open a tag”, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document. .next_element and .previous_element ¶ The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling , but it’s usually drastically different. Here’s the final tag in the “three sisters” document. Its .next_sibling is a string: the conclusion of the sentence that was interrupted by the start of the tag.: last_a_tag = soup . find ( “a” , id = “link3” ) last_a_tag # Tillie last_a_tag . next_sibling # ‘; and they lived at the bottom of a well.’ But the .next_element of that tag, the thing that was parsed immediately after the tag, is not the rest of that sentence: it’s the word “Tillie”: last_a_tag . next_element # u’Tillie’ That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered an tag, then the word “Tillie”, then the closing tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the tag, but the word “Tillie” was encountered first. The .previous_element attribute is the exact opposite of .next_element . It points to whatever element was parsed immediately before this one: last_a_tag . previous_element # u’ and
‘ last_a_tag . previous_element . next_element # Tillie .next_elements and .previous_elements ¶ You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed: for element in last_a_tag . next_elements : print ( repr ( element )) # u’Tillie’ # u’;
and they lived at the bottom of a well.’ # u’
‘ #
…
# u’…’ # u’
‘ # None
Modifying the tree¶ Beautiful Soup’s main strength is in searching the parse tree, but you can also modify the tree and write your changes as a new HTML or XML document. Changing tag names and attributes¶ I covered this earlier, in Attributes, but it bears repeating. You can rename a tag, change the values of its attributes, add new attributes, and delete attributes: soup = BeautifulSoup ( ‘Extremely bold‘ ) tag = soup . b tag . name = “blockquote” tag [ ‘class’ ] = ‘verybold’ tag [ ‘id’ ] = 1 tag #
Extremely bold
del tag [ ‘class’ ] del tag [ ‘id’ ] tag #
Extremely bold
Modifying .string ¶ If you set a tag’s .string attribute to a new string, the tag’s contents are replaced with that string: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) tag = soup . a tag . string = “New link text.” tag # New link text. Be careful: if the tag contained other tags, they and all their contents will be destroyed. append() ¶ You can add to a tag’s contents with Tag.append() . It works just like calling .append() on a Python list: soup = BeautifulSoup ( “Foo” ) soup . a . append ( “Bar” ) soup #
everstop
ever, u’stop’] clear() ¶ Tag.clear() removes the contents of a tag: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) tag = soup . a tag . clear () tag # extract() ¶ PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) a_tag = soup . a i_tag = soup . i . extract () a_tag # I linked to i_tag # example.com print ( i_tag . parent ) None At this point you effectively have two parse trees: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call extract on a child of the element you extracted: my_string = i_tag . string . extract () my_string # u’example.com’ print ( my_string . parent ) # None i_tag # decompose() ¶ Tag.decompose() removes a tag from the tree, then completely destroys it and its contents : markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) a_tag = soup . a soup . i . decompose () a_tag # I linked to replace_with() ¶ PageElement.replace_with() removes a tag or string from the tree, and replaces it with the tag or string of your choice: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) a_tag = soup . a new_tag = soup . new_tag ( “b” ) new_tag . string = “example.net” a_tag . i . replace_with ( new_tag ) a_tag # I linked to example.net replace_with() returns the tag or string that was replaced, so that you can examine it or add it back to another part of the tree. wrap() ¶ PageElement.wrap() wraps an element in the tag you specify. It returns the new wrapper: soup = BeautifulSoup ( “
I wish I was bold.
” ) soup . p . string . wrap ( soup . new_tag ( “b” )) # I wish I was bold. soup . p . wrap ( soup . new_tag ( “div” ) #
I wish I was bold.
This method is new in Beautiful Soup 4.0.5. unwrap() ¶ Tag.unwrap() is the opposite of wrap() . It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) a_tag = soup . a a_tag . i . unwrap () a_tag # I linked to example.com Like replace_with() , unwrap() returns the tag that was replaced. smooth() ¶ After calling a bunch of methods that modify the parse tree, you may end up with two or more NavigableString objects next to each other. Beautiful Soup doesn’t have any problems with this, but since it can’t happen in a freshly parsed document, you might not expect behavior like the following: soup = BeautifulSoup ( “
A one
” ) soup . p . append ( “, a two” ) soup . p . contents # [u’A one’, u’, a two’] print ( soup . p . encode ()) #
A one, a two
print ( soup . p . prettify ()) #
# A one # , a two #
You can call Tag.smooth() to clean up the parse tree by consolidating adjacent strings: soup . smooth () soup . p . contents # [u’A one, a two’] print ( soup . p . prettify ()) #
# A one, a two #
The smooth() method is new in Beautiful Soup 4.8.0.
Output¶ Pretty-printing¶ The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup ) soup . prettify () # ‘
…’ print ( soup . prettify ()) # #
# # # # I linked to # # example.com # # # # You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: print ( soup . a . prettify ()) # # I linked to # # example.com # # Non-pretty printing¶ If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str ( soup ) # ‘I linked to example.com‘ unicode ( soup . a ) # u’I linked to example.com‘ The str() function returns a string encoded in UTF-8. See Encodings for other options. You can also call encode() to get a bytestring, and decode() to get Unicode. Output formatters¶ If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters: soup = BeautifulSoup ( ““Dammit!” he said.” ) unicode ( soup ) # u’\u201cDammit!\u201d he said.‘ If you then convert the document to a string, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back: str ( soup ) # ‘\xe2\x80\x9cDammit!\xe2\x80\x9d he said.‘ By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML: soup = BeautifulSoup ( “The law firm of Dewey, Cheatem, & Howe
” ) soup . p #
The law firm of Dewey, Cheatem, & Howe
soup = BeautifulSoup ( ‘A link‘ ) soup . a # A link You can change this behavior by providing a value for the formatter argument to prettify() , encode() , or decode() . Beautiful Soup recognizes five possible values for formatter . The default is formatter=”minimal” . Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML: french = “
Il a dit <
> ” soup = BeautifulSoup ( french ) print ( soup . prettify ( formatter = “minimal” )) # #
## Il a dit <
> # # # If you pass in formatter=”html” , Beautiful Soup will convert Unicode characters to HTML entities whenever possible: print ( soup . prettify ( formatter = “html” )) # #
## Il a dit <
> # # # If you pass in formatter=”html5″ , it’s the same as formatter=”html5″ , but Beautiful Soup will omit the closing slash in HTML void tags like “br”: soup = BeautifulSoup ( “
” ) print ( soup . encode ( formatter = “html” )) #
print ( soup . encode ( formatter = “html5” )) #
If you pass in formatter=None , Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples: print ( soup . prettify ( formatter = None )) # # ## Il a dit <
> # # # link_soup = BeautifulSoup ( ‘A link‘ ) print ( link_soup . a . encode ( formatter = None )) # A link If you need more sophisticated control over your output, you can use Beautiful Soup’s Formatter class. Here’s a formatter that converts strings to uppercase, whether they occur in a text node or in an attribute value: from bs4.formatter import HTMLFormatter def uppercase ( str ): return str . upper () formatter = HTMLFormatter ( uppercase ) print ( soup . prettify ( formatter = formatter )) # #
## IL A DIT <
> # # # print ( link_soup . a . prettify ( formatter = formatter )) # # A LINK # Subclassing HTMLFormatter or XMLFormatter will give you even more control over the output. For example, Beautiful Soup sorts the attributes in every tag by default: attr_soup = BeautifulSoup ( b ‘
‘ ) print ( attr_soup . p . encode ()) #
To turn this off, you can subclass the Formatter.attributes() method, which controls which attributes are output and in what order. This implementation also filters out the attribute called “m” whenever it appears: class UnsortedAttributes ( HTMLFormatter ): def attributes ( self , tag ): for k , v in tag . attrs . items (): if k == ‘m’ : continue yield k , v print ( attr_soup . p . encode ( formatter = UnsortedAttributes ())) #
One last caveat: if you create a CData object, the text inside that object is always presented exactly as it appears, with no formatting . Beautiful Soup will call your entity substitution function, just in case you’ve written a custom function that counts all the strings in the document or something, but it will ignore the return value: from bs4.element import CData soup = BeautifulSoup ( “” ) soup . a . string = CData ( “one < three" ) print ( soup . a . prettify ( formatter = "xml" )) # # get_text() ¶ If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string: markup = ‘
I linked to example.com
‘ soup = BeautifulSoup ( markup ) soup . get_text () u ‘
I linked to example.com
‘ soup . i . get_text () u ‘example.com’ You can specify a string to be used to join the bits of text together: # soup.get_text(“|”) u ‘
I linked to |example.com|
‘ You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text: # soup.get_text(“|”, strip=True) u ‘I linked to|example.com’ But at that point you might want to use the .stripped_strings generator instead, and process the text yourself: [ text for text in soup . stripped_strings ] # [u’I linked to’, u’example.com’]
Specifying the parser to use¶ If you just need to parse some HTML, you can dump the markup into the BeautifulSoup constructor, and it’ll probably be fine. Beautiful Soup will pick a parser for you and parse the data. But there are a few additional arguments you can pass in to the constructor to change which parser is used. The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed. If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser. You can override this by specifying one of the following: What type of markup you want to parse. Currently supported are “html”, “xml”, and “html5”.
The name of the parser library you want to use. Currently supported options are “lxml”, “html5lib”, and “html.parser” (Python’s built-in HTML parser). The section Installing a parser contrasts the supported parsers. If you don’t have an appropriate parser installed, Beautiful Soup will ignore your request and pick a different parser. Right now, the only supported XML parser is lxml. If you don’t have lxml installed, asking for an XML parser won’t give you one, and asking for “lxml” won’t work either. Differences between parsers¶ Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here’s a short document, parsed as HTML: BeautifulSoup ( “” ) #
Since an empty tag is not valid HTML, the parser turns it into a tag pair. Here’s the same document parsed as XML (running this requires that you have lxml installed). Note that the empty tag is left alone, and that the document is given an XML declaration instead of being put into an tag.: BeautifulSoup ( “” , “xml” ) # # There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document. But if the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using lxml’s HTML parser. Note that the danglingtag is simply ignored: BeautifulSoup ( “
” , “lxml” ) #
Here’s the same document parsed using html5lib: BeautifulSoup ( “” , “html5lib” ) #
Instead of ignoring the danglingtag, html5lib pairs it with an opening
tag. This parser also adds an empty
tag to the document. Here’s the same document parsed with Python’s built-in HTML parser: BeautifulSoup ( “” , “html.parser” ) # Like html5lib, this parser ignores the closing
tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a
tag. Unlike lxml, it doesn’t even bother to add an tag. Since the document “” is invalid, none of these techniques is the “correct” way to handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the “correct” way, but all three techniques are legitimate. Differences between parsers can affect your script. If you’re planning on distributing your script to other people, or running it on multiple machines, you should specify a parser in the BeautifulSoup constructor. That will reduce the chances that your users parse a document differently from the way you parse it.
Encodings¶ Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode: markup = “
Sacr \xc3\xa9 bleu!
” soup = BeautifulSoup ( markup ) soup . h1 #
Sacré bleu!
soup . h1 . string # u’Sacr\xe9 bleu!’ It’s not magic. (That sure would be nice.) Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode. The autodetected encoding is available as the .original_encoding attribute of the BeautifulSoup object: soup . original_encoding ‘utf-8’ Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time. If you happen to know a document’s encoding ahead of time, you can avoid mistakes and delays by passing it to the BeautifulSoup constructor as from_encoding . Here’s a document written in ISO-8859-8. The document is so short that Unicode, Dammit can’t get a lock on it, and misidentifies it as ISO-8859-7: markup = b “
\xed\xe5\xec\xf9
” soup = BeautifulSoup ( markup ) soup . h1 < h1 > νεμω h1 > soup . original_encoding ‘ISO-8859-7’ We can fix this by passing in the correct from_encoding : soup = BeautifulSoup ( markup , from_encoding = “iso-8859-8” ) soup . h1 < h1 > םולש h1 > soup . original_encoding ‘iso8859-8’ If you don’t know what the correct encoding is, but you know that Unicode, Dammit is guessing wrong, you can pass the wrong guesses in as exclude_encodings : soup = BeautifulSoup ( markup , exclude_encodings = [ “ISO-8859-7” ]) soup . h1 < h1 > םולש h1 > soup . original_encoding ‘WINDOWS-1255′ Windows-1255 isn’t 100% correct, but that encoding is a compatible superset of ISO-8859-8, so it’s close enough. ( exclude_encodings is a new feature in Beautiful Soup 4.4.0.) In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object. This lets you know that the Unicode representation is not an exact representation of the original–some data was lost. If a document contains �, but .contains_replacement_characters is False , you’ll know that the � was there originally (as it is in this paragraph) and doesn’t stand in for missing data. Output encoding¶ When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with. Here’s a document written in the Latin-1 encoding: markup = b ”’
Sacr \xe9 bleu!
”’ soup = BeautifulSoup ( markup ) print ( soup . prettify ()) # #
# # # ## Sacré bleu! #
# # Note that the tag has been rewritten to reflect the fact that the document is now in UTF-8. If you don’t want UTF-8, you can pass an encoding into prettify() : print ( soup . prettify ( “latin-1” )) # #
# # … You can also call encode() on the BeautifulSoup object, or any element in the soup, just as if it were a Python string: soup . p . encode ( “latin-1” ) # ‘Sacr\xe9 bleu!
‘ soup . p . encode ( “utf-8” ) # ‘
Sacr\xc3\xa9 bleu!
‘ Any characters that can’t be represented in your chosen encoding will be converted into numeric XML entity references. Here’s a document that includes the Unicode character SNOWMAN: markup = u “ \N{SNOWMAN} ” snowman_soup = BeautifulSoup ( markup ) tag = snowman_soup . b The SNOWMAN character can be part of a UTF-8 document (it looks like ☃), but there’s no representation for that character in ISO-Latin-1 or ASCII, so it’s converted into “☃” for those encodings: print ( tag . encode ( “utf-8” )) # ☃ print tag . encode ( “latin-1” ) # ☃ print tag . encode ( “ascii” ) # ☃ Unicode, Dammit¶ You can use Unicode, Dammit without using Beautiful Soup. It’s useful whenever you have data in an unknown encoding and you just want it to become Unicode: from bs4 import UnicodeDammit dammit = UnicodeDammit ( “Sacr \xc3\xa9 bleu!” ) print ( dammit . unicode_markup ) # Sacré bleu! dammit . original_encoding # ‘utf-8’ Unicode, Dammit’s guesses will get a lot more accurate if you install the chardet or cchardet Python libraries. The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list: dammit = UnicodeDammit ( “Sacr \xe9 bleu!” , [ “latin-1” , “iso-8859-1” ]) print ( dammit . unicode_markup ) # Sacré bleu! dammit . original_encoding # ‘latin-1’ Unicode, Dammit has two special features that Beautiful Soup doesn’t use. Smart quotes¶ You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities: markup = b “
I just \x93 love \x94 Microsoft Word \x92 s smart quotes
” UnicodeDammit ( markup , [ “windows-1252” ], smart_quotes_to = “html” ) . unicode_markup # u’
I just “love” Microsoft Word’s smart quotes
‘ UnicodeDammit ( markup , [ “windows-1252” ], smart_quotes_to = “xml” ) . unicode_markup # u’
I just “love” Microsoft Word’s smart quotes
‘ You can also convert Microsoft smart quotes to ASCII quotes: UnicodeDammit ( markup , [ “windows-1252” ], smart_quotes_to = “ascii” ) . unicode_markup # u’
I just “love” Microsoft Word\’s smart quotes
‘ Hopefully you’ll find this feature useful, but Beautiful Soup doesn’t use it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else: UnicodeDammit ( markup , [ “windows-1252” ]) . unicode_markup # u’
I just \u201clove\u201d Microsoft Word\u2019s smart quotes
‘ Inconsistent encodings¶ Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters such as (again) Microsoft smart quotes. This can happen when a website includes data from multiple sources. You can use UnicodeDammit.detwingle() to turn such a document into pure UTF-8. Here’s a simple example: snowmen = ( u ” \N{SNOWMAN} ” * 3 ) quote = ( u ” \N{LEFT DOUBLE QUOTATION MARK} I like snowmen! \N{RIGHT DOUBLE QUOTATION MARK} ” ) doc = snowmen . encode ( “utf8” ) + quote . encode ( “windows_1252” ) This document is a mess. The snowmen are in UTF-8 and the quotes are in Windows-1252. You can display the snowmen or the quotes, but not both: print ( doc ) # ☃☃☃�I like snowmen!� print ( doc . decode ( “windows-1252” )) # ☃☃☃“I like snowmen!” Decoding the document as UTF-8 raises a UnicodeDecodeError , and decoding it as Windows-1252 gives you gibberish. Fortunately, UnicodeDammit.detwingle() will convert the string to pure UTF-8, allowing you to decode it to Unicode and display the snowmen and quote marks simultaneously: new_doc = UnicodeDammit . detwingle ( doc ) print ( new_doc . decode ( “utf8” )) # ☃☃☃“I like snowmen!” UnicodeDammit.detwingle() only knows how to handle Windows-1252 embedded in UTF-8 (or vice versa, I suppose), but this is the most common case. Note that you must know to call UnicodeDammit.detwingle() on your data before passing it into BeautifulSoup or the UnicodeDammit constructor. Beautiful Soup assumes that a document has a single encoding, whatever it might be. If you pass it a document that contains both UTF-8 and Windows-1252, it’s likely to think the whole document is Windows-1252, and the document will come out looking like ☃☃☃“I like snowmen!” . UnicodeDammit.detwingle() is new in Beautiful Soup 4.1.0.
Line numbers¶ The html.parser` and “html5lib parsers can keep track of where in the original document each Tag was found. You can access this information as Tag.sourceline (line number) and Tag.sourcepos (position of the start tag within a line): markup = “
Paragraph 1
Paragraph 2
” soup = BeautifulSoup ( markup , ‘html.parser’ ) for tag in soup . find_all ( ‘p’ ): print ( tag . sourceline , tag . sourcepos , tag . string ) # (1, 0, u’Paragraph 1′) # (2, 3, u’Paragraph 2′) Note that the two parsers mean slightly different things by sourceline and sourcepos . For html.parser, these numbers represent the position of the initial less-than sign. For html5lib, these numbers represent the position of the final greater-than sign: soup = BeautifulSoup ( markup , ‘html5lib’ ) for tag in soup . find_all ( ‘p’ ): print ( tag . sourceline , tag . sourcepos , tag . string ) # (2, 1, u’Paragraph 1′) # (3, 7, u’Paragraph 2′) You can shut off this feature by passing store_line_numbers=False` into the “BeautifulSoup constructor: markup = “
Paragraph 1
Paragraph 2
” soup = BeautifulSoup ( markup , ‘html.parser’ , store_line_numbers = False ) soup . p . sourceline # None This feature is new in 4.8.1, and the parsers based on lxml don’t support it.
Comparing objects for equality¶ Beautiful Soup says that two NavigableString or Tag objects are equal when they represent the same HTML or XML markup. In this example, the two tags are treated as equal, even though they live in different parts of the object tree, because they both look like “pizza”: markup = “
I want pizza and more pizza!
” soup = BeautifulSoup ( markup , ‘html.parser’ ) first_b , second_b = soup . find_all ( ‘b’ ) print first_b == second_b # True print first_b . previous_element == second_b . previous_element # False If you want to see whether two variables refer to exactly the same object, use is : print first_b is second_b # False
Copying Beautiful Soup objects¶ You can use copy.copy() to create a copy of any Tag or NavigableString : import copy p_copy = copy . copy ( soup . p ) print p_copy #
I want pizza and more pizza!
The copy is considered equal to the original, since it represents the same markup as the original, but it’s not the same object: print soup . p == p_copy # True print soup . p is p_copy # False The only real difference is that the copy is completely detached from the original Beautiful Soup object tree, just as if extract() had been called on it: print p_copy . parent # None This is because two different Tag objects can’t occupy the same space at the same time.
Parsing only part of a document¶ Let’s say you want to use Beautiful Soup look at a document’s tags. It’s a waste of time and memory to parse the entire document and then go over it again looking for tags. It would be much faster to ignore everything that wasn’t an tag in the first place. The SoupStrainer class allows you to choose which parts of an incoming document are parsed. You just create a SoupStrainer and pass it in to the BeautifulSoup constructor as the parse_only argument. (Note that this feature won’t work if you’re using the html5lib parser. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didn’t actually make it into the parse tree, it’ll crash. To avoid confusion, in the examples below I’ll be forcing Beautiful Soup to use Python’s built-in parser.) SoupStrainer ¶ The SoupStrainer class takes the same arguments as a typical method from Searching the tree: name, attrs, string, and **kwargs. Here are three SoupStrainer objects: from bs4 import SoupStrainer only_a_tags = SoupStrainer ( “a” ) only_tags_with_id_link2 = SoupStrainer ( id = “link2” ) def is_short_string ( string ): return len ( string ) < 10 only_short_strings = SoupStrainer ( string = is_short_string ) I’m going to bring back the “three sisters” document one more time, and we’ll see what the document looks like when it’s parsed with these three SoupStrainer objects: html_doc = """
The Dormouse’s story The Dormouse’s story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
…
“”” print ( BeautifulSoup ( html_doc , “html.parser” , parse_only = only_a_tags ) . prettify ()) # # Elsie # # # Lacie # # # Tillie # print ( BeautifulSoup ( html_doc , “html.parser” , parse_only = only_tags_with_id_link2 ) . prettify ()) # # Lacie # print ( BeautifulSoup ( html_doc , “html.parser” , parse_only = only_short_strings ) . prettify ()) # Elsie # , # Lacie # and # Tillie # … # You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it: soup = BeautifulSoup ( html_doc ) soup . find_all ( only_short_strings ) # [u’
‘, u’
‘, u’Elsie’, u’,
‘, u’Lacie’, u’ and
‘, u’Tillie’, # u’
‘, u’…’, u’
‘]
Troubleshooting¶ diagnose() ¶ If you’re having trouble understanding what Beautiful Soup does to a document, pass the document into the diagnose() function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing you how different parsers handle the document, and tell you if you’re missing a parser that Beautiful Soup could be using: from bs4.diagnose import diagnose with open ( “bad.html” ) as fp : data = fp . read () diagnose ( data ) # Diagnostic running on Beautiful Soup 4.2.0 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) # I noticed that html5lib is not installed. Installing it may help. # Found lxml version 2.3.2.0 # # Trying to parse your data with html.parser # Here’s what html.parser did with the document: # … Just looking at the output of diagnose() may show you how to solve the problem. Even if not, you can paste the output of diagnose() when asking for help. Errors when parsing a document¶ There are two different kinds of parse errors. There are crashes, where you feed a document to Beautiful Soup and it raises an exception, usually an HTMLParser.HTMLParseError . And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different than the document used to create it. Almost none of these problems turn out to be problems with Beautiful Soup. This is not because Beautiful Soup is an amazingly well-written piece of software. It’s because Beautiful Soup doesn’t include any parsing code. Instead, it relies on external parsers. If one parser isn’t working on a certain document, the best solution is to try a different parser. See Installing a parser for details and a parser comparison. The most common parse errors are HTMLParser.HTMLParseError: malformed start tag and HTMLParser.HTMLParseError: bad end tag . These are both generated by Python’s built-in HTML parser library, and the solution is to install lxml or html5lib. The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None . This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib. Version mismatch problems¶ SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = u'[document]’ ): Caused by running the Python 2 version of Beautiful Soup under Python 3, without converting the code.
(on the line ): Caused by running the Python 2 version of Beautiful Soup under Python 3, without converting the code. ImportError: No module named HTMLParser – Caused by running the Python 2 version of Beautiful Soup under Python 3.
– Caused by running the Python 2 version of Beautiful Soup under Python 3. ImportError: No module named html.parser – Caused by running the Python 3 version of Beautiful Soup under Python 2.
– Caused by running the Python 3 version of Beautiful Soup under Python 2. ImportError: No module named BeautifulSoup – Caused by running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4 .
– Caused by running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to . ImportError: No module named bs4 – Caused by running Beautiful Soup 4 code on a system that doesn’t have BS4 installed. Parsing XML¶ By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in “xml” as the second argument to the BeautifulSoup constructor: soup = BeautifulSoup ( markup , “xml” ) You’ll need to have lxml installed. Other parser problems¶ If your script works on one computer but not another, or in one virtual environment but not another, or outside the virtual environment but not inside, it’s probably because the two environments have different parser libraries available. For example, you may have developed the script on a computer that has lxml installed, and then tried to run it on a computer that only has html5lib installed. See Differences between parsers for why this matters, and fix the problem by mentioning a specific parser library in the BeautifulSoup constructor.
constructor. Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup
is converted to . If you want to preserve mixed-case or uppercase tags and attributes, you’ll need to parse the document as XML. Miscellaneous¶ UnicodeEncodeError: ‘charmap’ codec can’t encode character u’\xfoo’ in position bar (or just about any other UnicodeEncodeError ) – This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with u.encode(“utf8”) . (or just about any other ) – This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with . KeyError: [attr] – Caused by accessing tag[‘attr’] when the tag in question doesn’t define the attr attribute. The most common errors are KeyError: ‘href’ and KeyError: ‘class’ . Use tag.get(‘attr’) if you’re not sure attr is defined, just as you would with a Python dictionary.
– Caused by accessing when the tag in question doesn’t define the attribute. The most common errors are and . Use if you’re not sure is defined, just as you would with a Python dictionary. AttributeError: ‘ResultSet’ object has no attribute ‘foo’ – This usually happens because you expected find_all() to return a single tag or string. But find_all() returns a _list_ of tags and strings–a ResultSet object. You need to iterate over the list and look at the .foo of each one. Or, if you really only want one result, you need to use find() instead of find_all() .
– This usually happens because you expected to return a single tag or string. But returns a _list_ of tags and strings–a object. You need to iterate over the list and look at the of each one. Or, if you really only want one result, you need to use instead of . AttributeError: ‘NoneType’ object has no attribute ‘foo’ – This usually happens because you called find() and then tried to access the .foo` attribute of the result. But in your case, find() didn’t find anything, so it returned None , instead of returning a tag or a string. You need to figure out why your find() call isn’t returning anything. Improving Performance¶ Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is critical, if you’re paying for computer time by the hour, or if there’s any other reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml. That said, there are things you can do to speed up Beautiful Soup. If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib. You can speed up encoding detection significantly by installing the cchardet library. Parsing only part of a document won’t save you much time parsing the document, but it can save a lot of memory, and it’ll make searching the document much faster.
Translating this documentation¶ New translations of the Beautiful Soup documentation are greatly appreciated. Translations should be licensed under the MIT license, just like Beautiful Soup and its English documentation are. There are two ways of getting your translation into the main code base and onto the Beautiful Soup website: Create a branch of the Beautiful Soup repository, add your translation, and propose a merge with the main branch, the same as you would do with a proposed change to the source code. Send a message to the Beautiful Soup discussion group with a link to your translation, or attach your translation to the message. Use the Chinese or Brazilian Portuguese translations as your model. In particular, please translate the source file doc/source/index.rst , rather than the HTML version of the documentation. This makes it possible to publish the documentation in a variety of formats, not just HTML.
Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation
Quick Start¶ Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland : html_doc = “””
The Dormouse’s story The Dormouse’s story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
…
“”” Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure: from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) print ( soup . prettify ()) # #
## The Dormouse’s story # # # ## # The Dormouse’s story # #
#
# Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #
#
# … #
# # Here are some simple ways to navigate that data structure: soup . title #
The Dormouse’s story soup . title . name # u’title’ soup . title . string # u’The Dormouse’s story’ soup . title . parent . name # u’head’ soup . p #The Dormouse’s story
soup . p [ ‘class’ ] # u’title’ soup . a # Elsie soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] soup . find ( id = “link3” ) # Tillie One common task is extracting all the URLs found within a page’s tags: for link in soup . find_all ( ‘a’ ): print ( link . get ( ‘href’ )) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page: print ( soup . get_text ()) # The Dormouse’s story # # The Dormouse’s story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # … Does this look like what you need? If so, read on.
Installing Beautiful Soup¶ If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager: $ apt – get install python3 – bs4 Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip . The package name is beautifulsoup4 . Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively). $ easy_install beautifulsoup4 $ pip install beautifulsoup4 (The BeautifulSoup package is not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4 .) If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py . $ python setup.py install If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all. I use Python 3.8 to develop Beautiful Soup, but it should work with other recent versions. Installing a parser¶ Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands: $ apt – get install python – lxml $ easy_install lxml $ pip install lxml Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands: $ apt – get install python – html5lib $ easy_install html5lib $ pip install html5lib This table summarizes the advantages and disadvantages of each parser library: Parser Typical usage Advantages Disadvantages Python’s html.parser BeautifulSoup(markup, “html.parser”) Batteries included
Decent speed
Lenient (As of Python 3.2) Not as fast as lxml, less lenient than html5lib. lxml’s HTML parser BeautifulSoup(markup, “lxml”) Very fast
Lenient External C dependency lxml’s XML parser BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup, “xml”) Very fast
The only currently supported XML parser External C dependency html5lib BeautifulSoup(markup, “html5lib”) Extremely lenient
Parses pages the same way a web browser does
Creates valid HTML5 Very slow
External Python dependency If you can, I recommend you install and use lxml for speed. If you’re using a very old version of Python – earlier than 3.2.2 – it’s essential that you install lxml or html5lib. Python’s built-in HTML parser is just not very good in those old versions. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.
Making the soup¶ To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle: from bs4 import BeautifulSoup with open ( “index.html” ) as fp : soup = BeautifulSoup ( fp , ‘html.parser’ ) soup = BeautifulSoup ( “a web page” , ‘html.parser’ ) First, the document is converted to Unicode, and HTML entities are converted to Unicode characters: print ( BeautifulSoup ( “
Sacré bleu!” , “html.parser” )) # Sacré bleu! Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)Kinds of objects¶ Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag , NavigableString , BeautifulSoup , and Comment . Tag ¶ A Tag object corresponds to an XML or HTML tag in the original document: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b type ( tag ) #
Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes. Name¶ Every tag has a name, accessible as .name : tag . name # ‘b’ If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup: tag . name = “blockquote” tag # Extremely bold
Attributes¶ A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary: tag = BeautifulSoup ( ‘bold‘ , ‘html.parser’ ) . b tag [ ‘id’ ] # ‘boldest’ You can access that dictionary directly as .attrs : tag . attrs # {‘id’: ‘boldest’} You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary: tag [ ‘id’ ] = ‘verybold’ tag [ ‘another-attribute’ ] = 1 tag # del tag [ ‘id’ ] del tag [ ‘another-attribute’ ] tag # bold tag [ ‘id’ ] # KeyError: ‘id’ tag . get ( ‘id’ ) # None Multi-valued attributes¶ HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel , rev , accept-charset , headers , and accesskey . Beautiful Soup presents the value(s) of a multi-valued attribute as a list: css_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ ) css_soup . p [ ‘class’ ] # [‘body’] css_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ ) css_soup . p [ ‘class’ ] # [‘body’, ‘strikeout’] If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone: id_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ ) id_soup . p [ ‘id’ ] # ‘my id’ When you turn a tag back into a string, multiple attribute values are consolidated: rel_soup = BeautifulSoup ( ‘
Back to the homepage
‘ , ‘html.parser’ ) rel_soup . a [ ‘rel’ ] # [‘index’] rel_soup . a [ ‘rel’ ] = [ ‘index’ , ‘contents’ ] print ( rel_soup . p ) #
Back to the homepage
You can disable this by passing multi_valued_attributes=None as a keyword argument into the BeautifulSoup constructor: no_list_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ , multi_valued_attributes = None ) no_list_soup . p [ ‘class’ ] # ‘body strikeout’ You can use get_attribute_list to get a value that’s always a list, whether or not it’s a multi-valued atribute: id_soup . p . get_attribute_list ( ‘id’ ) # [“my id”] If you parse a document as XML, there are no multi-valued attributes: xml_soup = BeautifulSoup ( ‘
‘ , ‘xml’ ) xml_soup . p [ ‘class’ ] # ‘body strikeout’ Again, you can configure this using the multi_valued_attributes argument: class_is_multi = { ‘*’ : ‘class’ } xml_soup = BeautifulSoup ( ‘
‘ , ‘xml’ , multi_valued_attributes = class_is_multi ) xml_soup . p [ ‘class’ ] # [‘body’, ‘strikeout’] You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification: from bs4.builder import builder_registry builder_registry . lookup ( ‘html’ ) . DEFAULT_CDATA_LIST_ATTRIBUTES NavigableString ¶ A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b tag . string # ‘Extremely bold’ type ( tag . string ) #
A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with str : unicode_string = str ( tag . string ) unicode_string # ‘Extremely bold’ type ( unicode_string ) # You can’t edit a string in place, but you can replace one string with another, using replace_with(): tag . string . replace_with ( “No longer bold” ) tag # No longer bold NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method. If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory. BeautifulSoup ¶ The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree. You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents: doc = BeautifulSoup ( “ Here’s the footer INSERT FOOTER HERE ” , “xml” ) doc . find ( text = “INSERT FOOTER HERE” ) . replace_with ( footer ) # ‘INSERT FOOTER HERE’ print ( doc ) # #
Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name , so it’s been given the special .name “[document]”: soup . name # ‘[document]’
Navigating the tree¶ Here’s the “Three sisters” HTML document again: html_doc = “””
The Dormouse’s story The Dormouse’s story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
…
“”” from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) I’ll use this as an example to show you how to move from one part of a document to another. Going down¶ Tags may contain strings and other tags. These elements are the tag’s children . Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children. Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children. Navigating using tag names¶ The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the
tag, just say soup.head : soup . head #The Dormouse’s story soup . title #The Dormouse’s story You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first tag beneath the tag: soup . body . b # The Dormouse’s story Using a tag name as an attribute will give you only the first tag by that name: soup . a # Elsie If you need to get all the tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all() : soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] .contents and .children ¶ A tag’s children are available in a list called .contents : head_tag = soup . head head_tag #The Dormouse’s story head_tag . contents # [The Dormouse’s story ] title_tag = head_tag . contents [ 0 ] title_tag #The Dormouse’s story title_tag . contents # [‘The Dormouse’s story’] The BeautifulSoup object itself has children. In this case, the tag is the child of the BeautifulSoup object.: len ( soup . contents ) # 1 soup . contents [ 0 ] . name # ‘html’ A string does not have .contents , because it can’t contain anything: text = title_tag . contents [ 0 ] text . contents # AttributeError: ‘NavigableString’ object has no attribute ‘contents’ Instead of getting them as a list, you can iterate over a tag’s children using the .children generator: for child in title_tag . children : print ( child ) # The Dormouse’s story If you want to modify a tag’s children, use the methods described in Modifying the tree. Don’t modify the the .contents list directly: that can lead to problems that are subtle and difficult to spot. .descendants ¶ The .contents and .children attributes only consider a tag’s direct children. For instance, the tag has a single direct child–thetag: head_tag . contents # [ The Dormouse’s story ] But thetag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on: for child in head_tag . descendants : print ( child ) # The Dormouse’s story # The Dormouse’s story The tag has only one child, but it has two descendants: thetag and the tag’s child. The BeautifulSoup object only has one direct child (the tag), but it has a whole lot of descendants: len ( list ( soup . children )) # 1 len ( list ( soup . descendants )) # 26 .string ¶ If a tag has only one child, and that child is a NavigableString , the child is made available as .string : title_tag . string # ‘The Dormouse’s story’ If a tag’s only child is another tag, and that tag has a .string , then the parent tag is considered to have the same .string as its child: head_tag . contents # [ The Dormouse’s story ] head_tag . string # ‘The Dormouse’s story’ If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None : print ( soup . html . string ) # None .strings and stripped_strings ¶ If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator: for string in soup . strings : print ( repr ( string )) ‘‘ # “The Dormouse’s story” # ‘
‘ # ‘
‘ # “The Dormouse’s story” # ‘
‘ # ‘Once upon a time there were three little sisters; and their names were
‘ # ‘Elsie’ # ‘,
‘ # ‘Lacie’ # ‘ and
‘ # ‘Tillie’ # ‘;
and they lived at the bottom of a well.’ # ‘
‘ # ‘…’ # ‘
‘ These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead: for string in soup . stripped_strings : print ( repr ( string )) # “The Dormouse’s story” # “The Dormouse’s story” # ‘Once upon a time there were three little sisters; and their names were’ # ‘Elsie’ # ‘,’ # ‘Lacie’ # ‘and’ # ‘Tillie’ # ‘;
and they lived at the bottom of a well.’ # ‘…’ Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed. Going up¶ Continuing the “family tree” analogy, every tag and every string has a parent : the tag that contains it. .parent ¶ You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the
tag is the parent of thetag: title_tag = soup . title title_tag # The Dormouse’s story title_tag . parent #The Dormouse’s story The title string itself has a parent: thetag that contains it: title_tag . string . parent # The Dormouse’s story The parent of a top-level tag like is the BeautifulSoup object itself: html_tag = soup . html type ( html_tag . parent ) #And the .parent of a BeautifulSoup object is defined as None: print ( soup . parent ) # None .parents ¶ You can iterate over all of an element’s parents with .parents . This example uses .parents to travel from an tag buried deep within the document, to the very top of the document: link = soup . a link # Elsie for parent in link . parents : print ( parent . name ) # p # body # html # [document] Going sideways¶ Consider a simple document like this: sibling_soup = BeautifulSoup ( “text1 text2 ” , ‘html.parser’ ) print ( sibling_soup . prettify ()) # # # text1 # ## text2 # # The tag and thetag are at the same level: they’re both direct children of the same tag. We call them siblings . When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write. .next_sibling and .previous_sibling ¶ You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree: sibling_soup . b . next_sibling # text2 sibling_soup . c . previous_sibling # text1 The tag has a .next_sibling , but no .previous_sibling , because there’s nothing before the tag on the same level of the tree . For the same reason, thetag has a .previous_sibling but no .next_sibling : print ( sibling_soup . b . previous_sibling ) # None print ( sibling_soup . c . next_sibling ) # None The strings “text1” and “text2” are not siblings, because they don’t have the same parent: sibling_soup . b . string # ‘text1’ print ( sibling_soup . b . string . next_sibling ) # None In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document: # Elsie # Lacie # Tillie You might think that the .next_sibling of the first tag would be the second tag. But actually, it’s a string: the comma and newline that separate the first tag from the second: link = soup . a link # Elsie link . next_sibling # ‘, soup . b . contents # [Don’t, ‘ you’,‘ The second tag is actually the .next_sibling of the comma: link . next_sibling . next_sibling # Lacie .next_siblings and .previous_siblings ¶ You can iterate over a tag’s siblings with .next_siblings or .previous_siblings : for sibling in soup . a . next_siblings : print ( repr ( sibling )) # ‘,
‘ # Lacie # ‘ and
‘ # Tillie # ‘; and they lived at the bottom of a well.’ for sibling in soup . find ( id = “link3” ) . previous_siblings : print ( repr ( sibling )) # ‘ and
‘ # Lacie # ‘,
‘ # Elsie # ‘Once upon a time there were three little sisters; and their names were
‘ Going back and forth¶ Take a look at the beginning of the “three sisters” document: #
The Dormouse’s story #The Dormouse’s story
An HTML parser takes this string of characters and turns it into a series of events: “open an tag”, “open a
tag”, “open atag”, “add a string”, “close the tag”, “open a tag”, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document. .next_element and .previous_element ¶ The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling , but it’s usually drastically different. Here’s the final tag in the “three sisters” document. Its .next_sibling is a string: the conclusion of the sentence that was interrupted by the start of the tag.: last_a_tag = soup . find ( “a” , id = “link3” ) last_a_tag # Tillie last_a_tag . next_sibling # ‘;
and they lived at the bottom of a well.’ But the .next_element of that tag, the thing that was parsed immediately after the tag, is not the rest of that sentence: it’s the word “Tillie”: last_a_tag . next_element # ‘Tillie’ That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered an tag, then the word “Tillie”, then the closing tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the tag, but the word “Tillie” was encountered first. The .previous_element attribute is the exact opposite of .next_element . It points to whatever element was parsed immediately before this one: last_a_tag . previous_element # ‘ and
‘ last_a_tag . previous_element . next_element # Tillie .next_elements and .previous_elements ¶ You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed: for element in last_a_tag . next_elements : print ( repr ( element )) # ‘Tillie’ # ‘;
and they lived at the bottom of a well.’ # ‘
‘ #
…
# ‘…’ # ‘
‘
Modifying the tree¶ Beautiful Soup’s main strength is in searching the parse tree, but you can also modify the tree and write your changes as a new HTML or XML document. Changing tag names and attributes¶ I covered this earlier, in Attributes, but it bears repeating. You can rename a tag, change the values of its attributes, add new attributes, and delete attributes: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b tag . name = “blockquote” tag [ ‘class’ ] = ‘verybold’ tag [ ‘id’ ] = 1 tag #
Extremely bold
del tag [ ‘class’ ] del tag [ ‘id’ ] tag #
Extremely bold
Modifying .string ¶ If you set a tag’s .string attribute to a new string, the tag’s contents are replaced with that string: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . string = “New link text.” tag # New link text. Be careful: if the tag contained other tags, they and all their contents will be destroyed. append() ¶ You can add to a tag’s contents with Tag.append() . It works just like calling .append() on a Python list: soup = BeautifulSoup ( “Foo” , ‘html.parser’ ) soup . a . append ( “Bar” ) soup # FooBar soup . a . contents # [‘Foo’, ‘Bar’] extend() ¶ Starting in Beautiful Soup 4.7.0, Tag also supports a method called .extend() , which adds every element of a list to a Tag , in order: soup = BeautifulSoup ( “Soup” , ‘html.parser’ ) soup . a . extend ([ “‘s” , ” ” , “on” ]) soup # Soup’s on soup . a . contents # [‘Soup’, ”s’, ‘ ‘, ‘on’] NavigableString() and .new_tag() ¶ If you need to add a string to a document, no problem–you can pass a Python string in to append() , or you can call the NavigableString constructor: soup = BeautifulSoup ( “” , ‘html.parser’ ) tag = soup . b tag . append ( “Hello” ) new_string = NavigableString ( ” there” ) tag . append ( new_string ) tag # Hello there. tag . contents # [‘Hello’, ‘ there’] If you want to create a comment or some other subclass of NavigableString , just call the constructor: from bs4 import Comment new_comment = Comment ( “Nice to see you.” ) tag . append ( new_comment ) tag # Hello there tag . contents # [‘Hello’, ‘ there’, ‘Nice to see you.’] (This is a new feature in Beautiful Soup 4.4.0.) What if you need to create a whole new tag? The best solution is to call the factory method BeautifulSoup.new_tag() : soup = BeautifulSoup ( “” , ‘html.parser’ ) original_tag = soup . b new_tag = soup . new_tag ( “a” , href = “http://www.example.com” ) original_tag . append ( new_tag ) original_tag # new_tag . string = “Link text.” original_tag # Link text. Only the first argument, the tag name, is required. insert() ¶ Tag.insert() is just like Tag.append() , except the new element doesn’t necessarily go at the end of its parent’s .contents . It’ll be inserted at whatever numeric position you say. It works just like .insert() on a Python list: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . insert ( 1 , “but did not endorse ” ) tag # I linked to but did not endorse example.com tag . contents # [‘I linked to ‘, ‘but did not endorse’, example.com] insert_before() and insert_after() ¶ The insert_before() method inserts tags or strings immediately before something else in the parse tree: soup = BeautifulSoup ( “leave” , ‘html.parser’ ) tag = soup . new_tag ( “i” ) tag . string = “Don’t” soup . b . string . insert_before ( tag ) soup . b # Don’tleave The insert_after() method inserts tags or strings immediately following something else in the parse tree: div = soup . new_tag ( ‘div’ ) div . string = ‘ever’ soup . b . i . insert_after ( ” you ” , div ) soup . b # Don’t you
everleave
ever, ‘leave’] clear() ¶ Tag.clear() removes the contents of a tag: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . clear () tag # extract() ¶ PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a i_tag = soup . i . extract () a_tag # I linked to i_tag # example.com print ( i_tag . parent ) # None At this point you effectively have two parse trees: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call extract on a child of the element you extracted: my_string = i_tag . string . extract () my_string # ‘example.com’ print ( my_string . parent ) # None i_tag # decompose() ¶ Tag.decompose() removes a tag from the tree, then completely destroys it and its contents : markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a i_tag = soup . i i_tag . decompose () a_tag # I linked to The behavior of a decomposed Tag or NavigableString is not defined and you should not use it for anything. If you’re not sure whether something has been decomposed, you can check its .decomposed property (new in Beautiful Soup 4.9.0) : i_tag . decomposed # True a_tag . decomposed # False replace_with() ¶ PageElement.replace_with() removes a tag or string from the tree, and replaces it with one or more tags or strings of your choice: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a new_tag = soup . new_tag ( “b” ) new_tag . string = “example.com” a_tag . i . replace_with ( new_tag ) a_tag # I linked to example.com bold_tag = soup . new_tag ( “b” ) bold_tag . string = “example” i_tag = soup . new_tag ( “i” ) i_tag . string = “net” a_tag . b . replace_with ( bold_tag , “.” , i_tag ) a_tag # I linked to example.net replace_with() returns the tag or string that got replaced, so that you can examine it or add it back to another part of the tree. The ability to pass multiple arguments into replace_with() is new in Beautiful Soup 4.10.0. wrap() ¶ PageElement.wrap() wraps an element in the tag you specify. It returns the new wrapper: soup = BeautifulSoup ( “
I wish I was bold.
” , ‘html.parser’ ) soup . p . string . wrap ( soup . new_tag ( “b” )) # I wish I was bold. soup . p . wrap ( soup . new_tag ( “div” )) #
I wish I was bold.
This method is new in Beautiful Soup 4.0.5. unwrap() ¶ Tag.unwrap() is the opposite of wrap() . It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a a_tag . i . unwrap () a_tag # I linked to example.com Like replace_with() , unwrap() returns the tag that was replaced. smooth() ¶ After calling a bunch of methods that modify the parse tree, you may end up with two or more NavigableString objects next to each other. Beautiful Soup doesn’t have any problems with this, but since it can’t happen in a freshly parsed document, you might not expect behavior like the following: soup = BeautifulSoup ( “
A one
” , ‘html.parser’ ) soup . p . append ( “, a two” ) soup . p . contents # [‘A one’, ‘, a two’] print ( soup . p . encode ()) # b’
A one, a two
‘ print ( soup . p . prettify ()) #
# A one # , a two #
You can call Tag.smooth() to clean up the parse tree by consolidating adjacent strings: soup . smooth () soup . p . contents # [‘A one, a two’] print ( soup . p . prettify ()) #
# A one, a two #
This method is new in Beautiful Soup 4.8.0.
Output¶ Pretty-printing¶ The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: markup = ‘
I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) soup . prettify () # ‘…’ print ( soup . prettify ()) # #
# # # # I linked to # # example.com # # # # You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: print ( soup . a . prettify ()) # # I linked to # # example.com # # Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one. The goal of prettify() is to help you visually understand the structure of the documents you work with. Non-pretty printing¶ If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object, or on a Tag within it: str ( soup ) # ‘I linked to example.com‘ str ( soup . a ) # ‘I linked to example.com‘ The str() function returns a string encoded in UTF-8. See Encodings for other options. You can also call encode() to get a bytestring, and decode() to get Unicode. Output formatters¶ If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters: soup = BeautifulSoup ( ““Dammit!” he said.” , ‘html.parser’ ) str ( soup ) # ‘“Dammit!” he said.’ If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back: soup . encode ( “utf8” ) # b’\xe2\x80\x9cDammit!\xe2\x80\x9d he said.’ By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML: soup = BeautifulSoup ( “The law firm of Dewey, Cheatem, & Howe
” , ‘html.parser’ ) soup . p #
The law firm of Dewey, Cheatem, & Howe
soup = BeautifulSoup ( ‘A link‘ , ‘html.parser’ ) soup . a # A link You can change this behavior by providing a value for the formatter argument to prettify() , encode() , or decode() . Beautiful Soup recognizes five possible values for formatter . The default is formatter=”minimal” . Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML: french = “
Il a dit <
> ” soup = BeautifulSoup ( french , ‘html.parser’ ) print ( soup . prettify ( formatter = “minimal” )) #
# Il a dit <
> # If you pass in formatter=”html” , Beautiful Soup will convert Unicode characters to HTML entities whenever possible: print ( soup . prettify ( formatter = “html” )) #
# Il a dit <
> # If you pass in formatter=”html5″ , it’s similar to formatter=”html” , but Beautiful Soup will omit the closing slash in HTML void tags like “br”: br = BeautifulSoup ( “
” , ‘html.parser’ ) . br print ( br . encode ( formatter = “html” )) # b’
‘ print ( br . encode ( formatter = “html5” )) # b’
‘ In addition, any attributes whose values are the empty string will become HTML-style boolean attributes: option = BeautifulSoup ( ‘‘ ) . option print ( option . encode ( formatter = “html” )) # b’‘ print ( option . encode ( formatter = “html5” )) # b’‘ (This behavior is new as of Beautiful Soup 4.10.0.) If you pass in formatter=None , Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples: print ( soup . prettify ( formatter = None )) ## Il a dit <
> # link_soup = BeautifulSoup ( ‘A link‘ , ‘html.parser’ ) print ( link_soup . a . encode ( formatter = None )) # b’A link‘ If you need more sophisticated control over your output, you can use Beautiful Soup’s Formatter class. Here’s a formatter that converts strings to uppercase, whether they occur in a text node or in an attribute value: from bs4.formatter import HTMLFormatter def uppercase ( str ): return str . upper () formatter = HTMLFormatter ( uppercase ) print ( soup . prettify ( formatter = formatter )) #
# IL A DIT <
> # print ( link_soup . a . prettify ( formatter = formatter )) # # A LINK # Here’s a formatter that increases the indentation when pretty-printing: formatter = HTMLFormatter ( indent = 8 ) print ( link_soup . a . prettify ( formatter = formatter )) # # A link # Subclassing HTMLFormatter or XMLFormatter will give you even more control over the output. For example, Beautiful Soup sorts the attributes in every tag by default: attr_soup = BeautifulSoup ( b ‘
‘ , ‘html.parser’ ) print ( attr_soup . p . encode ()) #
To turn this off, you can subclass the Formatter.attributes() method, which controls which attributes are output and in what order. This implementation also filters out the attribute called “m” whenever it appears: class UnsortedAttributes ( HTMLFormatter ): def attributes ( self , tag ): for k , v in tag . attrs . items (): if k == ‘m’ : continue yield k , v print ( attr_soup . p . encode ( formatter = UnsortedAttributes ())) #
One last caveat: if you create a CData object, the text inside that object is always presented exactly as it appears, with no formatting . Beautiful Soup will call your entity substitution function, just in case you’ve written a custom function that counts all the strings in the document or something, but it will ignore the return value: from bs4.element import CData soup = BeautifulSoup ( “” , ‘html.parser’ ) soup . a . string = CData ( “one < three" ) print ( soup . a . prettify ( formatter = "html" )) # # get_text() ¶ If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string: markup = ‘
I linked to example.com
‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) soup . get_text () ‘
I linked to example.com
‘ soup . i . get_text () ‘example.com’ You can specify a string to be used to join the bits of text together: # soup.get_text(“|”) ‘
I linked to |example.com|
‘ You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text: # soup.get_text(“|”, strip=True) ‘I linked to|example.com’ But at that point you might want to use the .stripped_strings generator instead, and process the text yourself: [ text for text in soup . stripped_strings ] # [‘I linked to’, ‘example.com’] As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of