トピック記事を見てみましょう “from bs4 import beautifulsoup – BS4 library in python|| Import Error: No module named ‘bs4’ || python3“? カテゴリ内: Top 716 tips update new. この記事は、インターネット上の多くのソースからのhttps://ph.taphoamini.comによって編集されています. 著者Coding Menteによる記事には3,606 回視聴があり、高評価 44 件で高く評価されています.
このfrom bs4 import beautifulsoupトピックの詳細については、以下の記事を参照してください。.投稿がある場合は、記事の下にコメントするか、関連記事セクションのトピックfrom bs4 import beautifulsoupに関連するその他の記事を参照してください。.
For Windows… Go to start menu type cmd right click on cmd icon click run as administrator then type pip install beautifulsoup4. It likely will fail to install correctly if you fail to do the above step as even though your windows user is an admin account it does not run all apps as administrator.To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .To install Beautifulsoup on Windows, Linux, or any operating system, one would need pip package. To check how to install pip on your operating system, check out – PIP Installation – Windows || Linux. Wait and relax, Beautifulsoup would be installed shortly.
- Open up the Python interpreter in a terminal by using the following command: python.
- Now, we can issue a simple import statement to see whether we have successfully installed Beautiful Soup or not by using the following command: from bs4 import BeautifulSoup.
主題に関するビデオを見る from bs4 import beautifulsoup
以下は、このトピックに関する詳細なビデオです from bs4 import beautifulsoup – BS4 library in python|| Import Error: No module named ‘bs4’ || python3. 注意深く見て、あなたが読んでいるものについてのフィードバックを私たちに与えてください!
BS4 library in python|| Import Error: No module named ‘bs4’ || python3 – from bs4 import beautifulsoup このトピックの詳細
テーマの説明 from bs4 import beautifulsoup:
In this video, we discuss one of the most important uses while scraping the data that is BeautifulSoup
The Code Snippet
https://codingmente.com/category/blog/python-projects/
See some more details on the topic from bs4 import beautifulsoup here:
beautifulsoup4 · PyPI
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML … from bs4 import BeautifulSoup >>> soup …
Source: pypi.org
Date Published: 8/8/2022
View: 7145
from bs4 import BeautifulSoup Code Example – Code Grepper
from requests import get from bs4 import BeautifulSoup as bs page = get(“http://website.url/goes-here”) soup = bs(page.content, ‘html.parser’)
Source: www.codegrepper.com
Date Published: 11/13/2022
View: 9252
Beautiful Soup – Installation – Tutorialspoint
As BeautifulSoup is not a standard python library, we need to install it first. We are going to install the BeautifulSoup 4 library (also known as BS4), …
Source: www.tutorialspoint.com
Date Published: 3/3/2022
View: 6362
Beautiful Soup 4.9.0 documentation – Crummy
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, … Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system …
Source: www.crummy.com
Date Published: 3/4/2022
View: 9708
from bs4 import BeautifulSoup error – Google Groups
I use sudo pip install beautifulsoup4
. All is well. So I write a simple python program with the code from bs4 import BeautifulSoup
.
Source: groups.google.com
Date Published: 4/11/2022
View: 3478
PyMOTM: Beautiful Soup 4 (Part I) – Viblo
Cài đặt · Qua APT: sudo apt-get install python-bs4 · Qua PIP: sudo pip install beautifulsoup4 · Qua EasyInstall: sudo easy_install beautifulsoup4 · Qua source: Vào …
Source: viblo.asia
Date Published: 4/26/2021
View: 1339
from bs4 import beautifulsoup – Webmatrices Blog
from bs4 import beautifulsoup · What is Beautifulsoup? · Getting started with Beautifulsoup. Install beautifulsoup4 and requests; What does requests do in Python?
Source: blog.webmatrices.com
Date Published: 12/22/2021
View: 4631
Web scraping and parsing with Beautiful Soup 4 Introduction
To begin, we need to import Beautiful Soup and urllib, and grab source code: import bs4 as bs import urllib.request source …
Source: pythonprogramming.net
Date Published: 11/9/2022
View: 6850
コンテンツの写真 from bs4 import beautifulsoup
トピックに関する写真 BS4 library in python|| Import Error: No module named ‘bs4’ || python3 記事の内容をよりよく理解するために記事を説明するために使用されます。コメントセクションでより多くの関連画像を参照するか、必要に応じてより多くの関連記事を参照してください.
トピックに関する記事を評価する from bs4 import beautifulsoup
- 著者: Coding Mente
- 意見: 3,606 回視聴
- いいねの数: 高評価 44 件
- 動画のアップロード日: 2021/08/01
- ビデオURL: https://www.youtube.com/watch?v=bbevgam1MAo
How do I import from BeautifulSoup?
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
How do I download a bs4 module in Python?
To install Beautifulsoup on Windows, Linux, or any operating system, one would need pip package. To check how to install pip on your operating system, check out – PIP Installation – Windows || Linux. Wait and relax, Beautifulsoup would be installed shortly.
Is bs4 same as beautifulsoup4?
The official name of PyPI’s Beautiful Soup Python package is beautifulsoup4 . This package ensures that if you type pip install bs4 by mistake you will end up with Beautiful Soup .
How do I know if bs4 is installed?
- Open up the Python interpreter in a terminal by using the following command: python.
- Now, we can issue a simple import statement to see whether we have successfully installed Beautiful Soup or not by using the following command: from bs4 import BeautifulSoup.
How do I install BeautifulSoup in PyCharm?
- Open File > Settings > Project from the PyCharm menu.
- Select your current project.
- Click the Python Interpreter tab within your project tab.
- Click the small + symbol to add a new library to the project.
How do I import a beautifulsoup4 in Jupyter notebook?
- Open a new anaconda prompt.
- Run conda install -c anaconda beautifulsoup4.
- Close and reopen jupyter notebook.
- In jupyter notebook import libraries as following: from bs4 import BeautifulSoup.
Is Beautiful Soup faster than selenium?
Comparing selenium vs BeautifulSoup allows you to see that BeautifulSoup is more user-friendly and allows you to learn faster and begin web scraping smaller tasks easier. Selenium on the other hand is important when the target website has a lot of java elements in its code.
How do I download pip for Python?
- Step 1: Download PIP get-pip.py.
- Step 2: Installing PIP on Windows.
- Step 3: Verify Installation.
- Step 4: Add Pip to Windows Environment Variables.
- Step 5: Configuration.
How do I add Beautiful Soup to my Mac?
- Step 1: Install latest Python3 in MacOS.
- Step 2: Check if pip3 and python3 are correctly installed.
- Step 3: Upgrade your pip to avoid errors during installation.
- Step 4: Enter the following command to install Beautiful Soup using pip.
- Step 1: Download the latest package of Beautiful Soup for python3.
How do I use BS4 in Python?
Jump into the Code
First, we need to import all the libraries that we are going to use. Next, declare a variable for the url of the page. Then, make use of the Python urllib2 to get the HTML page of the url declared. Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.
What is BS4?
What is Bharat Stage 4? The Central Pollution Control Board (CPCB) introduced the Bharat stage 4, or BS4 emission norms, in 2017. According to the Bharat stage 4 emission criteria, 50 parts per million sulphur content was permitted instead of 10 parts per million mandated now by the BS6 norms.
What is LXML in BeautifulSoup?
To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup into an lxml.
Do you have to download BeautifulSoup?
As BeautifulSoup is not a standard python library, we need to install it first.
How can I from bs4 import BeautifulSoup?
For Windows… Go to start menu type cmd right click on cmd icon click run as administrator then type pip install beautifulsoup4.
It likely will fail to install correctly if you fail to do the above step as even though your windows user is an admin account it does not run all apps as administrator.
Notice the difference if you simply just open cmd without the run as admin.
Remember also when using it like so…
from bs4 import beautifulsoup4
Will not work as it is not correctly formatted.
from bs4 import BeautifulSoup4
Will work correctly as it is CaseSensitive.
beautifulsoup4
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
Quick start
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(“
SomebadHTML”) >>> print(soup.prettify())
Some bad HTML
>>> soup.find(text=”bad”) ‘bad’ >>> soup.i HTML # >>> soup = BeautifulSoup(“
Some bad XML”, “xml”) # >>> print(soup.prettify()) Some bad XML To go beyond the basics, comprehensive documentation is available.
Links
Note on Python 2 sunsetting
Beautiful Soup’s support for Python 2 was discontinued on December 31, 2020: one year after the sunset date for Python 2 itself. From this point onward, new Beautiful Soup development will exclusively target Python 3. The final release of Beautiful Soup 4 to support Python 2 was 4.9.3.
Supporting the project
If you use Beautiful Soup as part of your professional work, please consider a Tidelift subscription. This will support many of the free software projects your organization depends on, not just Beautiful Soup.
If you use Beautiful Soup for personal projects, the best way to say thank you is to read Tool Safety, a zine I wrote about what Beautiful Soup has taught me about software development.
Building the documentation
The bs4/doc/ directory contains full documentation in Sphinx format. Run make html in that directory to create HTML documentation.
Running the unit tests
Beautiful Soup supports unit test discovery using Pytest:
Python Programming Tutorials
Web scraping and parsing with Beautiful Soup 4 Introduction
Welcome to a tutorial on web scraping with Beautiful Soup 4. Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data from websites.
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
To begin, we need HTML. I have created an example page for us to work with.
To begin, we need to import Beautiful Soup and urllib, and grab source code:
import bs4 as bs import urllib.request source = urllib.request.urlopen(‘https://pythonprogramming.net/parsememcparseface/’).read()
Then, we create the “soup.” This is a beautiful soup object:
soup = bs.BeautifulSoup(source,’lxml’)
If you do print(soup) and print(source) , it looks the same, but the source is just plain the response data, and the soup is an object that we can actually interact with, by tag, now, like so:
# title of the page print(soup.title) # get attributes: print(soup.title.name) # get values: print(soup.title.string) # beginning navigation: print(soup.title.parent.name) # getting specific values: print(soup.p)
Finding paragraph tags
is a fairly common task. In the case above, we’re just finding the first one. What if we wanted to find them all?
print(soup.find_all(‘p’))
We can also iterate through them:
for paragraph in soup.find_all(‘p’): print(paragraph.string) print(str(paragraph.text))
The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we’re attempting to use .string on, we will get None returned.
Another common task is to grab links. For example:
for url in soup.find_all(‘a’): print(url.get(‘href’))
In this case, if we just grabbed the .text from the tag, you’d get the anchor text, but we actually want the link itself. That’s why we’re using .get(‘href’) to get the true URL.
Finally, you may just want to grab text. You can use .get_text() on a Beautiful Soup object, including the full soup:
print(soup.get_text())
This concludes the introduction to Beautiful Soup. In the next tutorial, we’re going cover navigating a page’s elements to get more specifically what you want.
Beautifulsoup Installation
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The latest Version of Beautifulsoup is v4.9.3 as of now.
Prerequisites
How to install Beautifulsoup?
To install Beautifulsoup on Windows, Linux, or any operating system, one would need pip package. To check how to install pip on your operating system, check out – PIP Installation – Windows || Linux.
Now, run a simple command,
pip install beautifulsoup4
Wait and relax, Beautifulsoup would be installed shortly.
Install Beautifulsoup4 using Source code
One can install beautifulsoup, using source code directly, install beautifulsoup tarball from here – download the Beautiful Soup 4 source tarball
after downloading cd into the directory and run,
Python setup.py install
Verifying Installation
To check whether the installation is complete or not, let’s try implementing it using python
What is the difference beautifulsoup and bs4
I’m new to python and I tried to parse some XML files in order to add some new tags and store that new XML file.
python-beautifulsoup seams to be the right package for that. Searching around the web for tutorials, how to add an new tag to XML parsed by BeautifulSoup, i found out, that the package python-bs4 is used.
Looking at the package description, both packages have the same title:
python-bs4 – error-tolerant HTML parser for Python python-beautifulsoup – error-tolerant HTML parser for Python
So my question: what is the difference?
Getting Started with Beautiful Soup [Book]
Get full access to Getting Started with Beautiful Soup and 60K+ other titles, with free 10-day trial of O’Reilly.
There’s also live online events, interactive content, certification prep materials, and more.
Beautiful Soup – Installation
Beautiful Soup – Installation
Advertisements
As BeautifulSoup is not a standard python library, we need to install it first. We are going to install the BeautifulSoup 4 library (also known as BS4), which is the latest one.
To isolate our working environment so as not to disturb the existing setup, let us first create a virtual environment.
Creating a virtual environment (optional)
A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.
Best way to install any python package machine is using pip, however, if pip is not installed already (you can check it using – “pip –version” in your command or shell prompt), you can install by giving below command −
Linux environment
$sudo apt-get install python-pip
Windows environment
To install pip in windows, do the following −
Download the get-pip.py from https://bootstrap.pypa.io/get-pip.py or from the github to your computer.
Open the command prompt and navigate to the folder containing get-pip.py file.
Run the following command −
>python get-pip.py
That’s it, pip is now installed in your windows machine.
You can verify your pip installed by running below command −
>pip –version pip 19.2.3 from c:\users\yadur\appdata\local\programs\python\python37\lib\site-packages\pip (python 3.7)
Installing virtual environment
Run the below command in your command prompt −
>pip install virtualenv
After running, you will see the below screenshot −
Below command will create a virtual environment (“myEnv”) in your current directory −
>virtualenv myEnv
Screenshot
To activate your virtual environment, run the following command −
>myEnv\Scripts\activate
In the above screenshot, you can see we have “myEnv” as prefix which tells us that we are under virtual environment “myEnv”.
To come out of virtual environment, run deactivate.
(myEnv) C:\Users\yadur>deactivate C:\Users\yadur>
As our virtual environment is ready, now let us install beautifulsoup.
Installing BeautifulSoup
As BeautifulSoup is not a standard library, we need to install it. We are going to use the BeautifulSoup 4 package (known as bs4).
Linux Machine
To install bs4 on Debian or Ubuntu linux using system package manager, run the below command −
$sudo apt-get install python-bs4 (for python 2.x) $sudo apt-get install python3-bs4 (for python 3.x)
You can install bs4 using easy_install or pip (in case you find problem in installing using system packager).
$easy_install beautifulsoup4 $pip install beautifulsoup4
(You may need to use easy_install3 or pip3 respectively if you’re using python3)
Windows Machine
To install beautifulsoup4 in windows is very simple, especially if you have pip already installed.
>pip install beautifulsoup4
So now beautifulsoup4 is installed in our machine. Let us talk about some problems encountered after installation.
Problems after installation
On windows machine you might encounter, wrong version being installed error mainly through −
error: ImportError “No module named HTMLParser” , then you must be running python 2 version of the code under Python 3.
error: ImportError “No module named html.parser” error, then you must be running Python 3 version of the code under Python 2.
Best way to get out of above two situations is to re-install the BeautifulSoup again, completely removing existing installation.
If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u’[document]’, then you need to convert the python 2 code to python 3, just by either installing the package −
$ python3 setup.py install
or by manually running python’s 2 to 3 conversion script on the bs4 directory −
$ 2to3-3.2 -w bs4
Installing a Parser
By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.
To install lxml or html5lib parser, use the command −
Linux Machine
$apt-get install python-lxml $apt-get insall python-html5lib
Windows Machine
$pip install lxml $pip install html5lib
Generally, users use lxml for speed and it is recommended to use lxml or html5lib parser if you are using older version of python 2 (before 2.7.3 version) or python 3 (before 3.2.2) as python’s built-in HTML parser is not very good in handling older version.
Running Beautiful Soup
It is time to test our Beautiful Soup package in one of the html pages (taking web page – https://www.tutorialspoint.com/index.htm, you can choose any-other web page you want) and extract some information from it.
In the below code, we are trying to extract the title from the webpage −
from bs4 import BeautifulSoup import requests url = “https://www.tutorialspoint.com/index.htm” req = requests.get(url) soup = BeautifulSoup(req.text, “html.parser”) print(soup.title)
Output
H2O, Colab, Theano, Flutter, KNime, Mean.js, Weka, Solidity, Org.Json, AWS QuickSight, JSON.Simple, Jackson Annotations, Passay, Boon, MuleSoft, Nagios, Matplotlib, Java NIO, PyTorch, SLF4J, Parallax Scrolling, Java Cryptography One common task is to extract all the URLs within a webpage. For that we just need to add the below line of code −
for link in soup.find_all(‘a’): print(link.get(‘href’))
Output
https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/about/about_careers.htm https://www.tutorialspoint.com/questions/index.php https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/current_affairs.htm https://www.tutorialspoint.com/upsc_ias_exams.htm https://www.tutorialspoint.com/tutor_connect/index.php https://www.tutorialspoint.com/whiteboard.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/tutorialslibrary.htm https://www.tutorialspoint.com/videotutorials/index.php https://store.tutorialspoint.com https://www.tutorialspoint.com/gate_exams_tutorials.htm https://www.tutorialspoint.com/html_online_training/index.asp https://www.tutorialspoint.com/css_online_training/index.asp https://www.tutorialspoint.com/3d_animation_online_training/index.asp https://www.tutorialspoint.com/swift_4_online_training/index.asp https://www.tutorialspoint.com/blockchain_online_training/index.asp https://www.tutorialspoint.com/reactjs_online_training/index.asp https://www.tutorix.com https://www.tutorialspoint.com/videotutorials/top-courses.php https://www.tutorialspoint.com/the_full_stack_web_development/index.asp …. …. https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/free_web_graphics.htm https://www.tutorialspoint.com/online_file_conversion.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/free_online_whiteboard.htm http://www.tutorialspoint.com https://www.facebook.com/tutorialspointindia https://plus.google.com/u/0/+tutorialspoint http://www.twitter.com/tutorialspoint http://www.linkedin.com/company/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://www.tutorialspoint.com/index.htm /about/about_privacy.htm#cookies /about/faq.htm /about/about_helping.htm /about/contact_us.htm
Similarly, we can extract useful information using beautifulsoup4.
Now let us understand more about “soup” in above example.
Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation
Quick Start¶ Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland : html_doc = “””
The Dormouse’s story The Dormouse’s story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
…
“”” Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure: from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) print ( soup . prettify ()) # #
## The Dormouse’s story # # # ## # The Dormouse’s story # #
#
# Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #
#
# … #
# # Here are some simple ways to navigate that data structure: soup . title #
The Dormouse’s story soup . title . name # u’title’ soup . title . string # u’The Dormouse’s story’ soup . title . parent . name # u’head’ soup . p #The Dormouse’s story
soup . p [ ‘class’ ] # u’title’ soup . a # Elsie soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] soup . find ( id = “link3” ) # Tillie One common task is extracting all the URLs found within a page’s tags: for link in soup . find_all ( ‘a’ ): print ( link . get ( ‘href’ )) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page: print ( soup . get_text ()) # The Dormouse’s story # # The Dormouse’s story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # … Does this look like what you need? If so, read on.
Installing Beautiful Soup¶ If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager: $ apt – get install python3 – bs4 Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip . The package name is beautifulsoup4 . Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively). $ easy_install beautifulsoup4 $ pip install beautifulsoup4 (The BeautifulSoup package is not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4 .) If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py . $ python setup.py install If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all. I use Python 3.8 to develop Beautiful Soup, but it should work with other recent versions. Installing a parser¶ Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands: $ apt – get install python – lxml $ easy_install lxml $ pip install lxml Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands: $ apt – get install python – html5lib $ easy_install html5lib $ pip install html5lib This table summarizes the advantages and disadvantages of each parser library: Parser Typical usage Advantages Disadvantages Python’s html.parser BeautifulSoup(markup, “html.parser”) Batteries included
Decent speed
Lenient (As of Python 3.2) Not as fast as lxml, less lenient than html5lib. lxml’s HTML parser BeautifulSoup(markup, “lxml”) Very fast
Lenient External C dependency lxml’s XML parser BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup, “xml”) Very fast
The only currently supported XML parser External C dependency html5lib BeautifulSoup(markup, “html5lib”) Extremely lenient
Parses pages the same way a web browser does
Creates valid HTML5 Very slow
External Python dependency If you can, I recommend you install and use lxml for speed. If you’re using a very old version of Python – earlier than 3.2.2 – it’s essential that you install lxml or html5lib. Python’s built-in HTML parser is just not very good in those old versions. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.
Making the soup¶ To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle: from bs4 import BeautifulSoup with open ( “index.html” ) as fp : soup = BeautifulSoup ( fp , ‘html.parser’ ) soup = BeautifulSoup ( “a web page” , ‘html.parser’ ) First, the document is converted to Unicode, and HTML entities are converted to Unicode characters: print ( BeautifulSoup ( “
Sacré bleu!” , “html.parser” )) # Sacré bleu! Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)Kinds of objects¶ Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag , NavigableString , BeautifulSoup , and Comment . Tag ¶ A Tag object corresponds to an XML or HTML tag in the original document: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b type ( tag ) #
Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes. Name¶ Every tag has a name, accessible as .name : tag . name # ‘b’ If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup: tag . name = “blockquote” tag # Extremely bold
Attributes¶ A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary: tag = BeautifulSoup ( ‘bold‘ , ‘html.parser’ ) . b tag [ ‘id’ ] # ‘boldest’ You can access that dictionary directly as .attrs : tag . attrs # {‘id’: ‘boldest’} You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary: tag [ ‘id’ ] = ‘verybold’ tag [ ‘another-attribute’ ] = 1 tag # del tag [ ‘id’ ] del tag [ ‘another-attribute’ ] tag # bold tag [ ‘id’ ] # KeyError: ‘id’ tag . get ( ‘id’ ) # None Multi-valued attributes¶ HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel , rev , accept-charset , headers , and accesskey . Beautiful Soup presents the value(s) of a multi-valued attribute as a list: css_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ ) css_soup . p [ ‘class’ ] # [‘body’] css_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ ) css_soup . p [ ‘class’ ] # [‘body’, ‘strikeout’] If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone: id_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ ) id_soup . p [ ‘id’ ] # ‘my id’ When you turn a tag back into a string, multiple attribute values are consolidated: rel_soup = BeautifulSoup ( ‘
Back to the homepage
‘ , ‘html.parser’ ) rel_soup . a [ ‘rel’ ] # [‘index’] rel_soup . a [ ‘rel’ ] = [ ‘index’ , ‘contents’ ] print ( rel_soup . p ) #
Back to the homepage
You can disable this by passing multi_valued_attributes=None as a keyword argument into the BeautifulSoup constructor: no_list_soup = BeautifulSoup ( ‘
‘ , ‘html.parser’ , multi_valued_attributes = None ) no_list_soup . p [ ‘class’ ] # ‘body strikeout’ You can use get_attribute_list to get a value that’s always a list, whether or not it’s a multi-valued atribute: id_soup . p . get_attribute_list ( ‘id’ ) # [“my id”] If you parse a document as XML, there are no multi-valued attributes: xml_soup = BeautifulSoup ( ‘
‘ , ‘xml’ ) xml_soup . p [ ‘class’ ] # ‘body strikeout’ Again, you can configure this using the multi_valued_attributes argument: class_is_multi = { ‘*’ : ‘class’ } xml_soup = BeautifulSoup ( ‘
‘ , ‘xml’ , multi_valued_attributes = class_is_multi ) xml_soup . p [ ‘class’ ] # [‘body’, ‘strikeout’] You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification: from bs4.builder import builder_registry builder_registry . lookup ( ‘html’ ) . DEFAULT_CDATA_LIST_ATTRIBUTES NavigableString ¶ A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b tag . string # ‘Extremely bold’ type ( tag . string ) #
A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with str : unicode_string = str ( tag . string ) unicode_string # ‘Extremely bold’ type ( unicode_string ) # You can’t edit a string in place, but you can replace one string with another, using replace_with(): tag . string . replace_with ( “No longer bold” ) tag # No longer bold NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method. If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory. BeautifulSoup ¶ The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree. You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents: doc = BeautifulSoup ( “ Here’s the footer INSERT FOOTER HERE ” , “xml” ) doc . find ( text = “INSERT FOOTER HERE” ) . replace_with ( footer ) # ‘INSERT FOOTER HERE’ print ( doc ) # #
Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name , so it’s been given the special .name “[document]”: soup . name # ‘[document]’
Navigating the tree¶ Here’s the “Three sisters” HTML document again: html_doc = “””
The Dormouse’s story The Dormouse’s story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
…
“”” from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) I’ll use this as an example to show you how to move from one part of a document to another. Going down¶ Tags may contain strings and other tags. These elements are the tag’s children . Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children. Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children. Navigating using tag names¶ The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the
tag, just say soup.head : soup . head #The Dormouse’s story soup . title #The Dormouse’s story You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first tag beneath the tag: soup . body . b # The Dormouse’s story Using a tag name as an attribute will give you only the first tag by that name: soup . a # Elsie If you need to get all the tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all() : soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] .contents and .children ¶ A tag’s children are available in a list called .contents : head_tag = soup . head head_tag #The Dormouse’s story head_tag . contents # [The Dormouse’s story ] title_tag = head_tag . contents [ 0 ] title_tag #The Dormouse’s story title_tag . contents # [‘The Dormouse’s story’] The BeautifulSoup object itself has children. In this case, the tag is the child of the BeautifulSoup object.: len ( soup . contents ) # 1 soup . contents [ 0 ] . name # ‘html’ A string does not have .contents , because it can’t contain anything: text = title_tag . contents [ 0 ] text . contents # AttributeError: ‘NavigableString’ object has no attribute ‘contents’ Instead of getting them as a list, you can iterate over a tag’s children using the .children generator: for child in title_tag . children : print ( child ) # The Dormouse’s story If you want to modify a tag’s children, use the methods described in Modifying the tree. Don’t modify the the .contents list directly: that can lead to problems that are subtle and difficult to spot. .descendants ¶ The .contents and .children attributes only consider a tag’s direct children. For instance, the tag has a single direct child–thetag: head_tag . contents # [ The Dormouse’s story ] But thetag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on: for child in head_tag . descendants : print ( child ) # The Dormouse’s story # The Dormouse’s story The tag has only one child, but it has two descendants: thetag and the tag’s child. The BeautifulSoup object only has one direct child (the tag), but it has a whole lot of descendants: len ( list ( soup . children )) # 1 len ( list ( soup . descendants )) # 26 .string ¶ If a tag has only one child, and that child is a NavigableString , the child is made available as .string : title_tag . string # ‘The Dormouse’s story’ If a tag’s only child is another tag, and that tag has a .string , then the parent tag is considered to have the same .string as its child: head_tag . contents # [ The Dormouse’s story ] head_tag . string # ‘The Dormouse’s story’ If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None : print ( soup . html . string ) # None .strings and stripped_strings ¶ If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator: for string in soup . strings : print ( repr ( string )) ‘‘ # “The Dormouse’s story” # ‘
‘ # ‘
‘ # “The Dormouse’s story” # ‘
‘ # ‘Once upon a time there were three little sisters; and their names were
‘ # ‘Elsie’ # ‘,
‘ # ‘Lacie’ # ‘ and
‘ # ‘Tillie’ # ‘;
and they lived at the bottom of a well.’ # ‘
‘ # ‘…’ # ‘
‘ These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead: for string in soup . stripped_strings : print ( repr ( string )) # “The Dormouse’s story” # “The Dormouse’s story” # ‘Once upon a time there were three little sisters; and their names were’ # ‘Elsie’ # ‘,’ # ‘Lacie’ # ‘and’ # ‘Tillie’ # ‘;
and they lived at the bottom of a well.’ # ‘…’ Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed. Going up¶ Continuing the “family tree” analogy, every tag and every string has a parent : the tag that contains it. .parent ¶ You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the
tag is the parent of thetag: title_tag = soup . title title_tag # The Dormouse’s story title_tag . parent #The Dormouse’s story The title string itself has a parent: thetag that contains it: title_tag . string . parent # The Dormouse’s story The parent of a top-level tag like is the BeautifulSoup object itself: html_tag = soup . html type ( html_tag . parent ) #And the .parent of a BeautifulSoup object is defined as None: print ( soup . parent ) # None .parents ¶ You can iterate over all of an element’s parents with .parents . This example uses .parents to travel from an tag buried deep within the document, to the very top of the document: link = soup . a link # Elsie for parent in link . parents : print ( parent . name ) # p # body # html # [document] Going sideways¶ Consider a simple document like this: sibling_soup = BeautifulSoup ( “text1 text2 ” , ‘html.parser’ ) print ( sibling_soup . prettify ()) # # # text1 # ## text2 # # The tag and thetag are at the same level: they’re both direct children of the same tag. We call them siblings . When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write. .next_sibling and .previous_sibling ¶ You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree: sibling_soup . b . next_sibling # text2 sibling_soup . c . previous_sibling # text1 The tag has a .next_sibling , but no .previous_sibling , because there’s nothing before the tag on the same level of the tree . For the same reason, thetag has a .previous_sibling but no .next_sibling : print ( sibling_soup . b . previous_sibling ) # None print ( sibling_soup . c . next_sibling ) # None The strings “text1” and “text2” are not siblings, because they don’t have the same parent: sibling_soup . b . string # ‘text1’ print ( sibling_soup . b . string . next_sibling ) # None In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document: # Elsie # Lacie # Tillie You might think that the .next_sibling of the first tag would be the second tag. But actually, it’s a string: the comma and newline that separate the first tag from the second: link = soup . a link # Elsie link . next_sibling # ‘, soup . b . contents # [Don’t, ‘ you’,‘ The second tag is actually the .next_sibling of the comma: link . next_sibling . next_sibling # Lacie .next_siblings and .previous_siblings ¶ You can iterate over a tag’s siblings with .next_siblings or .previous_siblings : for sibling in soup . a . next_siblings : print ( repr ( sibling )) # ‘,
‘ # Lacie # ‘ and
‘ # Tillie # ‘; and they lived at the bottom of a well.’ for sibling in soup . find ( id = “link3” ) . previous_siblings : print ( repr ( sibling )) # ‘ and
‘ # Lacie # ‘,
‘ # Elsie # ‘Once upon a time there were three little sisters; and their names were
‘ Going back and forth¶ Take a look at the beginning of the “three sisters” document: #
The Dormouse’s story #The Dormouse’s story
An HTML parser takes this string of characters and turns it into a series of events: “open an tag”, “open a
tag”, “open atag”, “add a string”, “close the tag”, “open a tag”, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document. .next_element and .previous_element ¶ The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling , but it’s usually drastically different. Here’s the final tag in the “three sisters” document. Its .next_sibling is a string: the conclusion of the sentence that was interrupted by the start of the tag.: last_a_tag = soup . find ( “a” , id = “link3” ) last_a_tag # Tillie last_a_tag . next_sibling # ‘;
and they lived at the bottom of a well.’ But the .next_element of that tag, the thing that was parsed immediately after the tag, is not the rest of that sentence: it’s the word “Tillie”: last_a_tag . next_element # ‘Tillie’ That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered an tag, then the word “Tillie”, then the closing tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the tag, but the word “Tillie” was encountered first. The .previous_element attribute is the exact opposite of .next_element . It points to whatever element was parsed immediately before this one: last_a_tag . previous_element # ‘ and
‘ last_a_tag . previous_element . next_element # Tillie .next_elements and .previous_elements ¶ You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed: for element in last_a_tag . next_elements : print ( repr ( element )) # ‘Tillie’ # ‘;
and they lived at the bottom of a well.’ # ‘
‘ #
…
# ‘…’ # ‘
‘
Modifying the tree¶ Beautiful Soup’s main strength is in searching the parse tree, but you can also modify the tree and write your changes as a new HTML or XML document. Changing tag names and attributes¶ I covered this earlier, in Attributes, but it bears repeating. You can rename a tag, change the values of its attributes, add new attributes, and delete attributes: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b tag . name = “blockquote” tag [ ‘class’ ] = ‘verybold’ tag [ ‘id’ ] = 1 tag #
Extremely bold
del tag [ ‘class’ ] del tag [ ‘id’ ] tag #
Extremely bold
Modifying .string ¶ If you set a tag’s .string attribute to a new string, the tag’s contents are replaced with that string: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . string = “New link text.” tag # New link text. Be careful: if the tag contained other tags, they and all their contents will be destroyed. append() ¶ You can add to a tag’s contents with Tag.append() . It works just like calling .append() on a Python list: soup = BeautifulSoup ( “Foo” , ‘html.parser’ ) soup . a . append ( “Bar” ) soup # FooBar soup . a . contents # [‘Foo’, ‘Bar’] extend() ¶ Starting in Beautiful Soup 4.7.0, Tag also supports a method called .extend() , which adds every element of a list to a Tag , in order: soup = BeautifulSoup ( “Soup” , ‘html.parser’ ) soup . a . extend ([ “‘s” , ” ” , “on” ]) soup # Soup’s on soup . a . contents # [‘Soup’, ”s’, ‘ ‘, ‘on’] NavigableString() and .new_tag() ¶ If you need to add a string to a document, no problem–you can pass a Python string in to append() , or you can call the NavigableString constructor: soup = BeautifulSoup ( “” , ‘html.parser’ ) tag = soup . b tag . append ( “Hello” ) new_string = NavigableString ( ” there” ) tag . append ( new_string ) tag # Hello there. tag . contents # [‘Hello’, ‘ there’] If you want to create a comment or some other subclass of NavigableString , just call the constructor: from bs4 import Comment new_comment = Comment ( “Nice to see you.” ) tag . append ( new_comment ) tag # Hello there tag . contents # [‘Hello’, ‘ there’, ‘Nice to see you.’] (This is a new feature in Beautiful Soup 4.4.0.) What if you need to create a whole new tag? The best solution is to call the factory method BeautifulSoup.new_tag() : soup = BeautifulSoup ( “” , ‘html.parser’ ) original_tag = soup . b new_tag = soup . new_tag ( “a” , href = “http://www.example.com” ) original_tag . append ( new_tag ) original_tag # new_tag . string = “Link text.” original_tag # Link text. Only the first argument, the tag name, is required. insert() ¶ Tag.insert() is just like Tag.append() , except the new element doesn’t necessarily go at the end of its parent’s .contents . It’ll be inserted at whatever numeric position you say. It works just like .insert() on a Python list: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . insert ( 1 , “but did not endorse ” ) tag # I linked to but did not endorse example.com tag . contents # [‘I linked to ‘, ‘but did not endorse’, example.com] insert_before() and insert_after() ¶ The insert_before() method inserts tags or strings immediately before something else in the parse tree: soup = BeautifulSoup ( “leave” , ‘html.parser’ ) tag = soup . new_tag ( “i” ) tag . string = “Don’t” soup . b . string . insert_before ( tag ) soup . b # Don’tleave The insert_after() method inserts tags or strings immediately following something else in the parse tree: div = soup . new_tag ( ‘div’ ) div . string = ‘ever’ soup . b . i . insert_after ( ” you ” , div ) soup . b # Don’t you
everleave
ever, ‘leave’] clear() ¶ Tag.clear() removes the contents of a tag: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . clear () tag # extract() ¶ PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a i_tag = soup . i . extract () a_tag # I linked to i_tag # example.com print ( i_tag . parent ) # None At this point you effectively have two parse trees: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call extract on a child of the element you extracted: my_string = i_tag . string . extract () my_string # ‘example.com’ print ( my_string . parent ) # None i_tag # decompose() ¶ Tag.decompose() removes a tag from the tree, then completely destroys it and its contents : markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a i_tag = soup . i i_tag . decompose () a_tag # I linked to The behavior of a decomposed Tag or NavigableString is not defined and you should not use it for anything. If you’re not sure whether something has been decomposed, you can check its .decomposed property (new in Beautiful Soup 4.9.0) : i_tag . decomposed # True a_tag . decomposed # False replace_with() ¶ PageElement.replace_with() removes a tag or string from the tree, and replaces it with one or more tags or strings of your choice: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a new_tag = soup . new_tag ( “b” ) new_tag . string = “example.com” a_tag . i . replace_with ( new_tag ) a_tag # I linked to example.com bold_tag = soup . new_tag ( “b” ) bold_tag . string = “example” i_tag = soup . new_tag ( “i” ) i_tag . string = “net” a_tag . b . replace_with ( bold_tag , “.” , i_tag ) a_tag # I linked to example.net replace_with() returns the tag or string that got replaced, so that you can examine it or add it back to another part of the tree. The ability to pass multiple arguments into replace_with() is new in Beautiful Soup 4.10.0. wrap() ¶ PageElement.wrap() wraps an element in the tag you specify. It returns the new wrapper: soup = BeautifulSoup ( “
I wish I was bold.
” , ‘html.parser’ ) soup . p . string . wrap ( soup . new_tag ( “b” )) # I wish I was bold. soup . p . wrap ( soup . new_tag ( “div” )) #
I wish I was bold.
This method is new in Beautiful Soup 4.0.5. unwrap() ¶ Tag.unwrap() is the opposite of wrap() . It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup: markup = ‘I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a a_tag . i . unwrap () a_tag # I linked to example.com Like replace_with() , unwrap() returns the tag that was replaced. smooth() ¶ After calling a bunch of methods that modify the parse tree, you may end up with two or more NavigableString objects next to each other. Beautiful Soup doesn’t have any problems with this, but since it can’t happen in a freshly parsed document, you might not expect behavior like the following: soup = BeautifulSoup ( “
A one
” , ‘html.parser’ ) soup . p . append ( “, a two” ) soup . p . contents # [‘A one’, ‘, a two’] print ( soup . p . encode ()) # b’
A one, a two
‘ print ( soup . p . prettify ()) #
# A one # , a two #
You can call Tag.smooth() to clean up the parse tree by consolidating adjacent strings: soup . smooth () soup . p . contents # [‘A one, a two’] print ( soup . p . prettify ()) #
# A one, a two #
This method is new in Beautiful Soup 4.8.0.
Output¶ Pretty-printing¶ The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: markup = ‘
I linked to example.com‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) soup . prettify () # ‘…’ print ( soup . prettify ()) # #
# # # # I linked to # # example.com # # # # You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: print ( soup . a . prettify ()) # # I linked to # # example.com # # Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one. The goal of prettify() is to help you visually understand the structure of the documents you work with. Non-pretty printing¶ If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object, or on a Tag within it: str ( soup ) # ‘I linked to example.com‘ str ( soup . a ) # ‘I linked to example.com‘ The str() function returns a string encoded in UTF-8. See Encodings for other options. You can also call encode() to get a bytestring, and decode() to get Unicode. Output formatters¶ If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters: soup = BeautifulSoup ( ““Dammit!” he said.” , ‘html.parser’ ) str ( soup ) # ‘“Dammit!” he said.’ If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back: soup . encode ( “utf8” ) # b’\xe2\x80\x9cDammit!\xe2\x80\x9d he said.’ By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML: soup = BeautifulSoup ( “The law firm of Dewey, Cheatem, & Howe
” , ‘html.parser’ ) soup . p #
The law firm of Dewey, Cheatem, & Howe
soup = BeautifulSoup ( ‘A link‘ , ‘html.parser’ ) soup . a # A link You can change this behavior by providing a value for the formatter argument to prettify() , encode() , or decode() . Beautiful Soup recognizes five possible values for formatter . The default is formatter=”minimal” . Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML: french = “
Il a dit <
> ” soup = BeautifulSoup ( french , ‘html.parser’ ) print ( soup . prettify ( formatter = “minimal” )) #
# Il a dit <
> # If you pass in formatter=”html” , Beautiful Soup will convert Unicode characters to HTML entities whenever possible: print ( soup . prettify ( formatter = “html” )) #
# Il a dit <
> # If you pass in formatter=”html5″ , it’s similar to formatter=”html” , but Beautiful Soup will omit the closing slash in HTML void tags like “br”: br = BeautifulSoup ( “
” , ‘html.parser’ ) . br print ( br . encode ( formatter = “html” )) # b’
‘ print ( br . encode ( formatter = “html5” )) # b’
‘ In addition, any attributes whose values are the empty string will become HTML-style boolean attributes: option = BeautifulSoup ( ‘‘ ) . option print ( option . encode ( formatter = “html” )) # b’‘ print ( option . encode ( formatter = “html5” )) # b’‘ (This behavior is new as of Beautiful Soup 4.10.0.) If you pass in formatter=None , Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples: print ( soup . prettify ( formatter = None )) ## Il a dit <
> # link_soup = BeautifulSoup ( ‘A link‘ , ‘html.parser’ ) print ( link_soup . a . encode ( formatter = None )) # b’A link‘ If you need more sophisticated control over your output, you can use Beautiful Soup’s Formatter class. Here’s a formatter that converts strings to uppercase, whether they occur in a text node or in an attribute value: from bs4.formatter import HTMLFormatter def uppercase ( str ): return str . upper () formatter = HTMLFormatter ( uppercase ) print ( soup . prettify ( formatter = formatter )) #
# IL A DIT <
> # print ( link_soup . a . prettify ( formatter = formatter )) # # A LINK # Here’s a formatter that increases the indentation when pretty-printing: formatter = HTMLFormatter ( indent = 8 ) print ( link_soup . a . prettify ( formatter = formatter )) # # A link # Subclassing HTMLFormatter or XMLFormatter will give you even more control over the output. For example, Beautiful Soup sorts the attributes in every tag by default: attr_soup = BeautifulSoup ( b ‘
‘ , ‘html.parser’ ) print ( attr_soup . p . encode ()) #
To turn this off, you can subclass the Formatter.attributes() method, which controls which attributes are output and in what order. This implementation also filters out the attribute called “m” whenever it appears: class UnsortedAttributes ( HTMLFormatter ): def attributes ( self , tag ): for k , v in tag . attrs . items (): if k == ‘m’ : continue yield k , v print ( attr_soup . p . encode ( formatter = UnsortedAttributes ())) #
One last caveat: if you create a CData object, the text inside that object is always presented exactly as it appears, with no formatting . Beautiful Soup will call your entity substitution function, just in case you’ve written a custom function that counts all the strings in the document or something, but it will ignore the return value: from bs4.element import CData soup = BeautifulSoup ( “” , ‘html.parser’ ) soup . a . string = CData ( “one < three" ) print ( soup . a . prettify ( formatter = "html" )) # # get_text() ¶ If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string: markup = ‘
I linked to example.com
‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) soup . get_text () ‘
I linked to example.com
‘ soup . i . get_text () ‘example.com’ You can specify a string to be used to join the bits of text together: # soup.get_text(“|”) ‘
I linked to |example.com|
‘ You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text: # soup.get_text(“|”, strip=True) ‘I linked to|example.com’ But at that point you might want to use the .stripped_strings generator instead, and process the text yourself: [ text for text in soup . stripped_strings ] # [‘I linked to’, ‘example.com’] As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of