Skip to content
Home » How To Install Beautiful Soup In Python 3.9 (Windows 10) | Import Beautifulsoup Python Câu Trả Lời Nhanh

How To Install Beautiful Soup In Python 3.9 (Windows 10) | Import Beautifulsoup Python Câu Trả Lời Nhanh

トピック記事を見てみましょう “import beautifulsoup python – How To Install Beautiful Soup In Python 3.9 (Windows 10)“? カテゴリ内: Top 716 tips update new. この記事は、インターネット上の多くのソースからのhttps://ph.taphoamini.comによって編集されています. 著者ProgrammingFeverによる記事には6,768 回視聴があり、高評価 81 件で高く評価されています.

このimport beautifulsoup pythonトピックの詳細については、以下の記事を参照してください。.投稿がある場合は、記事の下にコメントするか、関連記事セクションのトピックimport beautifulsoup pythonに関連するその他の記事を参照してください。.

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

Installing Beautiful Soup using setup.py
  1. Unzip it to a folder (for example, BeautifulSoup ).
  2. Open up the command-line prompt and navigate to the folder where you have unzipped the folder as follows: cd BeautifulSoup python setup.py install.
  3. The python setup.py install line will install Beautiful Soup in our system.

主題に関するビデオを見る import beautifulsoup python

以下は、このトピックに関する詳細なビデオです import beautifulsoup python – How To Install Beautiful Soup In Python 3.9 (Windows 10). 注意深く見て、あなたが読んでいるものについてのフィードバックを私たちに与えてください!

How To Install Beautiful Soup In Python 3.9 (Windows 10) – import beautifulsoup python このトピックの詳細

テーマの説明 import beautifulsoup python:

how to install Beautiful Soup in python 3.9 windows 10
In this video I will show you how to install Beautiful Soup in python 3.9.
By the end of this video you will understand how to install Beautiful Soup in window 10 using pip command.
pip command is use to install python package, in this video we will use pip to install Beautiful Soup latest version in window 10 with python 3.9
What is Beautiful Soup?
Here you find the documentation and you can also read how to install Beautiful Soup
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Beautiful Soup (4.9.3) latest version link
https://pypi.org/project/beautifulsoup4/
python (3.9.5) latest version link
https://www.python.org/downloads/
#install #Beautiful #Soup #python #3.9

See some more details on the topic import beautifulsoup python here:

beautifulsoup4 · PyPI

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, proving Pythonic ioms for …

+ View More Here

Source: pypi.org

Date Published: 3/29/2022

View: 623

Beautiful Soup – Installation – Tutorialspoint

We are going to install the BeautifulSoup 4 library (also known as BS4), … As BeautifulSoup is not a standard python library, we need to install it first.

+ View Here

Source: www.tutorialspoint.com

Date Published: 5/11/2021

View: 2763

I cannot import beautiful soup on python – Stack Overflow

I installed Beautiful Soup library, and it seems to be well set up as there is the bs4 folder in C:\Python33\Lib\site-packages .

+ Read More Here

Source: stackoverflow.com

Date Published: 2/25/2021

View: 2400

Beautiful Soup 4.9.0 documentation – Crummy

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system … from bs4 import BeautifulSoup with open(“index.html”) as fp: soup …

+ Read More

Source: www.crummy.com

Date Published: 10/1/2021

View: 9247

Kỹ thuật Scraping Trang web trong Python bằng Beautiful Soup

Tên gói là beautifulsoup4 . Nó sẽ làm việc trên cả Python 2 và Python 3. 1. $ pip install beautifulsoup4 …

+ View Here

Source: code.tutsplus.com

Date Published: 2/16/2022

View: 7288

PyMOTM: Beautiful Soup 4 (Part I) – Viblo

Cài đặt · Qua APT: sudo apt-get install python-bs4 · Qua PIP: sudo pip install beautifulsoup4 · Qua EasyInstall: sudo easy_install beautifulsoup4 · Qua source: Vào …

+ Read More

Source: viblo.asia

Date Published: 8/23/2022

View: 7944

コンテンツの写真 import beautifulsoup python

トピックに関する写真 How To Install Beautiful Soup In Python 3.9 (Windows 10) 記事の内容をよりよく理解するために記事を説明するために使用されます。コメントセクションでより多くの関連画像を参照するか、必要に応じてより多くの関連記事を参照してください.

How To Install Beautiful Soup In Python 3.9 (Windows 10)
How To Install Beautiful Soup In Python 3.9 (Windows 10)

トピックに関する記事を評価する import beautifulsoup python

  • 著者: ProgrammingFever
  • 意見: 6,768 回視聴
  • いいねの数: 高評価 81 件
  • 動画のアップロード日: 2021/05/11
  • ビデオURL: https://www.youtube.com/watch?v=jOfbQnnllCw

How do I add BeautifulSoup to Python?

Installing Beautiful Soup using setup.py
  1. Unzip it to a folder (for example, BeautifulSoup ).
  2. Open up the command-line prompt and navigate to the folder where you have unzipped the folder as follows: cd BeautifulSoup python setup.py install.
  3. The python setup.py install line will install Beautiful Soup in our system.

What is from bs4 import BeautifulSoup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Does BeautifulSoup come with Python?

By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

How do I know if BeautifulSoup is installed?

To verify the installation, perform the following steps:
  1. Open up the Python interpreter in a terminal by using the following command: python.
  2. Now, we can issue a simple import statement to see whether we have successfully installed Beautiful Soup or not by using the following command: from bs4 import BeautifulSoup.

How do I add BeautifulSoup to PyCharm?

Install beautiful soup using PyCharm

Navigate to File >> Settings (Ctrl + Alt + S) and choose Project Interpreter. Click the plus (+) sign to add a new package. Type beautifulsoup, and choose beautifulsoup4 and Install package.

How do I download pip for Python?

How To Install PIP to Manage Python Packages On Windows
  1. Step 1: Download PIP get-pip.py.
  2. Step 2: Installing PIP on Windows.
  3. Step 3: Verify Installation.
  4. Step 4: Add Pip to Windows Environment Variables.
  5. Step 5: Configuration.

How do I install BeautifulSoup on Windows pip?

Steps to Install Beautifulsoup using PIP
  1. Step 1: Open your command prompt.
  2. Step 2: Check the version of the python by typing the following command. python –version Checking the version of python on windows.
  3. Step 3: Install the beautifulsoup using pip.

How do I use BeautifulSoup for web scraping?

We will be using requests and BeautifulSoup for scraping and parsing the data.
  1. Step 1: Find the URL of the webpage that you want to scrape. …
  2. Step 3: Write the code to get the content of the selected elements. …
  3. Step 4: Store the data in the required format.

How do I download a bs4 module in Python?

To install Beautifulsoup on Windows, Linux, or any operating system, one would need pip package. To check how to install pip on your operating system, check out – PIP Installation – Windows || Linux. Wait and relax, Beautifulsoup would be installed shortly.

Is BeautifulSoup faster than selenium?

Comparing selenium vs BeautifulSoup allows you to see that BeautifulSoup is more user-friendly and allows you to learn faster and begin web scraping smaller tasks easier. Selenium on the other hand is important when the target website has a lot of java elements in its code.

How do I install pandas in Python?

Installing Pandas on Windows
  1. Open up the command prompt so you can install Pandas.
  2. Enter the command “pip install pandas” on the terminal. …
  3. Launch the installer that you downloaded from the website, and click the “Next” button.
  4. Next, to agree to the license agreement, press the “I Agree” button.

What is BeautifulSoup Python?

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

How do you use web scraping in Python?

To extract data using web scraping with python, you need to follow these basic steps:
  1. Find the URL that you want to scrape.
  2. Inspecting the Page.
  3. Find the data you want to extract.
  4. Write the code.
  5. Run the code and extract the data.
  6. Store the data in the required format.

Python Programming Tutorials

Web scraping and parsing with Beautiful Soup 4 Introduction

Welcome to a tutorial on web scraping with Beautiful Soup 4. Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data from websites.

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

To begin, we need HTML. I have created an example page for us to work with.

To begin, we need to import Beautiful Soup and urllib, and grab source code:

import bs4 as bs import urllib.request source = urllib.request.urlopen(‘https://pythonprogramming.net/parsememcparseface/’).read()

Then, we create the “soup.” This is a beautiful soup object:

soup = bs.BeautifulSoup(source,’lxml’)

If you do print(soup) and print(source) , it looks the same, but the source is just plain the response data, and the soup is an object that we can actually interact with, by tag, now, like so:

# title of the page print(soup.title) # get attributes: print(soup.title.name) # get values: print(soup.title.string) # beginning navigation: print(soup.title.parent.name) # getting specific values: print(soup.p)

Finding paragraph tags

is a fairly common task. In the case above, we’re just finding the first one. What if we wanted to find them all?

print(soup.find_all(‘p’))

We can also iterate through them:

for paragraph in soup.find_all(‘p’): print(paragraph.string) print(str(paragraph.text))

The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we’re attempting to use .string on, we will get None returned.

Another common task is to grab links. For example:

for url in soup.find_all(‘a’): print(url.get(‘href’))

In this case, if we just grabbed the .text from the tag, you’d get the anchor text, but we actually want the link itself. That’s why we’re using .get(‘href’) to get the true URL.

Finally, you may just want to grab text. You can use .get_text() on a Beautiful Soup object, including the full soup:

print(soup.get_text())

This concludes the introduction to Beautiful Soup. In the next tutorial, we’re going cover navigating a page’s elements to get more specifically what you want.

beautifulsoup4

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

Quick start

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(“

SomebadHTML”) >>> print(soup.prettify())

Some bad HTML

>>> soup.find(text=”bad”) ‘bad’ >>> soup.i HTML # >>> soup = BeautifulSoup(“SomebadXML”, “xml”) # >>> print(soup.prettify()) Some bad XML

To go beyond the basics, comprehensive documentation is available.

Links

Note on Python 2 sunsetting

Beautiful Soup’s support for Python 2 was discontinued on December 31, 2020: one year after the sunset date for Python 2 itself. From this point onward, new Beautiful Soup development will exclusively target Python 3. The final release of Beautiful Soup 4 to support Python 2 was 4.9.3.

Supporting the project

If you use Beautiful Soup as part of your professional work, please consider a Tidelift subscription. This will support many of the free software projects your organization depends on, not just Beautiful Soup.

If you use Beautiful Soup for personal projects, the best way to say thank you is to read Tool Safety, a zine I wrote about what Beautiful Soup has taught me about software development.

Building the documentation

The bs4/doc/ directory contains full documentation in Sphinx format. Run make html in that directory to create HTML documentation.

Running the unit tests

Beautiful Soup supports unit test discovery using Pytest:

bs4 — BeautifulSoup 4 — Python 3.6.1 documentation

bs4 — BeautifulSoup 4¶

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup – Installation

Beautiful Soup – Installation

Advertisements

As BeautifulSoup is not a standard python library, we need to install it first. We are going to install the BeautifulSoup 4 library (also known as BS4), which is the latest one.

To isolate our working environment so as not to disturb the existing setup, let us first create a virtual environment.

Creating a virtual environment (optional)

A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.

Best way to install any python package machine is using pip, however, if pip is not installed already (you can check it using – “pip –version” in your command or shell prompt), you can install by giving below command −

Linux environment

$sudo apt-get install python-pip

Windows environment

To install pip in windows, do the following −

Download the get-pip.py from https://bootstrap.pypa.io/get-pip.py or from the github to your computer.

Open the command prompt and navigate to the folder containing get-pip.py file.

Run the following command −

>python get-pip.py

That’s it, pip is now installed in your windows machine.

You can verify your pip installed by running below command −

>pip –version pip 19.2.3 from c:\users\yadur\appdata\local\programs\python\python37\lib\site-packages\pip (python 3.7)

Installing virtual environment

Run the below command in your command prompt −

>pip install virtualenv

After running, you will see the below screenshot −

Below command will create a virtual environment (“myEnv”) in your current directory −

>virtualenv myEnv

Screenshot

To activate your virtual environment, run the following command −

>myEnv\Scripts\activate

In the above screenshot, you can see we have “myEnv” as prefix which tells us that we are under virtual environment “myEnv”.

To come out of virtual environment, run deactivate.

(myEnv) C:\Users\yadur>deactivate C:\Users\yadur>

As our virtual environment is ready, now let us install beautifulsoup.

Installing BeautifulSoup

As BeautifulSoup is not a standard library, we need to install it. We are going to use the BeautifulSoup 4 package (known as bs4).

Linux Machine

To install bs4 on Debian or Ubuntu linux using system package manager, run the below command −

$sudo apt-get install python-bs4 (for python 2.x) $sudo apt-get install python3-bs4 (for python 3.x)

You can install bs4 using easy_install or pip (in case you find problem in installing using system packager).

$easy_install beautifulsoup4 $pip install beautifulsoup4

(You may need to use easy_install3 or pip3 respectively if you’re using python3)

Windows Machine

To install beautifulsoup4 in windows is very simple, especially if you have pip already installed.

>pip install beautifulsoup4

So now beautifulsoup4 is installed in our machine. Let us talk about some problems encountered after installation.

Problems after installation

On windows machine you might encounter, wrong version being installed error mainly through −

error: ImportError “No module named HTMLParser” , then you must be running python 2 version of the code under Python 3.

error: ImportError “No module named html.parser” error, then you must be running Python 3 version of the code under Python 2.

Best way to get out of above two situations is to re-install the BeautifulSoup again, completely removing existing installation.

If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u’[document]’, then you need to convert the python 2 code to python 3, just by either installing the package −

$ python3 setup.py install

or by manually running python’s 2 to 3 conversion script on the bs4 directory −

$ 2to3-3.2 -w bs4

Installing a Parser

By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

To install lxml or html5lib parser, use the command −

Linux Machine

$apt-get install python-lxml $apt-get insall python-html5lib

Windows Machine

$pip install lxml $pip install html5lib

Generally, users use lxml for speed and it is recommended to use lxml or html5lib parser if you are using older version of python 2 (before 2.7.3 version) or python 3 (before 3.2.2) as python’s built-in HTML parser is not very good in handling older version.

Running Beautiful Soup

It is time to test our Beautiful Soup package in one of the html pages (taking web page – https://www.tutorialspoint.com/index.htm, you can choose any-other web page you want) and extract some information from it.

In the below code, we are trying to extract the title from the webpage −

from bs4 import BeautifulSoup import requests url = “https://www.tutorialspoint.com/index.htm” req = requests.get(url) soup = BeautifulSoup(req.text, “html.parser”) print(soup.title)

Output

H2O, Colab, Theano, Flutter, KNime, Mean.js, Weka, Solidity, Org.Json, AWS QuickSight, JSON.Simple, Jackson Annotations, Passay, Boon, MuleSoft, Nagios, Matplotlib, Java NIO, PyTorch, SLF4J, Parallax Scrolling, Java Cryptography

One common task is to extract all the URLs within a webpage. For that we just need to add the below line of code −

for link in soup.find_all(‘a’): print(link.get(‘href’))

Output

https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/about/about_careers.htm https://www.tutorialspoint.com/questions/index.php https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/current_affairs.htm https://www.tutorialspoint.com/upsc_ias_exams.htm https://www.tutorialspoint.com/tutor_connect/index.php https://www.tutorialspoint.com/whiteboard.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/tutorialslibrary.htm https://www.tutorialspoint.com/videotutorials/index.php https://store.tutorialspoint.com https://www.tutorialspoint.com/gate_exams_tutorials.htm https://www.tutorialspoint.com/html_online_training/index.asp https://www.tutorialspoint.com/css_online_training/index.asp https://www.tutorialspoint.com/3d_animation_online_training/index.asp https://www.tutorialspoint.com/swift_4_online_training/index.asp https://www.tutorialspoint.com/blockchain_online_training/index.asp https://www.tutorialspoint.com/reactjs_online_training/index.asp https://www.tutorix.com https://www.tutorialspoint.com/videotutorials/top-courses.php https://www.tutorialspoint.com/the_full_stack_web_development/index.asp …. …. https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/free_web_graphics.htm https://www.tutorialspoint.com/online_file_conversion.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/free_online_whiteboard.htm http://www.tutorialspoint.com https://www.facebook.com/tutorialspointindia https://plus.google.com/u/0/+tutorialspoint http://www.twitter.com/tutorialspoint http://www.linkedin.com/company/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://www.tutorialspoint.com/index.htm /about/about_privacy.htm#cookies /about/faq.htm /about/about_helping.htm /about/contact_us.htm

Similarly, we can extract useful information using beautifulsoup4.

Now let us understand more about “soup” in above example.

Beautiful Soup – Installation

Beautiful Soup – Installation

Advertisements

As BeautifulSoup is not a standard python library, we need to install it first. We are going to install the BeautifulSoup 4 library (also known as BS4), which is the latest one.

To isolate our working environment so as not to disturb the existing setup, let us first create a virtual environment.

Creating a virtual environment (optional)

A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.

Best way to install any python package machine is using pip, however, if pip is not installed already (you can check it using – “pip –version” in your command or shell prompt), you can install by giving below command −

Linux environment

$sudo apt-get install python-pip

Windows environment

To install pip in windows, do the following −

Download the get-pip.py from https://bootstrap.pypa.io/get-pip.py or from the github to your computer.

Open the command prompt and navigate to the folder containing get-pip.py file.

Run the following command −

>python get-pip.py

That’s it, pip is now installed in your windows machine.

You can verify your pip installed by running below command −

>pip –version pip 19.2.3 from c:\users\yadur\appdata\local\programs\python\python37\lib\site-packages\pip (python 3.7)

Installing virtual environment

Run the below command in your command prompt −

>pip install virtualenv

After running, you will see the below screenshot −

Below command will create a virtual environment (“myEnv”) in your current directory −

>virtualenv myEnv

Screenshot

To activate your virtual environment, run the following command −

>myEnv\Scripts\activate

In the above screenshot, you can see we have “myEnv” as prefix which tells us that we are under virtual environment “myEnv”.

To come out of virtual environment, run deactivate.

(myEnv) C:\Users\yadur>deactivate C:\Users\yadur>

As our virtual environment is ready, now let us install beautifulsoup.

Installing BeautifulSoup

As BeautifulSoup is not a standard library, we need to install it. We are going to use the BeautifulSoup 4 package (known as bs4).

Linux Machine

To install bs4 on Debian or Ubuntu linux using system package manager, run the below command −

$sudo apt-get install python-bs4 (for python 2.x) $sudo apt-get install python3-bs4 (for python 3.x)

You can install bs4 using easy_install or pip (in case you find problem in installing using system packager).

$easy_install beautifulsoup4 $pip install beautifulsoup4

(You may need to use easy_install3 or pip3 respectively if you’re using python3)

Windows Machine

To install beautifulsoup4 in windows is very simple, especially if you have pip already installed.

>pip install beautifulsoup4

So now beautifulsoup4 is installed in our machine. Let us talk about some problems encountered after installation.

Problems after installation

On windows machine you might encounter, wrong version being installed error mainly through −

error: ImportError “No module named HTMLParser” , then you must be running python 2 version of the code under Python 3.

error: ImportError “No module named html.parser” error, then you must be running Python 3 version of the code under Python 2.

Best way to get out of above two situations is to re-install the BeautifulSoup again, completely removing existing installation.

If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u’[document]’, then you need to convert the python 2 code to python 3, just by either installing the package −

$ python3 setup.py install

or by manually running python’s 2 to 3 conversion script on the bs4 directory −

$ 2to3-3.2 -w bs4

Installing a Parser

By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

To install lxml or html5lib parser, use the command −

Linux Machine

$apt-get install python-lxml $apt-get insall python-html5lib

Windows Machine

$pip install lxml $pip install html5lib

Generally, users use lxml for speed and it is recommended to use lxml or html5lib parser if you are using older version of python 2 (before 2.7.3 version) or python 3 (before 3.2.2) as python’s built-in HTML parser is not very good in handling older version.

Running Beautiful Soup

It is time to test our Beautiful Soup package in one of the html pages (taking web page – https://www.tutorialspoint.com/index.htm, you can choose any-other web page you want) and extract some information from it.

In the below code, we are trying to extract the title from the webpage −

from bs4 import BeautifulSoup import requests url = “https://www.tutorialspoint.com/index.htm” req = requests.get(url) soup = BeautifulSoup(req.text, “html.parser”) print(soup.title)

Output

H2O, Colab, Theano, Flutter, KNime, Mean.js, Weka, Solidity, Org.Json, AWS QuickSight, JSON.Simple, Jackson Annotations, Passay, Boon, MuleSoft, Nagios, Matplotlib, Java NIO, PyTorch, SLF4J, Parallax Scrolling, Java Cryptography

One common task is to extract all the URLs within a webpage. For that we just need to add the below line of code −

for link in soup.find_all(‘a’): print(link.get(‘href’))

Output

https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/about/about_careers.htm https://www.tutorialspoint.com/questions/index.php https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/current_affairs.htm https://www.tutorialspoint.com/upsc_ias_exams.htm https://www.tutorialspoint.com/tutor_connect/index.php https://www.tutorialspoint.com/whiteboard.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/tutorialslibrary.htm https://www.tutorialspoint.com/videotutorials/index.php https://store.tutorialspoint.com https://www.tutorialspoint.com/gate_exams_tutorials.htm https://www.tutorialspoint.com/html_online_training/index.asp https://www.tutorialspoint.com/css_online_training/index.asp https://www.tutorialspoint.com/3d_animation_online_training/index.asp https://www.tutorialspoint.com/swift_4_online_training/index.asp https://www.tutorialspoint.com/blockchain_online_training/index.asp https://www.tutorialspoint.com/reactjs_online_training/index.asp https://www.tutorix.com https://www.tutorialspoint.com/videotutorials/top-courses.php https://www.tutorialspoint.com/the_full_stack_web_development/index.asp …. …. https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/free_web_graphics.htm https://www.tutorialspoint.com/online_file_conversion.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/free_online_whiteboard.htm http://www.tutorialspoint.com https://www.facebook.com/tutorialspointindia https://plus.google.com/u/0/+tutorialspoint http://www.twitter.com/tutorialspoint http://www.linkedin.com/company/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://www.tutorialspoint.com/index.htm /about/about_privacy.htm#cookies /about/faq.htm /about/about_helping.htm /about/contact_us.htm

Similarly, we can extract useful information using beautifulsoup4.

Now let us understand more about “soup” in above example.

I cannot import beautiful soup on python

I installed Beautiful Soup library, and it seems to be well set up as there is the bs4 folder in C:\Python33\Lib\site-packages .

(I changed the name into bs4 before installation, and it went the same after install)

But when I type in from bs4 import beautifulsoup in the code, it says there is no such library.

And I don’t see any beautifulsoup.py or something. Isn’t there supposed to be one?

I’m really confused. Anyone help please?

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Quick Start¶ Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland : html_doc = “””The Dormouse’s story

The Dormouse’s story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

“”” Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure: from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) print ( soup . prettify ()) # # # # The Dormouse’s story # # # #

# # The Dormouse’s story # #

#

# Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #

#

# … #

# # Here are some simple ways to navigate that data structure: soup . title # The Dormouse’s story soup . title . name # u’title’ soup . title . string # u’The Dormouse’s story’ soup . title . parent . name # u’head’ soup . p #

The Dormouse’s story

soup . p [ ‘class’ ] # u’title’ soup . a # Elsie soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] soup . find ( id = “link3” ) # Tillie One common task is extracting all the URLs found within a page’s tags: for link in soup . find_all ( ‘a’ ): print ( link . get ( ‘href’ )) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page: print ( soup . get_text ()) # The Dormouse’s story # # The Dormouse’s story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # … Does this look like what you need? If so, read on.

Installing Beautiful Soup¶ If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager: $ apt – get install python3 – bs4 Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip . The package name is beautifulsoup4 . Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively). $ easy_install beautifulsoup4 $ pip install beautifulsoup4 (The BeautifulSoup package is not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4 .) If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py . $ python setup.py install If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all. I use Python 3.8 to develop Beautiful Soup, but it should work with other recent versions. Installing a parser¶ Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands: $ apt – get install python – lxml $ easy_install lxml $ pip install lxml Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands: $ apt – get install python – html5lib $ easy_install html5lib $ pip install html5lib This table summarizes the advantages and disadvantages of each parser library: Parser Typical usage Advantages Disadvantages Python’s html.parser BeautifulSoup(markup, “html.parser”) Batteries included

Decent speed

Lenient (As of Python 3.2) Not as fast as lxml, less lenient than html5lib. lxml’s HTML parser BeautifulSoup(markup, “lxml”) Very fast

Lenient External C dependency lxml’s XML parser BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup, “xml”) Very fast

The only currently supported XML parser External C dependency html5lib BeautifulSoup(markup, “html5lib”) Extremely lenient

Parses pages the same way a web browser does

Creates valid HTML5 Very slow

External Python dependency If you can, I recommend you install and use lxml for speed. If you’re using a very old version of Python – earlier than 3.2.2 – it’s essential that you install lxml or html5lib. Python’s built-in HTML parser is just not very good in those old versions. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.

Making the soup¶ To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle: from bs4 import BeautifulSoup with open ( “index.html” ) as fp : soup = BeautifulSoup ( fp , ‘html.parser’ ) soup = BeautifulSoup ( “a web page” , ‘html.parser’ ) First, the document is converted to Unicode, and HTML entities are converted to Unicode characters: print ( BeautifulSoup ( “Sacré bleu!” , “html.parser” )) # Sacré bleu! Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)

Kinds of objects¶ Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag , NavigableString , BeautifulSoup , and Comment . Tag ¶ A Tag object corresponds to an XML or HTML tag in the original document: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b type ( tag ) # Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes. Name¶ Every tag has a name, accessible as .name : tag . name # ‘b’ If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup: tag . name = “blockquote” tag #

Extremely bold

Attributes¶ A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary: tag = BeautifulSoup ( ‘bold‘ , ‘html.parser’ ) . b tag [ ‘id’ ] # ‘boldest’ You can access that dictionary directly as .attrs : tag . attrs # {‘id’: ‘boldest’} You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary: tag [ ‘id’ ] = ‘verybold’ tag [ ‘another-attribute’ ] = 1 tag # del tag [ ‘id’ ] del tag [ ‘another-attribute’ ] tag # bold tag [ ‘id’ ] # KeyError: ‘id’ tag . get ( ‘id’ ) # None Multi-valued attributes¶ HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel , rev , accept-charset , headers , and accesskey . Beautiful Soup presents the value(s) of a multi-valued attribute as a list: css_soup = BeautifulSoup ( ‘

‘ , ‘html.parser’ ) css_soup . p [ ‘class’ ] # [‘body’] css_soup = BeautifulSoup ( ‘

‘ , ‘html.parser’ ) css_soup . p [ ‘class’ ] # [‘body’, ‘strikeout’] If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone: id_soup = BeautifulSoup ( ‘

‘ , ‘html.parser’ ) id_soup . p [ ‘id’ ] # ‘my id’ When you turn a tag back into a string, multiple attribute values are consolidated: rel_soup = BeautifulSoup ( ‘

Back to the homepage

‘ , ‘html.parser’ ) rel_soup . a [ ‘rel’ ] # [‘index’] rel_soup . a [ ‘rel’ ] = [ ‘index’ , ‘contents’ ] print ( rel_soup . p ) #

Back to the homepage

You can disable this by passing multi_valued_attributes=None as a keyword argument into the BeautifulSoup constructor: no_list_soup = BeautifulSoup ( ‘

‘ , ‘html.parser’ , multi_valued_attributes = None ) no_list_soup . p [ ‘class’ ] # ‘body strikeout’ You can use get_attribute_list to get a value that’s always a list, whether or not it’s a multi-valued atribute: id_soup . p . get_attribute_list ( ‘id’ ) # [“my id”] If you parse a document as XML, there are no multi-valued attributes: xml_soup = BeautifulSoup ( ‘

‘ , ‘xml’ ) xml_soup . p [ ‘class’ ] # ‘body strikeout’ Again, you can configure this using the multi_valued_attributes argument: class_is_multi = { ‘*’ : ‘class’ } xml_soup = BeautifulSoup ( ‘

‘ , ‘xml’ , multi_valued_attributes = class_is_multi ) xml_soup . p [ ‘class’ ] # [‘body’, ‘strikeout’] You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification: from bs4.builder import builder_registry builder_registry . lookup ( ‘html’ ) . DEFAULT_CDATA_LIST_ATTRIBUTES NavigableString ¶ A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text: soup = BeautifulSoup ( ‘Extremely bold‘ , ‘html.parser’ ) tag = soup . b tag . string # ‘Extremely bold’ type ( tag . string ) # A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with str : unicode_string = str ( tag . string ) unicode_string # ‘Extremely bold’ type ( unicode_string ) # You can’t edit a string in place, but you can replace one string with another, using replace_with(): tag . string . replace_with ( “No longer bold” ) tag # No longer bold NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method. If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory. BeautifulSoup ¶ The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree. You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents: doc = BeautifulSoup ( “INSERT FOOTER HEREHere’s the footer

” , “xml” ) doc . find ( text = “INSERT FOOTER HERE” ) . replace_with ( footer ) # ‘INSERT FOOTER HERE’ print ( doc ) # #

Here’s the footer

Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name , so it’s been given the special .name “[document]”: soup . name # ‘[document]’

Navigating the tree¶ Here’s the “Three sisters” HTML document again: html_doc = “”” The Dormouse’s story

The Dormouse’s story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

“”” from bs4 import BeautifulSoup soup = BeautifulSoup ( html_doc , ‘html.parser’ ) I’ll use this as an example to show you how to move from one part of a document to another. Going down¶ Tags may contain strings and other tags. These elements are the tag’s children . Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children. Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children. Navigating using tag names¶ The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the tag, just say soup.head : soup . head # The Dormouse’s story soup . title # The Dormouse’s story You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first tag beneath the tag: soup . body . b # The Dormouse’s story Using a tag name as an attribute will give you only the first tag by that name: soup . a # Elsie If you need to get all the tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all() : soup . find_all ( ‘a’ ) # [Elsie, # Lacie, # Tillie] .contents and .children ¶ A tag’s children are available in a list called .contents : head_tag = soup . head head_tag # The Dormouse’s story head_tag . contents # [The Dormouse’s story] title_tag = head_tag . contents [ 0 ] title_tag # The Dormouse’s story title_tag . contents # [‘The Dormouse’s story’] The BeautifulSoup object itself has children. In this case, the tag is the child of the BeautifulSoup object.: len ( soup . contents ) # 1 soup . contents [ 0 ] . name # ‘html’ A string does not have .contents , because it can’t contain anything: text = title_tag . contents [ 0 ] text . contents # AttributeError: ‘NavigableString’ object has no attribute ‘contents’ Instead of getting them as a list, you can iterate over a tag’s children using the .children generator: for child in title_tag . children : print ( child ) # The Dormouse’s story If you want to modify a tag’s children, use the methods described in Modifying the tree. Don’t modify the the .contents list directly: that can lead to problems that are subtle and difficult to spot. .descendants ¶ The .contents and .children attributes only consider a tag’s direct children. For instance, the tag has a single direct child–the tag: head_tag . contents # [<title>The Dormouse’s story] But the tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the <head> tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on: for child in head_tag . descendants : print ( child ) # <title>The Dormouse’s story # The Dormouse’s story The tag has only one child, but it has two descendants: the tag and the <title> tag’s child. The BeautifulSoup object only has one direct child (the <html> tag), but it has a whole lot of descendants: len ( list ( soup . children )) # 1 len ( list ( soup . descendants )) # 26 .string ¶ If a tag has only one child, and that child is a NavigableString , the child is made available as .string : title_tag . string # ‘The Dormouse’s story’ If a tag’s only child is another tag, and that tag has a .string , then the parent tag is considered to have the same .string as its child: head_tag . contents # [<title>The Dormouse’s story] head_tag . string # ‘The Dormouse’s story’ If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None : print ( soup . html . string ) # None .strings and stripped_strings ¶ If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator: for string in soup . strings : print ( repr ( string )) ‘

‘ # “The Dormouse’s story” # ‘

‘ # ‘

‘ # “The Dormouse’s story” # ‘

‘ # ‘Once upon a time there were three little sisters; and their names were

‘ # ‘Elsie’ # ‘,

‘ # ‘Lacie’ # ‘ and

‘ # ‘Tillie’ # ‘;

and they lived at the bottom of a well.’ # ‘

‘ # ‘…’ # ‘

‘ These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead: for string in soup . stripped_strings : print ( repr ( string )) # “The Dormouse’s story” # “The Dormouse’s story” # ‘Once upon a time there were three little sisters; and their names were’ # ‘Elsie’ # ‘,’ # ‘Lacie’ # ‘and’ # ‘Tillie’ # ‘;

and they lived at the bottom of a well.’ # ‘…’ Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed. Going up¶ Continuing the “family tree” analogy, every tag and every string has a parent : the tag that contains it. .parent ¶ You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the tag is the parent of the tag: title_tag = soup . title title_tag # <title>The Dormouse’s story title_tag . parent # The Dormouse’s story The title string itself has a parent: the tag that contains it: title_tag . string . parent # <title>The Dormouse’s story The parent of a top-level tag like is the BeautifulSoup object itself: html_tag = soup . html type ( html_tag . parent ) # And the .parent of a BeautifulSoup object is defined as None: print ( soup . parent ) # None .parents ¶ You can iterate over all of an element’s parents with .parents . This example uses .parents to travel from an tag buried deep within the document, to the very top of the document: link = soup . a link # Elsie for parent in link . parents : print ( parent . name ) # p # body # html # [document] Going sideways¶ Consider a simple document like this: sibling_soup = BeautifulSoup ( “text1text2” , ‘html.parser’ ) print ( sibling_soup . prettify ()) # # # text1 # # # text2 # # The tag and the tag are at the same level: they’re both direct children of the same tag. We call them siblings . When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write. .next_sibling and .previous_sibling ¶ You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree: sibling_soup . b . next_sibling # text2 sibling_soup . c . previous_sibling # text1 The tag has a .next_sibling , but no .previous_sibling , because there’s nothing before the tag on the same level of the tree . For the same reason, the tag has a .previous_sibling but no .next_sibling : print ( sibling_soup . b . previous_sibling ) # None print ( sibling_soup . c . next_sibling ) # None The strings “text1” and “text2” are not siblings, because they don’t have the same parent: sibling_soup . b . string # ‘text1’ print ( sibling_soup . b . string . next_sibling ) # None In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document: # Elsie # Lacie # Tillie You might think that the .next_sibling of the first tag would be the second tag. But actually, it’s a string: the comma and newline that separate the first tag from the second: link = soup . a link # Elsie link . next_sibling # ‘,

‘ The second tag is actually the .next_sibling of the comma: link . next_sibling . next_sibling # Lacie .next_siblings and .previous_siblings ¶ You can iterate over a tag’s siblings with .next_siblings or .previous_siblings : for sibling in soup . a . next_siblings : print ( repr ( sibling )) # ‘,

‘ # Lacie # ‘ and

‘ # Tillie # ‘; and they lived at the bottom of a well.’ for sibling in soup . find ( id = “link3” ) . previous_siblings : print ( repr ( sibling )) # ‘ and

‘ # Lacie # ‘,

‘ # Elsie # ‘Once upon a time there were three little sisters; and their names were

‘ Going back and forth¶ Take a look at the beginning of the “three sisters” document: # The Dormouse’s story #

The Dormouse’s story

An HTML parser takes this string of characters and turns it into a series of events: “open an tag”, “open a tag”, “open a tag”, “add a string”, “close the <title> tag”, “open a </p> <p> tag”, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document. .next_element and .previous_element ¶ The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling , but it’s usually drastically different. Here’s the final <a> tag in the “three sisters” document. Its .next_sibling is a string: the conclusion of the sentence that was interrupted by the start of the <a> tag.: last_a_tag = soup . find ( “a” , id = “link3” ) last_a_tag # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> last_a_tag . next_sibling # ‘;</p> <p>and they lived at the bottom of a well.’ But the .next_element of that <a> tag, the thing that was parsed immediately after the <a> tag, is not the rest of that sentence: it’s the word “Tillie”: last_a_tag . next_element # ‘Tillie’ That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered an <a> tag, then the word “Tillie”, then the closing </a> tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the <a> tag, but the word “Tillie” was encountered first. The .previous_element attribute is the exact opposite of .next_element . It points to whatever element was parsed immediately before this one: last_a_tag . previous_element # ‘ and</p> <p>‘ last_a_tag . previous_element . next_element # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> .next_elements and .previous_elements ¶ You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed: for element in last_a_tag . next_elements : print ( repr ( element )) # ‘Tillie’ # ‘;</p> <p>and they lived at the bottom of a well.’ # ‘</p> <p>‘ # </p> <p class="story">…</p> <p> # ‘…’ # ‘</p> <p>‘</p> <p>Modifying the tree¶ Beautiful Soup’s main strength is in searching the parse tree, but you can also modify the tree and write your changes as a new HTML or XML document. Changing tag names and attributes¶ I covered this earlier, in Attributes, but it bears repeating. You can rename a tag, change the values of its attributes, add new attributes, and delete attributes: soup = BeautifulSoup ( ‘<b class="boldest">Extremely bold</b>‘ , ‘html.parser’ ) tag = soup . b tag . name = “blockquote” tag [ ‘class’ ] = ‘verybold’ tag [ ‘id’ ] = 1 tag # </p> <blockquote class="verybold" id="1"><p>Extremely bold</p></blockquote> <p> del tag [ ‘class’ ] del tag [ ‘id’ ] tag # </p> <blockquote><p>Extremely bold</p></blockquote> <p> Modifying .string ¶ If you set a tag’s .string attribute to a new string, the tag’s contents are replaced with that string: markup = ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . string = “New link text.” tag # <a href="http://example.com/">New link text.</a> Be careful: if the tag contained other tags, they and all their contents will be destroyed. append() ¶ You can add to a tag’s contents with Tag.append() . It works just like calling .append() on a Python list: soup = BeautifulSoup ( “<a>Foo</a>” , ‘html.parser’ ) soup . a . append ( “Bar” ) soup # <a>FooBar</a> soup . a . contents # [‘Foo’, ‘Bar’] extend() ¶ Starting in Beautiful Soup 4.7.0, Tag also supports a method called .extend() , which adds every element of a list to a Tag , in order: soup = BeautifulSoup ( “<a>Soup</a>” , ‘html.parser’ ) soup . a . extend ([ “‘s” , ” ” , “on” ]) soup # <a>Soup’s on</a> soup . a . contents # [‘Soup’, ”s’, ‘ ‘, ‘on’] NavigableString() and .new_tag() ¶ If you need to add a string to a document, no problem–you can pass a Python string in to append() , or you can call the NavigableString constructor: soup = BeautifulSoup ( “<b></b>” , ‘html.parser’ ) tag = soup . b tag . append ( “Hello” ) new_string = NavigableString ( ” there” ) tag . append ( new_string ) tag # <b>Hello there.</b> tag . contents # [‘Hello’, ‘ there’] If you want to create a comment or some other subclass of NavigableString , just call the constructor: from bs4 import Comment new_comment = Comment ( “Nice to see you.” ) tag . append ( new_comment ) tag # <b>Hello there<!--Nice to see you.--></b> tag . contents # [‘Hello’, ‘ there’, ‘Nice to see you.’] (This is a new feature in Beautiful Soup 4.4.0.) What if you need to create a whole new tag? The best solution is to call the factory method BeautifulSoup.new_tag() : soup = BeautifulSoup ( “<b></b>” , ‘html.parser’ ) original_tag = soup . b new_tag = soup . new_tag ( “a” , href = “http://www.example.com” ) original_tag . append ( new_tag ) original_tag # <b><a href="http://www.example.com"></a></b> new_tag . string = “Link text.” original_tag # <b><a href="http://www.example.com">Link text.</a></b> Only the first argument, the tag name, is required. insert() ¶ Tag.insert() is just like Tag.append() , except the new element doesn’t necessarily go at the end of its parent’s .contents . It’ll be inserted at whatever numeric position you say. It works just like .insert() on a Python list: markup = ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . insert ( 1 , “but did not endorse ” ) tag # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> tag . contents # [‘I linked to ‘, ‘but did not endorse’, <i>example.com</i>] insert_before() and insert_after() ¶ The insert_before() method inserts tags or strings immediately before something else in the parse tree: soup = BeautifulSoup ( “<b>leave</b>” , ‘html.parser’ ) tag = soup . new_tag ( “i” ) tag . string = “Don’t” soup . b . string . insert_before ( tag ) soup . b # <b><i>Don’t</i>leave</b> The insert_after() method inserts tags or strings immediately following something else in the parse tree: div = soup . new_tag ( ‘div’ ) div . string = ‘ever’ soup . b . i . insert_after ( ” you ” , div ) soup . b # <b><i>Don’t</i> you </p> <div>ever</div> <p> leave</b> soup . b . contents # [<i>Don’t</i>, ‘ you’, </p> <div>ever</div> <p>, ‘leave’] clear() ¶ Tag.clear() removes the contents of a tag: markup = ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) tag = soup . a tag . clear () tag # <a href="http://example.com/"></a> extract() ¶ PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted: markup = ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a i_tag = soup . i . extract () a_tag # <a href="http://example.com/">I linked to</a> i_tag # <i>example.com</i> print ( i_tag . parent ) # None At this point you effectively have two parse trees: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call extract on a child of the element you extracted: my_string = i_tag . string . extract () my_string # ‘example.com’ print ( my_string . parent ) # None i_tag # <i></i> decompose() ¶ Tag.decompose() removes a tag from the tree, then completely destroys it and its contents : markup = ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a i_tag = soup . i i_tag . decompose () a_tag # <a href="http://example.com/">I linked to</a> The behavior of a decomposed Tag or NavigableString is not defined and you should not use it for anything. If you’re not sure whether something has been decomposed, you can check its .decomposed property (new in Beautiful Soup 4.9.0) : i_tag . decomposed # True a_tag . decomposed # False replace_with() ¶ PageElement.replace_with() removes a tag or string from the tree, and replaces it with one or more tags or strings of your choice: markup = ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a new_tag = soup . new_tag ( “b” ) new_tag . string = “example.com” a_tag . i . replace_with ( new_tag ) a_tag # <a href="http://example.com/">I linked to <b>example.com</b></a> bold_tag = soup . new_tag ( “b” ) bold_tag . string = “example” i_tag = soup . new_tag ( “i” ) i_tag . string = “net” a_tag . b . replace_with ( bold_tag , “.” , i_tag ) a_tag # <a href="http://example.com/">I linked to <b>example</b>.<i>net</i></a> replace_with() returns the tag or string that got replaced, so that you can examine it or add it back to another part of the tree. The ability to pass multiple arguments into replace_with() is new in Beautiful Soup 4.10.0. wrap() ¶ PageElement.wrap() wraps an element in the tag you specify. It returns the new wrapper: soup = BeautifulSoup ( “</p> <p>I wish I was bold.</p> <p>” , ‘html.parser’ ) soup . p . string . wrap ( soup . new_tag ( “b” )) # <b>I wish I was bold.</b> soup . p . wrap ( soup . new_tag ( “div” )) # </p> <div> <p><b>I wish I was bold.</b></p> </div> <p> This method is new in Beautiful Soup 4.0.5. unwrap() ¶ Tag.unwrap() is the opposite of wrap() . It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup: markup = ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) a_tag = soup . a a_tag . i . unwrap () a_tag # <a href="http://example.com/">I linked to example.com</a> Like replace_with() , unwrap() returns the tag that was replaced. smooth() ¶ After calling a bunch of methods that modify the parse tree, you may end up with two or more NavigableString objects next to each other. Beautiful Soup doesn’t have any problems with this, but since it can’t happen in a freshly parsed document, you might not expect behavior like the following: soup = BeautifulSoup ( “</p> <p>A one</p> <p>” , ‘html.parser’ ) soup . p . append ( “, a two” ) soup . p . contents # [‘A one’, ‘, a two’] print ( soup . p . encode ()) # b’</p> <p>A one, a two</p> <p>‘ print ( soup . p . prettify ()) # </p> <p> # A one # , a two # </p> <p> You can call Tag.smooth() to clean up the parse tree by consolidating adjacent strings: soup . smooth () soup . p . contents # [‘A one, a two’] print ( soup . p . prettify ()) # </p> <p> # A one, a two # </p> <p> This method is new in Beautiful Soup 4.8.0.</p> <p>Output¶ Pretty-printing¶ The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: markup = ‘<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) soup . prettify () # ‘<html></p> <p><head></p> <p></head></p> <p><body></p> <p><a href="http://example.com/"></p> <p>…’ print ( soup . prettify ()) # <html> # <head> # </head> # <body> # <a href="http://example.com/"> # I linked to # <i> # example.com # </i> # </a> # </body> # </html> You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: print ( soup . a . prettify ()) # <a href="http://example.com/"> # I linked to # <i> # example.com # </i> # </a> Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one. The goal of prettify() is to help you visually understand the structure of the documents you work with. Non-pretty printing¶ If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object, or on a Tag within it: str ( soup ) # ‘<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>‘ str ( soup . a ) # ‘<a href="http://example.com/">I linked to <i>example.com</i></a>‘ The str() function returns a string encoded in UTF-8. See Encodings for other options. You can also call encode() to get a bytestring, and decode() to get Unicode. Output formatters¶ If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters: soup = BeautifulSoup ( ““Dammit!” he said.” , ‘html.parser’ ) str ( soup ) # ‘“Dammit!” he said.’ If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back: soup . encode ( “utf8” ) # b’\xe2\x80\x9cDammit!\xe2\x80\x9d he said.’ By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML: soup = BeautifulSoup ( “</p> <p>The law firm of Dewey, Cheatem, & Howe</p> <p>” , ‘html.parser’ ) soup . p # </p> <p>The law firm of Dewey, Cheatem, & Howe</p> <p> soup = BeautifulSoup ( ‘<a href="http://example.com/?foo=val1&bar=val2">A link</a>‘ , ‘html.parser’ ) soup . a # <a href="http://example.com/?foo=val1&bar=val2">A link</a> You can change this behavior by providing a value for the formatter argument to prettify() , encode() , or decode() . Beautiful Soup recognizes five possible values for formatter . The default is formatter=”minimal” . Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML: french = “</p> <p>Il a dit <<Sacré bleu!>></p> <p>” soup = BeautifulSoup ( french , ‘html.parser’ ) print ( soup . prettify ( formatter = “minimal” )) # </p> <p> # Il a dit <<Sacré bleu!>> # </p> <p> If you pass in formatter=”html” , Beautiful Soup will convert Unicode characters to HTML entities whenever possible: print ( soup . prettify ( formatter = “html” )) # </p> <p> # Il a dit <<Sacré bleu!>> # </p> <p> If you pass in formatter=”html5″ , it’s similar to formatter=”html” , but Beautiful Soup will omit the closing slash in HTML void tags like “br”: br = BeautifulSoup ( “<br />” , ‘html.parser’ ) . br print ( br . encode ( formatter = “html” )) # b’<br />‘ print ( br . encode ( formatter = “html5” )) # b’<br />‘ In addition, any attributes whose values are the empty string will become HTML-style boolean attributes: option = BeautifulSoup ( ‘<option selected=""></option>‘ ) . option print ( option . encode ( formatter = “html” )) # b’<option selected=""></option>‘ print ( option . encode ( formatter = “html5” )) # b’<option selected></option>‘ (This behavior is new as of Beautiful Soup 4.10.0.) If you pass in formatter=None , Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples: print ( soup . prettify ( formatter = None )) # </p> <p> # Il a dit <<Sacré bleu!>> # </p> <p> link_soup = BeautifulSoup ( ‘<a href="http://example.com/?foo=val1&bar=val2">A link</a>‘ , ‘html.parser’ ) print ( link_soup . a . encode ( formatter = None )) # b’<a href="http://example.com/?foo=val1&bar=val2">A link</a>‘ If you need more sophisticated control over your output, you can use Beautiful Soup’s Formatter class. Here’s a formatter that converts strings to uppercase, whether they occur in a text node or in an attribute value: from bs4.formatter import HTMLFormatter def uppercase ( str ): return str . upper () formatter = HTMLFormatter ( uppercase ) print ( soup . prettify ( formatter = formatter )) # </p> <p> # IL A DIT <<SACRÉ BLEU!>> # </p> <p> print ( link_soup . a . prettify ( formatter = formatter )) # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> # A LINK # </a> Here’s a formatter that increases the indentation when pretty-printing: formatter = HTMLFormatter ( indent = 8 ) print ( link_soup . a . prettify ( formatter = formatter )) # <a href="http://example.com/?foo=val1&bar=val2"> # A link # </a> Subclassing HTMLFormatter or XMLFormatter will give you even more control over the output. For example, Beautiful Soup sorts the attributes in every tag by default: attr_soup = BeautifulSoup ( b ‘</p> <p z="1" m="2" a="3"> <p>‘ , ‘html.parser’ ) print ( attr_soup . p . encode ()) # </p> <p a="3" m="2" z="1"> <p> To turn this off, you can subclass the Formatter.attributes() method, which controls which attributes are output and in what order. This implementation also filters out the attribute called “m” whenever it appears: class UnsortedAttributes ( HTMLFormatter ): def attributes ( self , tag ): for k , v in tag . attrs . items (): if k == ‘m’ : continue yield k , v print ( attr_soup . p . encode ( formatter = UnsortedAttributes ())) # </p> <p z="1" a="3"> <p> One last caveat: if you create a CData object, the text inside that object is always presented exactly as it appears, with no formatting . Beautiful Soup will call your entity substitution function, just in case you’ve written a custom function that counts all the strings in the document or something, but it will ignore the return value: from bs4.element import CData soup = BeautifulSoup ( “<a></a>” , ‘html.parser’ ) soup . a . string = CData ( “one < three" ) print ( soup . a . prettify ( formatter = "html" )) # <a> # <![CDATA[one < three]]> # </a> get_text() ¶ If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string: markup = ‘<a href="http://example.com/"></p> <p>I linked to <i>example.com</i></p> <p></a>‘ soup = BeautifulSoup ( markup , ‘html.parser’ ) soup . get_text () ‘</p> <p>I linked to example.com</p> <p>‘ soup . i . get_text () ‘example.com’ You can specify a string to be used to join the bits of text together: # soup.get_text(“|”) ‘</p> <p>I linked to |example.com|</p> <p>‘ You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text: # soup.get_text(“|”, strip=True) ‘I linked to|example.com’ But at that point you might want to use the .stripped_strings generator instead, and process the text yourself: [ text for text in soup . stripped_strings ] # [‘I linked to’, ‘example.com’] As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, </p> <style>, and <template> tags are generally not considered to be ‘text’, since those tags are not part of the human-visible content of the page. As of Beautiful Soup version 4.10.0, you can call get_text(), .strings, or .stripped_strings on a NavigableString object. It will either return the object itself, or nothing, so the only reason to do this is when you’re iterating over a mixed list.</p> <p>Specifying the parser to use¶ If you just need to parse some HTML, you can dump the markup into the BeautifulSoup constructor, and it’ll probably be fine. Beautiful Soup will pick a parser for you and parse the data. But there are a few additional arguments you can pass in to the constructor to change which parser is used. The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed. If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser. You can override this by specifying one of the following: What type of markup you want to parse. Currently supported are “html”, “xml”, and “html5”.</p> <p>The name of the parser library you want to use. Currently supported options are “lxml”, “html5lib”, and “html.parser” (Python’s built-in HTML parser). The section Installing a parser contrasts the supported parsers. If you don’t have an appropriate parser installed, Beautiful Soup will ignore your request and pick a different parser. Right now, the only supported XML parser is lxml. If you don’t have lxml installed, asking for an XML parser won’t give you one, and asking for “lxml” won’t work either. Differences between parsers¶ Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here’s a short document, parsed as HTML using the parser that comes with Python: BeautifulSoup ( "<a><b/></a>" , "html.parser" ) # <a><b></b></a> Since a standalone <b/> tag is not valid HTML, html.parser turns it into a <b></b> tag pair. Here’s the same document parsed as XML (running this requires that you have lxml installed). Note that the standalone <b/> tag is left alone, and that the document is given an XML declaration instead of being put into an <html> tag.: print ( BeautifulSoup ( "<a><b/></a>" , "xml" )) # <?xml version="1.0" encoding="utf-8"?> # <a><b/></a> There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document. But if the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using lxml’s HTML parser. Note that the <a> tag gets wrapped in <body> and <html> tags, and the dangling </p> <p> tag is simply ignored: BeautifulSoup ( "<a></p> <p>" , "lxml" ) # <html><body><a></a></body></html> Here’s the same document parsed using html5lib: BeautifulSoup ( "<a></p> <p>" , "html5lib" ) # <html><head></head><body><a></p> <p></a></body></html> Instead of ignoring the dangling </p> <p> tag, html5lib pairs it with an opening </p> <p> tag. html5lib also adds an empty <head> tag; lxml didn’t bother. Here’s the same document parsed with Python’s built-in HTML parser: BeautifulSoup ( "<a></p> <p>" , "html.parser" ) # <a></a> Like lxml, this parser ignores the closing </p> <p> tag. Unlike html5lib or lxml, this parser makes no attempt to create a well-formed HTML document by adding <html> or <body> tags. Since the document “<a></p> <p>” is invalid, none of these techniques is the ‘correct’ way to handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the ‘correct’ way, but all three techniques are legitimate. Differences between parsers can affect your script. If you’re planning on distributing your script to other people, or running it on multiple machines, you should specify a parser in the BeautifulSoup constructor. That will reduce the chances that your users parse a document differently from the way you parse it.</p> <p>Encodings¶ Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode: markup = "</p> <h1>Sacr \xc3\xa9 bleu!</h1> <p>" soup = BeautifulSoup ( markup , 'html.parser' ) soup . h1 # </p> <h1>Sacré bleu!</h1> <p> soup . h1 . string # 'Sacr\xe9 bleu!' It’s not magic. (That sure would be nice.) Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode. The autodetected encoding is available as the .original_encoding attribute of the BeautifulSoup object: soup . original_encoding 'utf-8' Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time. If you happen to know a document’s encoding ahead of time, you can avoid mistakes and delays by passing it to the BeautifulSoup constructor as from_encoding . Here’s a document written in ISO-8859-8. The document is so short that Unicode, Dammit can’t get a lock on it, and misidentifies it as ISO-8859-7: markup = b "</p> <h1> \xed\xe5\xec\xf9 </h1> <p>" soup = BeautifulSoup ( markup , 'html.parser' ) print ( soup . h1 ) # </p> <h1>νεμω</h1> <p> print ( soup . original_encoding ) # iso-8859-7 We can fix this by passing in the correct from_encoding : soup = BeautifulSoup ( markup , 'html.parser' , from_encoding = "iso-8859-8" ) print ( soup . h1 ) # </p> <h1>םולש</h1> <p> print ( soup . original_encoding ) # iso8859-8 If you don’t know what the correct encoding is, but you know that Unicode, Dammit is guessing wrong, you can pass the wrong guesses in as exclude_encodings : soup = BeautifulSoup ( markup , 'html.parser' , exclude_encodings = [ "iso-8859-7" ]) print ( soup . h1 ) # </p> <h1>םולש</h1> <p> print ( soup . original_encoding ) # WINDOWS-1255 Windows-1255 isn’t 100% correct, but that encoding is a compatible superset of ISO-8859-8, so it’s close enough. ( exclude_encodings is a new feature in Beautiful Soup 4.4.0.) In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object. This lets you know that the Unicode representation is not an exact representation of the original–some data was lost. If a document contains �, but .contains_replacement_characters is False , you’ll know that the � was there originally (as it is in this paragraph) and doesn’t stand in for missing data. Output encoding¶ When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with. Here’s a document written in the Latin-1 encoding: markup = b ''' <html> <head> <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /> </head> <body> </p> <p>Sacr \xe9 bleu!</p> <p> </body> </html> ''' soup = BeautifulSoup ( markup , 'html.parser' ) print ( soup . prettify ()) # <html> # <head> # <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> # </head> # <body> # </p> <p> # Sacré bleu! # </p> <p> # </body> # </html> Note that the <meta> tag has been rewritten to reflect the fact that the document is now in UTF-8. If you don’t want UTF-8, you can pass an encoding into prettify() : print ( soup . prettify ( "latin-1" )) # <html> # <head> # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> # ... You can also call encode() on the BeautifulSoup object, or any element in the soup, just as if it were a Python string: soup . p . encode ( "latin-1" ) # b'</p> <p>Sacr\xe9 bleu!</p> <p>' soup . p . encode ( "utf-8" ) # b'</p> <p>Sacr\xc3\xa9 bleu!</p> <p>' Any characters that can’t be represented in your chosen encoding will be converted into numeric XML entity references. Here’s a document that includes the Unicode character SNOWMAN: markup = u "<b> \N{SNOWMAN} </b>" snowman_soup = BeautifulSoup ( markup , 'html.parser' ) tag = snowman_soup . b The SNOWMAN character can be part of a UTF-8 document (it looks like ☃), but there’s no representation for that character in ISO-Latin-1 or ASCII, so it’s converted into “☃” for those encodings: print ( tag . encode ( "utf-8" )) # b'<b>\xe2\x98\x83</b>' print ( tag . encode ( "latin-1" )) # b'<b>☃</b>' print ( tag . encode ( "ascii" )) # b'<b>☃</b>' Unicode, Dammit¶ You can use Unicode, Dammit without using Beautiful Soup. It’s useful whenever you have data in an unknown encoding and you just want it to become Unicode: from bs4 import UnicodeDammit dammit = UnicodeDammit ( "Sacr \xc3\xa9 bleu!" ) print ( dammit . unicode_markup ) # Sacré bleu! dammit . original_encoding # 'utf-8' Unicode, Dammit’s guesses will get a lot more accurate if you install one of these Python libraries: charset-normalizer , chardet , or cchardet . The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list: dammit = UnicodeDammit ( "Sacr \xe9 bleu!" , [ "latin-1" , "iso-8859-1" ]) print ( dammit . unicode_markup ) # Sacré bleu! dammit . original_encoding # 'latin-1' Unicode, Dammit has two special features that Beautiful Soup doesn’t use. Smart quotes¶ You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities: markup = b "</p> <p>I just \x93 love \x94 Microsoft Word \x92 s smart quotes</p> <p>" UnicodeDammit ( markup , [ "windows-1252" ], smart_quotes_to = "html" ) . unicode_markup # '</p> <p>I just “love” Microsoft Word’s smart quotes</p> <p>' UnicodeDammit ( markup , [ "windows-1252" ], smart_quotes_to = "xml" ) . unicode_markup # '</p> <p>I just “love” Microsoft Word’s smart quotes</p> <p>' You can also convert Microsoft smart quotes to ASCII quotes: UnicodeDammit ( markup , [ "windows-1252" ], smart_quotes_to = "ascii" ) . unicode_markup # '</p> <p>I just "love" Microsoft Word\'s smart quotes</p> <p>' Hopefully you’ll find this feature useful, but Beautiful Soup doesn’t use it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else: UnicodeDammit ( markup , [ "windows-1252" ]) . unicode_markup # '</p> <p>I just “love” Microsoft Word’s smart quotes</p> <p>' Inconsistent encodings¶ Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters such as (again) Microsoft smart quotes. This can happen when a website includes data from multiple sources. You can use UnicodeDammit.detwingle() to turn such a document into pure UTF-8. Here’s a simple example: snowmen = ( u " \N{SNOWMAN} " * 3 ) quote = ( u " \N{LEFT DOUBLE QUOTATION MARK} I like snowmen! \N{RIGHT DOUBLE QUOTATION MARK} " ) doc = snowmen . encode ( "utf8" ) + quote . encode ( "windows_1252" ) This document is a mess. The snowmen are in UTF-8 and the quotes are in Windows-1252. You can display the snowmen or the quotes, but not both: print ( doc ) # ☃☃☃�I like snowmen!� print ( doc . decode ( "windows-1252" )) # ☃☃☃“I like snowmen!” Decoding the document as UTF-8 raises a UnicodeDecodeError , and decoding it as Windows-1252 gives you gibberish. Fortunately, UnicodeDammit.detwingle() will convert the string to pure UTF-8, allowing you to decode it to Unicode and display the snowmen and quote marks simultaneously: new_doc = UnicodeDammit . detwingle ( doc ) print ( new_doc . decode ( "utf8" )) # ☃☃☃“I like snowmen!” UnicodeDammit.detwingle() only knows how to handle Windows-1252 embedded in UTF-8 (or vice versa, I suppose), but this is the most common case. Note that you must know to call UnicodeDammit.detwingle() on your data before passing it into BeautifulSoup or the UnicodeDammit constructor. Beautiful Soup assumes that a document has a single encoding, whatever it might be. If you pass it a document that contains both UTF-8 and Windows-1252, it’s likely to think the whole document is Windows-1252, and the document will come out looking like ☃☃☃“I like snowmen!” . UnicodeDammit.detwingle() is new in Beautiful Soup 4.1.0.</p> <p>Line numbers¶ The html.parser and html5lib parsers can keep track of where in the original document each Tag was found. You can access this information as Tag.sourceline (line number) and Tag.sourcepos (position of the start tag within a line): markup = "</p> <p >Paragraph 1</p> <p>Paragraph 2</p> <p>" soup = BeautifulSoup ( markup , 'html.parser' ) for tag in soup . find_all ( 'p' ): print ( repr (( tag . sourceline , tag . sourcepos , tag . string ))) # (1, 0, 'Paragraph 1') # (3, 4, 'Paragraph 2') Note that the two parsers mean slightly different things by sourceline and sourcepos . For html.parser, these numbers represent the position of the initial less-than sign. For html5lib, these numbers represent the position of the final greater-than sign: soup = BeautifulSoup ( markup , 'html5lib' ) for tag in soup . find_all ( 'p' ): print ( repr (( tag . sourceline , tag . sourcepos , tag . string ))) # (2, 0, 'Paragraph 1') # (3, 6, 'Paragraph 2') You can shut off this feature by passing store_line_numbers=False` into the ``BeautifulSoup constructor: markup = "</p> <p >Paragraph 1</p> <p>Paragraph 2</p> <p>" soup = BeautifulSoup ( markup , 'html.parser' , store_line_numbers = False ) print ( soup . p . sourceline ) # None This feature is new in 4.8.1, and the parsers based on lxml don’t support it.</p> <p>Comparing objects for equality¶ Beautiful Soup says that two NavigableString or Tag objects are equal when they represent the same HTML or XML markup. In this example, the two <b> tags are treated as equal, even though they live in different parts of the object tree, because they both look like “<b>pizza</b>”: markup = "</p> <p>I want <b>pizza</b> and more <b>pizza</b>!</p> <p>" soup = BeautifulSoup ( markup , 'html.parser' ) first_b , second_b = soup . find_all ( 'b' ) print ( first_b == second_b ) # True print ( first_b . previous_element == second_b . previous_element ) # False If you want to see whether two variables refer to exactly the same object, use is : print ( first_b is second_b ) # False</p> <p>Copying Beautiful Soup objects¶ You can use copy.copy() to create a copy of any Tag or NavigableString : import copy p_copy = copy . copy ( soup . p ) print ( p_copy ) # </p> <p>I want <b>pizza</b> and more <b>pizza</b>!</p> <p> The copy is considered equal to the original, since it represents the same markup as the original, but it’s not the same object: print ( soup . p == p_copy ) # True print ( soup . p is p_copy ) # False The only real difference is that the copy is completely detached from the original Beautiful Soup object tree, just as if extract() had been called on it: print ( p_copy . parent ) # None This is because two different Tag objects can’t occupy the same space at the same time.</p> <p>Advanced parser customization¶ Beautiful Soup offers a number of ways to customize how the parser treats incoming HTML and XML. This section covers the most commonly used customization techniques. Parsing only part of a document¶ Let’s say you want to use Beautiful Soup look at a document’s <a> tags. It’s a waste of time and memory to parse the entire document and then go over it again looking for <a> tags. It would be much faster to ignore everything that wasn’t an <a> tag in the first place. The SoupStrainer class allows you to choose which parts of an incoming document are parsed. You just create a SoupStrainer and pass it in to the BeautifulSoup constructor as the parse_only argument. (Note that this feature won’t work if you’re using the html5lib parser. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didn’t actually make it into the parse tree, it’ll crash. To avoid confusion, in the examples below I’ll be forcing Beautiful Soup to use Python’s built-in parser.) SoupStrainer ¶ The SoupStrainer class takes the same arguments as a typical method from Searching the tree: name, attrs, string, and **kwargs. Here are three SoupStrainer objects: from bs4 import SoupStrainer only_a_tags = SoupStrainer ( "a" ) only_tags_with_id_link2 = SoupStrainer ( id = "link2" ) def is_short_string ( string ): return string is not None and len ( string ) < 10 only_short_strings = SoupStrainer ( string = is_short_string ) I’m going to bring back the “three sisters” document one more time, and we’ll see what the document looks like when it’s parsed with these three SoupStrainer objects: html_doc = """<html><head><title>The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" print ( BeautifulSoup ( html_doc , "html.parser" , parse_only = only_a_tags ) . prettify ()) # # Elsie # # # Lacie # # # Tillie # print ( BeautifulSoup ( html_doc , "html.parser" , parse_only = only_tags_with_id_link2 ) . prettify ()) # # Lacie # print ( BeautifulSoup ( html_doc , "html.parser" , parse_only = only_short_strings ) . prettify ()) # Elsie # , # Lacie # and # Tillie # ... # You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it: soup = BeautifulSoup ( html_doc , 'html.parser' ) soup . find_all ( only_short_strings ) # ['

', '

', 'Elsie', ',

', 'Lacie', ' and

', 'Tillie', # '

', '...', '

'] Customizing multi-valued attributes¶ In an HTML document, an attribute like class is given a list of values, and an attribute like id is given a single value, because the HTML specification treats those attributes differently: markup = '' soup = BeautifulSoup ( markup , 'html.parser' ) soup . a [ 'class' ] # ['cls1', 'cls2'] soup . a [ 'id' ] # 'id1 id2' You can turn this off by passing in multi_valued_attributes=None . Than all attributes will be given a single value: soup = BeautifulSoup ( markup , 'html.parser' , multi_valued_attributes = None ) soup . a [ 'class' ] # 'cls1 cls2' soup . a [ 'id' ] # 'id1 id2' You can customize this behavior quite a bit by passing in a dictionary for multi_valued_attributes . If you need this, look at HTMLTreeBuilder.DEFAULT_CDATA_LIST_ATTRIBUTES to see the configuration Beautiful Soup uses by default, which is based on the HTML specification. (This is a new feature in Beautiful Soup 4.8.0.) Handling duplicate attributes¶ When using the html.parser parser, you can use the on_duplicate_attribute constructor argument to customize what Beautiful Soup does when it encounters a tag that defines the same attribute more than once: markup = '' The default behavior is to use the last value found for the tag: soup = BeautifulSoup ( markup , 'html.parser' ) soup . a [ 'href' ] # http://url2/ soup = BeautifulSoup ( markup , 'html.parser' , on_duplicate_attribute = 'replace' ) soup . a [ 'href' ] # http://url2/ With on_duplicate_attribute='ignore' you can tell Beautiful Soup to use the first value found and ignore the rest: soup = BeautifulSoup ( markup , 'html.parser' , on_duplicate_attribute = 'ignore' ) soup . a [ 'href' ] # http://url1/ (lxml and html5lib always do it this way; their behavior can’t be configured from within Beautiful Soup.) If you need more, you can pass in a function that’s called on each duplicate value: def accumulate ( attributes_so_far , key , value ): if not isinstance ( attributes_so_far [ key ], list ): attributes_so_far [ key ] = [ attributes_so_far [ key ]] attributes_so_far [ key ] . append ( value ) soup = BeautifulSoup ( markup , 'html.parser' , on_duplicate_attribute = accumulate ) soup . a [ 'href' ] # ["http://url1/", "http://url2/"] (This is a new feature in Beautiful Soup 4.9.1.) Instantiating custom subclasses¶ When a parser tells Beautiful Soup about a tag or a string, Beautiful Soup will instantiate a Tag or NavigableString object to contain that information. Instead of that default behavior, you can tell Beautiful Soup to instantiate subclasses of Tag or NavigableString , subclasses you define with custom behavior: from bs4 import Tag , NavigableString class MyTag ( Tag ): pass class MyString ( NavigableString ): pass markup = "

some text

" soup = BeautifulSoup ( markup , 'html.parser' ) isinstance ( soup . div , MyTag ) # False isinstance ( soup . div . string , MyString ) # False my_classes = { Tag : MyTag , NavigableString : MyString } soup = BeautifulSoup ( markup , 'html.parser' , element_classes = my_classes ) isinstance ( soup . div , MyTag ) # True isinstance ( soup . div . string , MyString ) # True This can be useful when incorporating Beautiful Soup into a test framework. (This is a new feature in Beautiful Soup 4.8.1.)

Troubleshooting¶ diagnose() ¶ If you’re having trouble understanding what Beautiful Soup does to a document, pass the document into the diagnose() function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing you how different parsers handle the document, and tell you if you’re missing a parser that Beautiful Soup could be using: from bs4.diagnose import diagnose with open ( "bad.html" ) as fp : data = fp . read () diagnose ( data ) # Diagnostic running on Beautiful Soup 4.2.0 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) # I noticed that html5lib is not installed. Installing it may help. # Found lxml version 2.3.2.0 # # Trying to parse your data with html.parser # Here's what html.parser did with the document: # ... Just looking at the output of diagnose() may show you how to solve the problem. Even if not, you can paste the output of diagnose() when asking for help. Errors when parsing a document¶ There are two different kinds of parse errors. There are crashes, where you feed a document to Beautiful Soup and it raises an exception, usually an HTMLParser.HTMLParseError . And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different than the document used to create it. Almost none of these problems turn out to be problems with Beautiful Soup. This is not because Beautiful Soup is an amazingly well-written piece of software. It’s because Beautiful Soup doesn’t include any parsing code. Instead, it relies on external parsers. If one parser isn’t working on a certain document, the best solution is to try a different parser. See Installing a parser for details and a parser comparison. The most common parse errors are HTMLParser.HTMLParseError: malformed start tag and HTMLParser.HTMLParseError: bad end tag . These are both generated by Python’s built-in HTML parser library, and the solution is to install lxml or html5lib. The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None . This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the best solution is to install lxml or html5lib. Version mismatch problems¶ SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = '[document]' ): Caused by running an old Python 2 version of Beautiful Soup under Python 3, without converting the code.

ImportError: No module named HTMLParser - Caused by running an old Python 2 version of Beautiful Soup under Python 3.

ImportError: No module named html.parser - Caused by running the Python 3 version of Beautiful Soup under Python 2.

ImportError: No module named BeautifulSoup - Caused by running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4 .

ImportError: No module named bs4 - Caused by running Beautiful Soup 4 code on a system that doesn’t have BS4 installed. Parsing XML¶ By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in “xml” as the second argument to the BeautifulSoup constructor: soup = BeautifulSoup ( markup , "xml" ) You’ll need to have lxml installed. Other parser problems¶ If your script works on one computer but not another, or in one virtual environment but not another, or outside the virtual environment but not inside, it’s probably because the two environments have different parser libraries available. For example, you may have developed the script on a computer that has lxml installed, and then tried to run it on a computer that only has html5lib installed. See Differences between parsers for why this matters, and fix the problem by mentioning a specific parser library in the BeautifulSoup constructor.

Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup is converted to . If you want to preserve mixed-case or uppercase tags and attributes, you’ll need to parse the document as XML. Miscellaneous¶ UnicodeEncodeError: 'charmap' codec can't encode character '\xfoo' in position bar (or just about any other UnicodeEncodeError ) - This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with u.encode("utf8") .

KeyError: [attr] - Caused by accessing tag['attr'] when the tag in question doesn’t define the attr attribute. The most common errors are KeyError: 'href' and KeyError: 'class' . Use tag.get('attr') if you’re not sure attr is defined, just as you would with a Python dictionary.

AttributeError: 'ResultSet' object has no attribute 'foo' - This usually happens because you expected find_all() to return a single tag or string. But find_all() returns a _list_ of tags and strings–a ResultSet object. You need to iterate over the list and look at the .foo of each one. Or, if you really only want one result, you need to use find() instead of find_all() .

AttributeError: 'NoneType' object has no attribute 'foo' - This usually happens because you called find() and then tried to access the .foo` attribute of the result. But in your case, find() didn’t find anything, so it returned None , instead of returning a tag or a string. You need to figure out why your find() call isn’t returning anything.

AttributeError: 'NavigableString' object has no attribute 'foo' - This usually happens because you’re treating a string as though it were a tag. You may be iterating over a list, expecting that it contains nothing but tags, when it actually contains both tags and strings. Improving Performance¶ Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is critical, if you’re paying for computer time by the hour, or if there’s any other reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml. That said, there are things you can do to speed up Beautiful Soup. If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib. You can speed up encoding detection significantly by installing the cchardet library. Parsing only part of a document won’t save you much time parsing the document, but it can save a lot of memory, and it’ll make searching the document much faster.

Translating this documentation¶ New translations of the Beautiful Soup documentation are greatly appreciated. Translations should be licensed under the MIT license, just like Beautiful Soup and its English documentation are. There are two ways of getting your translation into the main code base and onto the Beautiful Soup website: Create a branch of the Beautiful Soup repository, add your translation, and propose a merge with the main branch, the same as you would do with a proposed change to the source code. Send a message to the Beautiful Soup discussion group with a link to your translation, or attach your translation to the message. Use the Chinese or Brazilian Portuguese translations as your model. In particular, please translate the source file doc/source/index.rst , rather than the HTML version of the documentation. This makes it possible to publish the documentation in a variety of formats, not just HTML.

Kỹ thuật Scraping Trang web trong Python bằng Beautiful Soup: Cơ bản

Subscribe below and we’ll send you a weekly email summary of all new Code tutorials. Never miss out on learning about the next big thing.

Monty Shokeen Freelancer, India

I am a full-stack developer who also loves to write tutorials. After trying out a bunch of things till second year of college, I decided to work on my web development skills. Starting with just HTML and CSS, I kept moving forward and gained experience in PHP, JavaScript and Python. I usually spend my free time either working on some side projects or traveling around.

PyMOTM: Beautiful Soup 4 (Part I)

Beautiful Soup 4

Mục đích: Parse HTML, XML và Website scraping

Bạn muốn parse HTML, XML hay đơn giản là muốn scraping dữ liệu từ một website nào đó với ngôn ngữ Python? Trên Viblo cũng có một số bài của Anh Tranngoc về Scraping và Crawling dữ liệu từ một website nào đó với module Scrapy như: Kỹ thuật scraping và crawling Web nâng cao với Scrapy và SQLAlchemy hay Scraping và crawling Web với Scrapy và SQLAlchemy. Nay mình cũng xin phép giới thiệu một module cũng có thể scraping dữ liệu giống Scrapy nhưng nó nhẹ và đơn giản hơn cho những yêu cầu không đòi hỏi sự phức tạp như Scrapy, đó là Beautiful Soup. Chúng ta cùng đi tìm hiểu nó nhé!

Cài đặt

Trước tiên, chúng ta có thể kiểm tra xem trên máy của mình đã cài module này chưa hoặc nếu đã cài rồi, chúng ta xem mình đã ở phiên bản mới nhất chưa bằng 1 đoạn code ngắn trên terminal như sau:

python - c "import bs4; print(bs4.__version__);"

Sau khi thử đoạn code trên, nếu máy bạn đã cài rồi thì nó sẽ xuất hiện phiên bản của module BeautifulSoup. Còn không, bạn sẽ nhận được một lỗi như sau:

Traceback ( most recent call last ) : File "" , line 1 , in < module > ImportError : No module named bs4

Để cài Beautiful Soup 4, bạn có bốn cách là thông qua APT, PIP, EasyInstall hoặc thông qua source. Mình sẽ đưa ra cả bốn cách cài để bạn có thể lựa chọn cho phù hợp với sở thích của mình nhé

Qua APT: sudo apt-get install python-bs4

Qua PIP: sudo pip install beautifulsoup4

Qua EasyInstall: sudo easy_install beautifulsoup4

Qua source: Vào trang download của Beautiful Soup tại đây rồi download phiên bản mà bạn muốn sử dụng. Xả nén source bằng lệnh: tar -vfx Thay đổi thư mục hiện thời bằng lệnh: cd / Cài đặt: python setup.py install

Nếu tất cả cách trên đều không thể cài đặt được, thì bạn vẫn có thể sử dụng nó bằng cách copy thư mục bs4 ở phần cài đặt bằng source vào thư mục source code của bạn rồi sử dụng như bình thường !

Cài đặt parser

Beautiful Soup hỗ trợ thư viện HTML parser mặc định của Python và một số thư viện của bên thứ ba. Ví dụ như lxml parse hoặc html5lib parser. Mình sẽ hướng dẫn các bạn cài đặt cả 2 thư viện này nhé !

HTML5Lib

Để cài đặt HTML5Lib, bạn có thể làm theo 1 trong 3 cách:

Thông qua APT: sudo apt-get install python-html5lib

Thông qua PIP: pip install html5lib

Thông qua EasyInstall easy_install html5lib

LXML

Để có thể cài đặt được LXML parser, bạn cần phải có 3 package sau: libxml2-dev , libxslt1-dev và python-dev .

Cài đặt các required packages: sudo apt-get install libxml2-dev libxslt1-dev python-dev

Cài đặt LXML parser: Thông qua APT: sudo apt-get install python-lxml Thông qua PIP: pip install lxml Thông qua EasyInstall: easy_install lxml

Cách sử dụng các parser

Python html.parser BeautifulSoup(markup, "html.parser")

LXML HTML parser BeautifulSoup(markup, "lxml")

LXML XLM parser BeautifulSoup(markup, "lxml-xml") hoặc BeautifulSoup(markup, "xml")

HTML5Lib BeautifulSoup(markup, "html5lib")

Chuẩn bị nguyên liệu

Để thử Beautiful Soup, chúng ta sẽ tạo một file HTML với tên bs4.html trong thư mục tmp với nội dung như sau:

< html lang = " en " > < head > < meta charset = " UTF-8 " > < title > Document < link rel = " stylesheet " href = " css.css " type = " text/css " > < body > < div class = " items-list " > < div class = " item pull-right " > < p class = " title " > Item 001 < p class = " price " > Price: 01$ < p > < a href = " # " > Buy < div class = " item pull-right " > < p class = " title " > Item 002 < p class = " price " > Price: 02$ < p > < a href = " # " > Buy < div class = " item pull-right " > < p class = " title " > Item 003 < p class = " price " > Price: 03$ < p > < a href = " # " > Buy < div class = " item pull-right " > < p class = " title " > Item 004 < p class = " price " > Price: 04$ < p > < a href = " # " > Buy < div class = " item pull-right " > < p class = " title " > Item 005 < p class = " price " > Price: 05$ < p > < a href = " # " > Buy

Making the soup

Sau khi đã có đầy đủ nguyên liệu và dụng cụ rồi, chúng ta cùng nhau đi chế biến món soup nhé ). Để parser một tài liệu, bạn chỉ cần import BeautifulSoup từ thư viện bs4 rồi truyền file handle hoặc một chuỗi HTML (XML) vào constructor của BeautifulSoup là có thể sử dụng được ngay. Xem ra món này nấu khá đơn giản đấy nhỉ ?

from bs4 import BeautifulSoup from_file_handle = BeautifulSoup ( open ( '/tmp/bs4.html' ) ) from_string = BeautifulSoup ( 'data' )

Ở BeautifulSoup constructor, nếu bạn không truyền tên của parser mà bạn muốn sử dụng ở tham số thứ 2 thì mặc định nó sẽ sử dụng parser tốt nhất đang có sẵn trên hệ thống của bạn.

OK, bây giờ chúng ta sẽ đi vào chi tiết 1 chút về Beautiful Soup nhé. Trước khi đi vào chi tiết, chúng ta sẽ tạo 1 file code python đơn giản là đọc dữ liệu từ file bs4.html mà chúng ta đã chuẩn bị từ trước để sử dụng nhé. Trong file code này, mình có sử dụng module pdb (Python Debugger) để tiện cho việc debug code của chúng ta trên terminal. Nếu có thể, mình sẽ giới thiệu module pdb này trong bài viết tiếp theo của series PyMOTM này ! Còn việc sử dụng parser nào thì tùy bạn quyết định nhé. Mình sẽ dùng HTML5Lib !

from bs4 import BeautifulSoup import pdb html_dom = BeautifulSoup ( open ( '/tmp/bs4.html' ) , 'html5lib' ) pdb . set_trace ( ) ;

Let's go (gogo)!

Các loại object của Beautiful Soup

Beautiful Soup sẽ chuyển đổi tài liệu HTML sang Python tree object. Nhưng bạn chỉ cần quan tâm 4 loại object là: Tag , NavigableString , BeautifulSoup và Comment thôi nhé !

Tag

Là một HTML (XML) tag

( Pdb ) title_tag = html_dom . find ( 'title' ) ( Pdb ) type ( title_tag ) < class 'bs4.element.Tag' > ( Pdb )

Name

Tên của tag. Bạn có thể lấy ra bằng cách sử dụng .name

( Pdb ) title_tag . name u 'title'

Bạn cũng có thể thay đổi tên của tag bằng cách gán cho nó 1 cái tên

( Pdb ) title_tag . name = 'blockquote' ( Pdb ) title_tag < blockquote > Document < / blockquote >

Attributes

Một tag có thể có một hoặc nhiều attribute. Bạn có thể lấy giá trị của một attribute như sau:

( Pdb ) link_tag = html_dom . find ( 'link' ) ( Pdb ) link_tag [ 'rel' ] [ u 'stylesheet' ]

Hoặc bạn cũng có thể lấy ra tất cả các attribute của 1 tag bằng .attrs . Dữ liệu trả ra là 1 dictionary:

(Pdb) link_tag.attrs { u 'href' : u 'css.css' , u 'type' : u 'text/css' , u 'rel' : [u 'stylesheet' ] }

Bạn cũng có thể thêm/sửa/xóa một attribute của tag:

(Pdb) link_tag['href'] = 'lorem.css' (Pdb) link_tag < link href = " lorem.css " rel = " stylesheet " type = " text/css " /> (Pdb) del link_tag['rel'] (Pdb) link_tag < link href = " lorem.css " type = " text/css " /> (Pdb) link_tag['media'] = 'all' (Pdb) link_tag < link href = " lorem.css " media = " all " type = " text/css " />

Multi-valued attributes

Ở HTML4 nó định nghĩa khá nhiều attribute được phép nhiều hơn một giá trị (các giá trị cách nhau 1 space). Nhưng sang HTML5 thì nó đã được định nghĩa lại chỉ còn một vài attribute được phép nhiều hơn 1 giá trị là class , rel , rev , accept-charset , headers và accesskey . Kết quả trả về là một danh sách các giá trị của attribute đó (nếu nó thuộc các attribute được liệt kê ở trên), hoặc là một string (nếu attribute đó không thuộc danh sách được liệt kê ở trên):

( Pdb ) items = html_dom . select ( '.items-list .item' ) ( Pdb ) items [ 0 ] [ 'class' ] [ u'item' , u'pull-right' ]

Bạn cũng có thể thêm/sửa/xóa một giá trị trong danh sách các giá trị của attribute đó:

(Pdb) del items[0]['class'][0] (Pdb) items[0]['class'] [u'pull-right'] (Pdb) items[0]['class'] = ['lorem', 'lipsum'] (Pdb) items[0]['class'] ['lorem', 'lipsum'] (Pdb) items[0]['class'][0] = 'lorem_updated' (Pdb) items[0]['class'] ['lorem_updated', 'lipsum']

NavigableString

Là nội dung text của tag đó. NavigableString giống như một Python Unicode string, ngoại trừ việc nó hỗ trợ bạn một số tính năng như duyệt và tìm kiếm (sẽ được giới thiệu ở bài tiếp theo). Bạn cũng có thể convert nó sang kiểu Unicode string bằng function unicode()

( Pdb ) title_tag . string u 'Document' ( Pdb ) type ( title_tag . string ) < class 'bs4.element.NavigableString' > ( Pdb ) unicode_title = unicode ( title_tag . string ) ( Pdb ) type ( unicode_title ) < type 'unicode' >

Bạn không thể sửa nội dung văn bản của 1 tag bằng cách trực tiếp, nhưng có thể thay thế nó thông qua function replace_with() :

( Pdb ) title_tag . string u 'Document' ( Pdb ) title_tag . string . replace_with ( 'PyMOTM: BeautifulSoup4' ) u 'Document' ( Pdb ) title_tag < title > PyMOTM: BeautifulSoup4 < / title >

Bài viết cũng đã khá dài rồi mà chúng ta mới chỉ đi chưa được một nửa của Beautiful Soup, nên mình xin phép tạm dừng tại đây nhé. Phần 2 mình sẽ giới thiệu về việc duyệt và tìm kiếm nhé !

コメントセクションでHow To Install Beautiful Soup In Python 3.9 (Windows 10)に関連する詳細情報を参照するか、トピックに関連するその他の記事を参照してくださいimport beautifulsoup python.

キーワードに関する情報 import beautifulsoup python

以下はの検索結果です import beautifulsoup python Bingサイトから. 必要に応じてもっと読むことができます.

投稿 import beautifulsoup python - How To Install Beautiful Soup In Python 3.9 (Windows 10) インターネット上のさまざまな情報源から編集しました。この記事がお役に立てば幸いです。より多くの人に見てもらえるように共有して応援してください!ありがとうございました!

記事のキーワード How To Install Beautiful Soup In Python 3.9 (Windows 10)

  • how to install Beautiful Soup in python
  • install Beautiful Soup in python
  • install Beautiful Soup
  • Beautiful Soup
  • python
  • 3.9
  • windows 10
  • install
  • how to install Beautiful Soup in python windows 10
  • how to install Beautiful Soup
  • how to install Beautiful Soup using pip
  • pip install Beautiful Soup
  • pip install Beautiful Soup error
  • pip install Beautiful Soup error window 10
  • Beautiful Soup in python windows
  • Beautiful Soup tutorial
  • install Beautiful Soup using pip
  • 2021

How #To #Install #Beautiful #Soup #In #Python #3.9 #(Windows #10)


Youtubeでトピックimport beautifulsoup pythonに関するビデオをもっと見る


また、最新のニュースレターでキーワードimport beautifulsoup pythonに関連するニュースをさらに見ることができます。.

トピック import beautifulsoup python - How To Install Beautiful Soup In Python 3.9 (Windows 10) に関する記事の表示が終了しました。この記事の情報が役に立った場合は、共有してください。どうもありがとうございます。

Leave a Reply

Your email address will not be published. Required fields are marked *