Categories
How To Guides

There’s a White Steam Deck Now, however We Simply Need a Steam Deck 2

There won’t be a Steam Deck 2 this year. Valve staff previously affirmed it in a new meeting. “Everything I can manage is a white Steam Deck OLED.” Nonetheless, anyone wanting to beat the hawkers ought to realize that whenever they’re gone, Valve will not be making anything else of them.

Valve uncovered the impending white restricted version Steam Deck on the organization’s Twitter page. As per Valve, the white Steam Deck is a similar precise OLED-based gadget Valve delivered last year, however the gadget arrives in a lively, grayish variety with dim buttons and thumbsticks. It will cost however much the past 1 TB clear plastic Steam Deck OLED at $680, yet it likewise accompanies a white conveying case and white microfiber cleaning fabric (which is, obviously, a selling point). The power button actually incorporates the orange complement as the main spot of additional variety.

These will open up Nov. 18 at 6 p.m. ET, 3 p.m. PT. They’ll be accessible in the U.S., Australia, Japan, South Korea, Taiwan, and Hong Kong. Valve said it has “restricted amounts” of white Steam Decks for all locales. To attempt to beat the hawkers, Valve announced it’s confining buys to one for every record. Those records must have purchased something different on Steam before November to be qualified. Notwithstanding, we don’t envision that will totally prevent affiliates from flooding eBay with increased white Steam Decks, as they did with the restricted release 30th commemoration PlayStation 5 Expert.

The standard 1 TB Steam Deck goes for $650 MSRP, and to save the extra $30 and put it towards a skin or dock, that is entirely sensible. Valve’s handheld is effectively awesome of its sort in its cost range. The following stage for gaming handhelds is a Lenovo Army Go or an Asus ROG Partner X, which cost many dollars extra in return for additional strong gadgets that run Windows rather than the Linux-based SteamOS. Windows likewise takes into account simpler double booting and less similarity issues stacking your games from non-Steam launchers. In any case, on the off chance that you need the most direct insight, the Steam Deck is as yet the best quality level for PC handhelds.

I partake in my Steam Deck however much the following gamer who can’t be tried to sit at their work area. Simply this week, I went through an entire day off sitting idle yet groaning about my hurting appendages and playing Similitude: ReFantazio on that 7.4-inch OLED screen. The new, white model looks extraordinary, however likewise with all white plastic, you’ll definitely track down scrapes and soil deface the unblemished outside. You could rather choose decals and stickers like those from DBrand if you have any desire to make your Steam Deck look remarkable.

I likewise need to concede I’m disheartened with how this restricted version model feels barebones contrasted with the more seasoned clear plastic (notwithstanding revealed breaking issues). I’d need something that helps me to remember the turrets from Entrance, perhaps with red face buttons or trackpads. Maybe the organization is saving its energy for an inescapable Steam Deck 2. Valve creator Lawrence Yang told the Australian outlet Reviews.org that the organization isn’t anticipating any sort of yearly delivery. All things considered, it’s searching for a genuine “generational jump” in registering power without forfeiting battery duration. Taking into account the reports for the AMD Ryzen Z2, the cutting edge chip for what’s controlling the Partner and Army Go, that generational jump may currently be not too far off.

Categories
How To Guides

How to Use Web Scraping with Scrapy

Web scraping is a method for extracting information from websites. It can be used for data analysis, competitive analysis, monitoring, and much more. Scrapy is a powerful and versatile web scraping framework in Python that provides tools to build web scrapers, also known as spiders. This guide will cover everything from setting up Scrapy to performing advanced scraping tasks, including handling JavaScript content, managing requests, and deploying your scraper.

Table of Contents

  1. Introduction to Scrapy
  2. Setting Up the Scrapy Environment
  3. Creating a New Scrapy Project
  4. Understanding Scrapy Components
  5. Writing Your First Spider
  6. Extracting Data
  7. Handling Pagination
  8. Working with Forms
  9. Handling JavaScript Content
  10. Managing Requests and Middleware
  11. Storing Scraped Data
  12. Error Handling and Debugging
  13. Testing Your Spiders
  14. Deploying Your Scraper
  15. Legal and Ethical Considerations
  16. Conclusion

1. Introduction to Scrapy

Scrapy is an open-source web crawling framework designed for web scraping and extracting data from websites. Unlike other scraping tools, Scrapy is designed to handle complex scraping tasks efficiently. It provides a robust architecture to create spiders that crawl websites and extract structured data.

Key Features of Scrapy

  • Powerful and Flexible: Supports scraping complex websites and handling various data formats.
  • Asynchronous: Built on top of Twisted, an asynchronous networking library, for efficient network operations.
  • Built-in Data Export: Allows easy export of scraped data to various formats like CSV, JSON, and XML.
  • Extensible: Provides middleware and pipelines for additional functionality.

2. Setting Up the Scrapy Environment

Installing Python

Ensure Python is installed on your system. You can download it from the official Python website.

Installing Scrapy

Install Scrapy using pip, Python’s package manager:

bash

pip install scrapy

Creating a Virtual Environment (Optional)

Create a virtual environment to manage dependencies:

bash

python -m venv myenv
source myenv/bin/activate # On Windows use `myenv\Scripts\activate`

3. Creating a New Scrapy Project

Starting a New Project

Create a new Scrapy project using the following command:

bash

scrapy startproject myproject
cd myproject

This will generate a project structure with directories for spiders, settings, and more.

Project Structure

  • myproject/: Project directory.
    • myproject/spiders/: Directory for spider files.
    • myproject/items.py: Define item classes here.
    • myproject/middlewares.py: Define custom middleware here.
    • myproject/pipelines.py: Define item pipelines here.
    • myproject/settings.py: Project settings.
    • myproject/init.py: Package initializer.
    • scrapy.cfg: Project configuration file.

4. Understanding Scrapy Components

Spiders

Spiders are classes that define how a website should be scraped, including the URLs to start from and how to follow links.

Items

Items are simple containers used to structure the data you extract. They are defined in items.py.

Item Loaders

Item Loaders are used to populate and clean items.

Pipelines

Pipelines process the data extracted by spiders. They are defined in pipelines.py and can be used for tasks like cleaning data or saving it to a database.

Middleware

Middleware allows you to modify requests and responses globally. This is defined in middlewares.py.

5. Writing Your First Spider

Creating a Spider

Create a spider in the spiders directory. For example, create a file named example_spider.py:

python

import scrapy

class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']

def parse(self, response):
self.log('Visited %s' % response.url)

Running the Spider

Run the spider using:

bash

scrapy crawl example

6. Extracting Data

Extracting Using Selectors

Scrapy uses selectors based on XPath or CSS to extract data. In your parse method:

python

def parse(self, response):
title = response.css('title::text').get()
self.log('Page title: %s' % title)

Extracting Multiple Items

Extract multiple items by iterating over a set of elements:

python

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
}

7. Handling Pagination

Extracting Pagination Links

Identify the link to the next page and follow it:

python

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)

8. Working with Forms

Filling and Submitting Forms

Scrapy can handle form submission. Use the FormRequest class to send form data:

python

def start_requests(self):
return [scrapy.FormRequest('https://example.com/login', formdata={
'username': 'myuser',
'password': 'mypassword'
}, callback=self.after_login)]

def after_login(self, response):
# Check if login was successful and continue scraping
pass

9. Handling JavaScript Content

Using Splash for JavaScript Rendering

Scrapy alone cannot handle JavaScript-rendered content. Use Scrapy-Splash to render JavaScript:

  1. Install Splash: Splash is a headless browser for rendering JavaScript.
    bash

    docker run -p 8050:8050 scrapinghub/splash
  2. Install Scrapy-Splash:
    bash

    pip install scrapy-splash
  3. Configure Scrapy-Splash: Update settings.py:
    python

    SPLASH_URL = 'http://localhost:8050'

    DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 50,
    }

    SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }

    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

  4. Create a Splash-Enabled Spider:
    python

    import scrapy
    from scrapy_splash import SplashRequest

    class SplashSpider(scrapy.Spider):
    name = 'splash'
    start_urls = ['https://example.com']

    def start_requests(self):
    for url in self.start_urls:
    yield SplashRequest(url, self.parse, args={'wait': 2})

    def parse(self, response):
    self.log('Page title: %s' % response.css('title::text').get())

10. Managing Requests and Middleware

Custom Middleware

You can create custom middleware to process requests and responses:

python

# myproject/middlewares.py
class CustomMiddleware:
def process_request(self, request, spider):
# Add custom headers or modify requests
request.headers['User-Agent'] = 'my-custom-agent'
return None

def process_response(self, request, response, spider):
# Process responses here
return response

Enabling Middleware

Add your middleware to settings.py:

python

DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomMiddleware': 543,
}

11. Storing Scraped Data

Exporting to CSV

To save scraped data to CSV:

python

# Run the spider with:
scrapy crawl example -o output.csv

Exporting to JSON

To save scraped data to JSON:

python

# Run the spider with:
scrapy crawl example -o output.json

Exporting to XML

To save scraped data to XML:

python

# Run the spider with:
scrapy crawl example -o output.xml

Saving Data to a Database

You can use item pipelines to save data to a database:

python

# myproject/pipelines.py
import sqlite3

class SQLitePipeline:
def open_spider(self, spider):
self.conn = sqlite3.connect('data.db')
self.c = self.conn.cursor()
self.c.execute('''
CREATE TABLE IF NOT EXISTS quotes (
text TEXT,
author TEXT
)
'''
)

def close_spider(self, spider):
self.conn.commit()
self.conn.close()

def process_item(self, item, spider):
self.c.execute('INSERT INTO quotes (text, author) VALUES (?, ?)', (item['text'], item['author']))
return item

Add the pipeline to settings.py:

python

ITEM_PIPELINES = {
'myproject.pipelines.SQLitePipeline': 1,
}

12. Error Handling and Debugging

Handling Errors

Catch and handle errors in your spider:

python

def parse(self, response):
try:
title = response.css('title::text').get()
if not title:
raise ValueError('Title not found')
except Exception as e:
self.log(f'Error occurred: {e}')

Debugging with Logging

Use Scrapy’s built-in logging to debug:

python

import logging

logging.basicConfig(level=logging.DEBUG)

Using Scrapy Shell

Scrapy Shell is a useful tool for testing and debugging your spiders:

bash

scrapy shell 'https://example.com'

13. Testing Your Spiders

Unit Testing

Use Python’s unittest to write unit tests for your spiders:

python

import unittest
from scrapy.http import HtmlResponse
from myproject.spiders.example_spider import ExampleSpider

class TestExampleSpider(unittest.TestCase):
def setUp(self):
self.spider = ExampleSpider()

def test_parse(self):
url = 'https://example.com'
response = HtmlResponse(url=url, body='<html><title>Test</title></html>', encoding='utf-8')
results = list(self.spider.parse(response))
self.assertEqual(results, [{'title': 'Test'}])

Integration Testing

Write integration tests to ensure your spider works with real data and handles edge cases.

14. Deploying Your Scraper

Deploying to Scrapinghub

Scrapinghub is a cloud-based platform for deploying and managing Scrapy spiders.

  1. Install Scrapinghub Command Line Interface (CLI):
    bash

    pip install shub
  2. Configure Scrapinghub:
    bash

    shub login
  3. Deploy Your Project:
    bash

    shub deploy

Deploying to a Server

You can also deploy your Scrapy project to a server or cloud service like AWS:

  1. Prepare Your Server: Set up your server environment with Python and Scrapy.
  2. Transfer Your Project: Use scp or another file transfer method to upload your project files.
  3. Run Your Spider:
    bash

    scrapy crawl example

15. Legal and Ethical Considerations

Respecting robots.txt

Check the robots.txt file of the website to see if scraping is allowed.

Avoiding Overloading Servers

Implement delays between requests and avoid making too many requests in a short period.

Handling Sensitive Data

Ensure that you handle any sensitive data responsibly and comply with data protection regulations.

16. Conclusion

Scrapy is a robust and flexible web scraping framework that allows you to efficiently extract data from websites. This guide has covered the fundamentals of setting up Scrapy, writing and running spiders, extracting and storing data, handling JavaScript content, and deploying your scraper.

By understanding Scrapy’s components and following best practices for web scraping, you can build powerful scrapers to gather valuable data from the web. Always ensure that your scraping activities are legal and ethical, and use the data responsibly. With these skills, you are well-equipped to tackle various web scraping challenges and projects.

Categories
How To Guides

How to Use Web Scraping with BeautifulSoup

Web scraping is the process of extracting data from websites. It’s a powerful technique that can be used for a variety of applications, from data analysis to competitive analysis. BeautifulSoup is a popular Python library used for web scraping, allowing you to parse HTML and XML documents easily. This comprehensive guide will cover everything from setting up your environment to performing advanced scraping tasks with BeautifulSoup.

Table of Contents

  1. Introduction to Web Scraping
  2. Setting Up Your Environment
  3. Introduction to BeautifulSoup
  4. Basic Scraping with BeautifulSoup
  5. Navigating the HTML Tree
  6. Searching the HTML Tree
  7. Handling Pagination
  8. Working with Forms
  9. Handling JavaScript Content
  10. Dealing with Cookies and Sessions
  11. Error Handling and Best Practices
  12. Saving Scraped Data
  13. Legal and Ethical Considerations
  14. Conclusion

1. Introduction to Web Scraping

Web scraping involves downloading web pages and extracting data from them. This data can be structured (like tables) or unstructured (like text). Web scraping is commonly used for:

  • Data extraction for research or business intelligence.
  • Monitoring changes on web pages.
  • Aggregating data from multiple sources.

Key Concepts

  • HTML: The standard markup language for documents designed to be displayed in a web browser.
  • DOM (Document Object Model): A programming interface for HTML and XML documents. It represents the structure of the document as a tree of nodes.
  • HTTP Requests: The method used to fetch web pages.

2. Setting Up Your Environment

Installing Python

Ensure you have Python installed. You can download it from the official Python website.

Installing Required Libraries

You’ll need requests for making HTTP requests and beautifulsoup4 for parsing HTML. Install these libraries using pip:

bash

pip install requests beautifulsoup4

Creating a Virtual Environment (Optional)

It’s a good practice to use a virtual environment to manage dependencies. Create and activate one as follows:

bash

python -m venv myenv
source myenv/bin/activate # On Windows use `myenv\Scripts\activate`

3. Introduction to BeautifulSoup

BeautifulSoup is a library for parsing HTML and XML documents. It creates a parse tree from page source codes that can be used to extract data from HTML.

Installing BeautifulSoup

You can install BeautifulSoup via pip:

bash

pip install beautifulsoup4

Basic Usage

Import BeautifulSoup and requests in your script:

python

from bs4 import BeautifulSoup
import requests

4. Basic Scraping with BeautifulSoup

Fetching a Web Page

Use the requests library to fetch the web page:

python

url = 'https://example.com'
response = requests.get(url)
html_content = response.text

Parsing HTML

Create a BeautifulSoup object and parse the HTML content:

python

soup = BeautifulSoup(html_content, 'html.parser')

Extracting Data

You can extract data using BeautifulSoup’s methods:

python

title = soup.title.text
print("Page Title:", title)

5. Navigating the HTML Tree

BeautifulSoup allows you to navigate and search the HTML tree easily.

Accessing Tags

Access tags directly:

python

h1_tag = soup.h1
print("First <h1> Tag:", h1_tag.text)

Accessing Attributes

Get attributes of tags:

python

link = soup.a
print("Link URL:", link['href'])

Traversing the Tree

Navigate the tree using parent, children, and sibling attributes:

python

# Get the parent tag
parent = soup.h1.parent
print("Parent Tag:", parent.name)
# Get all child tags
children = soup.body.children
for child in children:
print(“Child:”, child.name)

6. Searching the HTML Tree

BeautifulSoup provides methods for searching the HTML tree.

Finding All Instances

Find all tags that match a particular criteria:

python

all_links = soup.find_all('a')
for link in all_links:
print("Link Text:", link.text)

Finding the First Instance

Find the first tag that matches a particular criteria:

python

first_link = soup.find('a')
print("First Link:", first_link.text)

Using CSS Selectors

Select elements using CSS selectors:

python

selected_elements = soup.select('.class-name')
for element in selected_elements:
print("Selected Element:", element.text)

7. Handling Pagination

Many websites display data across multiple pages. Handling pagination requires extracting the link to the next page and making subsequent requests.

Finding Pagination Links

Identify the link to the next page:

python

next_page = soup.find('a', text='Next')
if next_page:
next_url = next_page['href']
response = requests.get(next_url)
next_page_content = response.text

Looping Through Pages

Loop through pages until no more pagination links are found:

python

current_url = 'https://example.com'
while current_url:
response = requests.get(current_url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data from current page
# …

next_page = soup.find(‘a’, text=‘Next’)
if next_page:
current_url = next_page[‘href’]
else:
break

8. Working with Forms

Web scraping often involves interacting with web forms, such as login forms or search forms.

Sending Form Data

Use requests to send form data:

python

payload = {
'username': 'myuser',
'password': 'mypassword'
}
response = requests.post('https://example.com/login', data=payload)

Parsing Form Responses

After sending form data, parse the response as usual with BeautifulSoup:

python

soup = BeautifulSoup(response.text, 'html.parser')

9. Handling JavaScript Content

Some websites load content dynamically using JavaScript, which can be challenging for web scraping.

Using Selenium for JavaScript

Selenium is a browser automation tool that can handle JavaScript. Install it via pip:

bash

pip install selenium

You will also need a browser driver (e.g., ChromeDriver). Here’s a basic example:

python

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get(‘https://example.com’)

html_content = driver.page_source
soup = BeautifulSoup(html_content, ‘html.parser’)
driver.quit()

Scraping JavaScript Content

Once you have the page source, use BeautifulSoup to parse and extract data:

python

data = soup.find_all('div', class_='dynamic-content')
for item in data:
print(item.text)

10. Dealing with Cookies and Sessions

Some websites use cookies and sessions for managing user states.

Handling Cookies

Use requests to handle cookies:

python

session = requests.Session()
response = session.get('https://example.com')
print(session.cookies.get_dict())

Using Cookies in Requests

Send cookies with your requests:

python

cookies = {'sessionid': 'your-session-id'}
response = session.get('https://example.com/dashboard', cookies=cookies)

11. Error Handling and Best Practices

Error Handling

Handle errors gracefully to avoid disruptions:

python

try:
response = requests.get('https://example.com')
response.raise_for_status()
except requests.exceptions.HTTPError as err:
print(f"HTTP error occurred: {err}")
except Exception as err:
print(f"Other error occurred: {err}")

Best Practices

  • Respect Robots.txt: Always check the site’s robots.txt file to understand the scraping rules.
  • Rate Limiting: Avoid overwhelming the server with too many requests in a short period. Implement delays between requests.
  • User-Agent: Set a User-Agent header to identify your requests as coming from a browser:

python

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://example.com', headers=headers)

12. Saving Scraped Data

Saving to CSV

You can save the scraped data to a CSV file using Python’s csv module:

python

import csv

data = [
[‘Name’, ‘Description’],
[‘Item 1’, ‘Description of item 1’],
[‘Item 2’, ‘Description of item 2’]
]

with open(‘data.csv’, ‘w’, newline=) as file:
writer = csv.writer(file)
writer.writerows(data)

Saving to JSON

For JSON, use the json module:

python

import json

data = {
‘items’: [
{‘name’: ‘Item 1’, ‘description’: ‘Description of item 1’},
{‘name’: ‘Item 2’, ‘description’: ‘Description of item 2’}
]
}

with open(‘data.json’, ‘w’) as file:
json.dump(data, file, indent=4)

Saving to a Database

To save data to a database, you can use an ORM like SQLAlchemy:

python

from sqlalchemy import create_engine, Column, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()

class Item(Base):
__tablename__ = ‘items’
id = Column(String, primary_key=True)
name = Column(String)
description = Column(String)

engine = create_engine(‘sqlite:///data.db’)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()

item = Item(id=‘1’, name=‘Item 1’, description=‘Description of item 1’)
session.add(item)
session.commit()

13. Legal and Ethical Considerations

Legal Issues

Web scraping may violate a website’s terms of service. Always check the website’s terms and conditions before scraping.

Ethical Issues

  • Respect Data Privacy: Avoid scraping sensitive or personal data.
  • Avoid Overloading Servers: Implement polite scraping practices and avoid making excessive requests.

14. Conclusion

Web scraping with BeautifulSoup is a powerful tool for extracting and analyzing data from websites. With the ability to navigate HTML trees, handle forms, deal with JavaScript content, and manage cookies, BeautifulSoup provides a versatile solution for many web scraping needs.

This guide has covered the essentials of setting up your environment, performing basic and advanced scraping, handling different types of web content, and saving the data you collect. Always remember to adhere to legal and ethical guidelines while scraping to ensure responsible use of this technology.

With these skills, you can tackle a wide range of web scraping projects and extract valuable insights from web data.