Parser

HTML visual code representation

The HTML Visual Code Representation section in ScrapeSuite offers an intuitive interface for identifying and selecting elements on a webpage that you want to parse. This dual-pane view allows users to interact with both the rendered web page and its underlying HTML code, ensuring precise data extraction. Here’s how it works:

Visual Highlighting

Easily choose areas of interest on the web page with a left click and the selected area of interest will be highlighted with solid lines. Suggestions for similar elements are displayed with dashed lines, aiding your selection process. This intuitive feature helps you identify the elements you want to parse with the necessary precision.

Markup Exploration

Explore the HTML markup visually alongside the web page. This feature allows you to inspect the underlying structure of the page and select elements based on certain attributes like class or href. By visually correlating markup with a rendered page, you can ensure accurate data selection. Areas of interest will be highlighted on the HTML markup with the same color as on a web page.

Advanced Select

For more experienced users, ScrapeSuite offers the “Advanced Select” tool. This feature enables you to highlight elements directly on the web page and then specify the desired block by its class in the “HTML Code” tab. This added level of precision provides for the most accurate and targeted data extraction.

JSON Result

The JSON Result section in ScrapeSuite displays the data extracted from a webpage in a structured JSON format. This section is essential for reviewing and verifying the parsed data before using it in further applications or exporting it. Here’s what you need to know:

  • Essential Containers
    • Upon accessing ScrapeSuite, you’ll encounter the “main” container, which is pivotal to the structure. Additional containers can be added to define specific content, providing flexibility in data extraction. By organizing data into containers, you can tailor a parser to your specific requirements.
  • Data Flexibility
    • Customize parsing by adding necessary elements and assigning suitable names. Data can take various forms, including Strings, Numbers, Boolean values, Arrays, Objects, and Null, catering to diverse parsing needs. This flexibility allows you to adapt parsing methods to suit the structure and format of the data you’re extracting.

Note:

JSON serves as a universal data exchange format, facilitating seamless communication between different programming languages. It offers flexibility in defining data structure and format, enhancing parsing capabilities. With ScrapeSuite, you can harness the power of JSON to efficiently extract and manipulate data from web pages.

Tree of Elements

In ScrapeSuite, the Settings Tree is an essential component for configuring your web scraping tasks. It allows you to define the structure of the elements you want to scrape, how they are selected, and how they are processed. The Settings Tree consists of several key components: the main container, containers, content, selectors, and container types.

Creation of the element

Creating elements in ScrapeSuite is an intuitive process designed to empower users in defining specific parts of a webpage for parsing. Whether selecting directly from the HTML Preview, from the HTML Code, or through the element tree, users have the flexibility to tailor their parsing tasks to their needs.

Steps to Create an Element from the HTML visual preview or HTML code

  1. Select Area of Interest:
    • Click on the area of interest in the HTML visual preview or select the corresponding HTML code in the HTML pane.
  2. Automatic Element Creation:
    • Upon selection, ScrapeSuite will automatically generate a new element, which will be highlighted in both the visual and code views

Steps to Create Element from the Element Tree

  1. Element Creation:
    • Click on the pencil icon next to a container in the element tree. Select “Add Element” to create a new element.
  2. Select Area of Interest
    • Click on the area of interest in the HTML visual preview or select the corresponding HTML code in the HTML pane.

Element Types

  • Element Type: Choose between “Content” or “Container”. This determines if the block will act as a content block that extracts data or as a container that groups multiple elements.

Hidden Elements

The Hidden Elements feature in ScrapeSuite allows you to hide specific elements from the HTML preview while keeping them in the code. This is useful for decluttering the preview and focusing on the elements you want to parse.

How to Use Hidden Elements

  • Accessing Hidden Elements:
    Click on the Main Container.
    In the settings, locate the Hidden Elements function.

  • Hiding Elements:
    Click the plus (+) icon next to the Hidden Elements function.
    In the HTML preview or the HTML code pane, click on the objects you want to hide.
    When the object will be successfully hidden, it will appear in the Hidden Elements list with the status “Hidden”.

  • Compiling Selection:
    After hiding all unnecessary objects, click the “Press to compile Selecting” button.

Content element

The “Content” block is used to define specific elements within a container that you want to scrape. Clicking on an area of interest in the HTML visual interface creates a Content block by default.

Settings of Content element:

Name

Assign a name to this content block for easy identification in the JSON Result.

Element Selection

Choose whether to parse all occurrences of the element: only the first or the last one. This is useful when the element appears multiple times on a page.

Is Optional

Indicate if the element might not be present on the page. If enabled, the ScrapeSuite will not fail if the element is missing.

Content Source

Specify how data will be extracted from the selected element. Options include extracting the text content or a specific attribute value (e.g., href, src).

Selector

The CSS code is used to identify elements on the webpage. This allows precise targeting of elements based on their attributes and structure.Detailed information about supported selectors you can find here.

Post-Processing

Options to process data after parsing.Detailed information about post-processing features you can find here.

Subordinate Operator

Define actions for subordinate operators (AND, OR, PLUS)

When multiple selectors are added within a Content block, the subordinate operator defines their interaction:

  • AND: All elements are parsed and included in the results.
  • OR: At least one of the elements is parsed and included in the results.
  • PLUS: Elements are parsed sequentially until one is successfully parsed; all successful parses are included.

Example:

You have two selectors within a Content block to capture different product title formats. The subordinate operator ensures they work together as defined.

Content selector element

Content Selectors allow precise data extraction by using CSS code to identify elements on the webpage.

Settings of Content Selector element:

Content Source

Specify how data will be extracted from the selected element. Options include extracting the text content or a specific attribute value (e.g., href, src).

Selector

The CSS code is used to identify elements on the webpage. This allows precise targeting of elements based on their attributes and structure.Detailed information about supported selectors you can find here.

Post-Processing

Options to process data after parsing.Detailed information about post-processing features you can find here.

Example:

Container element

Containers help organize elements into logical groups. Their settings have been updated for improved usability.

Settings of Container element:

Name

Assign a name to the container for easy identification in JSON Result.

Element Selection

Choose whether to parse all occurrences of the element: only the first or the last one. This is useful when the element appears multiple times on a page.

Is Optional

Indicate if the element might not be present on the page. If enabled, the ScrapeSuite will not fail if the element is missing.

Subordinate Operator

Define actions for subordinate operators (AND, OR, PLUS).

When multiple selectors are added within a Container, the subordinate operator defines their interaction:

  • AND: All elements are parsed and included in the results.
  • OR: At least one of the elements is parsed and included in the results.
  • PLUS: Elements are parsed sequentially until one is successfully parsed; all successful parses are included.

Container Type Name

Specify the type name for the container element. This helps in distinguishing between different types of containers in JSON Result.

Selector

The CSS code is used to identify elements on the webpage. This allows precise targeting of elements based on their attributes and structure.Detailed information about supported selectors you can find here.

Example

Container type element

“Container Type” allows multiple areas of interest within a container. This adds flexibility in defining complex page structures.

Settings of Container type element:

Container Type Name

Assign a name to the container type for easy identification in JSON Result.

Subordinate Operator

Define actions for subordinate operators (And, Or, Plus).

When multiple selectors are added within a Container, the subordinate operator defines their interaction:

  • AND: All elements are parsed and included in the results.
  • OR: At least one of the elements is parsed and included in the results.
  • PLUS: Elements are parsed sequentially until one is successfully parsed; all successful parses are included.

Selector

The CSS code is used to identify elements on the webpage. This allows precise targeting of elements based on their attributes and structure.Detailed information about supported selectors you can find here.

Example:

Post-Processing Section

Post-processing is a crucial step in the scraping process, allowing you to clean, format, or transform data after it has been scraped but before it is stored or used. This feature ensures that the data meets your specific requirements and is ready for immediate use.Post-processing provides the opportunity for additional data processing before saving or exporting, enhancing the accuracy and usability of the parsing result.

Adjustments in the Post-processing section include the following features:

Without post-processing: Values will not undergo post-processing.

Select this option if no additional processing of values is required after extraction.

Get absolute URL: Obtain the absolute URL from the relative one.

Choose this option if your URLs are extracted in relative form, and you want to add the absolute URL in the correct format.

Get number from string: Specify the index of the number in the string that you want to extract.

Choose this option if you need to extract a number from a string. If necessary, you can specify the specific index you want to process. The default is set to 0.

Get substring by regex: Specify the starting index and length to extract a substring using a regular expression. 

Use this option to extract a substring that matches the specified regular expression. If we need to split our URL into an array based on a given rule.

Get substring: Specify the starting index and length to extract a substring. 

Use this option to extract a substring from a value. For example, you can get the first 10 characters, starting from zero.

Get a trimmed string: Obtain a trimmed version of the string. 

Use this option to get a trimmed version of the string, removing any extra spaces.

Custom post-processing. Use JavaScript code to process the value. 

Select this option if you need custom processing of values using JavaScript code.

CSS Selector

In ScrapeSuite, selectors play a crucial role in identifying and extracting specific elements from a webpage. You can create selectors automatically by clicking on an area of interest in the HTML preview or manually by entering the CSS code in the “Selector” field.

Selectors allow you to pinpoint the exact elements you want to parse. Below is a list of supported selector methods, along with their descriptions and examples of how to construct them in ScrapeSuite.

Supported Selectors:

SelectorDescriptionExample

.class

Selects all elements with the specified class.

.product selects all elements with the class “product”. For example: div.product

.class1.class2

Selects all elements with both class names.

.product.featured selects all elements with both “product” and “featured” classes. For example: div.product.featured

.class1 .class2

Selects all elements with class2 that are descendants of an element with class1.

.product .product-title selects all elements with the class “product-title” inside elements with the class “product”. For example: div.product span.product-title

#id

Selects the element with the specified ID.

#main-title selects the element with id=”main-title”. For example: h1#main-title

*

Selects all elements.

* selects all elements on the page. For example: *

element

Selects all elements of the specified type.

div selects all <div> elements. For example: div

element.class

Selects all elements of the specified type with the specified class.

div.product selects all <div> elements with the class “product”. For example: div.product

element,element

Selects all elements of the specified types.

div, p selects all <div> and <p> elements. For example: div, p

element element

Selects all elements of the specified type that are descendants of the specified element.

div.related-products p selects all <p> elements inside elements with the class “related-products”. For example: div.related-products p

element>element

Selects all elements of the specified type that are direct children of the specified element.

div > p selects all <p> elements where the parent is a <div> element. For example: div > p

element+element

Selects the first element of the specified type that is immediately adjacent to the specified element.

div + p selects the first <p> element placed immediately after <div> elements. For example: div + p

element1~element2

Selects all elements of the specified type that are preceded by the specified element.

p ~ ul selects every <ul> element that is preceded by a <p> element. For example: p ~ ul

[attribute]

Selects all elements with the specified attribute.

[data-product] selects all elements with a “data-product” attribute. For example: div[data-product]

[attribute=value]

Selects all elements with the specified attribute value.

[target="_blank"] selects all elements with target=”_blank”. For example: a[target="_blank"]

[attribute~=value]

Selects all elements with the specified attribute value as a word.

[title~=flower] selects all elements with a title attribute containing the word “flower”. For example: div[title~=flower]

[attribute^=value]

Selects all elements with the specified attribute value starting with a specific value.

[href^="https"] selects every <a> element whose href attribute value begins with “https”. For example: a[href^="https"]

[attribute$=value]

Selects all elements with the specified attribute value ending with a specific value.

[href$=".pdf"] selects every <a> element whose href attribute value ends with “.pdf”. For example: a[href$=".pdf"]

[attribute*=value]

Selects all elements with the specified attribute value containing a specific substring.

[href*="scrapesuite"] selects every <a> element whose href attribute value contains the substring “scrapesuite”. For example: a[href*="scrapesuite"]

:first-child

Selects every element that is the first child of its parent.

div:first-child selects every <div> element that is the first child of its parent. For example: div:first-child

:last-child

Selects every element that is the last child of its parent.

div:last-child selects every <div> element that is the last child of its parent. For example: div:last-child

:not(selector)

Selects every element that does not match the specified selector.

:not(div) selects every element that is not a <div> element. For example: :not(div)

:nth-child(n)

Selects every element that is the nth child of its parent.

div:nth-child(2) selects every <div> element that is the second child of its parent. For example: div:nth-child(2)

Using these selectors, you can fine-tune your data extraction to capture exactly the elements you need from a webpage. Whether you are selecting elements by class, ID, attribute, or position, these methods provide a powerful way to define your parsing targets in ScrapeSuite.

Elements Section

The Elements section in ScrapeSuite simplifies the process of selecting and managing elements for parsing on web pages. It offers an intuitive interface for precisely defining the data to be extracted. Here’s a detailed overview of its functionalities:

The Elements section provides a user-friendly approach to selecting elements of interest directly from the HTML preview or code view. It empowers users to define their data extraction criteria with precision.

Elements Displaying: 

  • Users click once on an area of interest within the HTML preview or code view.
  • The Elements block then displays all matching elements found on the page according to the constructed selector.
  • After reviewing the listed elements, users compile their selection by pressing the “Press to compile selection” button. This finalizes the chosen elements for parsing.
  • Once compiled, the selected elements appear in the JSON Result block for further review and refinement if necessary.

Element Status:

Elements included in the JSON Result are marked with the “In result” indicator, allowing users to easily identify which elements have been successfully included in parsing results.

Including selected elements in JSON Result is provided by the “Elements” section in Content and Selector settings.

Timeline

Timeline is a feature that displays the sequence of events and actions related to setting up the parser. This timeline provides users with an overview of all key moments, showing the creation of containers, properties, text selectors, attribute selectors, and CSS selectors. By selecting a specific point on the timeline, you can always go back one or more steps and reconfigure with the changes you need.

IMPORTANT! Note that if you go back one or more steps and make changes to your settings, all subsequent changes on the timeline made after that point will be completely removed. Changes that may be removed after timeline modifications will be displayed in a faded gray color.

Example: Before making changes to the timeline and removal.

After making changes and deletions on the timeline.

Workflow