Crawler content selectors are used to define how the crawler should find content and in what order it should be handled. This is especially useful when the crawler is, for example, supposed to find products in a list and then go to a details page for each product. In such an example, we should define a content selector that finds an anchor tag <a class="product-link" href="...">View details</a>
using a selector such as: a.product-link->href
Selectors supporrt the follow CSS concepts:
.
#
>
to indicate child and direct child respectively+
~
[someAttr="someValue"]
[someAttr*="someValue"]
[someAttr^="someValue"]
[someAttr$="someValue"]
:not(...)
which only matches the element if everything inside the not paranthesis is not matching the element:first-child
:last-child
:nth-child(number)
- note that the number is starting from 1 being the first element (same as CSS).In addition to selecting elements, the crawler also supports performing many operations on the matched elements, their content and attributes.
Operations are indicated by using either ->
or =>
One example could be to find an anchor tag, select the attribute href
, parse it as an url and extract a specific parameter. This is often useful for storing the ID of an entity shown on a specific url. The selector could be: a->href->HtmlDecode->ParseUrl->id
which outputs an id
url parameter e.g. 123
from https://example.com?id=123
Here are some supported operations:
simplify
which removes almost all html except for text, links and images that are shown inline in text. This operation can be useful for removing styling and complex html structures that are not desired when you just want to get actual content.children
returns the children of the html tagcss
returns all CSS attributes inside the style attribute as a dictionary. This is useful when e.g. wanting to output the value of a specific CSS attribute such as a->css->color
, outputting red
from <a style="color: red;">...</a>
content
returns all text and children inside the matched tag as a html stringhtmldecode
is used to html decode strings, e.g. &
to &
limit(number)
is used to set an upper limit on the number of matched elements or string lengthoffset(number)
similar to limit, but is used to skip elements or characters in a stringparsedate
is used to convert a string to a datetime. Formats can optionally be provided using e.g. parsedate(dd-MM-YYYY|dd-mm-YYYY HH:mm)
where |
is used as a separator between multiple formats which are tried in order. There is rarely a need for passing formats though since most common formats are supported.parseurl
or urlparameters
is used to parse a url and return url parameters as a dictionary. It is worth noting that the url is converted to an absolute url based on the url of the page on which the url was found. CSS urls such as url('https://...')
are also handled.string
converts the currently matched value to a string. Html elements are converted to html strings and datetimes are serialized to strings.tohtmlparagraph
converts a string to a html paragraph html string - basically wrapping it in a p-tag if necessary.