Crawling module

Content selectors

Crawler content selectors are used to define how the crawler should find content and in what order it should be handled. This is especially useful when the crawler is, for example, supposed to find products in a list and then go to a details page for each product. In such an example, we should define a content selector that finds an anchor tag <a class="product-link" href="...">View details</a> using a selector such as: a.product-link->href

Selectors supporrt the follow CSS concepts:

Classes using .
Id using #
space and > to indicate child and direct child respectively
Element directly after another element using +
Element somewhere after another element using ~
Attribute value equals, such as [someAttr="someValue"]
Attribute value contains, such as [someAttr*="someValue"]
Attribute value starts with, such as [someAttr^="someValue"]
Attribute value ends with, such as [someAttr$="someValue"]
:not(...) which only matches the element if everything inside the not paranthesis is not matching the element
:first-child
:last-child
:nth-child(number) - note that the number is starting from 1 being the first element (same as CSS).

Content operations

In addition to selecting elements, the crawler also supports performing many operations on the matched elements, their content and attributes.

Operations are indicated by using either -> or =>

One example could be to find an anchor tag, select the attribute href, parse it as an url and extract a specific parameter. This is often useful for storing the ID of an entity shown on a specific url. The selector could be: a->href->HtmlDecode->ParseUrl->id which outputs an id url parameter e.g. 123 from https://example.com?id=123

Here are some supported operations:

simplify which removes almost all html except for text, links and images that are shown inline in text. This operation can be useful for removing styling and complex html structures that are not desired when you just want to get actual content.
children returns the children of the html tag
css returns all CSS attributes inside the style attribute as a dictionary. This is useful when e.g. wanting to output the value of a specific CSS attribute such as a->css->color, outputting red from <a style="color: red;">...</a>
content returns all text and children inside the matched tag as a html string
htmldecode is used to html decode strings, e.g. & to &
limit(number) is used to set an upper limit on the number of matched elements or string length
offset(number) similar to limit, but is used to skip elements or characters in a string
parsedate is used to convert a string to a datetime. Formats can optionally be provided using e.g. parsedate(dd-MM-YYYY|dd-mm-YYYY HH:mm) where | is used as a separator between multiple formats which are tried in order. There is rarely a need for passing formats though since most common formats are supported.
parseurl or urlparameters is used to parse a url and return url parameters as a dictionary. It is worth noting that the url is converted to an absolute url based on the url of the page on which the url was found. CSS urls such as url('https://...') are also handled.
string converts the currently matched value to a string. Html elements are converted to html strings and datetimes are serialized to strings.
tohtmlparagraph converts a string to a html paragraph html string - basically wrapping it in a p-tag if necessary.
If none of the above are matched then the crawler tries to find attributes based on the current value. If the value is a html tag then html attributes are looked through for example.