Crawling module

Content selectors

Crawler content selectors are used to define how the crawler should find content and in what order it should be handled. This is especially useful when the crawler is, for example, supposed to find products in a list and then go to a details page for each product. In such an example, we should define a content selector that finds an anchor tag <a class="product-link" href="...">View details</a> using a selector such as: a.product-link->href

Selectors supporrt the follow CSS concepts:

Content operations

In addition to selecting elements, the crawler also supports performing many operations on the matched elements, their content and attributes.

Operations are indicated by using either -> or =>

One example could be to find an anchor tag, select the attribute href, parse it as an url and extract a specific parameter. This is often useful for storing the ID of an entity shown on a specific url. The selector could be: a->href->HtmlDecode->ParseUrl->id which outputs an id url parameter e.g. 123 from https://example.com?id=123

Here are some supported operations: