photo

Maxim

shared this problem
2 years ago

Employees Involved

photo

SCM

Admin

Statistics

3
Comments
1
Views

Share

Tags

13
votes

Not find and grab URLs in Article Downloader with "Custom Search Engines"

If use "Article Downloader" with "Custom Search Engines"

Program not grab url start with "href=/url_page"

If urls start with "href=http://out-site.com/url_page"; -> grab ok

Example 1:

  1. www.bakemag.com/Standard-Items/Search-Results.aspx?searchStr=%keyword% /News/
Example 2:

  1. m.bakingbusiness.com/Standard-Items/Search-Results.aspx?Sortby=relevance&searchStr=%keyword% /articles/

Official Answer
photo Employee
SCM Posted 2 years ago

It can't read relative URLs.

Add Comment

Comments (3)

photo
8

Why?

This is a serious problem.

How grab articles from this sites?

Its need fix.

photo
5

I posted this problem 1 month ago.

But this bug didnt fix.

Why?

It's easy to fix with Regex.

  1. Our URLs filter is "/news/"

  2. href="([^"].*?)/news/([^"].*?)"
Fix please asap

photo Employee
4

Relative URLs needs the domain name attached to it when scraping or it won't work.

SCM can find urls like "/url.html"

But then when gets passed to the scraper, it doesn't know what domain it came from.

Which means I would have to add a new field to custom scraper to "attach" correct domain name in front. You can see its going to be messy.

Leave Comment

photo