photo

TheGypsy

shared this problem
9 months ago

Employees Involved

photo

SCM

Admin

Statistics

5
Comments
1
Views

Share

Tags

1
votes

Problem with article downloader

I've been trying to download some content from a site using the css selector filter. It works fine but sometimes SCM doesn't get part of the content.

I try to filter the content with .wprm-recipe css selector

For example these URLs work fine:

http://www.suncakemom.com/treats/cinnamon-yeast-bread-no-sugar-treat/

http://www.suncakemom.com/treats/homemade-sugar-free-crepes-recipe/

http://www.suncakemom.com/treats/sugar-free-ice-cream-recipe-with-raspberries/

But these are not returning some content which is immediately after the title:

http://www.suncakemom.com/treats/low-carb-muffins-with-chocolate-chip/

http://www.suncakemom.com/treats/homemade-vanilla-custard-profiteroles/

http://www.suncakemom.com/treats/viennetta-recipe-the-ice-cream-dream/

(please remove the links when checked, thanks)

The content is injected to the pages with a wordpress plugin and the structure shouldn't be different from one to the another but sometimes for some reason SCM doesn't fetch the .wprm-recipe-times-container part of the content.

Add Comment

Comments (5)

photo
1

Hi, since then I've tried downloading articles with Xpath as well. It seems that it doesn't matter if I use Xpath or css selector filtering, content machine skips certain content when scraping the articles.

Is it possible that it removes the content after scraping it because it doesn't think it's valid?

photo Employee
1

TheGypsy wrote:

Hi, since then I've tried downloading articles with Xpath as well. It seems that it doesn't matter if I use Xpath or css selector filtering, content machine skips certain content when scraping the articles.

Is it possible that it removes the content after scraping it because it doesn't think it's valid?

Yes it is certainly possible.

Can you paste an example of xpath and the webpage you tried to download content from?

What was missing?

photo
1

I try to download a recipe from a webpage. This is the page:

http://www.suncakemom.com/treats/cinnamon-pull-apart-bread/

This is the xpath selector I tried:

//*[@id="wprm-recipe-container-2701"]

I also tried with the css selector and it returns the same results:

.wprm-recipe

When I try a different URL that uses the same plugin SCM returns different results.

On this page it skips the prep time, cook time, total time and starts with the sentence immediately after them however it shows the servings number while at the previous URL it skips it:

http://www.suncakemom.com/treats/low-carb-muffins-with-chocolate-chip/

When comparing the scraped text to the original text on the webpage there are missing words in the scraped text. None of the URLs have the Ingredients header which has the same css class as the Instructions header. Some or most of the Ingredients actually missing.

photo
1

Is there any news on why this could be happening?

photo
1

Or a workaround that would stop filtering the scraped text?

Leave Comment

photo