photo

Maroso

shared this idea
3 years ago

Employees Involved

photo

SCM

Admin

Statistics

9
Comments
1
Views

Share

Tags

122
votes

Advanced Proxy Solution

When dealing with huge keyword list, each on own task, just vpn or any dedi proxy

solution will go down sometimes, and scm just skips that keyword/task, or some

pages doesn't get the content scraped, so we need to start that specific task

again. For huge keyword/tasks list this is a nightmare.

A better solution is to allow SCM more control on proxy usage and have this options:

1. Change proxy and Retry scrape option. An option to setup the max Number of retries per

each task/page BUT with NEW/DIFFERENT proxy. So, if proxy has been blocked in

scraping, SCM will change proxy and retry same task with new proxy. If we setup

to 1, scm will skip task, if >1 it will try again with new proxy. This

option will help to restart loading a page that gave an error, be it google

scarping or any other page scraping part. Just this part will help to have more

results be it in google scraping or any page scm scrapes.

2. Max Timeout to wait loading a page. Set max nr o seconds to wait for a proxy to finish

loading a page. If it does go in that time it’s fine, if not attempt to load

that page again with new proxy by using the above setting from #2. This way if

1 proxy is down or is slow or that specific proxy is timeout by a domain, etc,

when scraping we don’t get any error anymore by using another proxy.

3.Anticaptcha services integration, only when scraping google. Antigate is one of the best I’ve used. This will help keep the proxies up and running for more time.

4. Nr of threads option. I think that this is already in advanced settings, not 100%,

but as I did tested I notice I can change it there, hope I didn’t get it wrong.

If it doesn’t exist please consider adding more flexibility here.

5. Time between google queries. Pause in seconds between requests to Google. As I’ve

read, you setup this to randomize between 20-30 seconds per google query, but

if you implement the proxy rotation and retry option we can go faster and lower

it down to even zero. So, an option to switch from standard to desired time

will help a lot advanced users. Overall we do speed up the system.

Official Answer
photo Employee
SCM Posted 3 years ago

SCM will rotate proxies up to nth number of times

Together, we have our own decaptcher service for SCM to break Google captchas.

Add Comment

Comments (9)

photo
60

@SCM the above requests are just normal stuff, nothing fancy. Tools like a-parser, datacol, and other parser can do this stuff very easy.

In fact I run datacol (http://websiteextractor.net/) to extract content with 100k keywords and use the settings I added in the requset without any problem, not to mention how fast they are.

I don't try to say anything bad, I realy love the content generated by SCM, the output, but extracting data from google and other sites (the input scm needs) is very slow with scm.

Why to pay $700 with api ? when there is better and cheaper solution now.

About your answers:

1. how does SCM rotate proxy per each task? I always see it does scrape and if ip is deindex/burned, it switch IP BUT does switch to next keyword, instead of retry new IP for same keyword/google scrape scrape/page scrape

2. Good to know :)

3. Not the best and very important feature, but if you can add this it's fine, if not also not so big problem if we have the proxy rotation per each task/ google scrape/page scrape.

4. Well, yes it's big thing actually , I have 100 dedi proxies, so 10 threads run is no problem. I do it with other google scarpers in terms of >100k keywords scraping and extracting top 100 pages and all page scarping. Lots of scm users have lots of dedi proxies, so don't see a problem here. We run other content scrapers with multi thread.

5. I think you should reconsider this option. If we start to pay $700 for scraping content, all scm users will have a problem. Why the api, when you can improve the proxy solution.

As I said I do 100k keywords content extraction with datacol, and a-parser does it too, or other tools, can do this with advanced proxy solution, and we don't need to pay that much.

photo Employee
43

1. It rotates proxy per web call, not per task. The problem is that Google bans proxy when used with phantomjs and SCM will rotate through all of them and cause error.

The problem is that I don't have a working solution for anti-gate with phantomJs, a-parser does and they spend every single day running test scripts to make sure it doesn't break. I want to leverage that for the google api.

The problem is Google ban proxies when used with phantomJS (the scraper SCM uses)

The content scraping part in SCM is already multi-threaded, so running more jobs in parallel is likely not going to make it run faster, but will just make it more unstable (memory errors).

The bottle neck is the a) Google waiting b) Google ban.

Appreciate your thoughts, and yes you have pointed out a problem with Google getting very aggressive blocking proxies.

photo
41

SCM wrote:

1. It rotates proxy per web call, not per task. The problem is that Google bans proxy when used with phantomjs and SCM will rotate through all of them and cause error.

The problem is that I don't have a working solution for anti-gate with phantomJs, a-parser does and they spend every single day running test scripts to make sure it doesn't break. I want to leverage that for the google api.

The problem is Google ban proxies when used with phantomJS (the scraper SCM uses)

The content scraping part in SCM is already multi-threaded, so running more jobs in parallel is likely not going to make it run faster, but will just make it more unstable (memory errors).

The bottle neck is the a) Google waiting b) Google ban.

Appreciate your thoughts, and yes you have pointed out a problem with Google getting very aggressive blocking proxies.

My understanding from a previous post on here (that proposes a VPN), was that only the 1 IP (e.g. the server IP) uses phantom JS, and the IPs are actually used to scrape sites to avoid bans. Is this accurate? Or is the proxy list rotating through on the initial web call to google?

photo Employee
43

Proxies have been enabled on phantomJs side as well. So its proxy rotating for every single web call to Google.

What happens then is that SCM burns through ALL proxies cause they fail and then it falls back to the server IP.

The issue is that Google are really agressive with banning IPs, (VPN or Proxy). So it means that anti-captcha technology is a must.

The problem is we don't have that developed specifically for phantomJS, However we do have a Google API scraper we can use to bypass all of this... However, for those that what to do 70k keyword scrapes a paid API will be expensive.

photo
47

@SCM this is a good start, thanks for taking this feature into consideration.

And yes, anti-captcha technology is a very good option beside the proxy rotation. This will solve lots of not processed tasks. If we also have global page cache http://vote.seocontentmachine.com/responses/global-content-cache-not-just-per-task , we can run a campaign way faster, and proxy usage will be less.

As example. All scm users run multiple tasks, 1 keyword per task, but huge number of tasks in same project , and all keywords usually are from same niche, nobody builds a project in multi niche. Chances to have similar pages found in google search, for multiple keywords is huge. If page cache becomes global, or even per project based page cache, after scm scrapes for 10-20% of keywords it starts to build the page cache, that the other tasks, each with 1 keyword in same niche, will use the cache instead of re- download the page. And because of that tasks in same project will be done much faster, maybe 500% faster for say 50k keywords in same niche/micro-niche.

Not to mention that burning proxies will be low.

photo
45

PS: The Google API scraper is a great suggestion but for situations when you want readable content like SCC does it, and we don't over use it, but for huge quantity of content for other tasks the Google API scraper isn't the solution.

photo Employee
38

FYI I just fixed the proxy rotate code for the Google scraper. It will know rotate as many times as you specify in the proxy settings screen. Try this, will make at least Google scraper code more reliable.

photo Employee
34

Maroso wrote:

PS: The Google API scraper is a great suggestion but for situations when you want readable content like SCC does it, and we don't over use it, but for huge quantity of content for other tasks the Google API scraper isn't the solution.
How many proxies are you rotating through to do 80,000 searches a month?

50~?

photo Employee
33

Maroso wrote:

PS: The Google API scraper is a great suggestion but for situations when you want readable content like SCC does it, and we don't over use it, but for huge quantity of content for other tasks the Google API scraper isn't the solution.
I've added BETA support for captchas, I can set you up with some free testing credits. Please send me an email.

Leave Comment

photo