46 by mgliwka | 26 comments on Hacker News.
To comply with the new European legislation many websites put a GDPR / cookie consent notice in front of their websites. There are different implementations of this. While some are only implemented as modal covering the website or bar on the bottom of the screens (in both cases right next to the original content), other implementations redirect the user to a totally different (sub-)domain or even hijack the request and show the consent form instead of the requested content (on the same URL with a 200 status code). The latter ones present a issue to my crawler. I cannot access the content of the page without accepting those notices. Things I'm considering to bypass those notices: * US IP address (easy to implement, but some websites also display those notices to US IP's) * Heuristics to detect those notices and accept them programatically (takes some time to implement - while a couple of vendors (i.e. OneTrust) offer off-the-shelf solutions which are easy to identify and automate, there are also many custom made solutions, so the system would need understand the concept of a consent form and how to bypass it - some forms only require the press of the right button, others involve checkboxes/radio buttons). To collect test data one solution might be to visit a set of websites once with an US IP, once with an EU IP and/or with different user agents (browser or googlebot). Do you have any ideas how to approach this problem? Or are you even utilizing some techniques already and are willing to share them?
0 comments:
Post a Comment