免二次请求,selenium直接保存图片元素到本地

摘要:
varheaders=响应.headers;==null){contentType=ct.split(';reader.onload=函数(e){resolve({'data':response.i});reader.oneror=拒绝;reader.readAsDataURL(blob);

Selenium是不少爬虫工程师都会用的一个工具,它对页面元素的属性,文本等的提取都做的不错,但有一个缺点是只能获取到img元素的链接而不是图片二进制(即便在访问时已经加载过了一次图片)。想把指定的img保存到本地,只能使用获取的链接手动下载,不仅多花费了不少时间,而且在某些限制外链的站点还可能遇到下载失败的情况。本文介绍一个直接在selenium中保存图片的方法。

原理其实有点取巧,是通过selenium的execute_script方法,注入一段脚本令网页所有img都转换为base64格式,如此一来图片的二进制信息就被编码为base64写在了<img>的src属性中。代码如下

js = """
        _fetch = function(i,src){
          return fetch(src).then(function(response) {
            if(!response.ok) throw new Error("No image in the response");
            var headers = response.headers;
            var ct = headers.get('Content-Type');
            var contentType = 'image/png';
            if(ct !== null){
              contentType = ct.split(';')[0];
            }
            
            return response.blob().then(function(blob){
              return {
                'blob': blob,
                'mime': contentType,
                'i':i,
              };
            });
          });
        };
        
        _read = function(response){
          return new Promise(function(resolve, reject){
            var blob = new Blob([response.blob], {type : response.mime});
            var reader = new FileReader();
            reader.onload = function(e){
              resolve({'data':e.target.result, 'i':response.i});
            };
            reader.onerror = reject;
            reader.readAsDataURL(blob);
          });
        };
        
        _replace = function(){
            for (var i = 0, len = q.length; i < len; i++) {imgs[q[i].item].src = q[i].data;}
        }
        
        var q = [];
        var imgs = document.querySelectorAll('img');
        for (var i = 0, len = imgs.length; i < len; i++) {
                _fetch(i,imgs[i].src).then(_read).then(function(data){
            q.push({
              'data': data.data,
              'item': data.i,
            });
          });
            }
        setTimeout(_replace, 1000 );
        """
driver.execute_script(js)

fetch方法请求图片时浏览器会自动读取本地缓存,所以不会发生网络通讯;_replace延迟1秒执行是为了等待队列加载完成。

在driver.get(URL)后执行此脚本,源代码中所有<img>即变为base64编码。再附上一段Python的base64转文件脚本

imgsrc = "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQcAAABUCAYAAACPxvJKAAAACXBIWXMAABYlAAAWJQFJUiTwAAAOX0lEQVR4nO1dS1IjuRZVdby52QG8YQYD3CvANfWk3CsANkB5B+VaQRs2UGYFz0xyWvYKGg8ID5+9godXwAslR0ZOS0opU6lPWieCqG4+SaLP0f0c3fvl/f2dJCQkJJTxRxqRhIQEERI5JCQkCJHIISEhQYhEDgkJCUL8q86wZDm5JYRcEEIW6yFZhDC0WU76hJCzUN4nVnBzSzFbD8nm1MfkVGGcrcjyYvNdc5/6uR6Sia/xy3JyRkmKEHKFT20JIYOmixrPpRuF/vuGjfJm563DRJaTKSHkO/dyO0JIPxGEGFlekOgIayQWvNAPnTk1shyynAxKxEAxJsQfOeB3X3H/f47P3TZ87qL0XLoIBg2fGTrKY9bD53zOr3PgYLhYD4uNJAQs1QXGKMa/cUUIma6HZCb7HtOYg2hz9EAavtDXfE9tZHlBBFel77/GgugyRAv9ouN/8wEwx5QURhXfOomVGAC6vn9RTwBkeISuBiTPG/68bGHEZD7awsmQA+It/2iun66sBeoJCAkiZStKwMlxI/lyp2MOpwzM+y+DIehSHOYKLtIBamUrmqIUETeFVgQ9ywtfynQCRTEVhq3KB20DYPMRPi4Erk5dbDGOOrEE6k61qbHfsSAZ3snpGHOQ+t4SjOHS2poT37jKcjLh14RzcsjyYvKbDOiYxjg0FpHs9K8L08XTCFm+D/S24ddSs/lHlpO39bDIUPhED4RMP75nOVnSIKjLDAkOK6M1icxV33O8zRSDigOQ7q0py8o5JQcMZFOm7YG1m2YjTLByma6F1WOb3ESgFolvciiDLtwXzQPAFqqCj1JEpqsp3hX7cC44eHoYi+IgjDXm4DJItm2yeEwBi8EFMYSMHoJkrua56ynqA4DQZIfCfixck4MtU3Fu6TlVWLoUAWEzuNQUhGY18Og5dOViTknWhWzu94Ts1K2gmyzLyZ1hVLiMpYNFUykQaQljxUJlgTsboGQ3Xw+PSHbp+O8lSAnKXM1ruBetme6RxQysgcYVsrywiqVpW+cBSbrhsrxYlHUERRuDU/xrjee/eYyWE0Uc5Wk9bD/Gsh762ShII84lC/VWlGZLsIJNUORAPiO9rU54bBewsEFEVsPWBTH4BCVkxFr+I3iNrqtSg4UXckgQQqa4E8ZXBKet1wtwTUFdnCwXPqQrOoLokMghfBypMrmbqLylQXULc89u0cmACoYiyHJMBXElbSRyiBNTiQsyshi0TJAAxPAjgvGhAd0/6x4Y3skB6bsxWNiaCVlT8rvFiTzzELOQBVoPJrbi7kfsev+lQL337OldVIgpw1H7wPAmgqKmMYqL/BcFRkLwLc+x8X5TE112lbUNIAvzVHr0SmAWyvLTO4f6j7ZQ/tt2p1ZLIiT4unhVrt4UIr5BpTdwVQGKZiW4NO9bWc+BOwAyXTyFLKjnCnOduxqY/ykEN5QUx3SMEZT8N5fSDbVMXbkaWsiordU5JIfHyz6Y+ht3Ek3I/WvVBPUli/JFsrFmkUSh2VVWZ+k0WApHFgA2lOoU7QWwYKmPSysojWXfIKigdA2SKEx1kEHQ1gLNCmV58Z4hp5jfEJCsTa6f5PBBDPyk9WBij8jj5QW5f1Wdnn/LvpDlhwIeVFn6VveFPeDoKqsnDCwUsXGBW8SQjgAtg2itxHIK7wH1rGsFrVPwloPsenAPX5OeBhW4oaWoOCmy6jl1JMIy8ZCpFFj2HBJIHcVYKjIdjSGsnllkh8LJgycH1cQ1Nav5hS07JR5U5qgMgmrYBepIgRUn2zk1hz1rCOaR1C08CKrCjahyIx/af60EU/DZil2Lo1e4JIpLLqs6xGAbCKaVMwYMXmW88B0HSO1tfb6LADtYak+8ZYj5rgo8L0OY+4Rj8JbDXJE/b5Ii22n4ZiGl4BaScfBu1sNycVZbogkUVhiPXSx/zymCJwdZTbwVuX+tSk89SQQ4MTaDCVZIBN+9b3g71SkM4gtFEZ2uNwqKGZ/kQLMRj5cDTq1IihO9mhiIbUWhoW5daO63XBTVOcopQNzFX8DqWoSwyeBGzDSyKit0JUvEEDAOdQ4f6UqvUXlBS7aED5Rr/jE15w3GbckRhY8q2RPNeXtGAdlEDIEjxItXna5d0ABVpzGr4OzUqjCwFkjs18pPDelWZjzYGaQxW7cqcGFuqqld2CG+EH1FJ1i2o5qCtCWfYocgcNyyCIzO/aTO2IdIDtNIrsO6xrhB7U1rVgV3L0K3QnZn3AjcbbHi8oJcRZWvbOMad26MCyUHRw7Qrb8ZpLhsKSQZLkKUKaP2JtPzN5FS17IquKv1t5oWzA6kEPtNUR4209kuU7g97pKbNoJ0KyBG0iqbblMhSQIv5AHTkDUm6YMkRg3NUqFVwTY1fo9pL40HmLIp6ChH8GOTYg6RAqc8/ZjC1GdEYcWqqHn1e4nr112tRrWweHAwObwLK3XblZhDgiFwQu+velu0KnSxBSl0yYU4At1gWV60PNBpBC0a9/0GRd8IViKhSpov6+2x1XAVXurKExI5dBAtWRUy3Hlo/uMNvGsnA8b8f1XvCFKvvFeCdPFvwZd0O6XXQpDkgKjwSFGunYdMIVk3bRbL1WgtSKyKqUWL4hfma4GgZmpA05FeG8GRg8WAYHQFRNoCMg289WD72jcLav5ArGJ54mQhO9SiisWEaDmcZO9Cm+DIYOCpghRPFjtGFMiCxF4hWwcyyyGq7E2KOXQA8EkHWJR9S2Sw5a7SN2kb0IOKkn78jXQp01V0NYApI4eoiDFEctgkl0AMxAsuOBKwRQQMrKjwtJyOtOianENl+B0uyBOrPm3pbwgBQnKIzWoKkRxY9NZEdNMZYBOyYOwF99GWa8AIQdSSfw8s7H1RVVgrIwvNiG44ooseyFSI5qquYtcbQpRPvyGPrHU7E1Lro1NsPSRf6vx+RdqoVdS4s9AER0pIU5TUmk2tiquDGp2PlyPk5q/wrjNy/xrLbU5ZzCw6YVgXYg4vHXFDTCXKpmit3oNVq+Lx8rZ0wey8yF49Xp6R+9cYak3K7kxEl7VxTQ4yv7KJLyaKUawaPE/2jjExPyvx7yWdWMOqWHKEJbMQvpPHy0lF/5QQkMihDugCyPJi4/KnSdMej+x+Pb/oapugknfcBjy5jAjYxyKkwJfAquhzREFYOzzuR1SxlX7Imwz1GUTkt4ox4OrDrRggnsCEIo0K0GIz9znGtmE2l99x6mByF9gkMn99BatmgX9fFO0GgwUn7a6DWK2GKFO2zskBi1nrOrbBMzc2nsn11djoNIS1CVzqYanKM7YRTlBh+CypLrUl96/BunYIKLfR2sEbkghK0vU7y8lfrkU6IDkrLoGigZB3VBDebXkuIulvIQuWrmK9wp7I4QO3gqj6NETGxyWnAd+cuPT1I6ILDRA/PcNdOySKj4BjH+nMPshyHnIgEmMuIwenFqhNeCGHAE61sq8uEuCcYyOyk/zN5wmAqP8cm14lqBERXYgoJNVZLumRev86j8gcl/UwbRps9wqn5FBuzOITWX5Qh0B2TfugoGuWkyfZid0mQAwvmuOmc809JFAZ9ZmPcbUBrGlZ0VkXgezW8Ifj3zcLqEv0L5iDJrhBuso1yg1tVIhxMd6gt2aMkBW62cXsUhAPbkVo5m6dvHnfpakI18Zk3GYRuRY8JrFtJtQekY2zTavB7Dmf8RpSWJwfLpoxXJPDNrCy73UyA64FRkaWChZkn7vGHRoGErl7j1plsVzjhgUpK0q0tVm+DVqe6qZGVGL+cXBdlz6/LNaRYVDXNTlMGjRmsY2HGkrClYd6iTodpY6gU+vQF7K8GEORJmAQQwAPcQbVOmgjfjLXuHszkRDvNbIpRoTlWj49Q21H3fqQbWGumXn4yf33JubIc2CQXTIL/to2gsOqoPpDS8K1iQY5qEhpFDQ5EItqRhfw3fQVC7FzQFl2k96fIUEVVF+11aWe7pssLw4rVX1V1Xgax6BcZysSzNBJcgBibXwjKw+wa7snKA6rp5o/vjX9Aa/kQKO9tFgL7QEZYiorZAlyRxFDXQ7ZJhu5EMlBD/Ig+fKz4keNXR1v5MBFe3vIYNDioy9pQyYEDtEhdufyghwUpV8F1tcYFkwZO53mOWX4vFshijtQv+h3lhcMOD6RMuZ1kBrHeAJNtaIlHjvEZj7WqZCM7l835PGStdhjLulHvYwad1N8koNK78B09z99SlBT9ybnaFLByzak2bSQ08QFQVhKpfqMOehU46VuxwYqwYRuQbT5fBwCslO/E9WwK6CM8fgkh5EisMKjh3sQixOMR8g2SxcWbhDyboVL0K9x9yYa6Owlb+RAXQUEVv7UtCKuEY+YtTBpos1mnPqxDUX0O2pygMJQBF+XxkTuTK9OEC8iyLRGe3fJu86BboD1sGCxvzQ35I0tERUlGUh5ReKRUPLwojE5x3tHBxC77N19jblM+Up7fU67JEajFgNUyjLLbT8WwVSCwoWbOW66qQqtkipxEE6mSYVEu6qLVChS6YVENsuuj8cmJlL5ub7GfKZQHrLWfdvYel0K0K/YV3yLgCA7Xk1wKqq05NJJ0tC+62Dn4YKVDKpx6HWor+jSV6UtTWnyeWA3itvAgRsVpHyaThaUYF8F8YgqQYesd4AJgsmOIGCmE7iNHV79e0iTo+tnaRF3ZXIO+m4FzScjHnGHiXtGcVVV4KppsPIptJoCCNzW1dTHgKOF6QmjEyWIO5Gl/OX9/V37CfDl/yl9mprgwaR8DOst8qAWycR1vwoTIB6jMn1jwxaXlYISFGnGvboA5fgbkQP5LFs2gf+1wsODCooJApJngvZ2+6rSXAv64Osvci36RxoBphCxBXnPA4rrHAFZFX6cuxJvWLLaJFUWsjE5JCQknAZSPYeEhAQhEjkkJCQIkcghISFBiEQOCQkJxyCE/B9SI+s5+iGFRAAAAABJRU5ErkJggg=="

import base64
def base64img2file(imgsrc: str):
    suffix = imgsrc.split(';')[0][11:]
    with open("demo."+suffix, 'wb') as f:
        f.write(base64.b64decode(imgsrc.split(',')[1]))

免责声明:文章转载自《免二次请求,selenium直接保存图片元素到本地》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇使用XmlWriter写Xml百度地图API图标、文本、图例与连线下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

js 事件委托 事件代理

JavaScript高级程序设计上解释:事件委托就是利用事件冒泡,只指定一个事件处理程序,就可以管理某一类型的所有事件。 通过例子类比: 有三个同事预计会在周一收到快递。为签收快递,有两种办法:一是三个人在公司门口等快递;二是委托给前台MM为签收。现实中,我们大都采用委托的方案。前台MM收到快递后,会判断收件人是谁,然后按照收件人的要求签收。这种方案的优势...

iOS设置启动页后的广告页

转载请注明出处!!! 很多app(如淘宝、美团等)在启动图加载完毕后,还会显示几秒的广告,一般都有个跳过按钮可以跳过这个广告,有的app在点击广告页之后还会进入一个广告页面,点击返回进入首页。就像下面的效果。这个功能现在很常见,那么它是如何实现的呢? 思路1.广告页加载思路。广告页的内容要实时显示,在无网络状态或者网速缓慢的情况下不能延迟加载,或者等到首...

防盗链原理

昨天去参加一个面试,面试官上来就问“知道防盗链吗?它是怎么实现的?”。当时立刻傻了,防盗链是什么东东? 百度百科对防盗链的定义是 此内容不在自己服务器上,而通过技术手段,绕过别人放广告有利益的最终页,直接在自己的有广告有利益的页面上向最终用户提供此内容。 常常是一些名不见经传的小网站来盗取一些有实力的大网站的地址(比如一些音乐、图片、软件的下载地址)然后放置...

CSS中背景图片定位方法

CSS中背景图片的定位,困扰我很久了。今天总算搞懂了,一定要记下来。 在CSS中,背景图片的定位方法有3种:   1)关键字:background-position: top left;   2)像素:background-position: 0px 0px;   3)百分比:background-position: 0% 0%; 上面这三句语句,都将...

h5语音播放(移动端)

<!--语音导航 start--> <div style="border:0px solid red;100%;height:72px;position:relative;overflow-y: hidden;"> <img src="http://t.zoukankan.com/static/front/images/vo...

UDP广播 与 TCP客户端 --服务端

       随着倒计时的响声,自觉无心工作,只想为祖国庆生。        最近有遇到过这样一个问题,将摄像头识别的行人,车辆实时显示在客户端中。有提供接口,会以Json的数据的形式将实时将识别的对象进行Post提交。所以我们这边先写一个web服务来持续接收数据,再将数据进行解析存入数据库。到这里为止,数据没有问题,都全部存入数据库中,这样还剩下一个实时...