Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reuse local Chrome user dir/cookies #39

Open
machawk1 opened this issue Dec 7, 2018 · 7 comments
Open

Unable to reuse local Chrome user dir/cookies #39

machawk1 opened this issue Dec 7, 2018 · 7 comments
Labels
Projects

Comments

@machawk1
Copy link
Collaborator

machawk1 commented Dec 7, 2018

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

https://github.com/N0taN3rd/Squidwarc/blob/master/manual/configuration.md#userdatadir states that a userDataDir attribute can be specified to reuse the user directory for a system's Chrome. I use a logged in version of Chrome on my system, so wanted to leverage my logged-in cookies to crawl contents behind authentication using Squidwarc. I specify a config file for Squidwarc:

{ "use": "puppeteer", "headless": true, "script": "./userFns.js", "mode": "page-all-links", "depth": 1, "seeds": [ "https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly" ], "warc": { "naming": "url", "append": true }, "connect": { "launch": true, "host": "localhost", "port": 9222, "userDataDir": "/Users/machawk1/Library/Application Support/Google/Chrome" }, "crawlControl": { "globalWait": 5000, "inflightIdle": 1000, "numInflight": 2, "navWait": 8000 } }
...in an attempt to preserve https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly, a URI that will provide a login page if not authenticated. I get the following result on stdout:
Running Crawl From Config File /Users/machawk1/Desktop/squidwarcWithCookies.json With great power comes great responsibility! Squidwarc is not responsible for ill behaved user supplied scripts!

Crawler Operating In page-all-links mode
Crawler Will Be Preserving 1 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Running user script
Crawler Generating WARC
Crawler Has 18 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Running user script
Crawler Generating WARC
Crawler Has 17 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Running user script
Crawler Generating WARC
Crawler Has 16 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Running user script
Crawler Generating WARC
Crawler Has 15 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Running user script
Crawler Generating WARC
Crawler Has 14 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Running user script
Crawler Generating WARC
Crawler Has 13 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Running user script
Crawler Generating WARC
Crawler Has 12 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Running user script
Crawler Generating WARC
Crawler Has 11 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Running user script
Crawler Generating WARC
Crawler Has 10 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Running user script
Crawler Generating WARC
Crawler Has 9 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Running user script
Crawler Generating WARC
Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Crawler Navigated To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Running user script
Crawler Generating WARC
Crawler Has 7 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Running user script
Crawler Generating WARC
Crawler Has 6 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Running user script
Crawler Generating WARC
Crawler Has 5 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Running user script
Crawler Generating WARC
Crawler Has 4 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/
A Fatal Error Occurred
Error: options.stripFragment is renamed to options.stripHash

  • index.js:35 module.exports
    [Squidwarc]/[normalize-url]/index.js:35:9

  • _createHybrid.js:87 wrapper
    [Squidwarc]/[lodash]/_createHybrid.js:87:15

  • puppeteer.js:155 PuppeteerCrawler.navigate
    /private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11

Please Inform The Maintainer Of This Project About It. Information In package.json

The resulting WARC does not contain any records related to the specified URI, oddly, since anonymous access results in an HTTP 200. The URI https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random, however, is shown in the WARC. Replaying this page shows a login interface, indicative that my browser's cookies were not used.

What is the expected behavior?

Squidwarc uses my local Chrome's cookies and captures the page behind authentication, per the manual.

What's your environment?

macOS 10.14.2
Squidwarc a402335 (current master)
node v10.12.0

Other information

We discussed this informally via Slack. Previously, I experienced this config script borking my Chrome's user directory (i.e., conventionally using Chrome would no longer allow creds to "stick") but can no longer replicate this.

@machawk1 machawk1 added the bug label Dec 7, 2018
@machawk1
Copy link
Collaborator Author

machawk1 commented Dec 7, 2018

An update...after running the above, it appears that the cookies on the wiki site at the target URI of the crawl has been removed and I needed to log in again. This is a case of crawling-considered-harmful and an unfortunate side-effect.

EDIT: It appears to have affected the retention of other site cookies (e.g., facebook.com) as well.

@N0taN3rd N0taN3rd added this to To do in 1.3.0 via automation Dec 7, 2018
@N0taN3rd N0taN3rd moved this from To do to In progress in 1.3.0 Dec 7, 2018
N0taN3rd added a commit that referenced this issue Dec 16, 2018
@machawk1
Copy link
Collaborator Author

machawk1 commented Dec 17, 2018

@N0taN3rd Per your suggestion, I pulled 9bbc461 and re-installed with the same (above) config.json.

The crawl finished within a couple minutes with an error, which I did not mention in the ticket description but may be relevant for debugging

Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
A Fatal Error Occurred
  Error: options.stripFragment is renamed to options.stripHash

  - index.js:35 module.exports
    [Squidwarc]/[normalize-url]/index.js:35:9

  - _createHybrid.js:87 wrapper
    [Squidwarc]/[lodash]/_createHybrid.js:87:15

  - puppeteer.js:155 PuppeteerCrawler.navigate
    /private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11


Please Inform The Maintainer Of This Project About It. Information In package.json

Upon re-launching Chrome, some sites where I would have a cookie (inclusive of the ws-dl wiki site) showed that I was no longer logged in. Others, e.g., gmail.com, retained my cookie. EDIT: Google reported a cookie error on subsequent logins (pic).

Viewing the WARC showed that the URI specified to be archived was not present but a capture of the wiki login page was present and replay-able.

@machawk1
Copy link
Collaborator Author

machawk1 commented Feb 2, 2019

As discussed via Slack, making a duplicate of my profile might help resolve this issue. I did so via:

cp -r /Users/machawk1/Library/Application Support/Google/Chrome /tmp/Chrome

...then ran ./bootstrap.shand ./run-crawler.sh -c wsdlwiki.config from the root of my Squidwarc working directory of current master a2f1d63.

My macOS 10.14.2 Chrome reports version 72.0.3626.81.

wsdlwiki.config is the same as above but with the path changed to /tmp/Chrome.

./run-crawler.sh -c wsdlwiki.config
Running Crawl From Config File wsdlwiki.config
With great power comes great responsibility!
Squidwarc is not responsible for ill behaved user supplied scripts!

Crawler Operating In page-only mode
Crawler Will Be Preserving 1 Seeds
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc in appending mode
Crawler Will Be Generating WARC Files Using the filenamified url
A Fatal Error Occurred
  Error: Failed to launch chrome!
  dlopen /private/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.ap  p/Contents/MacOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework: dlopen(/priv  ate/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.app/Contents/M  acOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework, 261): image not found
  TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

  - Launcher.js:360 onClose
    [Squidwarc]/[puppeteer]/lib/Launcher.js:360:14

  - Launcher.js:349 Interface.helper.addEventListener
    [Squidwarc]/[puppeteer]/lib/Launcher.js:349:50

  - events.js:187 Interface.emit
    events.js:187:15

  - readline.js:379 Interface.close
    readline.js:379:8

  - readline.js:157 Socket.onend
    readline.js:157:10

  - events.js:187 Socket.emit
    events.js:187:15

  - _stream_readable.js:1094 endReadableNT
    _stream_readable.js:1094:12

  - next_tick.js:63 process._tickCallback
    internal/process/next_tick.js:63:19


Please Inform The Maintainer Of This Project About It. Information In package.json

It's interesting and potentially problematic that Squidwarc/puppeteer is trying to use Chrome 73.0.3679.0 per the error. Think the version difference is the issue, @N0taN3rd, or something else?

@N0taN3rd
Copy link
Owner

N0taN3rd commented Feb 5, 2019

TBH I am completely unsure at this point.

I have had success on Linux using the same browser (do have to continually re-sign in ever time :goberserk:), tho I do believe if you switch from stable <-> dev <-> unstable that does cause some issues.

The best bet I can think of now is to use a completely new user data dir by initially launching the version of chrome you want with the --user-data-dir=<path to whereever you want it>, sign into your google profile in chrome and then any of the sites you want to crawl.

That way when you start the crawl that completely new user data dir is unique to that browser.

@N0taN3rd
Copy link
Owner

N0taN3rd commented Feb 5, 2019

This will require some additional changes to Squidwarc but I suspect that the issue is with setting the user data dir itself rather than letting the normal browser resolution of that directories path take place.

So if their were an config option to not do anything data-dir / password related the browser would figure it out correctly.

//cc @N0taN3rd

@machawk1
Copy link
Collaborator Author

machawk1 commented Feb 5, 2019

Having to re-sign-in somewhat defeats the purpose of reusing the user data directory. It's reminiscent of the Webrecorder approach (:P) and is not nearly as powerful as reusing existing cookies/logins, if possible.

With regard to the delta of Chrome versions for the system vs. what is used in Squidwarc, is there currently a way to tell Squidwarc to use a certain version of Chrome(ium)? Having that match up and reusing the data dir might be one needed test to see if it persists.

@Mauville
Copy link

Mauville commented Nov 8, 2021

For the truly desperate, I was able to load some cookies by doing the following:

  1. Set a breakpoint on a line on the project and start debugging
  2. Wait for the browser to load
  3. Manually login into the sites you want in another tab to store the cookie
  4. Resume execution

This lets you "load" cookies into the session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
1.3.0
  
In progress
Development

No branches or pull requests

3 participants