Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original Response headers (i.e., start with X-Archive-Orig-...) are modified #14

Open
maturban opened this issue Aug 14, 2017 · 1 comment
Assignees
Labels

Comments

@maturban
Copy link

Are you submitting a bug report or a feature request?

A bug report.

What is the current behavior?

Generate a WARC file for https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ .

What is the expected behavior?

The Response headers of requesting https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ should be as following:

Content-Encoding: gzip
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-content-length: 11603
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"

But we got:

Date: Mon, 14 Aug 2017 03:41:43 GMT
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-Content-Length: 22495
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"

The issue is that the value of one of the original Response headers (i.e., X-Archive-Orig-content-length) has been changed from 11603 to 22495. In general, I think all original Response headers (i.e., start with "X-Archive-Orig-...") should not be modified.

What's your environment?

macOS Sierra

Other information

I think the issue is from the following lines of code:
File: .../node-modules/node-warc/lib/writers/remoteChrome.js
Lines: 767 and 768
The code:

          responseHeaders = responseHeaders.replace(noGZ, '')
          responseHeaders = responseHeaders.replace(replaceContentLen, `Content-Length: ${Buffer.byteLength(resData, 'utf8')}${CRLF}`)
@N0taN3rd
Copy link
Owner

@maturban

Thank you for pointing that out I really should go ahead and add re-gz or re-defleate via zlib
rather than tightening up the regex used.....

One for the node-warc issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants