Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit b90a9a5

Browse files
authoredMar 19, 2025··
4482 – Handle Archive.org HTML errors (#533)
Context - Sometimes ArchiveOrg will send a HTML response: - which are usually issues on their side like: “429 Too Many Requests” and “Item Not Avaiable”, - but also caused an error on our side, since we always expected a JSON. - We have two jobs that run for ArchiveOrg, both make requests, so this happens at different places/time, but need very similar error handling. - We also wanted to take this opportunity to try to centralize more our error handling. How - Handling HTML errors - I extracted a method json_parse we can use for our both requests. It rescues JSON::ParserError, and re-raises it as a more specific error or as itself. - Then handle_archiving_exceptions deals with it. - Other errors - To make this more cohesive, I updated this to have the same behavior as when we are dealing with HTML errors, - so we also re-raise the errors and handle_archiving_exceptions deals with it. - On handle_archiving_exceptions: - We had been dealing with ArchiveOrg errors directly in its module, but even if they are more ArchiveOrg related, I think it makes sense to centralize everything in handle_archiving_exceptions. - I did some work before where we moved error handling inside get_archive_org_status - 28941ec - I think this might not be needed, - but even if it is, I think it might make more sense for us to wrap this job with handle_archiving_exceptions? - It would be easier for us to avoid duplication. References: CV2-4482 PR 533
1 parent c0b74a0 commit b90a9a5

File tree

4 files changed

+149
-51
lines changed

4 files changed

+149
-51
lines changed
 

‎app/models/concerns/media_archive_org_archiver.rb

Lines changed: 29 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -25,36 +25,17 @@ def send_to_archive_org(url, key_id, _supported = nil)
2525

2626
Rails.logger.info level: 'INFO', message: '[archive_org] Sent URL to archive', url: url, code: response.code, response: response.message
2727

28-
if response&.body&.include?("<html><body><h1>429 Too Many Requests</h1>") && response&.code == "500"
29-
data = snapshot_data.to_h.merge({ error: { message: "(#{response.body}) #{response}", code: Lapis::ErrorCodes::const_get('ARCHIVER_ERROR') }})
30-
Media.notify_webhook_and_update_cache('archive_org', url, data, key_id)
31-
PenderSentry.notify(
32-
Pender::Exception::RateLimitExceeded.new("429 Too Many Requests"),
33-
url: url,
34-
response_body: response.body
35-
)
36-
return
37-
end
38-
39-
body = JSON.parse(response.body)
28+
body = json_parse(response)
4029
if body['job_id']
4130
ArchiverStatusJob.perform_in(2.minutes, body['job_id'], url, key_id)
4231
else
4332
data = snapshot_data.to_h.merge({ error: { message: "(#{body['status_ext']}) #{body['message']}", code: Lapis::ErrorCodes::const_get('ARCHIVER_ERROR') }})
4433
Media.notify_webhook_and_update_cache('archive_org', url, data, key_id)
4534

46-
if body['message']&.include?('The same snapshot') || body['status_ext'] == 'error:too-many-daily-captures'
47-
PenderSentry.notify(
48-
Pender::Exception::TooManyCaptures.new(body['message']),
49-
url: url,
50-
response_body: body
51-
)
35+
if body['message']&.include?('The same snapshot') || body['status_ext'] == 'error:too-many-daily-captures'
36+
raise Pender::Exception::TooManyCaptures, body['message']
5237
elsif body['status_ext'] == 'error:blocked-url'
53-
PenderSentry.notify(
54-
Pender::Exception::BlockedUrl.new(body['message']),
55-
url: url,
56-
response_body: body
57-
)
38+
raise Pender::Exception::BlockedUrl, body['message']
5839
else
5940
raise Pender::Exception::ArchiveOrgError, "(#{body['status_ext']}) #{body['message']}"
6041
end
@@ -78,20 +59,17 @@ def get_available_archive_org_snapshot(url, key_id)
7859
end
7960

8061
def get_archive_org_status(job_id, url, key_id)
81-
begin
82-
http, request = Media.archive_org_request("https://web.archive.org/save/status/#{job_id}", 'Get')
83-
response = http.request(request)
84-
body = JSON.parse(response.body)
85-
if body['status'] == 'success'
86-
location = "https://web.archive.org/web/#{body['timestamp']}/#{url}"
87-
data = { location: location }
88-
Media.notify_webhook_and_update_cache('archive_org', url, data, key_id)
89-
else
90-
message = body['status'] == 'pending' ? 'Capture is pending' : "(#{body['status_ext']}) #{body['message']}"
91-
raise Pender::Exception::RetryLater, message
92-
end
93-
rescue StandardError => error
94-
raise Pender::Exception::RetryLater, error.message
62+
http, request = Media.archive_org_request("https://web.archive.org/save/status/#{job_id}", 'Get')
63+
response = http.request(request)
64+
body = json_parse(response)
65+
66+
if body['status'] == 'success'
67+
location = "https://web.archive.org/web/#{body['timestamp']}/#{url}"
68+
data = { location: location }
69+
Media.notify_webhook_and_update_cache('archive_org', url, data, key_id)
70+
else
71+
message = body['status'] == 'pending' ? 'Capture is pending' : "(#{body['status_ext']}) #{body['message']}"
72+
raise Pender::Exception::RetryLater, message
9573
end
9674
end
9775

@@ -106,5 +84,19 @@ def archive_org_request(request_url, verb)
10684
}
10785
[http, "Net::HTTP::#{verb}".constantize.new(uri, headers)]
10886
end
87+
88+
def json_parse(response)
89+
begin
90+
JSON.parse(response.body)
91+
rescue JSON::ParserError => error
92+
if error.message.include?("Too Many Requests")
93+
raise Pender::Exception::RateLimitExceeded, error.message
94+
elsif error.message.include?("Item Not Available")
95+
raise Pender::Exception::ItemNotAvailable, error.message
96+
else
97+
raise JSON::ParserError, error.message
98+
end
99+
end
100+
end
109101
end
110102
end

‎app/models/concerns/media_archiver.rb

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -95,15 +95,38 @@ def handle_archiving_exceptions(archiver, params)
9595
yield
9696
rescue Pender::Exception::RetryLater => error
9797
retry_archiving_after_failure(archiver, { message: error.message })
98+
rescue Pender::Exception::BlockedUrl,
99+
Pender::Exception::TooManyCaptures,
100+
Pender::Exception::ItemNotAvailable,
101+
Pender::Exception::RateLimitExceeded,
102+
JSON::ParserError => error
103+
post_error_tasks(archiver, params, error)
98104
rescue StandardError => error
99-
error_type = 'ARCHIVER_ERROR'
100-
params.merge!({code: Lapis::ErrorCodes::const_get(error_type), message: error.message})
101-
data = { error: { message: params[:message], code: Lapis::ErrorCodes::const_get(error_type) }}
102-
Media.notify_webhook_and_update_cache(archiver, params[:url], data, params[:key_id])
105+
post_error_tasks(archiver, params, error, false)
103106
retry_archiving_after_failure(archiver, params)
104107
end
105108
end
106109

110+
def post_error_tasks(archiver, params, error, notify_sentry = true)
111+
error_type = 'ARCHIVER_ERROR'
112+
if notify_sentry then Media.notify_sentry(archiver, params[:url], error) end
113+
data = Media.updated_errored_data(archiver, params, error, error_type = 'ARCHIVER_ERROR')
114+
Media.notify_webhook_and_update_cache(archiver, params[:url], data, params[:key_id])
115+
end
116+
117+
def notify_sentry(archiver, url, error)
118+
PenderSentry.notify(
119+
error.class.new("#{archiver}: #{error.message}"),
120+
url: url,
121+
response_body: error.message
122+
)
123+
end
124+
125+
def updated_errored_data(archiver, params, error, error_type = 'ARCHIVER_ERROR')
126+
params.merge!({code: Lapis::ErrorCodes::const_get(error_type), message: error.message})
127+
{ error: { message: params[:message], code: Lapis::ErrorCodes::const_get(error_type) }}
128+
end
129+
107130
def retry_archiving_after_failure(archiver, params)
108131
Rails.logger.warn level: 'WARN', message: "#{params[:message]}", url: params[:url], archiver: archiver, error_code: params[:code], error_message: params[:message]
109132
raise Pender::Exception::RetryLater, "[#{archiver}]: #{params[:message]}"
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
module Pender
2+
module Exception
3+
class ItemNotAvailable < StandardError; end
4+
end
5+
end

‎test/models/archiver_test.rb

Lines changed: 88 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@ def create_api_key_with_webhook_for_perma_cc
159159
end
160160
end
161161

162-
test "when Archive.org fails with Pender::Exception::ArchiveOrgError it should retry, update data with snapshot (if available) and error" do
162+
test "when Archive.org fails with Pender::Exception::ArchiveOrgError it should not notify Sentry, it should retry, update data with snapshot (if available) and error" do
163163
api_key = create_api_key_with_webhook
164164
url = 'https://example.com/'
165165

@@ -171,10 +171,19 @@ def create_api_key_with_webhook_for_perma_cc
171171
WebMock.stub_request(:post, /example.com\/webhook/).to_return(status: 200, body: '')
172172
WebMock.stub_request(:post, /web.archive.org\/save/).to_return_json(status: 500, body: { status_ext: '500', message: 'Random Error.', url: url})
173173

174+
sentry_call_count = 0
175+
arguments_checker = Proc.new do |e|
176+
sentry_call_count += 1
177+
end
178+
174179
media = create_media url: url, key: api_key
175-
assert_raises StandardError do
176-
media.as_json(archivers: 'archive_org')
180+
PenderSentry.stub(:notify, arguments_checker) do
181+
assert_raises StandardError do
182+
media.as_json(archivers: 'archive_org')
183+
end
177184
end
185+
186+
assert_equal 0, sentry_call_count
178187
media_data = Pender::Store.current.read(Media.get_id(url), :json)
179188
assert_equal '(500) Random Error.', media_data.dig('archives', 'archive_org', 'error', 'message')
180189
assert_equal "https://web.archive.org/web/timestamp/#{url}", media_data.dig('archives', 'archive_org', 'location')
@@ -729,25 +738,27 @@ def create_api_key_with_webhook_for_perma_cc
729738
assert_equal({ 'location' => 'http://perma.cc/perma-cc-guid-1'}, cached['perma_cc'])
730739
end
731740

732-
test "when Archive.org returns 429 Too Many Requests it should notify Sentry with RateLimitExceeded" do
741+
test "when Archive.org status returns 429 Too Many Requests it should notify Sentry with RateLimitExceeded" do
733742
api_key = create_api_key_with_webhook
734743
url = 'https://example.com/'
735744

736745
Media.any_instance.unstub(:archive_to_archive_org)
737-
Media.stubs(:get_available_archive_org_snapshot).returns({ status_ext: 'error:rate-limit', message: '429 Too Many Requests', url: url })
738-
746+
739747
WebMock.stub_request(:get, url).to_return(status: 200, body: '<html>A page</html>')
740748
WebMock.stub_request(:post, /safebrowsing\.googleapis\.com/).to_return(status: 200, body: '{}')
741749
WebMock.stub_request(:post, /example.com\/webhook/).to_return(status: 200, body: '')
742-
WebMock.stub_request(:post, /archive.org\/save/).to_return(status: 500, body: '<html><body><h1>429 Too Many Requests</h1>')
750+
# This response comes from ArchiveStatusJob, in order to call it we need to get a job_id
751+
WebMock.stub_request(:post, /web.archive.org\/save/).to_return_json(body: {url: url, job_id: 'ebb13d31-7fcf-4dce-890c-c256e2823ca0' })
752+
WebMock.stub_request(:get, /archive.org\/wayback/).to_return_json(body: {"archived_snapshots":{}}, headers: {})
753+
WebMock.stub_request(:get, /archive.org\/save\/status/).to_return(body: '429 Too Many Requests')
743754

744755
m = Media.new url: url, key: api_key
745756

746757
sentry_call_count = 0
747758
arguments_checker = Proc.new do |e|
748759
sentry_call_count += 1
749760
assert_instance_of Pender::Exception::RateLimitExceeded, e
750-
assert_equal '429 Too Many Requests', e.message
761+
assert_includes e.message, 'Too Many Requests'
751762
end
752763

753764
PenderSentry.stub(:notify, arguments_checker) do
@@ -759,8 +770,75 @@ def create_api_key_with_webhook_for_perma_cc
759770
assert_equal 1, sentry_call_count
760771

761772
media_data = Pender::Store.current.read(Media.get_id(url), :json)
762-
expected_error_message = "<html><body><h1>429 Too Many Requests</h1>"
773+
expected_error_message = "Too Many Requests"
763774
assert_includes media_data.dig('archives', 'archive_org', 'error', 'message'), expected_error_message
764775
assert_equal Lapis::ErrorCodes::const_get('ARCHIVER_ERROR'), media_data.dig('archives', 'archive_org', 'error', 'code')
765-
end
776+
end
777+
778+
test "when Archive.org returns HTML response it should notify Sentry with JSON::ParserError" do
779+
api_key = create_api_key_with_webhook
780+
url = 'https://example.com/'
781+
782+
Media.any_instance.unstub(:archive_to_archive_org)
783+
784+
WebMock.stub_request(:get, url).to_return(status: 200, body: '<html>A page</html>')
785+
WebMock.stub_request(:post, /safebrowsing\.googleapis\.com/).to_return(status: 200, body: '{}')
786+
WebMock.stub_request(:post, /example.com\/webhook/).to_return(status: 200, body: '')
787+
WebMock.stub_request(:post, /web.archive.org\/save/).to_return_json(body: '<html>A html response</html>' )
788+
WebMock.stub_request(:get, /archive.org\/wayback/).to_return_json(body: {"archived_snapshots":{}}, headers: {})
789+
790+
m = Media.new url: url, key: api_key
791+
792+
sentry_call_count = 0
793+
arguments_checker = Proc.new do |e|
794+
sentry_call_count += 1
795+
assert_instance_of JSON::ParserError, e
796+
end
797+
798+
PenderSentry.stub(:notify, arguments_checker) do
799+
assert_nothing_raised do
800+
m.as_json(archivers: 'archive_org')
801+
end
802+
end
803+
804+
assert_equal 1, sentry_call_count
805+
806+
media_data = Pender::Store.current.read(Media.get_id(url), :json)
807+
assert_equal Lapis::ErrorCodes::const_get('ARCHIVER_ERROR'), media_data.dig('archives', 'archive_org', 'error', 'code')
808+
end
809+
810+
test "when Archive.org returns 'Item Not Available' response it should notify Sentry with Pender::Exception::ItemNotAvailable" do
811+
api_key = create_api_key_with_webhook
812+
url = 'https://example.com/'
813+
814+
Media.any_instance.unstub(:archive_to_archive_org)
815+
816+
WebMock.stub_request(:get, url).to_return(status: 200, body: '<html>A page</html>')
817+
WebMock.stub_request(:post, /safebrowsing\.googleapis\.com/).to_return(status: 200, body: '{}')
818+
WebMock.stub_request(:post, /example.com\/webhook/).to_return(status: 200, body: '')
819+
WebMock.stub_request(:post, /web.archive.org\/save/).to_return_json(body: 'Item Not Available' )
820+
WebMock.stub_request(:get, /archive.org\/wayback/).to_return_json(body: {"archived_snapshots":{}}, headers: {})
821+
822+
m = Media.new url: url, key: api_key
823+
824+
sentry_call_count = 0
825+
arguments_checker = Proc.new do |e|
826+
sentry_call_count += 1
827+
assert_instance_of Pender::Exception::ItemNotAvailable, e
828+
assert_includes e.message, 'Item Not Available'
829+
end
830+
831+
PenderSentry.stub(:notify, arguments_checker) do
832+
assert_nothing_raised do
833+
m.as_json(archivers: 'archive_org')
834+
end
835+
end
836+
837+
assert_equal 1, sentry_call_count
838+
839+
media_data = Pender::Store.current.read(Media.get_id(url), :json)
840+
expected_error_message = "Item Not Available"
841+
assert_includes media_data.dig('archives', 'archive_org', 'error', 'message'), expected_error_message
842+
assert_equal Lapis::ErrorCodes::const_get('ARCHIVER_ERROR'), media_data.dig('archives', 'archive_org', 'error', 'code')
843+
end
766844
end

0 commit comments

Comments
 (0)
Please sign in to comment.