Description
Hi,
I have configured https://github.com/dbpedia/marvin-config to extract german wikipedia. A first run worked for the 20220401 dump.
Today i run again to extract the 20220601 dump, but it only worked partly the extraction framework and after some time only HTTP 429 was returned from https://de.wikipedia.org/w/api.php.
Exception; de; Main Extraction at 00:00.957s for 62 datasets; Main Extraction failed for instance http://de.dbpedia.org/resource/Liste_von_Autoren/J: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php java.io.IOException: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1902) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268) at org.dbpedia.extraction.util.MediaWikiConnector$$anonfun$retrievePage$1.apply$mcVI$sp(MediaWikiConnector.scala:97) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166) ...
I used the following settings in extractionConfiguration/extraction.de.properties
mwc-apiUrl=https://{{LANG}}.wikipedia.org/w/api.php
mwc-maxRetries=5
mwc-connectMs=4000
mwc-readMs=30000
mwc-sleepFactor=2000
It seems the extraction-framework does not handle this HTTP error properly. I would be great if the Retry-After
HTTP header is used to handle such errors. Any suggestions which properties to adjust for this problem?