Hi, when i wget the page "absolute_instrument" i get a gzipped version of it.
file absolute_instrument absolute_instrument: gzip compressed data, from Unix as opposed to the example "-a", which is not gzipped, but plain HTML right away. Hence, the former one might look garbled to you, unless you use "gunzip" first to remove the compression. (If gzip complains about "unknown suffix" rename it to *.gz). Then you should get regular HTML. Here's an example on how to remove gzip in Java: http://code.hammerpig.com/how-to-gunzip-files-with-java.html I am not sure however how the server-side decides whether to compress it or not. Hope that helps anyways, Daniel On Fri, Jul 29, 2011 at 2:58 PM, Matthew Pocock < [email protected]> wrote: > Hi, > > I've been pulling down pages from wiktionary in a Java application. The > majority of pages seem to work fine (e.g. > http://en.wiktionary.org//wiki/-a). > I can load them in Java, and if I wget them, I end up with a file > containing > what I'd expect. > > However, some pages seem not to work (e.g. > http://en.wiktionary.org/wiki/absolute_instrument). In Java, I get a codec > exception and when using wget, the resulting downloaded file is garbled. I > think this is because although they claim to be UTF-8 encoded, they are > not. > These pages show up fine in my browser, but it isn't telling me what > charset > it uses to decode the text. > > Is this a known issue? Is there any workaround for this? Can it be fixed > server-side? > > Thanks, > > Matthew > > -- > Dr Matthew Pocock > Visitor, School of Computing Science, Newcastle University > mailto: [email protected] > gchat: [email protected] > msn: [email protected] > irc.freenode.net: drdozer > tel: (0191) 2566550 > mob: +447535664143 > _______________________________________________ > Wiktionary-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l > -- -- Daniel Zahn <[email protected] <[email protected]>> _______________________________________________ Wiktionary-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
