java抓取页面中文乱码解决方法
java程序在抓取url页面时,有时会遇到中文输出乱码的问题,主要原因是编码格式不匹配所导致。大部分网页以utf8编码格式存储,而通过网络抓取页面时,将utf8作为字节流形式传输到本地,因此需要将字节流转换回utf8编码的文本。如果不转换,或者转换成其他编码格式,就会出现中文乱码。
下面是我原来写的代码:
// 获得抓取网页的源码
public String getdata(String url) {
String data = null;
org.apache.commons.httpclient.HttpClient client = new HttpClient();
GetMethod getMethod = new GetMethod(url);
getMethod
.setRequestHeader("User_Agent",
"Mozilla/5.0(Windows NT 6.1;Win64;x64;rv:39.0) Gecko/20100101 Firefox/39.0");
getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
new DefaultHttpMethodRetryHandler());// 系统默认的恢复策略
try {
int statusCode = client.executeMethod(getMethod);
if (statusCode != HttpStatus.SC_OK) {
System.out.println("Wrong");
}
byte[] responseBody = getMethod.getResponseBody();
data = new String(responseBody);
return data;
} catch (HttpException e) {
System.out.println("Please check your provided http address!");
data = "";
e.printStackTrace();
} catch (IOException e) {
data = "";
e.printStackTrace();
} finally {
getMethod.releaseConnection();
}
return data;
}
大家注意我标红的地方,这样写执行程序的时候,所有中文都会显示乱码,打印出来如下图:
修改代码,使用utf编码格式, String data = new String(responseBody,"utf8");
中文显示正常 ,完整代码如下,注意标红的部分:
// 获得源码
public String getdata(String url) {
String data = null;
org.apache.commons.httpclient.HttpClient client = new HttpClient();
GetMethod getMethod = new GetMethod(url);
getMethod
.setRequestHeader("User_Agent",
"Mozilla/5.0(Windows NT 6.1;Win64;x64;rv:39.0) Gecko/20100101 Firefox/39.0");
getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
new DefaultHttpMethodRetryHandler());// 系统默认的恢复策略
try {
int statusCode = client.executeMethod(getMethod);
if (statusCode != HttpStatus.SC_OK) {
System.out.println("Wrong");
}
byte[] responseBody = getMethod.getResponseBody();
data = new String(responseBody, "utf8");
return data;
} catch (HttpException e) {
System.out.println("Please check your provided http address!");
data = "";
e.printStackTrace();
} catch (IOException e) {
data = "";
e.printStackTrace();
} finally {
getMethod.releaseConnection();
}
return data;
}
执行代码后,打印出来如下图所示:
问题解决。