I've run a web crawl and gathered a lot of html and xml pages. My purpose is to extract all Rss/Atom feeds out of them. I noticed that many sites simply use "text/xml" as content type on the header, so I can't identify a feed from any other kind of xml. So I wrote this piece of code:
public boolean isFeed(String content){
Document doc = Jsoup.parse(content);
Elements feed = doc.getElementsByTag("feed");
Elements channel = doc.getElementsByTag("channel");
if(feed!=null){
if(!feed.isEmpty()){
return true;
}
}
if(channel!=null){
if(!channel.isEmpty()){
return true;
}
}
return false;
}
Is there anything missing here? Any problem with it?