Quantcast
Channel: Active questions tagged atom-feed+feed+rss - Stack Overflow
Viewing all articles
Browse latest Browse all 47

how to identify feeds in a web crawl?

$
0
0

I've run a web crawl and gathered a lot of html and xml pages. My purpose is to extract all Rss/Atom feeds out of them. I noticed that many sites simply use "text/xml" as content type on the header, so I can't identify a feed from any other kind of xml. So I wrote this piece of code:

public boolean isFeed(String content){
    Document doc = Jsoup.parse(content);
    Elements feed = doc.getElementsByTag("feed");
    Elements channel = doc.getElementsByTag("channel");
    if(feed!=null){
        if(!feed.isEmpty()){
             return true;
        }
    }
    if(channel!=null){
        if(!channel.isEmpty()){
             return true;
        }
    }
    return false;
}

Is there anything missing here? Any problem with it?


Viewing all articles
Browse latest Browse all 47

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>