Tuesday, August 27, 2013

Web crawler

The Web crawler is a computer program that, given one or more seed URLs, downloads data or information from World Wide Web which associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the information identified by these hyperlinks. Web information is changed or updated rapidly without any information or notice. Web crawler searches the web for updated or new information. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on.
A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or a Web scutter.

An Internet bot, also known as web robot, WWW robot or simply bot, is a software application that runs automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone.
Why do we need a web crawler?
  • To maintain mirror sites for popular Web sites.
  • To test web pages and links for valid syntax and structure.
  • To monitor sites to see when their structure or contents change.
  • To search for copyright infringements.
  • To build a special-purpose index. For example, one that has some understanding of the content stored in multimedia files on the Web.
How does a web crawler work?
A typical web crawler starts by parsing a specified web page: noting any hypertext links on that page that point to other web pages. The Crawler then parses those pages for new links, and so on, recursively. A crawler is a software or script or automated program which resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Below picture shows architecture of a Web Crawler.
List of published crawler architectures for general-purpose crawlers:
  1. Yahoo! Slurp
  2. Bingbot
  3. Googlebot
  4. PolyBot
  5. RBSE
  6. WebCrawler
  7. World Wide Web Worm
  8. WebFountain
  9. WebRACE

Tuesday, August 6, 2013

Agile Software Development using Scrum

Many software development organizations are striving to become more agile, because successful agile teams are producing higher-quality software that better meets user needs more quickly and at a lower cost than are traditional teams.
Below attributes makes transition to Scrum more difficult than other changes:
  • Successful change is not entirely top-down or bottom-up.
  • The end state is unpredictable.
  • Scrum is pervasive.
  • Scrum is dramatically different.
  • Change is coming more quickly than ever before.
  • Best practices are dangerous.
Despite all the reasons why transitioning to Scrum can be particularly difficult, it worth effort because it reduces time-to-martket due to higher productivity of agile teams. Below reasons shows why transitioning to an agile process like Scrum is worthwhile:
  • Higher productivity and lower costs
  • Improved employee engagement and j ob satisfaction
  • Faster time to market
  • Higher quality
  • Improved stakeholder satisfaction
  • What we've been doing no longer works
The five common activities necessary for a successful and lasting Scrum adoption:
  • Awareness that the current process is not delivering acceptable results
  • Desire to adopt Scrum as a way to address current problems
  • Ability to succeed with Scrum
  • Promotion of Scrum through sharing experiences so that we remember and others can see our successes
  • Transfer of the implications of using Scrum throughout the company
Conveniently, these five activities - Awareness, Desire, Ability, Promotion, and Transfer - can be remembered by the acronym ADAPT. These activities are also summarized in below figure.