Web Crawling and Data Mining with Apache Nutch Год: 2013 Автор: Dr. Zakir Laliwala, Abdulbasit Shaikh Издательство: Packt Publishing ISBN: 978-1-78328-685-0 Язык: Английский Формат: PDF/EPUB/MOBI Качество: Изначально компьютерное (eBook) Количество страниц: 136 Описание: In Detail Apache Nutch helps you to create your own search engine and customize it according to your needs. You can integrate Apache Nutch very easily with your existing application and get the maximum benefit from it. It can be easily integrated with different components like Apache Hadoop, Eclipse, and MySQL. "Web Crawling and Data Mining with Apache Nutch" shows you all the necessary steps to help you in crawling webpages for your application and using them to make your application searching more efficient. You will create your own search engine and will be able to improve your application page rank in searching. "Web Crawling and Data Mining with Apache Nutch" starts with the basics of crawling webpages for your application. You will learn to deploy Apache Solr on server containing data crawled by Apache Nutch and perform Sharding with Apache Nutch using Apache Solr. You will integrate your application with databases such as MySQL, Hbase, and Accumulo, and also with Apache Solr, which is used as a searcher. With this book, you will gain the necessary skills to create your own search engine. You will also perform link analysis and scoring that are helpful in improving the rank of your application page. What you will learn from this book Carry out web crawling for your application Make your application searching efficient by integrating it with Apache Solr Integrate your application with different databases for data storage purposes Run your application in a cluster environment by integrating it with Apache Hadoop Perform crawling operations with Eclipse, which is used as an IDE instead of the command line Create your own plugin in Apache Nutch Integrate Apache Solr with Apache Nutch, and deploy Apache Solr on Apache Tomcat Apply Sharding on Apache Tomcat for getting good results from Apache Solr while searching Approach This book is a user-friendly guide that covers all the necessary steps and examples related to web crawling and data mining using Apache Nutch. Who this book is written for "Web Crawling and Data Mining with Apache Nutch" is aimed at data analysts, application developers, web mining engineers, and data scientists. It is a good start for those who want to learn how web crawling and data mining is applied in the current business world. It would be an added benefit for those who have some knowledge of web crawling and data mining.
Примеры страниц
Оглавление
Preface Chapter 1: Getting Started with Apache Nutch Introduction to Apache Nutch Installing and configuring Apache Nutch Installation dependencies Verifying your Apache Nutch installation Crawling your first website Installing Apache Solr Integration of Solr with Nutch Crawling your website using the crawl script Crawling the Web, the CrawlDb, and URL filters InjectorJob GeneratorJob FetcherJob ParserJob DbUpdaterJob Invertlinks Indexing with Apache Solr Parsing and parse filters Webgraph Loops LinkRank ScoreUpdater A scoring example The Apache Nutch plugin The Apache Nutch plugin example Modifying plugin.xml Describing dependencies with the ivy module The Indexer extension program The Scoring extension program Using your plugin with Apache Nutch Compiling your plugin Understanding the Nutch Plugin architecture Chapter 2: Deployment, Sharding, and AJAX Solr with Apache Nutch Deployment of Apache Solr Introduction of deployment Need of Apache Solr deployment Setting up Java Development Kit Setting up Tomcat Setting up Apache Solr Running Solr on Tomcat Sharding using Apache Solr Introduction to sharding Use of sharding with Apache Nutch Distributing documents across shards Sharding Apache Solr indexes Single cluster Splitting shards with Apache Nutch Cleaning up with Apache Nutch Splitting cluster shards Checking statistics of sharding with Apache Nutch The finaltest with Apache Nutch Working with AJAX Solr Architectural overview of AJAX Solr Applying AJAX Solr on Reuters' data Running AJAX Solr Chapter 3: Integration of Apache Nutch with Apache Hadoop and Eclipse Integrating Apache Nutch with Apache Hadoop Introducing Apache Hadoop InstallingApache Hadoop and Apache Nutch Downloading Apache Hadoop and Apache Nutch Setting up Apache Hadoop with the cluster Installing Java Downloading Apache Hadoop Configuring SSH Disabling IPv6 Installing Apache Hadoop Required ownerships and permissions The configuration required for Hadoop_HOME/conf/* Formatting the HDFS filesystem using the NameNode Setting up the deployment architecture of Apache Nutch Installing Apache Nutch Key points of the Apache Nutch installation Starting the cluster Performing crawling on the Apache Hadoop cluster Configuring Apache Nutch with Eclipse Introducing Apache Nutch configuration with Eclipse Installation and building Apache Nutch with Eclipse Crawling in Eclipse Chapter 4: Apache Nutch with Gora, Accumulo, and MySQL Introduction to Apache Accumulo Main features of Apache Accumulo Introduction to Apache Gora Supported data stores Use of Apache Gora Integration of Apache Nutch with Apache Accumulo ConfiguringApache Gora with Apache Nutch Setting up Apache Hadoop and Apache ZooKeeper Installing and configuring Apache Accumulo Testing Apache Accumulo Crawling with Apache Nutch on Apache Accumulo Integration of Apache Nutch with MySQL Introduction to MySQL Benefits of integrating MySQL with Apache Nutch Configuring MySQL with Apache Nutch Crawling with Apache Nutch on MySQL Index
Dr. Zakir Laliwala, Abdulbasit Shaikh - Web Crawling and Data Mining with Apache Nutch [2013, PDF/EPUB/MOBI, ENG] download torrent for free and without registration
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum